Data detox: training yourself for the messy, noisy, real world

by
0 comments
Data detox: training yourself for the messy, noisy, real world

data detox
Image by author

, Introduction

We’ve all spent many hours debugging a model, only to discover that it was not the algorithm, but an erroneous null value in line 47,832 that was manipulating your results. Kaggle competitions give the impression that the data is produced as clean, well-labeled CSV with no class imbalance problems, but in reality, this is not the case.

In this article, we will use a real-life data project to explore four practical steps for preparing to deal with unstructured, real-life datasets.

, NoBroker Data Project: A practical test of real-world chaos

NoBroker is an Indian property technology (prop-tech) company that directly connects property owners and tenants in a broker-free marketplace.

data detoxdata detox

This data project Used during the recruitment process for data science positions at NoBroker.

In this data project, NoBroker wants you to create a predictive model that predicts how many interactions an asset will receive within a certain time frame. We won’t complete the entire project here, but it will help us find ways to train ourselves on real-world unstructured data.

It has three datasets:

  • property_data_set.csv
    • It includes property details such as type, location, amenities, size, rent and other housing amenities.
  • property_photos.tsv
    • Photos of the property are included.
  • property_interactions.csv
    • The timestamp of the interaction is included on the properties.

, Comparing Clean Interview Data vs. Real Production Data: Reality Check

Interview datasets are sophisticated, balanced, and boring. Actual production data? It’s a dumpster fire with missing values, duplicate rows, inconsistent formats and silent errors that wait until 5pm on Friday to break your pipeline.

Take the NoBroker property dataset, which is a real-world mess with 28,888 properties in three tables. At first glance it looks fine. But dig deeper, and you’ll find 11,022 missing photo uniform resource locators (URLs), corrupted JSON strings with rogue backslashes, and more.

It’s the line between clean and messy. Clean data trains you to build models, but production data trains you to struggle and survive.

We will explore four exercises to train yourself.

data detoxdata detox

, Practice #1: Handling Missing Data

Losing data isn’t just annoying; This is a decision point. Delete row? Fill it with the mean? Mark it as unknown? The answer depends on why the data is missing and how much loss you can tolerate.

There were three types of data missing in the NoBroker dataset. photo_urls The column was missing 11,022 values ​​out of 28,888 rows – which is 38% of the dataset. Here is the code.

Here is the output.

data detoxdata detox

Deleting these rows will destroy valuable property records. Instead, the solution was to treat the missing photos as if there were zero and move on.

def correction(x):
    if x is np.nan or x == 'NaN':
        return 0  # Missing photos = 0 photos
    else:
        return len(json.loads(x.replace('\', '').replace('{title','{"title')))
pics('photo_count') = pics('photo_urls').apply(correction)

like for numeric columns total_floor (23 missing) and hierarchical columns like building_type (38 missing), the strategy was imputation. Fill the numerical intervals with the mean and the categorical intervals with the mode.

for col in x_remain_withNull.columns:
    x_remain(col) = x_remain_withNull(col).fillna(x_remain_withNull(col).mean())
for col in x_cat_withNull.columns:
    x_cat(col) = x_cat_withNull(col).fillna(x_cat_withNull(col).mode()(0))

First decision: Do not delete without a questioning mind!

Understand the pattern. The missing photo URLs were not random.

, Practice #2: Detecting Outliers

Outliers are not always an error, but it is always suspicious.

Can you imagine an 800-year-old property with 21 bathrooms or 40,000 square feet of space? You either found your dream venue or someone made a data entry error.

The NoBroker dataset was full of these red flags. Box plots revealed extreme values ​​in several columns: property age greater than 100, size greater than 10,000 square feet (sq. ft.), and deposit amount greater than $3.5 million. There were some legitimate luxury properties. Most were data entry mistakes.

df_num.plot(kind='box', subplots=True, figsize=(22,10))
plt.show()

Here is the output.

data detoxdata detox

The solution was interquartile range (IQR)-based outlier removal, a simple statistical method that flags values ​​greater than 2 times the IQR.

To handle this, we first write a function that removes those outliers.

def remove_outlier(df_in, col_name):
    q1 = df_in(col_name).quantile(0.25)
    q3 = df_in(col_name).quantile(0.75)
    iqr = q3 - q1
    fence_low = q1 - 2 * iqr
    fence_high = q3 + 2 * iqr
    df_out = df_in.loc((df_in(col_name) <= fence_high) & (df_in(col_name) >= fence_low))
    return df_out  # Note: Multiplier changed from 1.5 to 2 to match implementation.

And we run this code on numeric columns.

df = dataset.copy()
for col in df_num.columns:
    if col in ('gym', 'lift', 'swimming_pool', 'request_day_within_3d', 'request_day_within_7d'):
        continue  # Skip binary and target columns
    df = remove_outlier(df, col)
print(f"Before: {dataset.shape(0)} rows")
print(f"After: {df.shape(0)} rows")
print(f"Removed: {dataset.shape(0) - df.shape(0)} rows ({((dataset.shape(0) - df.shape(0)) / dataset.shape(0) * 100):.1f}% reduction)")

Here is the output.

data detoxdata detox

After removing outliers, the dataset reduced from 17,386 rows to 15,170, leaving 12.7% of the data lost while keeping the model balanced. The compromise was worth it.

like for the target variable request_day_within_3dCapping was used instead of removal. Values ​​above 10 were capped at 10 to prevent extreme outliers from distorting the predictions. In the following code, we also compare the results before and after.

def capping_for_3days(x):
    num = 10
    return num if x > num else x
df('request_day_within_3d_capping') = df('request_day_within_3d').apply(capping_for_3days)
before_count = (df('request_day_within_3d') > 10).sum()
after_count = (df('request_day_within_3d_capping') > 10).sum()
total_rows = len(df)
change_count = before_count - after_count
percent_change = (change_count / total_rows) * 100
print(f"Before capping (>10): {before_count}")
print(f"After capping (>10): {after_count}")
print(f"Reduced by: {change_count} ({percent_change:.2f}% of total rows affected)")

outcome?

data detoxdata detox

A cleaner delivery, better model performance and fewer debugging sessions.

, Practice #3: Dealing with Duplicates and Inconsistencies

Easy to duplicate. Discrepancies are difficult. there is just a duplicate row df.drop_duplicates()An inconsistent format, such as a JSON string that has been scrambled by three different systems, requires detective work,

The NoBroker dataset had one of the worst JSON inconsistencies I’ve seen. photo_urls The column should have contained valid JSON arrays, but instead, it was filled with malformed strings, missing quotes, escaped backslashes, and random trailing characters.

text_before = pics('photo_urls')(0)
print('Before Correction: nn', text_before)

Here is an improvement on the first one.

data detoxdata detox

The fix required multiple string replacements to correct the formatting before parsing. Here is the code.

text_after = text_before.replace('\', '').replace('{title', '{"title').replace(')"', ')').replace('),"', ')","')
parsed_json = json.loads(text_after)

Here is the output.

data detoxdata detox

After the improvements the JSON was actually valid and parsable. It’s not the cleanest way to do this kind of string manipulation, but it works.

You see inconsistent formats everywhere: dates saved as strings, typos in categorical values, and numeric IDs saved as floats.

The solution is standardization, as we did with JSON formatting.

, Practice #4: Data Type Validation and Schema Checking

It all starts when you load your data. Figuring out later whether dates are strings or numbers are objects would be a waste of time.

In the NoBroker project, the types were validated only during the CSV read, because the project was already implementing the correct data types Panda Parameters. Here is the code.

data = pd.read_csv('property_data_set.csv')
print(data('activation_date').dtype)  
data = pd.read_csv('property_data_set.csv',
                   parse_dates=('activation_date'), 
                   infer_datetime_format=True, 
                   dayfirst=True)
print(data('activation_date').dtype)

Here is the output.

data detoxdata detox

The same validation was applied to the interaction dataset.

interaction = pd.read_csv('property_interactions.csv',
    parse_dates=('request_date'), 
    infer_datetime_format=True, 
    dayfirst=True)

This was not only good practice, but it was necessary for anything downstream. The project required calculation of date and time differences between activation and request dates.

So if the dates are strings the following code will produce an error.

num_req('request_day') = (num_req('request_date') - num_req('activation_date')) / np.timedelta64(1, 'D')

Schema checking will ensure that the structure does not change, but in reality, the data will also drift as its distribution changes over time. You can simulate this drift by slightly changing the input ratio and check whether your model or its validation is able to detect and respond to that drift.

, Documenting your cleaning steps

In three months you won’t remember why you limited request_day_within_3d to 10. Six months from now, your teammate will break the pipeline by removing your external filter. In a year, the model will go into production, and no one will understand why it fails.

Documentation is not optional. This is the difference between a reproducible pipeline and a voodoo script that works until it doesn’t.

The NoBroker project documented each change in code comments and structured notebook sections with explanations and tables of contents.

# Assignment
# Read and Explore All Datasets
# Data Engineering
Handling Pics Data
Number of Interactions Within 3 Days
Number of Interactions Within 7 Days
Merge Data
# Exploratory Data Analysis and Processing
# Feature Engineering
Remove Outliers
One-Hot Encoding
MinMaxScaler
Classical Machine Learning
Predicting Interactions Within 3 Days
Deep Learning
# Try to correct the first Json
# Try to replace corrupted values then convert to json
# Function to correct corrupted json and get count of photos

Version control also matters. Track changes to your cleanup logic. Save the intermediate dataset. Keep a changelog of what you tried and what worked.

The goal is not perfection. The goal is clarity. If you can’t explain why you made a decision, you can’t defend it if the model fails.

, final thoughts

Clean data is a myth. The best data scientists are not those who shy away from unstructured datasets; They are the only ones who know how to tame them. They discover missing values ​​before training.

They are able to identify outliers before they affect the predictions. They check the schema before joining tables. And they write everything down so the next person doesn’t have to start from scratch.

Perfect data makes no real impact. This comes from the ability to deal with inaccurate data and still build something functional.

So when you’re dealing with a dataset and you see null values, broken strings, and outliers, don’t be afraid. What you are seeing is not a problem but an opportunity to showcase your skills against real-world datasets.

Nate Rosidi Is a data scientist and is into product strategy. He is also an adjunct professor teaching analytics, and is the founder of StratScratch, a platform that helps data scientists prepare for their interviews with real interview questions from top companies. Nate writes on the latest trends in the career market, gives interview advice, shares data science projects, and covers everything SQL.

Related Articles

Leave a Comment