From Dirty to Clean: 8 Python Tricks for Intuitive Data Preprocessing

Image by editor

# Introduction

Whereas data preprocessing While data holds substantial relevance in science and machine learning workflows, these processes are often not conducted correctly, largely because they require overly complex, time-consuming, or extensive custom code. As a result, practitioners may delay essential tasks such as data cleansing, rely on brittle ad-hoc solutions that are not sustainable in the long run, or over-engineer solutions to problems that may be simple at their core.

This article presents 8 Python tricks to transform raw, dirty data into clean, neat pre-processed data with minimal effort.

Before looking at specific tricks and related code examples, the following introduction code sets up the necessary libraries and defines a toy dataset to illustrate each trick:

import pandas as pd
import numpy as np
# A tiny, intentionally messy dataset
df = pd.DataFrame({
    " User Name ": (" Alice ", "bob", "Bob", "alice", None),
    "Age": ("25", "30", "?", "120", "28"),
    "Income$": ("50000", "60000", None, "1000000", "55000"),
    "Join Date": ("2023-01-01", "01/15/2023", "not a date", None, "2023-02-01"),
    "City": ("New York", "new york ", "NYC", "New York", "nyc"),
})

# 1. Quickly normalize column names

This is a very useful, one-liner style trick: in one line of code, it normalizes the names of all columns in the dataset. The specifics depend on how exactly you want to normalize your attribute names, but the following example shows how to replace whitespace with underscore symbols and lowercase everything, ensuring a consistent, standardized naming convention. This is important to prevent annoying bugs or fix potential typos in downstream tasks. No need to repeat column by column!

df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")

# 2. Separating whitespace from strings at scale

Sometimes you just want to make sure that specific junk invisible to the human eye, such as spaces at the beginning or end of string (categorical) values, is systematically removed across the entire dataset. This strategy does this neatly for all columns containing strings, leaving other columns such as numeric values unchanged.

df = df.apply(lambda s: s.str.strip() if s.dtype == "object" else s)

# 3. Converting numeric columns safely

If we are not 100% confident that all values in a numeric column follow the same format, it is usually a good idea to explicitly convert these values into a numeric format, which can sometimes turn a jumbled string of numbers that looks like numbers into real numbers. In one line, we can do what would otherwise require effort—except blocks and a more manual cleaning process.

df("age") = pd.to_numeric(df("age"), errors="coerce")
df("income$") = pd.to_numeric(df("income$"), errors="coerce")

Note here that other classical approaches like df('columna').astype(float) Sometimes a crash may occur if invalid raw values are found that cannot be trivially converted to numeric.

# 4. Parsing Dates `errors="coerce"`

Same validation-oriented process, specific data types. This trick converts date-time values that are valid, discarding those that are not. using the errors="coerce" is the key to tell Panda If invalid, non-convertible values are found, they should be converted NaT (not time), instead of generating an error during execution and crashing the program.

df("join_date") = pd.to_datetime(df("join_date"), errors="coerce")

# 5. Fixing Missing Values with Smart Defaults

For those who are unfamiliar with strategies for handling missing values other than deleting entire rows, this strategy imputes those values – filling in the gaps – like using statistical-driven defaults. median Or mode. An efficient, one-liner-based strategy that can be adjusted with different default sets. (0) The index that comes with the mode is used to obtain only one value in case of relationships between two or several “most frequent values”.

df("age") = df("age").fillna(df("age").median())
df("city") = df("city").fillna(df("city").mode()(0))

# 6. Standardizing Categories with Maps

In hierarchical columns with diverse values such as cities, it is also necessary to standardize the names and remove possible anomalies to obtain clean group names and create downstream group aggregation. groupby() Reliable and effective. With the help of a dictionary, this example applies a one-to-one mapping to string values corresponding to New York City, ensuring that they are all represented identically by “NYC”.

city_map = {"new york": "NYC", "nyc": "NYC"}
df("city") = df("city").str.lower().map(city_map).fillna(df("city"))

# 7. Removing duplicates intelligently and flexibly

The key to this highly customizable duplicate removal strategy is to use subset=("user_name"). In this example, it is used to tell Pandas to treat a row as a duplicate just by looking at it. "user_name" Column, and verifying whether the value in the column is the same as the value in the other row. This is a great way to ensure that each unique user is represented only once in the dataset, preventing double counting, and all in one instruction.

df = df.drop_duplicates(subset=("user_name"))

# 8. Shredding volumes for external removal

The final trick involves automatically capping extreme values or outliers rather than removing them altogether. For example, particularly useful when outliers are considered to be caused by errors manually introduced into the data. Clipping sets the extreme values falling below (and above) two percentiles (1 and 99 in the example) with such percentile values that the original values remain unchanged between the two specified percentiles. In simple words, it is like keeping extremely large or small values within limits.

q_low, q_high = df("income$").quantile((0.01, 0.99))
df("income$") = df("income$").clip(q_low, q_high)

# wrapping up

This article describes eight useful tips, tricks, and strategies that will boost your data preprocessing pipelines in Python, making them more efficient, effective, and robust: all at the same time.

ivan palomares carrascosa Is a leader, author, speaker and consultant in AI, Machine Learning, Deep Learning and LLM. He trains and guides others in using AI in the real world.

From Dirty to Clean: 8 Python Tricks for Intuitive Data Preprocessing

# Introduction

# 1. Quickly normalize column names

# 2. Separating whitespace from strings at scale

# 3. Converting numeric columns safely

# 4. Parsing Dates errors="coerce"

# 5. Fixing Missing Values ​​with Smart Defaults

# 6. Standardizing Categories with Maps

# 7. Removing duplicates intelligently and flexibly

# 8. Shredding volumes for external removal

# wrapping up

Google DeepMind wants to know if chatbots are just virtue signaling

Robots that predict the future

Related Articles

Leave a Comment Cancel Reply

# 4. Parsing Dates `errors="coerce"`

# 5. Fixing Missing Values with Smart Defaults