Image by editor
# Introduction
Working intensively with data in Python teaches us all an important lesson: data cleaning doesn’t usually feel like doing data science, but like acting as a digital janitor. Most use cases require this: loading the dataset is messy, searching for multiple column names, coming across missing values, and ending up with a lot of temporary data variables, only the last of which contains your final, clean dataset.
pyjanitor Provides a clean approach to completing these steps. This library can be used along with the concept of method chains to transform cumbersome data cleaning processes into pipelines that look elegant, efficient, and readable.
This article shows how to break down method chaining in the context of Pyjanitor and data cleaning.
# Understanding Method Chaining
Method chaining is nothing new in the field of programming: in fact, it is a well-established coding pattern. It involves calling multiple methods on an object in sequential order: all in just one statement. This way, you don’t need to reassign a variable after each step, because each method returns an object that invokes the next attached method, and so on.
The following example helps understand the concept at its core. See how we would apply several simple modifications to a small piece of text (string) using “standard” Python:
text = " Hello World! "
text = text.strip()
text = text.lower()
text = text.replace("world", "python")
The resulting value in text will be: "hello python!".
Now, with the method chain, the same process would look like this:
text = " Hello World! "
cleaned_text = text.strip().lower().replace("world", "python")
Note that the logical flow of the operations applied goes from left to right: all in a single, unified chain of thought!
If you got it, you now fully understand the concept of method chaining. Let us now translate this approach into the context of data science use Panda. A standard data cleanup on a DataFrame, involving several steps, typically looks like this without chaining:
# Traditional, step-by-step Pandas approach
df = pd.read_csv("data.csv")
df.columns = df.columns.str.lower().str.replace(' ', '_')
df = df.dropna(subset=('id'))
df = df.drop_duplicates()
As we will see shortly, by implementing method chaining, we will create a unified pipeline under which DataFrame operations will be encapsulated using parentheses. Additionally, we will no longer need intermediate variables with non-final DataFrames, which will allow for cleaner, more bug-resilient code. And (once again) on top of all that, Pyjanitor makes this process seamless.
# Entering Pyjanitor: Application Example
Pandas itself provides native support for method chaining to some extent. However, some of its essential functionalities are not designed with this pattern in mind. This is the main inspiration why Pygenitor was born, based on the R package of almost the same name: janitor.
In short, Pyjanitor can be modeled as an extension for pandas that brings together a pack of custom data-cleaning procedures in a method-friendly fashion. Examples of its application programming interface (API) method names include clean_names(), rename_column(), remove_empty()And so on. Its API employs a suite of intuitive method names that take code expression to a whole new level. Furthermore, Pyjanitor relies entirely on open-source, free tools, and can be run seamlessly in cloud and notebook environments like Google Colab.
Let’s fully understand how method chaining is implemented in PyGenerator, through an example in which we first create a small, synthetic dataset that looks intentionally messy, and put it into pandas. DataFrame object.
Important: To avoid common, yet somewhat terrible errors caused by incompatibilities between library versions, make sure you have the latest available version of both pandas and pyjanitor. !pip install --upgrade pyjanitor pandas First.
messy_data = {
'First Name ': ('Alice', 'Bob', 'Charlie', 'Alice', None),
' Last_Name': ('Smith', 'Jones', 'Brown', 'Smith', 'Doe'),
'Age': (25, np.nan, 30, 25, 40),
'Date_Of_Birth': ('1998-01-01', '1995-05-05', '1993-08-08', '1998-01-01', '1983-12-12'),
'Salary ($)': (50000, 60000, 70000, 50000, 80000),
'Empty_Col': (np.nan, np.nan, np.nan, np.nan, np.nan)
}
df = pd.DataFrame(messy_data)
print("--- Messy Original Data ---")
print(df.head(), "n")
We now define a PyGenerator method chain that applies a series of processing to both the column names and the data:
cleaned_df = (
df
.rename_column('Salary ($)', 'Salary') # 1. Manually fix tricky names BEFORE getting them mangled
.clean_names() # 2. Standardize everything (makes it 'salary')
.remove_empty() # 3. Drop empty columns/rows
.drop_duplicates() # 4. Remove duplicate rows
.fill_empty( # 5. Impute missing values
column_names=('age'), # CAUTION: after previous steps, assume lowercase name: 'age'
value=df('Age').median() # Pull the median from the original raw df
)
.assign( # 6. Create a new column using assign
salary_k=lambda d: d('salary') / 1000
)
)
print("--- Cleaned Pyjanitor Data ---")
print(cleaned_df)
The above code is self-explanatory, with inline comments explaining each method called at each step in the chain.
This is the output of our example, comparing the original dirty data to the cleaned version:
--- Messy Original Data ---
First Name Last_Name Age Date_Of_Birth Salary ($) Empty_Col
0 Alice Smith 25.0 1998-01-01 50000 NaN
1 Bob Jones NaN 1995-05-05 60000 NaN
2 Charlie Brown 30.0 1993-08-08 70000 NaN
3 Alice Smith 25.0 1998-01-01 50000 NaN
4 NaN Doe 40.0 1983-12-12 80000 NaN
--- Cleaned Pyjanitor Data ---
first_name_ _last_name age date_of_birth salary salary_k
0 Alice Smith 25.0 1998-01-01 50000 50.0
1 Bob Jones 27.5 1995-05-05 60000 60.0
2 Charlie Brown 30.0 1993-08-08 70000 70.0
4 NaN Doe 40.0 1983-12-12 80000 80.0
# wrapping up
Throughout this article, we have learned how to use the Pyjanitor library to implement method chaining and simplify otherwise difficult data cleaning processes. This makes the code clean, expressive, and – in a manner of speaking – self-documenting, so that other developers or your future self can read the pipeline and easily understand what’s happening in this journey from raw to finished dataset.
Nice work!
ivan palomares carrascosa Is a leader, author, speaker and consultant in AI, Machine Learning, Deep Learning and LLM. He trains and guides others in using AI in the real world.