All About pyjanitor's Method Chaining, and Why It's Useful

Anyone who works intensively with data in Python learns a recurring lesson: data cleaning rarely feels like data science and more like digital janitorial work. A typical task involves loading a messy dataset, reconciling inconsistent column names, handling missing values, and accumulating a trail of temporary variables, only the last of which holds the final, clean data. The pyjanitor library offers a tidier approach: combined with method chaining, it turns cumbersome cleaning steps into pipelines that are readable and efficient. This article explains pyjanitor method chaining in the context of data cleaning.

Understanding method chaining

Method chaining is a well-established programming pattern, not a new idea. It involves calling several methods on an object in sequence within a single statement, so there is no need to reassign a variable after each step: each method returns an object on which the next method is called. A small example makes the concept concrete. Applying several simple modifications to a string in “standard” Python might look like this:

text = "  Hello World!  "
text = text.strip()
text = text.lower()
text = text.replace("world", "python")

The resulting value would be “hello python!”. With method chaining, the same process becomes:

text = "  Hello World!  "
cleaned_text = text.strip().lower().replace("world", "python")

The logical flow reads left to right, in a single unified chain. The same approach translates directly to data science with pandas. A standard multi-step cleanup on a DataFrame, written without chaining, typically scatters logic across intermediate variables:

# Traditional, step-by-step Pandas approach
df = pd.read_csv("data.csv")
df.columns = df.columns.str.lower().str.replace(' ', '_')
df = df.dropna(subset=('id'))
df = df.drop_duplicates()

By contrast, method chaining encapsulates the DataFrame operations in one pipeline, wrapped in parentheses, removing the need for intermediate, non-final DataFrames. The result is cleaner code that is easier to read and less prone to bugs, and pyjanitor is designed to make this style seamless.

Using pyjanitor: an applied example

Pandas offers some native support for method chaining, but several of its essential operations were not designed with the pattern in mind. That gap is the main inspiration behind pyjanitor, which is modelled on the R package of nearly the same name, janitor. In short, pyjanitor is an extension for pandas that bundles a set of custom data-cleaning operations in a chain-friendly form. Its API includes descriptive method names such as clean_names(), rename_column(), and remove_empty(), each returning a DataFrame so calls can be chained.

messy_data = {
    'First Name ': ('Alice', 'Bob', 'Charlie', 'Alice', None),
    '  Last_Name': ('Smith', 'Jones', 'Brown', 'Smith', 'Doe'),
    'Age': (25, np.nan, 30, 25, 40),
    'Date_Of_Birth': ('1998-01-01', '1995-05-05', '1993-08-08', '1998-01-01', '1983-12-12'),
    'Salary ($)': (50000, 60000, 70000, 50000, 80000),
    'Empty_Col': (np.nan, np.nan, np.nan, np.nan, np.nan)
}

df = pd.DataFrame(messy_data)
print("--- Messy Original Data ---")
print(df.head(), "n")

Chaining these operations builds a single, expressive cleaning pipeline:

cleaned_df = (
    df
    .rename_column('Salary ($)', 'Salary')  # 1. Manually fix tricky names BEFORE getting them mangled
    .clean_names()                          # 2. Standardize everything (makes it 'salary')
    .remove_empty()                         # 3. Drop empty columns/rows
    .drop_duplicates()                      # 4. Remove duplicate rows
    .fill_empty(                            # 5. Impute missing values
        column_names=('age'),               # CAUTION: after previous steps, assume lowercase name: 'age'
        value=df('Age').median()            # Pull the median from the original raw df
    )
    .assign(                                # 6. Create a new column using assign
        salary_k=lambda d: d('salary') / 1000
    )
)

print("--- Cleaned Pyjanitor Data ---")
print(cleaned_df)

The same pipeline can be extended with further steps as the data requires:

--- Messy Original Data ---
  First Name    Last_Name   Age Date_Of_Birth  Salary ($)  Empty_Col
0       Alice       Smith  25.0    1998-01-01       50000        NaN
1         Bob       Jones   NaN    1995-05-05       60000        NaN
2     Charlie       Brown  30.0    1993-08-08       70000        NaN
3       Alice       Smith  25.0    1998-01-01       50000        NaN
4         NaN         Doe  40.0    1983-12-12       80000        NaN 

--- Cleaned Pyjanitor Data ---
  first_name_ _last_name   age date_of_birth  salary  salary_k
0       Alice      Smith  25.0    1998-01-01   50000      50.0
1         Bob      Jones  27.5    1995-05-05   60000      60.0
2     Charlie      Brown  30.0    1993-08-08   70000      70.0
4         NaN        Doe  40.0    1983-12-12   80000      80.0

Limitations and what to watch

Method chaining improves readability, but very long chains can become hard to debug, because there are no intermediate variables to inspect when something goes wrong; breaking a pipeline into a few logical stages, or using pandas’ pipe() and tools that preview intermediate results, can help. pyjanitor adds a dependency and a small amount of overhead, so for a one-off, trivial cleanup, plain pandas may be simpler. Its method names are convenient but specific to the library, which is a minor learning curve for collaborators unfamiliar with it. As with any cleaning pipeline, the operations still need to be correct for the data at hand: chaining makes mistakes look tidy, so the logic, not just the style, deserves review. The official pyjanitor documentation lists the full set of available functions.

Wrapping up

Used together, pyjanitor and method chaining simplify otherwise awkward data-cleaning code, making it clean, expressive, and effectively self-documenting, so that other developers, or one’s future self, can read a pipeline and follow the journey from raw to finished dataset. For related Python practices, see the guides to advanced data validation and a modern Python project setup.

All About pyjanitor’s Method Chaining, and Why It’s Useful

Understanding method chaining

Using pyjanitor: an applied example

Limitations and what to watch

Wrapping up

Sigmoid vs ReLU activation functions: estimation cost of losing geometric context

Measuring and bridging the realism gap in user simulators

Related Articles