Building a Simple Data Quality DSL in Python

by
0 comments
Building a Simple Data Quality DSL in Python

Building a Simple Data Quality DSL in Python
Image by author

, Introduction

Maintaining data validation code in Python is often difficult. Business rules get buried in nested if Statements, validation logic mixed with error handling, and adding new checks often means sifting through procedural functions to find the right place to put the code. Yes, there are data validation frameworks you can use, but we will focus on building something extremely simple but useful with Python.

Let’s create a kind of simple domain-specific language (DSL) and write vocabulary Especially for data validation. Instead of writing generic Python code, you create special functions and classes that express validation rules in terms that match the way you think about the problem.

For data validation, this means rules that read like business requirements: “Subscriber must be between 18 and 120” or “Email address must contain an @ symbol and be a valid domain.” You would like the DSL to handle the process of checking the data and reporting violations, while you focus on expression What Valid data appears. The result is validation logic that is readable, easy to maintain and test, and easy to extend. So, let’s start coding!

Link to code on GitHub

, Why build a DSL?

Consider validating customer data with Python:

def validate_customers(df):
    errors = ()
    if df('customer_id').duplicated().any():
        errors.append("Duplicate IDs")
    if (df('age') < 0).any():
        errors.append("Negative ages")
    if not df('email').str.contains('@').all():
        errors.append("Invalid emails")
    return errors

This approach hardcodes the validation logic, mixes business rules with error handling, and becomes unrepeatable as the rules grow. Instead, we are looking to write a DSL that separates concerns and creates reusable validation components.

Instead of writing procedural validation functions, a DSL lets you express rules that read like business requirements:

# Traditional approach
if df('age').min() < 0 or df('age').max() > 120:
    raise ValueError("Invalid ages found")

# DSL approach  
validator.add_rule(Rule("Valid ages", between('age', 0, 120), "Ages must be 0-120"))

The DSL approach separates what you are validating (business rules) and how violations are handled (error reporting).This makes the validation logic testable, reusable, and readable by non-programmers,

, Creating a Sample Dataset

Start by preparing sample, realistic e-commerce customer data containing common quality issues:

import pandas as pd

customers = pd.DataFrame({
    'customer_id': (101, 102, 103, 103, 105),
    'email': ('john@gmail.com', 'invalid-email', '', 'sarah@yahoo.com', 'mike@domain.co'),
    'age': (25, -5, 35, 200, 28),
    'total_spent': (250.50, 1200.00, 0.00, -50.00, 899.99),
    'join_date': ('2023-01-15', '2023-13-45', '2023-02-20', '2023-02-20', '')
}) # Note: 2023-13-45 is an intentionally malformed date.

This dataset contains duplicate customer IDs, invalid email formats, impossible ages, negative spend amounts, and malformed dates. This should work very well for testing validation rules.

, writing validation logic

, create rule class

Let’s start by writing a simple Rule The class that wraps the validation logic:

class Rule:
    def __init__(self, name, condition, error_msg):
        self.name = name
        self.condition = condition
        self.error_msg = error_msg
    
    def check(self, df):
        # The condition function returns True for VALID rows.
        # We use ~ (bitwise NOT) to select the rows that VIOLATE the condition.
        violations = df(~self.condition(df))
        if not violations.empty:
            return {
                'rule': self.name,
                'message': self.error_msg,
                'violations': len(violations),
                'sample_rows': violations.head(3).index.tolist()
            }
        return None

condition Accepts parameters any function that takes a DataFrame and returns a Boolean Series Indication of valid rows. tilde operator (~) reverses this boolean Series Identifying violations. When violations exist check The method returns detailed information including the rule name, error message, violation count, and sample row index for debugging.

This design separates validation logic from error reporting. condition The function focuses solely on the business rule whereas Rule The class handles error details consistently.

, Adding multiple rules

Next, let’s create the code DataValidator The class that manages the collection of rules:

class DataValidator:
    def __init__(self):
        self.rules = ()
    
    def add_rule(self, rule):
        self.rules.append(rule)
        return self # Enables method chaining
    
    def validate(self, df):
        results = ()
        for rule in self.rules:
            violation = rule.check(df)
            if violation:
                results.append(violation)
        return results

add_rule method return self To enable method chaining. validate The method executes all rules independently and collects violation reports. This approach ensures that a failed rule does not prevent others from running.

, creating readable conditions

Remember that when instantiating an object Rule class, we also need a condition Celebration. This can be any function that takes a DataFrame and returns a boolean SeriesWhile simple lambda functions work, they are not very easy to read, So let’s write a helper function to create a readable validation vocabulary:

def not_null(column):
    return lambda df: df(column).notna()

def unique_values(column):
    return lambda df: ~df.duplicated(subset=(column), keep=False)

def between(column, min_val, max_val):
    return lambda df: df(column).between(min_val, max_val)

Each helper function returns a lambda that works with pandas Boolean operations.

  • not_null The assistant uses pandas’ notna() Method to identify non-zero values.
  • unique_values auxiliary use duplicated(..., keep=False) With a subset parameter to flag all duplicate events, ensuring more accurate violation counts.
  • between Uses Panda Assistant between() Method that handles range checking automatically.

For pattern matching, regular expression Be straight:

import re

def matches_pattern(column, pattern):
    return lambda df: df(column).str.match(pattern, na=False)

na=False The parameter ensures that missing values ​​are treated as validation failures rather than matches, which is typically the desired behavior for required fields.

, Building a data validator for a sample dataset

Let’s now create a validator for the Customer dataset and see how this DSL works:

validator = DataValidator()

validator.add_rule(Rule(
   "Unique customer IDs", 
   unique_values('customer_id'),
   "Customer IDs must be unique across all records"
))

validator.add_rule(Rule(
   "Valid email format",
   matches_pattern('email', r'^(^@s)+@(^@s)+.(^@s)+$'),
   "Email addresses must contain @ symbol and domain"
))

validator.add_rule(Rule(
   "Reasonable customer age",
   between('age', 13, 120),
   "Customer age must be between 13 and 120 years"
))

validator.add_rule(Rule(
   "Non-negative spending",
   lambda df: df('total_spent') >= 0,
   "Total spending amount cannot be negative"
))

Each rule follows the same pattern: A descriptive name, a validation condition, and an error message,

  • uses the first rule unique_values Helpful function to check duplicate customer ID.
  • The second rule applies regular expression pattern matching to validate email formats. The pattern requires at least one character before and after the @ sign, as well as a domain extension.
  • uses the rule of thirds between Helpful for range verification, setting appropriate age limit for customers.
  • The last rule uses a lambda function to check inline condition total_spent Values ​​are non-negative.

Notice how each rule looks almost like a business requirement. The validator collects these rules and can execute them all against any DataFrame with matching column names:

issues = validator.validate(customers)

for issue in issues:
    print(f"❌ Rule: {issue('rule')}")
    print(f"Problem: {issue('message')}")
    print(f"Affected rows: {issue('sample_rows')}")
    print()

The output clearly identifies specific problems and their locations in the dataset, making debugging simple. For sample data, you will get the following output:

Validation Results:
❌ Rule: Unique customer IDs
   Problem: Customer IDs must be unique across all records
   Violations: 2
   Affected rows: (2, 3)

❌ Rule: Valid email format
   Problem: Email addresses must contain @ symbol and domain
   Violations: 3
   Affected rows: (1, 2, 4)

❌ Rule: Reasonable customer age
   Problem: Customer age must be between 13 and 120 years
   Violations: 2
   Affected rows: (1, 3)

❌ Rule: Non-negative spending
   Problem: Total spending amount cannot be negative
   Violations: 1
   Affected rows: (3)

, Adding Cross-Column Validation

Real business rules often involve relationships between columns. Custom Lambda functions handle complex validation logic:

def high_spender_email_required(df):
    high_spenders = df('total_spent') > 500
    has_valid_email = df('email').str.contains('@', na=False)
    # Passes if: (Not a high spender) OR (Has a valid email)
    return ~high_spenders | has_valid_email

validator.add_rule(Rule(
    "High Spenders Need Valid Email",
    high_spender_email_required,
    "Customers spending over $500 must have valid email addresses"
))

This rule uses Boolean logic where high-spending customers must have valid emails, but low-spending customers may have missing contact information. expression ~high_spenders | has_valid_email This translates to “is not a high spender or does not have a valid email”, allowing low spenders to pass verification regardless of email status.

, Handling date validation

Date validation requires careful management because date parsing can fail:

def valid_date_format(column, date_format="%Y-%m-%d"):
    def check_dates(df):
        # pd.to_datetime with errors="coerce" turns invalid dates into NaT (Not a Time)
        parsed_dates = pd.to_datetime(df(column), format=date_format, errors="coerce")
        # A row is valid if the original value is not null AND the parsed date is not NaT
        return df(column).notna() & parsed_dates.notna()
    return check_dates

validator.add_rule(Rule(
    "Valid Join Dates",
    valid_date_format('join_date'),
    "Join dates must follow YYYY-MM-DD format"
))

Validation passes only if the original value is not null and the parsed date is valid (ie, not). NaTwe remove unnecessary try-except block, rely on errors="coerce" In pd.to_datetime Transforming distorted strings and handling them gracefully NaTwho is later caught parsed_dates.notna(),

, Writing Decorator Integration Pattern

For production pipelines, you can write decorator patterns that provide clean integration:

def validate_dataframe(validator):
    def decorator(func):
        def wrapper(df, *args, **kwargs):
            issues = validator.validate(df)
            if issues:
                error_details = (f"{issue('rule')}: {issue('violations')} violations" for issue in issues)
                raise ValueError(f"Data validation failed: {'; '.join(error_details)}")
            return func(df, *args, **kwargs)
        return wrapper
    return decorator

# Note: 'customer_validator' needs to be defined globally or passed in a real implementation
# Assuming 'customer_validator' is the instance we built earlier
# @validate_dataframe(customer_validator)
def process_customer_data(df):
    return df.groupby('age').agg({'total_spent': 'sum'})

This decorator ensures that data is validated before processing begins, preventing corrupted data from spreading through the pipeline. The decorator produces descriptive errors including typical validation failures. A comment was added to the code snippet to note this customer_validator It would be necessary to be accessible to the decorator.

, Pattern Detail

You can extend the DSL to include other validation rules as needed:

# Statistical outlier detection
def within_standard_deviations(column, std_devs=3):
    # Valid if absolute difference from mean is within N standard deviations
    return lambda df: abs(df(column) - df(column).mean()) <= std_devs * df(column).std()

# Referential integrity across datasets
def foreign_key_exists(column, reference_df, reference_column):
    # Valid if value in column is present in the reference_column of the reference_df
    return lambda df: df(column).isin(reference_df(reference_column))

# Custom business logic
def profit_margin_reasonable(df):
    # Ensures 0 <= margin <= 1
    margin = (df('revenue') - df('cost')) / df('revenue')
    return (margin >= 0) & (margin <= 1)

Thus you can create validation logic as composable functions that return boolean series.

Here’s an example of how you can use the data validation DSL we created on sample data, assuming the helper functions are in a module called data_quality_dsl,

import pandas as pd
from data_quality_dsl import DataValidator, Rule, unique_values, between, matches_pattern

# Sample data
df = pd.DataFrame({
    'user_id': (1, 2, 2, 3),
    'email': ('user@test.com', 'invalid', 'user@real.com', ''),
    'age': (25, -5, 30, 150)
})

# Build validator
validator = DataValidator()
validator.add_rule(Rule("Unique users", unique_values('user_id'), "User IDs must be unique"))
validator.add_rule(Rule("Valid emails", matches_pattern('email', r'^(^@)+@(^@)+.(^@)+$'), "Invalid email format"))
validator.add_rule(Rule("Reasonable ages", between('age', 0, 120), "Age must be 0-120"))

# Run validation
issues = validator.validate(df)
for issue in issues:
    print(f"❌ {issue('rule')}: {issue('violations')} violations")

, conclusion

This DSL, although simple, works because it matches the way data professionals think about validation. Rules express business logic in easy-to-understand requirements, while allowing us to use Pandas for both performance and flexibility.

Separation of concerns makes the validation logic testable and maintainable. This approach requires no external dependencies beyond pandas and presents no learning curve for those who are already familiar with pandas operations.

This is something I worked on in the evening over a few coding sprints and several cups of coffee (of course!). But you can use this version as a starting point and create something even fancier. Happy coding!

Bala Priya C is a developer and technical writer from India. She likes to work in the fields of mathematics, programming, data science, and content creation. His areas of interest and expertise include DevOps, Data Science, and Natural Language Processing. She loves reading, writing, coding, and coffee! Currently, she is working on learning and sharing her knowledge with the developer community by writing tutorials, how-to guides, opinion pieces, and more. Bala also creates engaging resource overviews and coding tutorials.

Related Articles

Leave a Comment