Image by author
# Introduction
Data quality problems are everywhere. Missing values where there should be none. Dates in wrong format. Duplicate records that slip through. Outliers that distort your analysis. Text fields with inconsistent capitalization and spelling variations. These issues can break your analytics, pipelines and often lead to wrong business decisions.
Manual data validation is difficult. You need to repeatedly investigate the same problems across multiple datasets, and it is easy to miss subtle problems. This article covers five practical Python scripts that handle the most common data quality issues.
# 1. Analysis of missing data
// pain point
You receive a dataset with expected complete records, but with empty cells, null values, empty strings, and placeholder text such as “N/A” or “Unknown” scattered throughout. Some columns are mostly empty, others have just a few gaps. Before you can fix it, you need to understand the extent of the problem.
// what does the script do
Extensively scans the dataset for missing data in all its forms. Identifies patterns in missingness (random vs. systematic), calculates a completeness score for each column, and flags columns with excessive missing data. It also produces visual reports that show where your data is missing.
// how it works
The script reads data from CSV, Excel, or JSON files, detecting various representations of missing values such as None, NaN, empty strings, common placeholders. It then calculates missing data percentages on a column and row basis, identifying correlations between missing values across columns. Finally, it produces both summary statistics and detailed reports with recommendations for dealing with each type of missingness.
⏩ Find missing data analyzer script
# 2. Validating data types
// pain point
Your dataset claims to have numeric IDs, but some are text. Date fields contain dates, times, or sometimes just random strings. Email addresses in the Email column, excluding fields that are not valid emails. These types of discrepancies cause scripts to crash or incorrect calculations occur.
// what does the script do
Verifies that each column has the expected data type. Checks numeric columns for non-numeric values, date columns for invalid dates, email and URL columns for proper formatting, and categorical columns for unexpected values. The script also provides detailed reports on type violations with line numbers and examples.
// how it works
The script accepts a schema definition specifying the expected types for each column, uses regex patterns and validation libraries to check format compliance, identifies and reports rows that violate type expectations, calculates violation rates per column, and suggests appropriate data type conversion or cleanup steps.
⏩ Get Data Type Validator Script
# 3. Detecting duplicate records
// pain point
There should be unique records in your database, but duplicate entries keep appearing. Sometimes they are exact duplicates, sometimes only some fields match. It may be the same customer with a slightly different name spelling, or the transaction may have been submitted twice by mistake. Finding these manually is extremely challenging.
// what does the script do
Identifies duplicate and near-duplicate records using multiple identification strategies. Exact match, fuzzy match based on similarity threshold, and finds duplicates within specific column combinations. Group similar records together and calculate confidence scores for possible matches.
// how it works
The script uses hash-based exact matching to correct duplicates, using fuzzy string matching algorithms. Levenshtein distance For near-duplicates, allows specification of key columns for partial matching, generating duplicate clusters with similarity scores, and exporting detailed reports showing all possible duplicates with recommendations for deduplication.
⏩ Get Duplicate Record Detector Script
# 4. Detecting outliers
// pain point
The results of your analysis appear incorrect. You search and find that someone entered 999 for the age, the transaction amount is negative when it should be positive, or the measurement is three orders of magnitude larger than the rest. Outliers skew statistics, break models, and are often difficult to identify in large datasets.
// what does the script do
Automatically detects statistical outliers using several methods. Applies z-score analysis, IQR or interquartile range method, and domain-specific rules. Identifies extreme values, impossible values, and values that fall outside expected limits. Provides context for each outlier and suggests whether it is possibly an error or a legitimate extreme value.
// how it works
The script analyzes numeric columns using configurable statistical thresholds, applies domain-specific validation rules, visualizes distributions with outliers highlighted, calculates outlier scores and confidence levels, and produces prioritized reports flagging the most likely data errors first.
⏩ Get External Identity Script
# 5. Checking cross-field consistency
// pain point
Individual fields appear fine, but relationships between fields are broken. Start dates after end dates. Shipping addresses in different countries compared to the country code of the billing address. Child records without associated parent records. The order total does not match the line item total. These logical fallacies are difficult to recognize but also harmful.
// what does the script do
Validates logical relationships between fields based on business rules. Checks temporal consistency, referential integrity, mathematical relationships, and custom business logic. Flag violations with specific details about what is inconsistent.
// how it works
The script accepts a rule definition file that specifies the relationships to validate, evaluates conditional logic and cross-field comparisons, performs lookups to verify referential integrity, calculates derived values and compares stored values, and generates detailed violation reports with row references and specific rule failures.
⏩ Get Cross-Field Consistency Checker Script
# wrapping up
These five scripts help you catch data quality issues early, before they harm your analyzes or systems. Data validation should be automated, comprehensive, and fast and these scripts help in that.
So how do you get started? Download the script that solves your biggest data quality issue and install the required dependencies. Next, configure validation rules for your specific data, running it on a sample dataset to verify the setup. Then, integrate it into your data pipeline to automatically catch issues
Clean data is the foundation of everything else. Start performing validations systematically, and you’ll spend less time fixing problems. Good luck verifying!
Bala Priya C is a developer and technical writer from India. She likes to work in the fields of mathematics, programming, data science, and content creation. His areas of interest and expertise include DevOps, Data Science, and Natural Language Processing. She loves reading, writing, coding, and coffee! Currently, she is working on learning and sharing her knowledge with the developer community by writing tutorials, how-to guides, opinion pieces, and more. Bala also creates engaging resource overviews and coding tutorials.
