The “Strong” Data Scientist: Winning with Dirty Data and Pingoin

by ai-intensify
0 comments
The “Strong” Data Scientist: Winning with Dirty Data and Pingoin


Image by editor

# Introduction

A Bitter Truth to Get Started: Textbook data science In the real world it usually turns out to be a lie. Concepts and techniques are taught on finely curated, beautifully bell-curved data variables, but as soon as we venture into the jungle of real projects, we are hit with lots of outliers, unreasonably skewed distributions, and unmeasured variances.

A previous article on building an exploratory data analysis (EDA) pipeline pingoin Showed how to detect cases through tests when data violate various assumptions such as homoscedasticity and normality. But what if the tests fail? Throwing away data is not the solution: becoming stronger is the solution.

This article highlights the craftsmanship of using robust statistics in data science processes. These are mathematical methods specifically designed to obtain reliable and valid results, even when the data do not meet classical assumptions or are rife with outliers and noise. Taking a “choose your own adventure” approach, we’ll create a trio of scenarios using Python’s pingroutine to handle the worst aspects of data you’ll encounter in your daily work.

# initial setup

Let’s start by installing (if necessary) and importing Pingoin. PandaAfter which we will load the available wine quality datasets Here.

!pip install pingouin pandas

import pandas as pd
import pingouin as pg

# Loading our messy, real-world-like dataset, containing red and white wine samples
url = "https://raw.githubusercontent.com/gakudo-ai/open-datasets/refs/heads/main/wine-quality-white-and-red.csv"
df = pd.read_csv(url)

# Take a small peek at what we are about to deal with
df.head()

If you’ve seen a previous Pingoin article, you already know that this is a notoriously messy dataset that fails to meet many common assumptions. We will now embark on three different “adventures”, each highlighting a scenario, a core problem and a proposed robust solution to address it.

// Adventure 1: When the normality test fails

Suppose we run a normality test on two groups: white wine samples and red wine samples.

white_wine_alcohol = df(df('type') == 'white')('alcohol')
red_wine_alcohol = df(df('type') == 'red')('alcohol')

print("Normality test for White Wine Alcohol content:")
print(pg.normality(white_wine_alcohol))
print("nNormality test for Red Wine Alcohol content:")
print(pg.normality(red_wine_alcohol))

You will find that any distribution with an extremely low p-value is not normal. Although non-normality does not directly indicate outliers or skewness, a strong deviation from normality often suggests that such characteristics may be present in the data. In this situation, comparison through t-test would be dangerous and likely to yield unreliable results.

The robust solution for this kind of scenario is Mann-Whitney U test. Instead of comparing averages, it compares ranks in the test data – for example, ordering all the wines in a group from lowest to highest alcohol content. This rank-based approach is the master trick that removes outliers of their sometimes alarming magnitude. This way:

# Separating our two groups
red_wine = df(df('type') == 'red')('alcohol')
white_wine = df(df('type') == 'white')('alcohol')

# Running the robust Mann-Whitney U test
mwu_results = pg.mwu(x=red_wine, y=white_wine)
print(mwu_results)

Output:

         U_val alternative     p_val       RBC      CLES
MWU  3829043.5   two-sided  0.181845 -0.022193  0.488903

Since the p-value is not less than 0.05, there is no statistically significant difference in alcohol content between the two wine types – and this conclusion is guaranteed to be outlier-proof and skewness-proof.

// Adventure 2: When the paired t-test fails

Let’s say you now want to compare two measurements taken from the same subject – for example a patient’s sugar level before and after a drug prototype, or two properties measured in the same bottle of wine. The focus here is on how differences of opinions is distributed between paired measurements. When such differences are not normally distributed, a standard paired t-test will yield unreliable confidence intervals.

The ideal solution in this scenario is Wilcoxon signed-rank test: The stronger sibling of the paired t-test, which works by looking at the differences between columns and ranking their absolute values. In Penguin this test is called experiment pg.wilcoxon()Passing in two columns containing paired measurements within the same subject – for example two types of wine acidity.

# Run the robust Wilcoxon signed-rank test for paired data
wilcoxon_results = pg.wilcoxon(x=df('fixed acidity'), y=df('volatile acidity'))
print(wilcoxon_results)

Result:

          W_val alternative  p_val  RBC  CLES
Wilcoxon    0.0   two-sided    0.0  1.0   1.0

The above result shows a statistically significant difference, or “complete separation”, between the two measurements. Not only are the two wine attributes different, but they also operate at completely different levels of magnitude across the dataset.

// Adventure 3: When ANOVA Fails

In this third and final adventure, we want to test whether residual sugar levels in wines vary significantly across different quality ratings – note that the latter range between 3 and 9, which takes integer values, and can therefore be treated as separate categories.

For example, if Pingoin’s Levene test fails dramatically in homoscedasticity – because the sugar difference is very large in mediocre wines, but very small in top quality wines – then a classical one-way ANOVA may produce misleading results, because this test assumes equal variation between groups.

is fixed Welch’s ANOVAWhich penalizes groups with high variation, thereby balancing the scales across multiple categories and making comparisons fair. Here’s how to run this robust alternative to traditional ANOVA using Pingoin:

# Run Welch's ANOVA to compare sugar across quality ratings
welch_results = pg.welch_anova(data=df, dv='residual sugar', between='quality')
print(welch_results)

Result:

    Source  ddof1      ddof2          F         p_unc       np2
0  quality      6  54.507934  10.918282  5.937951e-08  0.008353

Even where one-way ANOVA might have struggled due to unequal variances, Welch’s ANOVA gives a solid conclusion. The very small p-value is clear evidence that residual sugar levels vary significantly across wine quality ratings. However, keep in mind that sugar is only a small part of the puzzle affecting wine quality – a point underlined by the low eta-squared value of 0.008.

# wrapping up

Through three example scenarios, each pairing a messy-data problem with a strong statistical strategy, we learned that being a skilled data scientist doesn’t mean having perfect data or tuning it perfectly – it means knowing what to do when the data gets tricky for different reasons. Pingouin’s work applies a variety of robustness tests that help avoid the fail-assumptions trap and obtain mathematically sound insights with little extra effort.

ivan palomares carrascosa Is a leader, author, speaker and consultant in AI, Machine Learning, Deep Learning and LLM. He trains and guides others in using AI in the real world.

Related Articles

Leave a Comment