We used 5 external validation methods on a real dataset: they disagreed on 96% of flagged samples

Image by author

# Introduction

All tutorials on data science make it quite easy to detect outliers. Remove all values greater than three standard deviations; Thats all there is to it. But once you start working with real datasets where the distribution is skewed and a stakeholder asks, “Why did you remove that data point?” You suddenly realize you don’t have any good answers.

So we ran an experiment. We tested five of the most commonly used outlier detection methods on a real dataset (6,497 Portuguese wines) to find out: Do these methods give consistent results?

He didn’t. What we learned from disagreements proved more valuable than anything we learned from a textbook.

Image by author

We’ve created this analysis as an interactive Strata notebook, a format you can use for your own experiments Data Project on StratScratch. you can see and run full code here.

# establishment of

This is where our data comes from wine quality datasetPublicly available through UCI’s Machine Learning Repository. It includes physicochemical measurements of 6,497 Portuguese “vinho verde” wines (1,599 red, 4,898 white), as well as quality ratings from expert tasters.

We chose it for several reasons. This is production data, not something artificially generated. The distributions are skewed (6 out of 11 features have skewness > 1), so the data do not meet textbook assumptions. And quality ratings let us examine whether detected “outliers” appear more in wines with unusual ratings.

Below are the five methods we tested:

External detection methods

# Discovering the First Surprise: Enhanced Results from Multiple Testing

Before we could compare the methods, we hit a wall. With 11 features, the naive approach (marking a sample based on an extreme value in at least one feature) produced highly inflated results.

The IQR identified approximately 23% of the wines as outliers. The Z-score was pegged at around 26%.

When approximately 1 in 4 wines are marked as outliers, something is wrong. Real datasets do not contain 25% outliers. The problem was that we were testing 11 features independently, and this inflated the results.

The math is simple. If the probability of each feature having a “random” extreme value is less than 5%, then with 11 independent features:
( P(text{at least one extremum}) = 1 – (0.95)^{11} approximately 43% )

In plain words: even if each feature were completely normal, you would expect about half of your samples to have at least one extreme value by random chance.

To fix this, we changed the requirement: Mark a sample only if at least 2 features are simultaneously extreme.

Changing min_features from 1 to 2 changed the definition from “any feature of the sample is extreme” to “the sample is extreme in more than one feature”.

Here is the improved code:

# Count extreme features per sample
outlier_counts = (np.abs(z_scores) > 3.5).sum(axis=1)
outliers = outlier_counts >= 2

# Comparing 5 methods on 1 dataset

Once the multi-test solution was implemented, we counted how many samples each method marked:

External detection methods
Here’s how we set up ML methods:

from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
 
iforest = IsolationForest(contamination=0.05, random_state=42)
lof = LocalOutlierFactor(n_neighbors=20, contamination=0.05)

Why do all ML methods show exactly 5%? Due to contamination parameters. For this they are required to mark the exact same percentage. This is a quota, not a limit. In other words, Isolation Forest will mark 5% whether your data has 1% or 20% true outliers.

# Discovering real differences: they identify different things

Here’s the one that surprised us the most. When we examined how much the methods agreed, Jaccard similarity Ranged from 0.10 to 0.30. This is bad consent.

Of 6,497 wines:

Only 32 samples (0.5%) were marked by all 4 primary methods
143 samples (2.2%) were identified by 3+ methods
The remaining “outliers” were marked in only 1 or 2 ways.

You might think this is a bug, but that’s the point. Each method has its own definition of “abnormal”:

External detection methods
If a wine has significantly higher residual sugar levels than average, this is an unobservable outlier (the Z-score/IQR will catch this). But if it is surrounded by other wines with similar sugar levels, LOF will not mark it. This is normal in the local context.

So the real question is not “Which method is best?” It’s a question of “What kind of unusual am I looking for?”

# Sanity check: Are outliers related to wine quality?

The dataset includes expert quality ratings (3-9). We wanted to know: Do detected outliers appear more frequently in wines with high quality ratings?

External detection methods
Highly quality wines were twice as likely to be unanimous outliers. This is a good sanity check. In some cases, the connection is clear: wines with too much volatile acidity taste like vinegar, are given poor ratings, and are labeled as outliers. Chemistry drives both outcomes. But we cannot assume that it explains every case. There may be patterns we aren’t seeing, or confounding factors we haven’t noticed.

# Making three decisions that shaped our results

External detection methods

// 1. Using Strong Z-Score instead of Standard Z-Score

A standard Z-score uses the mean and standard deviation of the data, both of which are affected by the outliers present in our dataset. Instead a robust Z-score uses the median and median absolute deviation (MAD), neither of which are affected by outliers.

As a result, the standard Z-score identified 0.8% of the data as outliers, while the robust Z-score identified 3.5%.

# Robust Z-Score using median and MAD
median = np.median(data, axis=0)
mad = np.median(np.abs(data - median), axis=0)
robust_z = 0.6745 * (data - median) / mad

// 2. Scaling red and white wines separately

Red and white wines have different baseline levels of chemicals. For example, when combining red and white wines into a single dataset, a red wine that has exactly average chemistry relative to other red wines may be identified as an outlier based only on its sulfur content compared to the combined average of red and white wines. Therefore, we scaled each wine type separately using the median and interquartile range (IQR) of each wine type, and then combined the two.

# Scale each wine type separately
from sklearn.preprocessing import RobustScaler
scaled_parts = ()
for wine_type in ('red', 'white'):
    subset = df(df('type') == wine_type)(features)
    scaled_parts.append(RobustScaler().fit_transform(subset))

// 3. Knowing when to exclude a method

The elliptic envelope assumes that your data follows a multivariate normal distribution. It didn’t happen for us. Six of the eleven features had skewness above 1, and one feature reached 5.4. We placed the elliptic envelope in the completion comparison, but excluded it from the consensus vote.

# Determining which method performs best for this wine dataset

Image by author

Can we pick a “winner” given the characteristics of our data (huge heterogeneity, mixed population, no known ground truth)?

Robust Z-score, IQR, Isolation Forest, and LOF all handle skewed data reasonably well. If forced to choose one, we’d go with Isolation Forest: no distribution assumptions, considers all features at once, and behaves decently with mixed populations.

But no one method does everything:

Isolation Forest may miss outliers that are extreme on only one feature (Z-score/IQR catches them)
Z-score/IQR may miss outliers that are unusual across multiple characteristics (multidimensional outliers)

Better approach: Use multiple methods and rely on consensus. 143 Wines marked by 3 or more methods are far more reliable than anything marked by any one method alone.

Here’s how we calculated the consensus:

# Count how many methods flagged each sample
consensus = zscore_out + iqr_out + iforest_out + lof_out
high_confidence = df(consensus >= 3)  # Identified by 3+ methods

Without ground truth (in most real-world projects), method agreement is the closest measure of trust.

# Understanding what this means for your own projects

Define your problem before choosing your method. What kind of “unusual” are you really looking for? Data entry errors look different from measurement discrepancies, and both look different from actual rare cases. The type of problem suggests different approaches.

Check your assumptions. If your data is too distorted, the standard Z-score and elliptic envelope will steer you in the wrong direction. Look at your distribution before committing to a method.

Use multiple methods. Samples characterized by three or more methods with different definitions of “outlier” are more reliable than samples characterized by only one.

Don’t assume that all outliers should be removed. There may be an external error. This may also be your most interesting data point. Domain knowledge makes that call, not algorithms.

# concluding remarks

The issue here is not that external identity is broken. It’s that “outsider” means different things depending on who is asking. Z-scores and IQR capture values that are extreme on the same dimension. Isolation Forest and LOF find patterns that stand out in their overall patterns. Elliptic envelope works well when your data is truly Gaussian (ours was not).

Before choosing a method, figure out exactly what you’re looking for. And if you’re not sure? Run multiple methods and go with consensus.

# questions to ask

// 1. Deciding which technology should I start with

A good place to start is the isolation forest technique. It doesn’t consider how your data is distributed and how all your features are used at the same time. However, if you want to identify extreme values for a particular measurement (such as a very high blood pressure reading), the Z-score or IQR may be more appropriate for that.

// 2. Choosing contamination rates for scikit-learn methods

It depends on the problem you are trying to solve. A commonly used value is 5% (or 0.05). But keep in mind that pollution is one iota. This means that 5% of your samples will be classified as outliers, regardless of whether your data actually contains 1% or 20% true outliers. Use the contamination rate based on your knowledge of the proportion of outliers in your data.

// 3. Removing outliers before splitting train/test data

No, you should fit an outlier-detection model to your training dataset, and then apply the trained model to your testing dataset. If you do otherwise, your test data is impacting your preprocessing, which introduces leakage.

// 4. Handling categorical attributes

The techniques involved here work on numerical data. There are three possible options for categorical attributes:

Encode your categorical variables and continue;
Use technology designed for mixed-type data (such as HBOS);
Run outlier detection separately on numeric columns and use frequency-based methods for categorical columns.

// 5. Knowing whether a flagged outlier is an error or simply unusual

You cannot determine from the algorithm alone when an identified outlier represents an error versus when it is simply abnormal. It marks what is unusual, not what is wrong. For example, a wine that has an extremely high residual sugar content may be a data entry error, or it may be a dessert wine that is intended to be so sweet. Ultimately, only your domain expertise can answer. If you’re unsure, mark it for review instead of automatically deleting it.

Nate Rosidi Is a data scientist and is into product strategy. He is also an adjunct professor teaching analytics, and is the founder of StratScratch, a platform that helps data scientists prepare for their interviews with real interview questions from top companies. Nate writes on the latest trends in the career market, gives interview advice, shares data science projects, and covers everything SQL.

We used 5 external validation methods on a real dataset: they disagreed on 96% of flagged samples

# Introduction

# establishment of

# Discovering the First Surprise: Enhanced Results from Multiple Testing

# Comparing 5 methods on 1 dataset

# Discovering real differences: they identify different things

# Sanity check: Are outliers related to wine quality?

# Making three decisions that shaped our results

// 1. Using Strong Z-Score instead of Standard Z-Score

// 2. Scaling red and white wines separately

// 3. Knowing when to exclude a method

# Determining which method performs best for this wine dataset

# Understanding what this means for your own projects

# concluding remarks

# questions to ask

// 1. Deciding which technology should I start with

// 2. Choosing contamination rates for scikit-learn methods

// 3. Removing outliers before splitting train/test data

// 4. Handling categorical attributes

// 5. Knowing whether a flagged outlier is an error or simply unusual

Transforming News Reports into Data with Gemini

How Rocket Mortgage built a text-to-SQL system with RAG

Related Articles

Leave a Comment Cancel Reply