
Image by author
, Introduction
Outliers represent extreme values ​​in a given dataset. They are so extreme that they can ruin your analysis by heavily distorting statistics like the mean. For example, in a player height dataset, 12 feet is over a range even for NBA players and will pull the mean significantly upward.
How will we handle them? We will answer this question by demonstrating a real-life data project requested by Physician Partners during the data scientist recruitment process.
First, we will explore detection methods, define outliers, and finally create signals to execute the process.
, What are the methods of external investigation and removal?
Outlier detection depends on the dataset you have. How?
For example, if your dataset distribution is normal, you can use standard deviation or z-score to detect them. However, if your dataset does not follow a normal distribution, you can use the percentile method, principal component analysis (PCA), or interquartile range (IQR) method.
you can check this article To see how to detect outliers using a box plot.
In this section, we will explore the methodology and Python code to implement these techniques.
, standard deviation method
In this method, we can define outliers by measuring how much each value deviates from the mean.
For example, in the graph below, you can see the normal distribution and ( pm3 ) standard deviation from the mean.

To use this method, first measure the mean and calculate the standard deviation. Next, determine the range by adding and subtracting three standard deviations from the mean, and filter the dataset to keep only values ​​within this range. here it is Panda The code that performs this operation.
import pandas as pd
import numpy as np
col = df('column')
mean = col.mean()
std = col.std()
lower = mean - 3 * std
upper = mean + 3 * std
# Keep values within the 3 std dev range
filtered_df = df((col >= lower) & (col <= upper))
We make an assumption: the dataset should follow a normal distribution. what is a normal distributionThis means that the data follows a balanced, bell-shaped distribution. Here is an example:

Using this method, you will mark approximately 0.3% of the data as outliers, because 3 standard deviations from the mean cover approximately 99.7% of the data.
, IQR
The interquartile range (IQR) represents the middle 50% of your data and shows the most common values ​​in your dataset, as shown in the graph below.

To detect outliers using IQR, first calculate the IQR. In the following code, we define the first and third quartiles and subtract the first quartile from the third to find the IQR (( 0.75 – 0.25 = 0.5 )).
Q1 = df('column').quantile(0.25)
Q3 = df('column').quantile(0.75)
IQR = Q3 - Q1
Once you have the IQR, you need to create filters defining the ranges.
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
Any values ​​outside these ranges will be marked as outliers.
filtered_df = df((df('column') >= lower) & (df('column') <= upper))
As you can see from the image below, IQR represents the box in the middle. You can clearly see the limits we have defined (( pm1.5 text{ IQR} )).
You can apply IQR to any distribution, but it works best if the distribution is not overly skewed.
, percentage
The percentage method involves removing values ​​based on the range chosen.
This threshold is commonly used because it removes the most extreme 1% to 5% of the data, which typically contain outliers.
We did the same in the previous section while calculating IQR, like this:
Q1 = df('column').quantile(0.25)
Q3 = df('column').quantile(0.75)
For example, let’s define the upper 99% and lower 1% of the dataset as outliers.
lower_p = df('column').quantile(0.01)
upper_p = df('column').quantile(0.99)
Finally, filter the dataset based on these limits.
filtered_df = df((df('column') >= lower_p) & (df('column') <= upper_p))
This method does not depend on assumptions, unlike the standard deviation (normal distribution) and IQR methods (non-highly skewed distribution).
, Outliers Detection Data Project from Physician Partners
Physician Partners is a health care group that helps doctors coordinate patient care more effectively. In This data projectThey asked us to create an algorithm that could detect outliers in data in one or multiple columns.
First, let’s explore the dataset using this code.
sfrs = pd.read_csv('sfr_test.csv')
sfrs.head()
Here is the output:
| member_unique_id | gender | date of birth | eligible_year | eligible_month | affiliation_type | pbp_group | scheme_name | npi | line of business |
|---|---|---|---|---|---|---|---|---|---|
| 1 | F | 21/06/1990 | 2020 | 202006 | associated | Non-SNP | Medicare – Carefree | 1 | HMO |
| 2 | m | 02/01/1948 | 2020 | 202006 | associated | Non-SNP | Nan | 1 | HMO |
| 3 | m | 14/06/1948 | 2020 | 202006 | associated | Non-SNP | Medicare – Carefree | 1 | HMO |
| 4 | m | 10/02/1954 | 2020 | 202006 | associated | D-SNP | Medicare – CareNeeds | 1 | HMO |
| 5 | m | 31/12/1953 | 2020 | 202006 | associated | Non-SNP | Nan | 1 | HMO |
However, there are more columns that we did not see head() Method To see them, let’s use info() Method
And let’s see the output.
This dataset contains synthetic health care and financial information, including demographics, plan details, clinical flags, and financial columns used to identify members with unusually high spending.
Here are those columns and their explanations.
| pillar | Explanation |
|---|---|
| member_unique_id | member id |
| gender | gender of member |
| date of birth | Member’s date of birth |
| eligible_year | Year |
| eligible_month | month |
| affiliation_type | type of doctor |
| pbp_group | health planning group |
| scheme_name | name of health plan |
| npi | doctor’s id |
| line of business | health plan type |
| ESRD | This is true if the patient is on dialysis |
| Hospice | This is true if the patient is in hospice |
As you can see from the project data details, there is a problem: some data points include a dollar sign (“$”), so this needs attention.

Let’s take a closer look at this column.
Here is the output.

We need to pay attention to the dollar signs and these commas so that we can do proper data analysis.
, Quick crafting for external detection
Now we are aware of the specifications of the dataset. Now it’s time to write two separate signals: one to detect outliers and one to remove them.
, Hints for detecting outliers
We’ve learned three different techniques, so we should include them in the prompt.
Furthermore, as you can see info() Method output, dataset contains NaNs (missing values): Most columns have 10,530 entries, but some columns have missing values ​​(for example, plan_name Column with 6,606 non-null values). This thing should be kept in mind.
Here is the hint:
You are a data analysis assistant. I have attached a dataset. Your task is to detect outliers using three methods: standard deviation, IQR, and percentage.
follow these steps:
1. Load the attached dataset and remove the “$” sign and any comma separators (“”,””) from the financial columns, then convert them to numeric.
2. Handle missing values ​​by removing rows that contain NA in the numeric columns we analyze.
3. Apply three methods to financial columns:
Standard deviation method: flag value outside the mean +/- 3 * class
IQR method: Flag values ​​outside Q1 – 1.5 * IQR and Q3 + 1.5 * IQR
Percentile Method: Use 1st and 99th percentile as cutoff
4. Instead of listing all the results for each column, just calculate and output:
– Total number of outliers found across all financial columns for each method
– Average number of outliers per column for each methodAdditionally, save the row indices of detected outliers into three separate CSV files:
– sd_outlier_indices.csv
– iqr_outlier_indices.csv
– percentile_outlier_indices.csvThe output only calculates the summary and saves the indices to CSV.
financial_columns = (
“ipa_funding”,
“ma_premium”,
“ma_risk_score”,
“mbr_with_rx_rebates”,
“partd_premium”,
“pcp_cap”,
“PCP_FFS”,
“plan_premium”,
“Professor”,
“Reinsurance”,
“risk_score_part”,
“rx”,
“rx_rebates”,
“rx_with_rebates”,
“rx_without_rebates”,
“spec_cap”
,
This signal above will first load the dataset and handle the missing values ​​by removing them. Next, it will output the number of outliers using the financial column and create three CSV files. They will include indices of missing values ​​for each of these techniques.
, Hint to remove outliers
After finding the indices, the next step is to delete them. To do this, we will also write a hint.
You are a data analysis assistant. I have attached a dataset with CSV that contains indices that are outliers.
Your task is to remove these outliers and return a clean version of the dataset.
1. Load the dataset.
2. Remove all the given outliers using the given indices.
3. Confirm how many values ​​were removed.
4. Return the cleaned dataset.
This prompt first loads the dataset and removes outliers using the given indices.
, test signal
Let’s examine how those signals work. First, download the dataset.
, outlier detection prompt
Now, attach the dataset you have to ChatGPT (or large language model (LLM) of your choice). Paste hints to detect outliers after attaching the dataset. Let’s see the output.

The output shows how many outliers each method detected, the average per column, and, as requested, CSV files containing the IDs of these outliers.
We then ask him to make all CSVs downloadable with this prompt:
Prepare cleaned CSV for download
Here is the output with links.


, external removal signal
This is the final step. Select the method you want to use to remove outliers, then copy the outliers removal prompt. Attach and send the CSV with this prompt.

We removed outliers. Now, let’s verify this using Python. The following code will read the cleaned dataset and compare shapes to show before and after.
cleaned = pd.read_csv("/cleaned_dataset.csv")
print("Before:", sfrs.shape)
print("After :", cleaned.shape)
print("Removed rows:", sfrs.shape(0) - cleaned.shape(0))
Here is the output.

This confirms that we have removed 791 outliers using the standard deviation method with ChatGPT.
, final thoughts
Removing outliers not only increases the efficiency of your machine learning model but also makes your analysis more robust. Excessive values ​​can ruin your analysis. The reason for these outliers? They may be simple typing mistakes, or they may be values ​​that appear in the dataset but do not represent the actual population, such as a 7-foot guy like Shaquille O’Neal.
To remove outliers, you can use those techniques using Python or go a step further and incorporate AI into the process using your signals. Always be very careful as your dataset may contain peculiarities that the AI ​​cannot understand at first glance, such as the “$” sign.
Nate Rosidi Is a data scientist and is into product strategy. He is also an adjunct professor teaching analytics, and is the founder of StratScratch, a platform that helps data scientists prepare for their interviews with real interview questions from top companies. Nate writes on the latest trends in the career market, gives interview advice, shares data science projects, and covers everything SQL.