Expedited engineering for external testing

by
0 comments
Expedited engineering for external testing

Expedited engineering for external testing
Image by author

, Introduction

Outliers represent extreme values ​​in a given dataset. They are so extreme that they can ruin your analysis by heavily distorting statistics like the mean. For example, in a player height dataset, 12 feet is over a range even for NBA players and will pull the mean significantly upward.

How will we handle them? We will answer this question by demonstrating a real-life data project requested by Physician Partners during the data scientist recruitment process.

First, we will explore detection methods, define outliers, and finally create signals to execute the process.

, What are the methods of external investigation and removal?

Outlier detection depends on the dataset you have. How?

For example, if your dataset distribution is normal, you can use standard deviation or z-score to detect them. However, if your dataset does not follow a normal distribution, you can use the percentile method, principal component analysis (PCA), or interquartile range (IQR) method.

you can check this article To see how to detect outliers using a box plot.

In this section, we will explore the methodology and Python code to implement these techniques.

, standard deviation method

In this method, we can define outliers by measuring how much each value deviates from the mean.

For example, in the graph below, you can see the normal distribution and ( pm3 ) standard deviation from the mean.

Expedited engineering for external testingExpedited engineering for external testing

To use this method, first measure the mean and calculate the standard deviation. Next, determine the range by adding and subtracting three standard deviations from the mean, and filter the dataset to keep only values ​​within this range. here it is Panda The code that performs this operation.

import pandas as pd
import numpy as np

col = df('column')

mean = col.mean()
std = col.std()

lower = mean - 3 * std
upper = mean + 3 * std

# Keep values within the 3 std dev range
filtered_df = df((col >= lower) & (col <= upper))

We make an assumption: the dataset should follow a normal distribution. what is a normal distributionThis means that the data follows a balanced, bell-shaped distribution. Here is an example:

Expedited engineering for external testingExpedited engineering for external testing

Using this method, you will mark approximately 0.3% of the data as outliers, because 3 standard deviations from the mean cover approximately 99.7% of the data.
Expedited engineering for external testingExpedited engineering for external testing

, IQR

The interquartile range (IQR) represents the middle 50% of your data and shows the most common values ​​in your dataset, as shown in the graph below.

Expedited engineering for external testingExpedited engineering for external testing

To detect outliers using IQR, first calculate the IQR. In the following code, we define the first and third quartiles and subtract the first quartile from the third to find the IQR (( 0.75 – 0.25 = 0.5 )).

Q1 = df('column').quantile(0.25)
Q3 = df('column').quantile(0.75)

IQR = Q3 - Q1

Once you have the IQR, you need to create filters defining the ranges.

lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

Any values ​​outside these ranges will be marked as outliers.

filtered_df = df((df('column') >= lower) & (df('column') <= upper))

As you can see from the image below, IQR represents the box in the middle. You can clearly see the limits we have defined (( pm1.5 text{ IQR} )).
Expedited engineering for external testingExpedited engineering for external testing

You can apply IQR to any distribution, but it works best if the distribution is not overly skewed.

, percentage

The percentage method involves removing values ​​based on the range chosen.

This threshold is commonly used because it removes the most extreme 1% to 5% of the data, which typically contain outliers.

We did the same in the previous section while calculating IQR, like this:

Q1 = df('column').quantile(0.25)
Q3 = df('column').quantile(0.75)

For example, let’s define the upper 99% and lower 1% of the dataset as outliers.

lower_p = df('column').quantile(0.01)
upper_p = df('column').quantile(0.99)

Finally, filter the dataset based on these limits.

filtered_df = df((df('column') >= lower_p) & (df('column') <= upper_p))

This method does not depend on assumptions, unlike the standard deviation (normal distribution) and IQR methods (non-highly skewed distribution).

, Outliers Detection Data Project from Physician Partners

Physician Partners is a health care group that helps doctors coordinate patient care more effectively. In This data projectThey asked us to create an algorithm that could detect outliers in data in one or multiple columns.

First, let’s explore the dataset using this code.

sfrs = pd.read_csv('sfr_test.csv')
sfrs.head()

Here is the output:

member_unique_id gender date of birth eligible_year eligible_month affiliation_type pbp_group scheme_name npi line of business
1 F 21/06/1990 2020 202006 associated Non-SNP Medicare – Carefree 1 HMO
2 m 02/01/1948 2020 202006 associated Non-SNP Nan 1 HMO
3 m 14/06/1948 2020 202006 associated Non-SNP Medicare – Carefree 1 HMO
4 m 10/02/1954 2020 202006 associated D-SNP Medicare – CareNeeds 1 HMO
5 m 31/12/1953 2020 202006 associated Non-SNP Nan 1 HMO

However, there are more columns that we did not see head() Method To see them, let’s use info() Method

And let’s see the output.
Expedited engineering for external testingExpedited engineering for external testing

This dataset contains synthetic health care and financial information, including demographics, plan details, clinical flags, and financial columns used to identify members with unusually high spending.

Here are those columns and their explanations.

pillar Explanation
member_unique_id member id
gender gender of member
date of birth Member’s date of birth
eligible_year Year
eligible_month month
affiliation_type type of doctor
pbp_group health planning group
scheme_name name of health plan
npi doctor’s id
line of business health plan type
ESRD This is true if the patient is on dialysis
Hospice This is true if the patient is in hospice

As you can see from the project data details, there is a problem: some data points include a dollar sign (“$”), so this needs attention.

Expedited engineering for external testingExpedited engineering for external testing

Let’s take a closer look at this column.

Here is the output.

Expedited engineering for external testingExpedited engineering for external testing

We need to pay attention to the dollar signs and these commas so that we can do proper data analysis.

, Quick crafting for external detection

Now we are aware of the specifications of the dataset. Now it’s time to write two separate signals: one to detect outliers and one to remove them.

, Hints for detecting outliers

We’ve learned three different techniques, so we should include them in the prompt.

Furthermore, as you can see info() Method output, dataset contains NaNs (missing values): Most columns have 10,530 entries, but some columns have missing values ​​(for example, plan_name Column with 6,606 non-null values). This thing should be kept in mind.

Here is the hint:

You are a data analysis assistant. I have attached a dataset. Your task is to detect outliers using three methods: standard deviation, IQR, and percentage.

follow these steps:

1. Load the attached dataset and remove the “$” sign and any comma separators (“”,””) from the financial columns, then convert them to numeric.

2. Handle missing values ​​by removing rows that contain NA in the numeric columns we analyze.

3. Apply three methods to financial columns:

Standard deviation method: flag value outside the mean +/- 3 * class

IQR method: Flag values ​​outside Q1 – 1.5 * IQR and Q3 + 1.5 * IQR

Percentile Method: Use 1st and 99th percentile as cutoff

4. Instead of listing all the results for each column, just calculate and output:

– Total number of outliers found across all financial columns for each method
– Average number of outliers per column for each method

Additionally, save the row indices of detected outliers into three separate CSV files:
– sd_outlier_indices.csv
– iqr_outlier_indices.csv
– percentile_outlier_indices.csv

The output only calculates the summary and saves the indices to CSV.

financial_columns = (
“ipa_funding”,
“ma_premium”,
“ma_risk_score”,
“mbr_with_rx_rebates”,
“partd_premium”,
“pcp_cap”,
“PCP_FFS”,
“plan_premium”,
“Professor”,
“Reinsurance”,
“risk_score_part”,
“rx”,
“rx_rebates”,
“rx_with_rebates”,
“rx_without_rebates”,
“spec_cap”
,

This signal above will first load the dataset and handle the missing values ​​by removing them. Next, it will output the number of outliers using the financial column and create three CSV files. They will include indices of missing values ​​for each of these techniques.

, Hint to remove outliers

After finding the indices, the next step is to delete them. To do this, we will also write a hint.

You are a data analysis assistant. I have attached a dataset with CSV that contains indices that are outliers.

Your task is to remove these outliers and return a clean version of the dataset.

1. Load the dataset.
2. Remove all the given outliers using the given indices.
3. Confirm how many values ​​were removed.
4. Return the cleaned dataset.

This prompt first loads the dataset and removes outliers using the given indices.

, test signal

Let’s examine how those signals work. First, download the dataset.

, outlier detection prompt

Now, attach the dataset you have to ChatGPT (or large language model (LLM) of your choice). Paste hints to detect outliers after attaching the dataset. Let’s see the output.

Expedited engineering for external testingExpedited engineering for external testing

The output shows how many outliers each method detected, the average per column, and, as requested, CSV files containing the IDs of these outliers.

We then ask him to make all CSVs downloadable with this prompt:

Prepare cleaned CSV for download

Here is the output with links.

Expedited engineering for external testingExpedited engineering for external testing

, external removal signal

This is the final step. Select the method you want to use to remove outliers, then copy the outliers removal prompt. Attach and send the CSV with this prompt.

Expedited engineering for external testingExpedited engineering for external testing

We removed outliers. Now, let’s verify this using Python. The following code will read the cleaned dataset and compare shapes to show before and after.

cleaned = pd.read_csv("/cleaned_dataset.csv")

print("Before:", sfrs.shape)
print("After :", cleaned.shape)
print("Removed rows:", sfrs.shape(0) - cleaned.shape(0))

Here is the output.

Expedited engineering for external testingExpedited engineering for external testing

This confirms that we have removed 791 outliers using the standard deviation method with ChatGPT.

, final thoughts

Removing outliers not only increases the efficiency of your machine learning model but also makes your analysis more robust. Excessive values ​​can ruin your analysis. The reason for these outliers? They may be simple typing mistakes, or they may be values ​​that appear in the dataset but do not represent the actual population, such as a 7-foot guy like Shaquille O’Neal.

To remove outliers, you can use those techniques using Python or go a step further and incorporate AI into the process using your signals. Always be very careful as your dataset may contain peculiarities that the AI ​​cannot understand at first glance, such as the “$” sign.

Nate Rosidi Is a data scientist and is into product strategy. He is also an adjunct professor teaching analytics, and is the founder of StratScratch, a platform that helps data scientists prepare for their interviews with real interview questions from top companies. Nate writes on the latest trends in the career market, gives interview advice, shares data science projects, and covers everything SQL.

Related Articles

Leave a Comment