We tuned 4 classifiers on the same dataset: none really improved

Image by author

# introduction to experiment

Hyperparameter tuning is often touted as a magic pill for machine learning. The promise is simple: tweak a few parameters for a few hours, run a grid search, and watch your model’s performance soar.

But does it really work in practice?

Image by author

We tested this premise on Portuguese student performance data using four different classifiers and rigorous statistical validation. Our approach used nested cross-validation (CV), robust preprocessing pipelines, and statistical significance testing – the whole nine yards.

outcome? Performance dropped by 0.0005. That’s right – tuning actually made the results slightly worse, although the difference was not statistically significant.

However, this is not a failure story. This is something even more valuable: evidence that in many cases, default settings work remarkably well. Sometimes the best move is to know when to stop tuning and focus your efforts elsewhere.

Want to see the full experiment? check it out Complete Jupyter Notebook with all code and analysis.

# setting dataset

Image by author

We used datasets from of stratscratch “Student Performance Analysis” Project. It contains records of 649 students with 30 attributes covering demographics, family background, social factors and school-related information. The objective was to predict whether students pass their final Portuguese class (score of ≥ 10).

An important decision in this setup was to exclude G1 and G2 grades. These are first and second period grades that correlate 0.83–0.92 with the final grade, G3. Including them makes predictions too easy and defeats the purpose of the experiment. We wanted to identify what is predictive of success beyond prior performance in the same course.

we experimented Panda Libraries for loading and preparing data:

# Load and prepare data
df = pd.read_csv('student-por.csv', sep=';')

# Create pass/fail target (grade >= 10)
PASS_THRESHOLD = 10
y = (df('G3') >= PASS_THRESHOLD).astype(int)

# Exclude G1, G2, G3 to prevent data leakage
features_to_exclude = ('G1', 'G2', 'G3')
X = df.drop(columns=features_to_exclude)

Class distribution revealed that 100 students failed (15.4%) while 549 passed (84.6%). Since the data is imbalanced, we optimized for F1-score rather than simple accuracy.

# Evaluating the classifier

We selected four classifiers representing different teaching approaches:

Image by author

Each model was initially run with default parameters, followed by tuning via grid search with 5-fold CV.

# Establish a strong methodology

Many machine learning tutorials demonstrate impressive tuning results because they omit important validation steps. We maintained high standards to ensure that our findings were reliable.

Our methodology includes:

No data leakage: All preprocessing was done inside pipelines and fit only on the training data
Nested cross-validation: We used an inner loop for hyperparameter tuning and an outer loop for the final evaluation
Appropriate train/test split: We used an 80/20 split with stratification, keeping the test set separate until the end (ie, no “peeping”).
Statistical validation: we applied McNemar’s test To verify whether the difference in performance was statistically significant
Metric selection: We prioritized F1-score for imbalanced classes rather than accuracy

Image by author

The pipeline structure was as follows:

# Preprocessing pipeline - fit only on training folds
numeric_transformer = Pipeline((
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
))

categorical_transformer = Pipeline((
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
))

# Combine transformers
from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer(transformers=(
    ('num', numeric_transformer, X.select_dtypes(include=('int64', 'float64')).columns),
    ('cat', categorical_transformer, X.select_dtypes(include=('object')).columns)
))

# Full pipeline with model
pipeline = Pipeline((
    ('preprocessor', preprocessor),
    ('classifier', model)
))

# analysis of results

After completing the tuning process, the results were amazing:

Tune the classifier on the same dataset

The average improvement across all models was -0.0005.

The three models actually performed slightly worse after tuning. XGBoost showed an improvement of about 1%, which seemed promising until we applied statistical tests. When evaluated on the hold-out test set, none of the models demonstrated a statistically significant difference.

we ran McNemar’s test Comparing the two best performing models (Random Forest vs. The p-value was 1.0, meaning there is no significant difference between the default and tuned versions.

# Explaining why the tuning failed

Image by author

Several factors explain these results:

Strong default. scikit-learn And XGBoost ships with highly optimized default parameters. Library maintainers have refined these values over the years to ensure that they work effectively across a wide variety of datasets.
Limited signal. After removing G1 and G2 grades (which could cause data leakage), the remaining features had low predictive power. There was not enough signal left to exploit hyperparameter optimization.
Small dataset size. With only 649 samples divided into training folds, there was insufficient data for the grid search to identify truly meaningful patterns. Grid search requires enough data to reliably distinguish between different parameter sets.
performance limit. Most baseline models already score between 92-93% F1. Without introducing better features or more data, the scope for improvement is naturally limited.
Strict methodology. When you eliminate data leaks and use nested CVs, the increased improvement often seen in improper validation disappears.

# learning from results

Image by author

This experiment offers several valuable lessons for any practitioner:

Methodology matters more than metrics. Fixing data leaks and using proper validation changes the outcome of the experiment. When the process is handled correctly the impressive points gained from improper verification vanish.
Statistical validation is required. Without McNemar’s test, we would have incorrectly deployed XGBoost based on a nominal 1% improvement. Testing revealed that it was mere noise.
Negative results are of utmost importance. Not every experiment needs to show massive improvements. Knowing when tuning doesn’t help saves time on future projects and is a sign of a mature workflow.
The default hyperparameters are underestimated. The defaults are often sufficient for standard datasets. Don’t assume you need to tune every parameter right from the start.

# Summary of Findings

We attempted to boost model performance by performing thorough hyperparameter tuning, following industry best practices, and applying statistical validation to four different models.

Results: No statistically significant improvement.

Image by author

This is *not* a failure. Instead, it represents the kind of honest results that allow you to make better choices in real-world project work. It tells you when to stop hyperparameter tuning and when to focus your attention on other important aspects, such as data quality, feature engineering, or collecting additional samples.

Machine learning is not about getting the highest possible number by any means; It’s about building models you can trust. This belief arises from the methodological process used to create the model, not from pursuing marginal profits. The hardest skill in machine learning is knowing when to stop.

Image by author

Nate Rosidi Is a data scientist and is into product strategy. He is also an adjunct professor teaching analytics, and is the founder of StratScratch, a platform that helps data scientists prepare for their interviews with real interview questions from top companies. Nate writes on the latest trends in the career market, gives interview advice, shares data science projects, and covers everything SQL.

We tuned 4 classifiers on the same dataset: none really improved

# introduction to experiment

# setting dataset

# Evaluating the classifier

# Establish a strong methodology

# analysis of results

# Explaining why the tuning failed

# learning from results

# Summary of Findings

Reimagining ERP for the agentic AI era

As the US marks one year since the measles outbreak, is the disease back for good?

Related Articles

Leave a Comment Cancel Reply