
Image by author
# Introduction
while building machine learning Models of medium to high complexity have a wide range of model parameters that are not learned from the data, but instead we must determine a priori: these are known as hyperparameter. Models such as random forest ensembles and neural networks have a variety of hyperparameters to adjust, such that each can take one of many different values. As a result, the possible ways to configure even a small subset of hyperparameters become almost endless. There is a problem involved: identifying the optimal configuration of these hyperparameters – i.e. the one providing the best model performance – can be like trying to find a needle in a haystack – or worse: in the ocean.
This article is based on a previous guide Mastery in Machine Learning Takes a practical approach to the art of hyperparameter tuning, and to illustrate the use of intermediate to advanced hyperparameter tuning techniques in practice.
Specifically, you will learn how to apply these three hyperparameter tuning techniques:
- random search
- bayesian optimization
- gradual stop
# Performing Initial Setup
Before we begin, we’ll import the required libraries and dependencies – if you have a “module not found” error for any of these, be sure to pip install First the library in question. we will experiment numpy, scikit-learnAnd optuna: :
import numpy as np
import time
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
import optuna
import warnings
warnings.filterwarnings('ignore')
We will also load the datasets used in the three examples: Revised National Institute of Standards and Technology (MNIST)A dataset for classification of low-resolution images of handwritten digits.
print("=" * 70)
print("LOADING MNIST DATASET FOR IMAGE CLASSIFICATION")
print("=" * 70)
# Load digits dataset (lightweight version of MNIST: 8x8 images, 1797 samples)
digits = load_digits()
X, y = digits.data, digits.target
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
print(f"Training instances: {X_train.shape(0)}")
print(f"Test instances: {X_test.shape(0)}")
print(f"Features: {X_train.shape(1)}")
print(f"Classes: {len(np.unique(y))}")
print()
Next, we define a hyperparameter search space; That is, we identify each parameter and the subset of values ​​we want to try in combination.
print("=" * 70)
print("HYPERPARAMETER SEARCH SPACE")
print("=" * 70)
# Typical hyperparameters to explore in a random forest ensemble
param_space = {
'n_estimators': (10, 200), # Number of trees
'max_depth': (5, 50), # Maximum tree depth
'min_samples_split': (2, 20), # Min samples to split node
'min_samples_leaf': (1, 10), # Min samples in leaf node
'max_features': (0.1, 1.0) # Fraction of features to consider
}
print("Search space:")
for param, bounds in param_space.items():
print(f" {param}: {bounds}")
print()
As a final preparatory step, we define a function that will be reused. It encompasses the process of training and evaluating a random forest ensemble model under a specific hyperparameter configuration using classification accuracy as well as cross-validation (CV) to determine the quality of the model. Note that this function can be called a large number of times by each of the three techniques we implement – ​​as many hyperparameter value combinations as there are to try.
def evaluate_model(params, X_train, y_train, cv=3):
# Instantiate a random forest model with given hyperparameters
model = RandomForestClassifier(
n_estimators=int(params('n_estimators')),
max_depth=int(params('max_depth')),
min_samples_split=int(params('min_samples_split')),
min_samples_leaf=int(params('min_samples_leaf')),
max_features=float(params('max_features')),
random_state=42,
n_jobs=-1 # Use all CPU cores for speed
)
# Use CV to measure performance
# This gives us a more robust estimate than a single train/val split
scores = cross_val_score(model, X_train, y_train, cv=cv,
scoring='accuracy', n_jobs=-1)
# Return the average cross-validation accuracy
return np.mean(scores)
Now we’re ready to try out three techniques!
# implement random search
As its name suggests, random search samples hyperparameter combinations randomly from the search space, rather than trying exhaustively. All Possible combinations in a pre-defined search space, such as a grid search. Each test is independent, with no knowledge gained from previous tests. Nevertheless, it is a highly effective method in many situations, usually finding high-quality solutions more quickly than a grid search.
Here is how random search can be implemented and used on random forest ensembles to classify MNIST data:
def randomized_search(n_trials=30):
start_time = time.time() # Optional: used to measure execution time
results = ()
print(f"nRunning {n_trials} random trials...")
for i in range(n_trials):
# RANDOM SAMPLING: hyperparameters are sampled independently using numpy's random number generation
params = {
'n_estimators': np.random.randint(param_space('n_estimators')(0),
param_space('n_estimators')(1)),
'max_depth': np.random.randint(param_space('max_depth')(0),
param_space('max_depth')(1)),
'min_samples_split': np.random.randint(param_space('min_samples_split')(0),
param_space('min_samples_split')(1)),
'min_samples_leaf': np.random.randint(param_space('min_samples_leaf')(0),
param_space('min_samples_leaf')(1)),
'max_features': np.random.uniform(param_space('max_features')(0),
param_space('max_features')(1))
}
# Evaluate a randomly defined configuration
score = evaluate_model(params, X_train, y_train)
results.append({'params': params, 'score': score})
# Provide a progress update every 10 trials, for informative purposes
if (i + 1) % 10 == 0:
best_so_far = max(results, key=lambda x: x('score'))
print(f" Trial {i+1}/{n_trials}: Best score so far = {best_so_far('score'):.4f}")
# Measure total time taken
elapsed_time = time.time() - start_time
# Identify best configuration found
best_result = max(results, key=lambda x: x('score'))
print(f"n✓ Completed in {elapsed_time:.2f} seconds")
print(f"Best validation accuracy: {best_result('score'):.4f}")
print(f"Best parameters: {best_result('params')}")
return best_result, results
# Call the method to perform randomized search over 30 trials
random_best, random_results = randomized_search(n_trials=30)
Comments are also given along with the code to facilitate understanding. The results obtained will be similar to the following:
Running 30 random trials...
Trial 10/30: Best score so far = 0.9617
Trial 20/30: Best score so far = 0.9617
Trial 30/30: Best score so far = 0.9617
✓ Completed in 64.59 seconds
Best validation accuracy: 0.9617
Best parameters: {'n_estimators': 195, 'max_depth': 16, 'min_samples_split': 8, 'min_samples_leaf': 2, 'max_features': 0.28306570555707966}
Note the time taken to run the hyperparameter search process as well as the best validation accuracy achieved. In this case, it appears that 10 trials were enough to find the optimal configuration.
# Applying Bayesian Optimization
This method employs an auxiliary or surrogate model – specifically, a probabilistic model based on Gaussian processes or tree-based structures – to predict the best performing hyperparameter settings. The tests are not free; Each test “learns” from previous tests. Additionally, this method attempts to balance exploration (trying new areas in the solution domain) and exploitation (refining promising areas). In short, we have a better way than grids and random search.
optuna The library provides a specific implementation of Bayesian optimization for hyperparameter tuning that uses a tree-structured Parzen estimator (TPE). It categorizes trials into “good” or “bad” groups, models the probabilistic distribution in each, and samples from promising areas.
The whole process can be implemented as follows:
def bayesian_optimization(n_trials=30):
"""
Implementation of Bayesian optimization using Optuna library.
"""
start_time = time.time()
def objective(trial):
"""
Optuna objective function: given a trial, returns a score.
"""
# Optuna can suggest values based on past performance
params = {
'n_estimators': trial.suggest_int('n_estimators',
param_space('n_estimators')(0),
param_space('n_estimators')(1)),
'max_depth': trial.suggest_int('max_depth',
param_space('max_depth')(0),
param_space('max_depth')(1)),
'min_samples_split': trial.suggest_int('min_samples_split',
param_space('min_samples_split')(0),
param_space('min_samples_split')(1)),
'min_samples_leaf': trial.suggest_int('min_samples_leaf',
param_space('min_samples_leaf')(0),
param_space('min_samples_leaf')(1)),
'max_features': trial.suggest_float('max_features',
param_space('max_features')(0),
param_space('max_features')(1))
}
# Evaluate and return score (maximizing by default in Optuna)
return evaluate_model(params, X_train, y_train)
# The create_study() function is used in Optuna to manage and run
# the overall optimization process
print(f"nRunning {n_trials} Bayesian optimization trials...")
study = optuna.create_study(
direction='maximize', # We want to maximize accuracy
sampler=optuna.samplers.TPESampler(seed=42) # Bayesian algorithm
)
# Perform optimization process with progress callback
def callback(study, trial):
if trial.number % 10 == 9:
print(f" Trial {trial.number + 1}/{n_trials}: Best score = {study.best_value:.4f}")
study.optimize(objective, n_trials=n_trials, callbacks=(callback), show_progress_bar=False)
elapsed_time = time.time() - start_time
print(f"n✓ Completed in {elapsed_time:.2f} seconds")
print(f"Best validation accuracy: {study.best_value:.4f}")
print(f"Best parameters: {study.best_params}")
return study.best_params, study.best_value, study
bayesian_best_params, bayesian_best_score, bayesian_study = bayesian_optimization(n_trials=30)
Output (summary):
✓ Completed in 62.66 seconds
Best validation accuracy: 0.9673
Best parameters: {'n_estimators': 150, 'max_depth': 33, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': 0.19145126698170384}
# Using Gradual Halving
The last of the three methods, sequential halting, balances the size of the search space with computing resources allocated according to possible configurations. It starts with a sufficient range of configurations but limited resources per configuration (e.g. training data), gradually removes poor performers and allocates more resources to promising configurations – similar to a real-world tournament where the stronger competitors “survive.”
The following implementation implements guided gradual halving by gradually modifying the size of the training set.
def successive_halving(n_initial=32, min_resource=0.25, max_resource=1.0):
start_time = time.time()
# Step 1: Defining initial hyperparameter configurations at random
print(f"nGenerating {n_initial} initial random configurations...")
configs = ()
for _ in range(n_initial):
config = {
'n_estimators': np.random.randint(param_space('n_estimators')(0),
param_space('n_estimators')(1)),
'max_depth': np.random.randint(param_space('max_depth')(0),
param_space('max_depth')(1)),
'min_samples_split': np.random.randint(param_space('min_samples_split')(0),
param_space('min_samples_split')(1)),
'min_samples_leaf': np.random.randint(param_space('min_samples_leaf')(0),
param_space('min_samples_leaf')(1)),
'max_features': np.random.uniform(param_space('max_features')(0),
param_space('max_features')(1))
}
configs.append(config)
# Step 2: apply tournament-like successive rounds of elimination
current_configs = configs
current_resource = min_resource
round_num = 1
while len(current_configs) > 1 and current_resource <= max_resource:
# Determine amount of training instances to use in the current round
n_samples = int(len(X_train) * current_resource)
print(f"n--- Round {round_num}: Evaluating {len(current_configs)} configs ---")
print(f" Using {current_resource*100:.0f}% of training data ({n_samples} samples)")
# Subsample training instances
indices = np.random.choice(len(X_train), size=n_samples, replace=False)
X_subset = X_train(indices)
y_subset = y_train(indices)
# Evaluate all current configs with the current resources
scores = ()
for i, config in enumerate(current_configs):
score = evaluate_model(config, X_subset, y_subset, cv=2) # Use cv=2 (minimum)
scores.append(score)
if (i + 1) % 10 == 0 or (i + 1) == len(current_configs):
print(f" Evaluated {i+1}/{len(current_configs)} configs...")
# Elimination policy: keep top-performing half only
n_keep = max(1, len(current_configs) // 2)
sorted_indices = np.argsort(scores)(::-1) # Descending order
current_configs = (current_configs(i) for i in sorted_indices(:n_keep))
best_score = scores(sorted_indices(0))
print(f" → Keeping top {n_keep} configs. Best score: {best_score:.4f}")
# Update resources, doubling them for the next round
current_resource = min(current_resource * 2, max_resource)
round_num += 1
# Final evaluation of best config found, given full training set
best_config = current_configs(0)
final_score = evaluate_model(best_config, X_train, y_train, cv=3)
elapsed_time = time.time() - start_time
print(f"n✓ Completed in {elapsed_time:.2f} seconds")
print(f"Best validation accuracy: {final_score:.4f}")
print(f"Best parameters: {best_config}")
return best_config, final_score
halving_best, halving_score = successive_halving(n_initial=32, min_resource=0.25, max_resource=1.0)
The final result obtained may look like the following:
✓ Completed in 56.18 seconds
Best validation accuracy: 0.9645
Best parameters: {'n_estimators': 158, 'max_depth': 39, 'min_samples_split': 5, 'min_samples_leaf': 2, 'max_features': 0.2269785516325355}
# comparing final results
In summary, all three methods found the optimal configuration with validation accuracy between 96% and 97%, with Bayesian optimization achieving the best results by a small margin. The results are more pronounced in terms of efficiency, with continuous halving giving the fastest results in just 56 seconds, compared to 62-64 seconds taken by the other two techniques.
ivan palomares carrascosa Is a leader, author, speaker and consultant in AI, Machine Learning, Deep Learning and LLM. He trains and guides others in using AI in the real world.