Image by editor
# Introduction
like ways of dressing XGBoost (Extreme gradient boosting) are powerful implementations of gradient-boosted decision trees that aggregate multiple weak estimators into a single strong prediction model. These groups are highly popular due to their accuracy, efficiency, and strong performance on structured (tabular) data. While machine learning libraries are widely used scikit-learn While XGBoost does not provide a native implementation, there is a separate library, appropriately called XGBoost, that provides an API compatible with scikit-learn.
All you need to do is import it like this:
from xgboost import XGBClassifier
Below, we outline 7 Python tricks that can help you get the most out of this standalone implementation of XGBoost, especially when aiming to build more accurate predictive models.
To illustrate these tricks, we will use the freely available breast cancer dataset in Scikit-Learn and define a baseline model with largely default settings. Make sure to run this code before using the following seven tricks:
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier
# Data
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Baseline model
model = XGBClassifier(eval_metric="logloss", random_state=42)
model.fit(X_train, y_train)
print("Baseline accuracy:", accuracy_score(y_test, model.predict(X_test)))
# 1. Tuning the learning rate and number of estimators
Although this is not a universal rule, explicitly reducing the learning rate while increasing the number of estimators (trees) in an XGBoost ensemble often improves accuracy. The smaller learning rate allows the model to learn slowly, while the additional trees compensate for the reduced step size.
Here is an example. Try it yourself and compare the resulting accuracy to the initial baseline:
model = XGBClassifier(
learning_rate=0.01,
n_estimators=5000,
eval_metric="logloss",
random_state=42
)
model.fit(X_train, y_train)
print("Model accuracy:", accuracy_score(y_test, model.predict(X_test)))
For clarity, last print() In the remaining examples the statement will be omitted. Combine it with any of the snippets below when testing yourself.
# 2. Adjusting the maximum depth of trees
max_depth Argument is an important hyperparameter inherited from classic decision trees. This limits how deep each tree in the group can grow. Limiting the depth of a tree may seem simple, but surprisingly, shallow trees often generalize better than deep trees.
This example limits trees to a maximum depth of 2:
model = XGBClassifier(
max_depth=2,
eval_metric="logloss",
random_state=42
)
model.fit(X_train, y_train)
# 3. Reducing Overfitting by Subsampling
subsample The logic randomly samples a proportion of the training data (for example, 80%) before growing each tree in the group. This simple technique serves as an effective regularization strategy and helps prevent overfitting.
If not specified, this hyperparameter defaults to 1.0, meaning that 100% of the training examples are used:
model = XGBClassifier(
subsample=0.8,
colsample_bytree=0.8,
eval_metric="logloss",
random_state=42
)
model.fit(X_train, y_train)
Keep in mind that this approach is most effective for datasets of reasonable size. If the dataset is already small, aggressive subsampling may lead to underfitting.
# 4. Adding Regularization Terms
To further control overfitting, complex trees can be penalized using traditional regularization strategies such as L1 (Lasso) and L2 (Ridge). In XGBoost, these are controlled by reg_alpha And reg_lambda parameters, respectively.
model = XGBClassifier(
reg_alpha=0.2, # L1
reg_lambda=0.5, # L2
eval_metric="logloss",
random_state=42
)
model.fit(X_train, y_train)
# 5. Use of early stop
Early stopping is an efficiency-oriented mechanism that stops training when performance on a validation set stops improving over a specified number of rounds.
Depending on your coding environment and the version of the XGBoost library you are using, you may need to upgrade to the latest version to use the implementation shown below. also make sure early_stopping_rounds is specified during model initialization instead of passing it fit() Method
model = XGBClassifier(
n_estimators=1000,
learning_rate=0.05,
eval_metric="logloss",
early_stopping_rounds=20,
random_state=42
)
model.fit(
X_train, y_train,
eval_set=((X_test, y_test)),
verbose=False
)
To upgrade the library, run:
!pip uninstall -y xgboost
!pip install xgboost --upgrade
# 6. Searching Hyperparameters
For a more systematic approach, hyperparameter search can help identify combinations of settings that maximize model performance. Below is an example of using grid search to find a combination of the three key hyperparameters introduced earlier:
param_grid = {
"max_depth": (3, 4, 5),
"learning_rate": (0.01, 0.05, 0.1),
"n_estimators": (200, 500)
}
grid = GridSearchCV(
XGBClassifier(eval_metric="logloss", random_state=42),
param_grid,
cv=3,
scoring="accuracy"
)
grid.fit(X_train, y_train)
print("Best params:", grid.best_params_)
best_model = XGBClassifier(
**grid.best_params_,
eval_metric="logloss",
random_state=42
)
best_model.fit(X_train, y_train)
print("Tuned accuracy:", accuracy_score(y_test, best_model.predict(X_test)))
# 7. Adjusting for class imbalance
This last tip is especially useful when working with highly class-imbalanced datasets (the breast cancer dataset is relatively balanced, so don’t be concerned if you see minimal changes). scale_pos_weight The parameter is especially helpful when the class ratio is highly skewed, such as 90/10, 95/5, or 99/1.
Here’s how to calculate and implement it based on training data:
ratio = np.sum(y_train == 0) / np.sum(y_train == 1)
model = XGBClassifier(
scale_pos_weight=ratio,
eval_metric="logloss",
random_state=42
)
model.fit(X_train, y_train)
# wrapping up
In this article, we explored seven practical tricks for enhancing the XGBoost ensemble model using its dedicated Python library. Thoughtful tuning of learning rate, tree depth, sampling strategies, regularization, and class weights – combined with systematic hyperparameter search – often makes the difference between a decent model and an overly accurate one.
ivan palomares carrascosa Is a leader, author, speaker and consultant in AI, Machine Learning, Deep Learning and LLM. He trains and guides others in using AI in the real world.
