5 Useful Python Scripts for Effective Feature Engineering

by
0 comments
5 Useful Python Scripts for Effective Feature Engineering

Useful Python Scripts for Effective Feature Engineering
Image by author

# Introduction

As a machine learning practitioner, you know that feature engineering is laborious, manual work. You need to create interaction terms between features, properly encode categorical variables, extract temporal patterns from dates, generate aggregations, and transform distributions. For each potential feature, you test whether it improves model performance, iterate on variations, and track what you’ve tried.

This becomes more challenging as your dataset grows. With dozens of features, you will need a systematic approach to generating candidate features, evaluating their usefulness, and selecting the best features. Without automation, you’ll likely miss valuable feature additions that could significantly increase your model’s performance.

This article covers five Python scripts specifically designed to automate the most impactful feature engineering tasks. These scripts help you systematically generate high-quality features, evaluate them objectively, and create optimized feature sets that maximize model performance.

You can find the code on GitHub.

# 1. Encoding hierarchical features

// pain point

Categorical variables are everywhere in real-world data. You need to encode these ranges, and choosing the right encoding method matters:

  • One-hot encoding works for low-cardinality features but creates dimensionality problems with high-cardinality categories.
  • Label encoding is memory-efficient but implies serialization
  • Target encoding is powerful but there is a risk of data leakage

Implementing these encodings correctly, handling unseen categories in test data, and maintaining consistency across train, validation, and test splits requires careful, error-prone code.

// what does the script do

The script automatically selects and applies appropriate encoding strategies based on feature characteristics: cardinality, target correlation, and data type.

It handles one-hot encoding for low-cardinality features, target encoding for target-related features, frequency encoding for high-cardinality features, and label encoding for ordinal variables. It automatically groups rare categories, handles unseen categories in test data beautifully, and maintains encoding consistency across all data partitions.

// how it works

The script analyzes each categorical feature to determine its salience and relationship with the target variable.

  • For features with less than 10 unique values, it applies one-hot encoding.
  • For high-cardinality features with more than 50 unique values, it uses frequency encoding to avoid dimensionality explosion
  • For features that show correlation with the target, it applies target encoding with smoothing to prevent overfitting.
  • Rare categories that appear in less than 1% of rows are grouped into the “Other” category

All encoding mappings are stored and can be applied continuously to new data, with unseen categories handled by defaulting to sparse category encoding or global mean.

Get Hierarchical Feature Encoder Script

# 2. Changing numerical attributes

// pain point

Raw numerical characteristics often require transformation before modeling. Skewed distributions must be normalized, outliers must be handled, features with different scales require standardization, and nonlinear relationships may require polynomial or logarithmic transformations. It is difficult to manually test different transformation strategies for each numerical attribute. This process should be repeated for each numerical column and validated to ensure that you are actually improving model performance.

// what does the script do

The script automatically tests several transformation strategies for numerical characteristics: log transformation, box-cox transformationSquare roots, cube roots, standardization, normalization, robust scaling, and power transformations.

It evaluates the impact of each transformation on distribution normality and model performance, selects the best transformation for each feature, and applies the transformation consecutively to train and test data. It also handles zero and negative values ​​appropriately, avoiding transformation errors.

// how it works

For each numerical attribute, the script tests several transformations and evaluates them using normality tests – e.g. Shapiro-Wilk And anderson-darling – and distribution metrics such as skewness and kurtosis. For features with skewness greater than 1, it prefers log and Box-Cox transformations.

For features with outliers, it applies stronger scaling. The script maintains the transformation parameters fitted on the training data and applies them consistently to the validation and test sets. Features with negative values ​​or zero are handled with shifted transformations yeo-johnson Transformations that work with any real values.

Get Numeric Feature Transformer Script

# 3. Generating Feature Interactions

// pain point

Interactions between features often contain valuable signals that are missed by individual features. Revenue may matter differently across customer segments, advertising spending may have different impacts by season, or the combination of product price and category may be more predictable than either alone. But with dozens of features, testing all possible pairwise interactions means evaluating thousands of candidates.

// what does the script do

This script generates feature interactions using mathematical operations, polynomial features, ratio features, and hierarchical combinations. It evaluates the predictive strength of each candidate interaction using mutual information or model-based importance scores. It returns only the top N most valuable interactions, avoiding feature explosion while capturing the most influential combinations. It also supports custom interaction functions for domain-specific feature engineering.

// how it works

The script generates candidate interactions between all feature pairs:

  • For numerical features, it creates products, ratios, sums and differences.
  • For categorical features, this creates combined encoding

Each candidate is scored using mutual information from the random forest with the target or feature importance. Only interactions exceeding the significance threshold or ranking in the top N are retained. The script handles edge cases like division by zero, infinite values, and correlation between the generated features and the original features. The results include clear feature names that show which original features were combined and how.

Get the Feature Interaction Generator Script

# 4. Removing Date Time Features

// pain point

Date time columns contain useful temporal information, but using them effectively requires extensive manual feature engineering. You need to do the following:

  • Extract components like year, month, day and hour
  • Create derived features like weekday, quarter and weekend flags
  • Calculate time difference between events like days after reference date and time
  • handle cyclical patterns

Writing this extraction code for each datetime column is repetitive and time consuming, and practitioners often forget valuable temporal features that can improve their models.

// what does the script do

The script automatically extracts comprehensive datetime features from timestamp columns, including basic components, calendar features, Boolean indicators, cyclic encoding using sine and cosine transformations, season indicators, and time differences from reference dates. It also detects and marks holidays, handles multiple datetime columns, and calculates time differences between datetime pairs.

// how it works

The script takes a datetime column and systematically extracts all relevant temporal patterns.

For cyclic attributes like month or hour, this creates sine and cosine transformations:
(
text{month_sin} = sinleft(frac{2pi times text{month}}{12}right)
)

It sure looks like December and January are close in the feature space. It calculates the time delta from a reference point (days since the epoch, days since a specific date) to capture trends.

For datasets with multiple datetime columns (e.g. order_date And ship_date), it calculates the difference between them to find the period processing_time. Boolean flags are created for particular days, weekends and period ranges. All features use clear naming conventions reflecting their source and meaning.

Get Datetime Feature Extractor Script

# 5. Selecting features automatically

// pain point

After feature engineering, you usually have many features, many of which are redundant, irrelevant, or lead to overfitting. You need to identify which features really help your model and which should be removed. Manual feature selection means repeatedly training the model with different feature subsets, tracking the results in a spreadsheet, and trying to understand complex feature importance scores. The process is slow and subjective, and you never know if you’ve found the optimal feature set or just got lucky in your tests.

// what does the script do

The script automatically selects the most valuable features using several selection methods:

  • Variance-based filtering removes stationary or near-stationary features
  • Correlation-based filtering removes unnecessary features
  • Statistical tests such as analysis of variance (ANOVA), chi-square and mutual information
  • importance of tree based facility
  • L1 regularization
  • Recursive feature elimination

It then combines the results of multiple methods into a composite score, ranks all features by importance, and identifies the optimal feature subset that maximizes model performance while minimizing dimensionality.

// how it works

The script implements a multi-stage selection pipeline. Here’s what each step does:

  1. Remove features with zero or near-zero variance because they provide no information
  2. Remove highly correlated feature pairs, keeping those that are more correlated with the target
  3. Calculate feature importance using multiple methods such as random forest importance, mutual information score, statistical tests and L1 regularization coefficient
  4. Create an overall ranking by normalizing and combining scores in different ways
  5. Use recursive feature elimination or cross-validation to determine the optimal number of features

The result is a ranked list of features and a recommended subset for model training along with detailed importance scores from each method.

Get Automated Feature Selector Script

# conclusion

These five scripts address the main feature engineering challenges that take up most of the time in machine learning projects. Here’s a quick recap:

  • Hierarchical encoder intelligently handles encoding based on cardinality and target correlation
  • Numerical Transformer automatically finds the optimal transformation for each numerical feature
  • Interaction generator systematically searches for valuable feature combinations
  • DateTime Extractor extracts broad temporal patterns and cyclical features
  • Feature selector identifies the most predictive features using combination methods

Each script can be used independently for specific feature engineering tasks or combined into a complete pipeline. Start with Encoders and Transformers to prepare your base features, use the Interaction Generator to discover complex patterns, extract temporal features from datetime columns, and finish with feature selection to optimize your feature set.

Happy feature engineering!

Bala Priya C is a developer and technical writer from India. She likes to work in the fields of mathematics, programming, data science, and content creation. His areas of interest and expertise include DevOps, Data Science, and Natural Language Processing. She loves reading, writing, coding, and coffee! Currently, she is working on learning and sharing her knowledge with the developer community by writing tutorials, how-to guides, opinion pieces, and more. Bala also creates engaging resource overviews and coding tutorials.

Related Articles

Leave a Comment