Feature engineering is the process of selecting, transforming, and creating features from raw data to improve machine learning model performance. It transforms data into formats that help models learn patterns more effectively.

Table of contents

  1. Best practices
  2. Feature engineering workflow
  3. Transformation selection guide
  4. Core feature engineering techniques

1. Best practices

1.1. General guidelines

  1. Understand Your Data
    • Visualize distributions before transformations
    • Check for missing values and outliers
    • Understand feature types and relationships
  2. Handle Missing Values First
    • Impute before scaling or transforming
    • Consider creating indicator variables
  3. Document Everything
    • Track all transformations for reproducibility
    • Maintain data dictionaries
    • Use version control
  4. Test Multiple Approaches
    • Try different transformations
    • Compare model performance
    • Use cross-validation
  5. Use Pipelines
  6. Check Feature Importance
    • Remove low-importance features
    • Reduce dimensionality
    • Improve model interpretability
  7. Validate Assumptions
    • Verify transformations achieve desired effect
    • Check for normality when needed
    • Ensure proper scaling

1.2. Common pitfalls

Pitfall Problem Solution
Log of zero/negative Mathematical error Use np.log1p() or Yeo-Johnson
High-cardinality one-hot Too many features Use feature hashing or target encoding
Ordinal for nominal Implies false ordering Use one-hot encoding instead
Outliers before scaling Compresses distribution Handle outliers first or use robust scalers
Keeping all dummy columns Multicollinearity Use drop_first=True
Not documenting steps Irreproducible Document all transformations
Feature leakage Overfitting Use only past/available information
Forgetting to transform test data Inconsistent preprocessing Apply same transformations to test set

2. Feature engineering workflow

2.1. Standard pipeline

1. Understand the Data
   ↓
2. Create New Features (mathematical combinations)
   ↓
3. Transform Features (log, sqrt, power, quantile)
   ↓
4. Scale Features (min-max or standard scaling)
   ↓
5. Encode Categories (ordinal, one-hot, or hashing)
   ↓
6. Group & Aggregate (create group-based features)
   ↓
7. Validate & Test (check feature importance, correlations)
   ↓
8. Ready for Model Training!

2.2. Key principles

  1. Visualize First: Understand distributions before transforming
  2. Domain Knowledge: Let subject matter expertise guide feature creation
  3. Iterative Process: Test transformations and evaluate impact
  4. Document Everything: Track all transformations for reproducibility
  5. Avoid Leakage: Don’t use future information in feature creation
  6. Test Impact: Validate that engineered features improve model performance

3. Transformation selection guide

3.1. By data characteristics

Data Characteristic Recommended Transformation Reason
Right-skewed Log or Power Transformation Compresses large values, spreads small values
Moderate skew Square root Gentler compression than log
Count data Square root or log(x+1) Stabilizes variance
Unknown distribution Power Transformation (Yeo-Johnson) Automatically finds optimal transformation
Negative values Power Transformation (Yeo-Johnson) Handles negative values without constant
Heavy outliers Quantile Transformation Most robust to extreme values
Need uniform distribution Quantile Transformation (uniform) Maps to [0,1] with uniform spread
Normally distributed No transformation needed Already suitable for most algorithms
Bounded range needed Min-Max scaling Maps to [0,1] or custom range
Unbounded, mixed scale Standard scaling Centers and standardizes

3.2. By transformation type

Transformation When to Use Advantages Limitations
Log Right-skewed, positive values Simple, interpretable Cannot handle zero/negative
Square Root Moderate skew, count data Gentler than log Only non-negative values
Power Unknown optimal transformation Automated, handles negatives (Yeo-Johnson) Standardizes output
Quantile Heavy outliers, unknown distribution Most robust, no assumptions Distorts linear relationships
Min-Max Neural networks, bounded data Preserves shape, bounded output Sensitive to outliers
Standard Normal data, linear models Less sensitive to outliers Unbounded output

4. Core feature engineering techniques

4.1. Feature transformations

Modify scale, distribution, or nature of features to improve model suitability.

Sklearn transformers API

All sklearn preprocessing transformers follow a consistent pattern:

  • fit(X): Learn parameters from data
  • transform(X): Apply learned transformation
  • fit_transform(X): Combine both steps
import pandas as pd
from sklearn.preprocessing import PowerTransformer

# Apply a power transform to a dataframe
transformer = PowerTransformer()
df_transformed = transformer.fit_transform(df)

# Convert back to DataFrame with column names
df_transformed = pd.DataFrame(
    X_transformed,
    columns=transformer.get_feature_names_out(),
    index=df.index
)

Transformation methods

Method Implementation Use Case Key Features
Log np.log() or np.log1p() Right-skewed data Simple, interpretable; use log1p() for zeros
Square Root np.sqrt() Moderately skewed, count data Gentler than log
Power PowerTransformer() Unknown distribution, negative values Box-Cox (positive) or Yeo-Johnson (any values)
Quantile QuantileTransformer() Heavy outliers, non-parametric Maps to uniform or normal distribution

4.2. Feature scaling

Standardize feature ranges to ensure equal contribution to model training.

Method Formula Output Range Outlier Sensitivity Distribution Best For Implementation
Min-Max (X - X_min) / (X_max - X_min) [0, 1] or custom High Preserves shape Neural networks, bounded data, known bounds, no extreme outliers MinMaxScaler()
Standard (X - μ) / σ Unbounded (typically -3 to +3) Moderate Centers around 0 Normal data, linear models, PCA, unknown bounds, data with outliers StandardScaler()

4.3. Feature encoding

Convert categorical variables into numeric format for machine learning algorithms.

Method Output Dimensionality Ordinal Assumption Memory Reversible Best For Implementation
Ordinal Single integer column No increase Yes Efficient Yes Ordered categories, tree models, binary categories OrdinalEncoder()
One-Hot Multiple binary columns +n_categories No Can be large Yes Nominal data, linear models, few categories pd.get_dummies()
Feature Hashing Fixed-size sparse matrix Fixed size No Very efficient No High-cardinality features, many categories FeatureHasher()

4.4. Grouping operations

Create aggregate features by splitting data into groups, applying functions, and combining results.

Common Usage: df.groupby()

# Basic aggregation
df.groupby('city')['price'].mean()

# Multiple functions
df.groupby('category').agg(['mean', 'sum', 'count'])

# Convert to DataFrame
result = df.groupby('city')['price'].mean().reset_index()

Common Functions: sum(), mean(), max(), min(), count(), std(), median()

Applications:

  • Aggregate features (customer total purchases)
  • Group statistics (average price by city)
  • Ratio features (value / group_mean)
  • Deviation features (value - group_median)
  • Target encoding (mean target by category)

Additional resources

Python libraries

  • pandas: Data manipulation, get_dummies(), groupby()
  • numpy: Mathematical transformations (log, sqrt)
  • scikit-learn: Preprocessing, scaling, encoding, feature extraction
  • scipy: Advanced statistical transformations

Key sklearn modules