Feature engineering overview

Feature engineering is the process of selecting, transforming, and creating features from raw data to improve machine learning model performance. It transforms data into formats that help models learn patterns more effectively.

Best practices
- General guidelines
- Common pitfalls
Feature engineering workflow
- Standard pipeline
- Key principles
Transformation selection guide
- By data characteristics
- By transformation type
Core feature engineering techniques

1. Best practices

1.1. General guidelines

Understand Your Data
- Visualize distributions before transformations
- Check for missing values and outliers
- Understand feature types and relationships
Handle Missing Values First
- Impute before scaling or transforming
- Consider creating indicator variables
Document Everything
- Track all transformations for reproducibility
- Maintain data dictionaries
- Use version control
Test Multiple Approaches
- Try different transformations
- Compare model performance
- Use cross-validation
Use Pipelines
- Sklearn pipelines ensure consistency
- Prevent data leakage
- Simplify deployment
Check Feature Importance
- Remove low-importance features
- Reduce dimensionality
- Improve model interpretability
Validate Assumptions
- Verify transformations achieve desired effect
- Check for normality when needed
- Ensure proper scaling

1.2. Common pitfalls

Pitfall	Problem	Solution
Log of zero/negative	Mathematical error	Use `np.log1p()` or Yeo-Johnson
High-cardinality one-hot	Too many features	Use feature hashing or target encoding
Ordinal for nominal	Implies false ordering	Use one-hot encoding instead
Outliers before scaling	Compresses distribution	Handle outliers first or use robust scalers
Keeping all dummy columns	Multicollinearity	Use `drop_first=True`
Not documenting steps	Irreproducible	Document all transformations
Feature leakage	Overfitting	Use only past/available information
Forgetting to transform test data	Inconsistent preprocessing	Apply same transformations to test set

2. Feature engineering workflow

2.1. Standard pipeline

1. Understand the Data
   ↓
2. Create New Features (mathematical combinations)
   ↓
3. Transform Features (log, sqrt, power, quantile)
   ↓
4. Scale Features (min-max or standard scaling)
   ↓
5. Encode Categories (ordinal, one-hot, or hashing)
   ↓
6. Group & Aggregate (create group-based features)
   ↓
7. Validate & Test (check feature importance, correlations)
   ↓
8. Ready for Model Training!

2.2. Key principles

Visualize First: Understand distributions before transforming
Domain Knowledge: Let subject matter expertise guide feature creation
Iterative Process: Test transformations and evaluate impact
Document Everything: Track all transformations for reproducibility
Avoid Leakage: Don’t use future information in feature creation
Test Impact: Validate that engineered features improve model performance

3. Transformation selection guide

3.1. By data characteristics

Data Characteristic	Recommended Transformation	Reason
Right-skewed	Log or Power Transformation	Compresses large values, spreads small values
Moderate skew	Square root	Gentler compression than log
Count data	Square root or log(x+1)	Stabilizes variance
Unknown distribution	Power Transformation (Yeo-Johnson)	Automatically finds optimal transformation
Negative values	Power Transformation (Yeo-Johnson)	Handles negative values without constant
Heavy outliers	Quantile Transformation	Most robust to extreme values
Need uniform distribution	Quantile Transformation (uniform)	Maps to [0,1] with uniform spread
Normally distributed	No transformation needed	Already suitable for most algorithms
Bounded range needed	Min-Max scaling	Maps to [0,1] or custom range
Unbounded, mixed scale	Standard scaling	Centers and standardizes

3.2. By transformation type

Transformation	When to Use	Advantages	Limitations
Log	Right-skewed, positive values	Simple, interpretable	Cannot handle zero/negative
Square Root	Moderate skew, count data	Gentler than log	Only non-negative values
Power	Unknown optimal transformation	Automated, handles negatives (Yeo-Johnson)	Standardizes output
Quantile	Heavy outliers, unknown distribution	Most robust, no assumptions	Distorts linear relationships
Min-Max	Neural networks, bounded data	Preserves shape, bounded output	Sensitive to outliers
Standard	Normal data, linear models	Less sensitive to outliers	Unbounded output

4. Core feature engineering techniques

4.1. Feature transformations

Modify scale, distribution, or nature of features to improve model suitability.

Sklearn transformers API

All sklearn preprocessing transformers follow a consistent pattern:

fit(X): Learn parameters from data
transform(X): Apply learned transformation
fit_transform(X): Combine both steps

import pandas as pd
from sklearn.preprocessing import PowerTransformer

# Apply a power transform to a dataframe
transformer = PowerTransformer()
df_transformed = transformer.fit_transform(df)

# Convert back to DataFrame with column names
df_transformed = pd.DataFrame(
    X_transformed,
    columns=transformer.get_feature_names_out(),
    index=df.index
)

Transformation methods

Method	Implementation	Use Case	Key Features
Log	`np.log()` or `np.log1p()`	Right-skewed data	Simple, interpretable; use `log1p()` for zeros
Square Root	`np.sqrt()`	Moderately skewed, count data	Gentler than log
Power	`PowerTransformer()`	Unknown distribution, negative values	Box-Cox (positive) or Yeo-Johnson (any values)
Quantile	`QuantileTransformer()`	Heavy outliers, non-parametric	Maps to uniform or normal distribution

4.2. Feature scaling

Standardize feature ranges to ensure equal contribution to model training.

Method	Formula	Output Range	Outlier Sensitivity	Distribution	Best For	Implementation
Min-Max	(X - X_min) / (X_max - X_min)	[0, 1] or custom	High	Preserves shape	Neural networks, bounded data, known bounds, no extreme outliers	`MinMaxScaler()`
Standard	(X - μ) / σ	Unbounded (typically -3 to +3)	Moderate	Centers around 0	Normal data, linear models, PCA, unknown bounds, data with outliers	`StandardScaler()`

4.3. Feature encoding

Convert categorical variables into numeric format for machine learning algorithms.

Method	Output	Dimensionality	Ordinal Assumption	Memory	Reversible	Best For	Implementation
Ordinal	Single integer column	No increase	Yes	Efficient	Yes	Ordered categories, tree models, binary categories	`OrdinalEncoder()`
One-Hot	Multiple binary columns	+n_categories	No	Can be large	Yes	Nominal data, linear models, few categories	`pd.get_dummies()`
Feature Hashing	Fixed-size sparse matrix	Fixed size	No	Very efficient	No	High-cardinality features, many categories	`FeatureHasher()`

4.4. Grouping operations

Create aggregate features by splitting data into groups, applying functions, and combining results.

Common Usage: df.groupby()

# Basic aggregation
df.groupby('city')['price'].mean()

# Multiple functions
df.groupby('category').agg(['mean', 'sum', 'count'])

# Convert to DataFrame
result = df.groupby('city')['price'].mean().reset_index()

Common Functions: sum(), mean(), max(), min(), count(), std(), median()

Applications:

Aggregate features (customer total purchases)
Group statistics (average price by city)
Ratio features (value / group_mean)
Deviation features (value - group_median)
Target encoding (mean target by category)