Feature engineering overview
Feature engineering is the process of selecting, transforming, and creating features from raw data to improve machine learning model performance. It transforms data into formats that help models learn patterns more effectively.
Table of contents
- Best practices
- Feature engineering workflow
- Transformation selection guide
- Core feature engineering techniques
1. Best practices
1.1. General guidelines
- Understand Your Data
- Visualize distributions before transformations
- Check for missing values and outliers
- Understand feature types and relationships
- Handle Missing Values First
- Impute before scaling or transforming
- Consider creating indicator variables
- Document Everything
- Track all transformations for reproducibility
- Maintain data dictionaries
- Use version control
- Test Multiple Approaches
- Try different transformations
- Compare model performance
- Use cross-validation
- Use Pipelines
- Sklearn pipelines ensure consistency
- Prevent data leakage
- Simplify deployment
- Check Feature Importance
- Remove low-importance features
- Reduce dimensionality
- Improve model interpretability
- Validate Assumptions
- Verify transformations achieve desired effect
- Check for normality when needed
- Ensure proper scaling
1.2. Common pitfalls
| Pitfall | Problem | Solution |
|---|---|---|
| Log of zero/negative | Mathematical error | Use np.log1p() or Yeo-Johnson |
| High-cardinality one-hot | Too many features | Use feature hashing or target encoding |
| Ordinal for nominal | Implies false ordering | Use one-hot encoding instead |
| Outliers before scaling | Compresses distribution | Handle outliers first or use robust scalers |
| Keeping all dummy columns | Multicollinearity | Use drop_first=True |
| Not documenting steps | Irreproducible | Document all transformations |
| Feature leakage | Overfitting | Use only past/available information |
| Forgetting to transform test data | Inconsistent preprocessing | Apply same transformations to test set |
2. Feature engineering workflow
2.1. Standard pipeline
1. Understand the Data
↓
2. Create New Features (mathematical combinations)
↓
3. Transform Features (log, sqrt, power, quantile)
↓
4. Scale Features (min-max or standard scaling)
↓
5. Encode Categories (ordinal, one-hot, or hashing)
↓
6. Group & Aggregate (create group-based features)
↓
7. Validate & Test (check feature importance, correlations)
↓
8. Ready for Model Training!
2.2. Key principles
- Visualize First: Understand distributions before transforming
- Domain Knowledge: Let subject matter expertise guide feature creation
- Iterative Process: Test transformations and evaluate impact
- Document Everything: Track all transformations for reproducibility
- Avoid Leakage: Don’t use future information in feature creation
- Test Impact: Validate that engineered features improve model performance
3. Transformation selection guide
3.1. By data characteristics
| Data Characteristic | Recommended Transformation | Reason |
|---|---|---|
| Right-skewed | Log or Power Transformation | Compresses large values, spreads small values |
| Moderate skew | Square root | Gentler compression than log |
| Count data | Square root or log(x+1) | Stabilizes variance |
| Unknown distribution | Power Transformation (Yeo-Johnson) | Automatically finds optimal transformation |
| Negative values | Power Transformation (Yeo-Johnson) | Handles negative values without constant |
| Heavy outliers | Quantile Transformation | Most robust to extreme values |
| Need uniform distribution | Quantile Transformation (uniform) | Maps to [0,1] with uniform spread |
| Normally distributed | No transformation needed | Already suitable for most algorithms |
| Bounded range needed | Min-Max scaling | Maps to [0,1] or custom range |
| Unbounded, mixed scale | Standard scaling | Centers and standardizes |
3.2. By transformation type
| Transformation | When to Use | Advantages | Limitations |
|---|---|---|---|
| Log | Right-skewed, positive values | Simple, interpretable | Cannot handle zero/negative |
| Square Root | Moderate skew, count data | Gentler than log | Only non-negative values |
| Power | Unknown optimal transformation | Automated, handles negatives (Yeo-Johnson) | Standardizes output |
| Quantile | Heavy outliers, unknown distribution | Most robust, no assumptions | Distorts linear relationships |
| Min-Max | Neural networks, bounded data | Preserves shape, bounded output | Sensitive to outliers |
| Standard | Normal data, linear models | Less sensitive to outliers | Unbounded output |
4. Core feature engineering techniques
4.1. Feature transformations
Modify scale, distribution, or nature of features to improve model suitability.
Sklearn transformers API
All sklearn preprocessing transformers follow a consistent pattern:
fit(X): Learn parameters from datatransform(X): Apply learned transformationfit_transform(X): Combine both steps
import pandas as pd
from sklearn.preprocessing import PowerTransformer
# Apply a power transform to a dataframe
transformer = PowerTransformer()
df_transformed = transformer.fit_transform(df)
# Convert back to DataFrame with column names
df_transformed = pd.DataFrame(
X_transformed,
columns=transformer.get_feature_names_out(),
index=df.index
)
Transformation methods
| Method | Implementation | Use Case | Key Features |
|---|---|---|---|
| Log | np.log() or np.log1p() |
Right-skewed data | Simple, interpretable; use log1p() for zeros |
| Square Root | np.sqrt() |
Moderately skewed, count data | Gentler than log |
| Power | PowerTransformer() |
Unknown distribution, negative values | Box-Cox (positive) or Yeo-Johnson (any values) |
| Quantile | QuantileTransformer() |
Heavy outliers, non-parametric | Maps to uniform or normal distribution |
4.2. Feature scaling
Standardize feature ranges to ensure equal contribution to model training.
| Method | Formula | Output Range | Outlier Sensitivity | Distribution | Best For | Implementation |
|---|---|---|---|---|---|---|
| Min-Max | (X - X_min) / (X_max - X_min) | [0, 1] or custom | High | Preserves shape | Neural networks, bounded data, known bounds, no extreme outliers | MinMaxScaler() |
| Standard | (X - μ) / σ | Unbounded (typically -3 to +3) | Moderate | Centers around 0 | Normal data, linear models, PCA, unknown bounds, data with outliers | StandardScaler() |
4.3. Feature encoding
Convert categorical variables into numeric format for machine learning algorithms.
| Method | Output | Dimensionality | Ordinal Assumption | Memory | Reversible | Best For | Implementation |
|---|---|---|---|---|---|---|---|
| Ordinal | Single integer column | No increase | Yes | Efficient | Yes | Ordered categories, tree models, binary categories | OrdinalEncoder() |
| One-Hot | Multiple binary columns | +n_categories | No | Can be large | Yes | Nominal data, linear models, few categories | pd.get_dummies() |
| Feature Hashing | Fixed-size sparse matrix | Fixed size | No | Very efficient | No | High-cardinality features, many categories | FeatureHasher() |
4.4. Grouping operations
Create aggregate features by splitting data into groups, applying functions, and combining results.
Common Usage: df.groupby()
# Basic aggregation
df.groupby('city')['price'].mean()
# Multiple functions
df.groupby('category').agg(['mean', 'sum', 'count'])
# Convert to DataFrame
result = df.groupby('city')['price'].mean().reset_index()
Common Functions: sum(), mean(), max(), min(), count(), std(), median()
Applications:
- Aggregate features (customer total purchases)
- Group statistics (average price by city)
- Ratio features (value / group_mean)
- Deviation features (value - group_median)
- Target encoding (mean target by category)
Additional resources
Python libraries
- pandas: Data manipulation,
get_dummies(),groupby() - numpy: Mathematical transformations (
log,sqrt) - scikit-learn: Preprocessing, scaling, encoding, feature extraction
- scipy: Advanced statistical transformations
Key sklearn modules
- sklearn.preprocessing: Transformers, scalers, encoders
- sklearn.feature_extraction: Feature hashing, text features
- sklearn.pipeline: Chaining transformations
Recommended reading
- Scikit-learn Preprocessing Guide: Comprehensive guide to data preprocessing
- “Feature Engineering for Machine Learning” by Alice Zheng & Amanda Casari
- Kaggle Feature Engineering Course: Hands-on tutorials and examples