Regression overview
Regression models predict continuous target variables by learning relationships between input features and outputs. This guide covers essential regression techniques, evaluation metrics, and best practices for building effective predictive models.
Table of contents
- Best practices
- Regression workflow
- Model selection guide
- Performance metrics
- Core regression techniques
1. Best practices
1.1. General guidelines
- Start Simple
- Begin with linear regression
- Add complexity only when justified
- Compare models systematically
- Always Split Your Data
- Use train-test split (70-30 or 80-20)
- Set
random_statefor reproducibility - Never fit scalers/transformers on test data
- Scale Your Features
- Essential for regularized regression
- Use
StandardScalerafter train-test split - Apply same transformation to test data
- Use Cross-Validation
- Provides robust performance estimates
- K-Fold (k=5 or k=10) is standard
- Use
cross_val_score()for quick evaluation
- Monitor Overfitting
- Compare training vs test performance
- Large gap indicates overfitting
- Use regularization if needed
- Leverage Pipelines
- Combine preprocessing and modeling
- Prevents data leakage
- Simplifies deployment
- Use
sklearn.pipeline.Pipeline
- Document Everything
- Track preprocessing steps
- Record hyperparameters
- Note performance metrics
1.2. Common pitfalls
| Pitfall | Problem | Solution |
|---|---|---|
| Not splitting data | Can’t evaluate generalization | Always use train-test split |
| Data leakage | Overly optimistic results | Fit transformers only on training data |
| Ignoring overfitting | Poor test performance | Monitor train vs test metrics |
| Wrong metric | Misleading conclusions | Use multiple metrics (MSE, MAE, R²) |
| Skipping cross-validation | Unreliable estimates | Use K-Fold cross-validation |
| Not scaling | Regularization ineffective | Standardize features |
| Categorical encoding errors | Model can’t learn | Use one-hot or ordinal encoding |
| Missing values | Training fails or biased | Impute before modeling |
| High multicollinearity | Unstable coefficients | Use Ridge or ElasticNet |
2. Regression workflow
1. Data Loading & Inspection
↓
2. Train-Test Split
↓
3. Exploratory Data Analysis
↓
4. Data Preprocessing
- Handle missing values
- Encode categorical variables
- Scale numerical features
↓
5. Model Selection & Training
- Linear regression
- Regularized regression
↓
6. Cross-Validation
↓
7. Model Evaluation (RSS, MSE, RMSE, MAE, R²)
↓
8. Hyperparameter Tuning (if using regularization)
↓
9. Final Model Selection
↓
10. Predictions on Test Set
3. Model selection guide
| Algorithm | Data Considerations | Regularization | Strengths | Weaknesses |
|---|---|---|---|---|
| Linear Regression | Remove or handle outliers; check for multicollinearity (VIF); may need feature scaling for regularized versions | Lasso (L1), Ridge (L2), ElasticNet | Simple, fast, interpretable; works well with linear relationships; provides feature importance via coefficients | Assumes linearity; sensitive to outliers; poor with non-linear patterns; affected by multicollinearity |
| K-Nearest Neighbors | Feature scaling required (StandardScaler, MinMaxScaler); remove irrelevant features; handle missing values | None (non-parametric) | No training phase; simple concept; naturally handles multi-class; non-parametric (no assumptions) | Slow predictions; memory intensive; sensitive to feature scaling; struggles with high dimensions (curse of dimensionality) |
| Decision Trees | Minimal preprocessing needed; handles missing values; no scaling required; can handle mixed data types | Pruning (max_depth, min_samples_split, min_samples_leaf) | Interpretable; handles non-linear relationships; no feature scaling needed; captures interactions | Prone to overfitting; unstable (small data changes cause big tree changes); biased toward features with more levels |
| Support Vector Machines | Feature scaling critical (StandardScaler); remove outliers; ensure balanced classes for classification | C parameter (controls margin), kernel parameters | Effective in high dimensions; memory efficient; works well with clear margins; robust to outliers (with proper kernel) | Slow with large datasets; sensitive to kernel choice; requires feature scaling; difficult to interpret |
| Neural Networks | Feature scaling required; handle missing values; may need normalization; consider data augmentation for small datasets | L1, L2, Dropout, Early stopping, Batch normalization | Highly flexible; captures complex non-linear patterns; scales well with large data; automatic feature learning | Computationally expensive; requires large datasets; black box (hard to interpret); sensitive to hyperparameters; prone to overfitting |
4. Performance metrics
Performace metrics quantify the difference between model predictions and true values of the label. When using for training a metric is refered to as the ‘loss’.
| Metric | Formula | Units | Outlier Sensitivity | Best For |
|---|---|---|---|---|
| RSS | $\sum(y_i - \hat{y}_i)^2$ | Squared | Highest | OLS optimization |
| MSE | $\frac{1}{n}\sum(y_i - \hat{y}_i)^2$ | Squared | Higher | Higher penalty for larger errors |
| RMSE | $\sqrt{\text{MSE}}$ | Same as y | High | Interpretable magnitude, penalizes large errors less than MSE, but more then MAE |
| MAE | $\frac{1}{n}\sum|y_i - \hat{y}_i|$ | Same as y | Low | Robust to outliers |
| R² | $1 - \frac{\text{SS}_{\text{res}}}{\text{SS}_{\text{tot}}}$ | 0 to 1 | Moderate | Variance explained |
Best practice: Report multiple metrics for comprehensive evaluation
5. Core regression techniques
These techniques are crucial for successful regression modeling with any algorithm. Proper application of cross-validation, regularization, and hyperparameter tuning significantly improves model performance and generalization across all regression methods.
5.1. Cross-validation
Trains and evaluates model on multiple train-validation splits called ‘folds’ to estimate generalization performance without overusing the test set.
Cross-validation functions:
cross_val_score(): Returns array of scores for each foldcross_validate(): Returns dict with scores, folds, fit times, etc for each fold
Cross-validation fold generators:
- K-Fold: k equal folds, each used once for validation
- Leave-One-Out (LOOCV): k = n (very computationally expensive)
- Repeated K-Fold: Multiple K-Fold runs with different splits
- Time Series Split: Preserves temporal order
5.2. Regularization methods
Add penalty terms to prevent overfitting and handle multicollinearity.
| Method | Penalty | Feature Selection | Best For | Implementation |
|---|---|---|---|---|
| Lasso (L1) | \(\alpha \sum|\beta_j|\) | Yes (sets coefficients to 0) | Sparse models, irrelevant features | Lasso(alpha=1.0) |
| Ridge (L2) | \(\alpha \sum \beta_j^2\) | No (shrinks but keeps all) | Multicollinearity, all features relevant | Ridge(alpha=1.0) |
| ElasticNet | \(\lambda_1 \sum|\beta_j| + \lambda_2 \sum \beta_j^2\) | Partial (some set to 0) | Both multicollinearity and sparse features | ElasticNet(alpha=1.0, l1_ratio=0.5) |
Key parameter:
- alpha (α): Controls regularization strength
- Higher α → stronger penalty → simpler model
- Lower α → weaker penalty → more complex model
- Use cross-validation to find optimal value
5.3. Hyperparameter tuning
Systematically searches for optimal model parameters.
| Method | Strategy | Pros | Cons | Implementation |
|---|---|---|---|---|
| Grid Search | Exhaustive search of parameter grid | Guaranteed to find best in grid | Computationally expensive | GridSearchCV |
| Random Search | Random sampling from distributions | More efficient, explores wider space | May miss optimal combination | RandomizedSearchCV |
Additional resources
Python libraries
- scikit-learn: Comprehensive ML library
linear_model: LinearRegression, Lasso, Ridge, ElasticNetmodel_selection: train_test_split, cross_val_score, GridSearchCVmetrics: mean_squared_error, mean_absolute_error, r2_scorepipeline: Pipeline, ColumnTransformer
- scipy: Scientific computing library
Key sklearn modules
- sklearn.linear_model: All regression algorithms
- sklearn.model_selection: Train-test split, CV, tuning
- sklearn.metrics: Performance metrics
- sklearn.pipeline: Workflow management
Recommended reading
- Scikit-learn Linear Models Guide: Comprehensive regression documentation
- “Introduction to Statistical Learning” by James, Witten, Hastie, Tibshirani
- “Hands-On Machine Learning” by Aurélien Géron