Regression models predict continuous target variables by learning relationships between input features and outputs. This guide covers essential regression techniques, evaluation metrics, and best practices for building effective predictive models.

Table of contents

  1. Best practices
  2. Regression workflow
  3. Model selection guide
  4. Performance metrics
  5. Core regression techniques

1. Best practices

1.1. General guidelines

  1. Start Simple
    • Begin with linear regression
    • Add complexity only when justified
    • Compare models systematically
  2. Always Split Your Data
    • Use train-test split (70-30 or 80-20)
    • Set random_state for reproducibility
    • Never fit scalers/transformers on test data
  3. Scale Your Features
    • Essential for regularized regression
    • Use StandardScaler after train-test split
    • Apply same transformation to test data
  4. Use Cross-Validation
    • Provides robust performance estimates
    • K-Fold (k=5 or k=10) is standard
    • Use cross_val_score() for quick evaluation
  5. Monitor Overfitting
    • Compare training vs test performance
    • Large gap indicates overfitting
    • Use regularization if needed
  6. Leverage Pipelines
    • Combine preprocessing and modeling
    • Prevents data leakage
    • Simplifies deployment
    • Use sklearn.pipeline.Pipeline
  7. Document Everything
    • Track preprocessing steps
    • Record hyperparameters
    • Note performance metrics

1.2. Common pitfalls

Pitfall Problem Solution
Not splitting data Can’t evaluate generalization Always use train-test split
Data leakage Overly optimistic results Fit transformers only on training data
Ignoring overfitting Poor test performance Monitor train vs test metrics
Wrong metric Misleading conclusions Use multiple metrics (MSE, MAE, R²)
Skipping cross-validation Unreliable estimates Use K-Fold cross-validation
Not scaling Regularization ineffective Standardize features
Categorical encoding errors Model can’t learn Use one-hot or ordinal encoding
Missing values Training fails or biased Impute before modeling
High multicollinearity Unstable coefficients Use Ridge or ElasticNet

2. Regression workflow

1. Data Loading & Inspection
   ↓
2. Train-Test Split
   ↓
3. Exploratory Data Analysis
   ↓
4. Data Preprocessing
   - Handle missing values
   - Encode categorical variables
   - Scale numerical features
   ↓
5. Model Selection & Training
   - Linear regression
   - Regularized regression
   ↓
6. Cross-Validation
   ↓
7. Model Evaluation (RSS, MSE, RMSE, MAE, R²)
   ↓
8. Hyperparameter Tuning (if using regularization)
   ↓
9. Final Model Selection
   ↓
10. Predictions on Test Set

3. Model selection guide

Algorithm Data Considerations Regularization Strengths Weaknesses
Linear Regression Remove or handle outliers; check for multicollinearity (VIF); may need feature scaling for regularized versions Lasso (L1), Ridge (L2), ElasticNet Simple, fast, interpretable; works well with linear relationships; provides feature importance via coefficients Assumes linearity; sensitive to outliers; poor with non-linear patterns; affected by multicollinearity
K-Nearest Neighbors Feature scaling required (StandardScaler, MinMaxScaler); remove irrelevant features; handle missing values None (non-parametric) No training phase; simple concept; naturally handles multi-class; non-parametric (no assumptions) Slow predictions; memory intensive; sensitive to feature scaling; struggles with high dimensions (curse of dimensionality)
Decision Trees Minimal preprocessing needed; handles missing values; no scaling required; can handle mixed data types Pruning (max_depth, min_samples_split, min_samples_leaf) Interpretable; handles non-linear relationships; no feature scaling needed; captures interactions Prone to overfitting; unstable (small data changes cause big tree changes); biased toward features with more levels
Support Vector Machines Feature scaling critical (StandardScaler); remove outliers; ensure balanced classes for classification C parameter (controls margin), kernel parameters Effective in high dimensions; memory efficient; works well with clear margins; robust to outliers (with proper kernel) Slow with large datasets; sensitive to kernel choice; requires feature scaling; difficult to interpret
Neural Networks Feature scaling required; handle missing values; may need normalization; consider data augmentation for small datasets L1, L2, Dropout, Early stopping, Batch normalization Highly flexible; captures complex non-linear patterns; scales well with large data; automatic feature learning Computationally expensive; requires large datasets; black box (hard to interpret); sensitive to hyperparameters; prone to overfitting

4. Performance metrics

Performace metrics quantify the difference between model predictions and true values of the label. When using for training a metric is refered to as the ‘loss’.

Metric Formula Units Outlier Sensitivity Best For
RSS $\sum(y_i - \hat{y}_i)^2$ Squared Highest OLS optimization
MSE $\frac{1}{n}\sum(y_i - \hat{y}_i)^2$ Squared Higher Higher penalty for larger errors
RMSE $\sqrt{\text{MSE}}$ Same as y High Interpretable magnitude, penalizes large errors less than MSE, but more then MAE
MAE $\frac{1}{n}\sum|y_i - \hat{y}_i|$ Same as y Low Robust to outliers
$1 - \frac{\text{SS}_{\text{res}}}{\text{SS}_{\text{tot}}}$ 0 to 1 Moderate Variance explained

Best practice: Report multiple metrics for comprehensive evaluation


5. Core regression techniques

These techniques are crucial for successful regression modeling with any algorithm. Proper application of cross-validation, regularization, and hyperparameter tuning significantly improves model performance and generalization across all regression methods.

5.1. Cross-validation

Trains and evaluates model on multiple train-validation splits called ‘folds’ to estimate generalization performance without overusing the test set.

Cross-validation functions:

Cross-validation fold generators:

5.2. Regularization methods

Add penalty terms to prevent overfitting and handle multicollinearity.

Method Penalty Feature Selection Best For Implementation
Lasso (L1) \(\alpha \sum|\beta_j|\) Yes (sets coefficients to 0) Sparse models, irrelevant features Lasso(alpha=1.0)
Ridge (L2) \(\alpha \sum \beta_j^2\) No (shrinks but keeps all) Multicollinearity, all features relevant Ridge(alpha=1.0)
ElasticNet \(\lambda_1 \sum|\beta_j| + \lambda_2 \sum \beta_j^2\) Partial (some set to 0) Both multicollinearity and sparse features ElasticNet(alpha=1.0, l1_ratio=0.5)

Key parameter:

  • alpha (α): Controls regularization strength
    • Higher α → stronger penalty → simpler model
    • Lower α → weaker penalty → more complex model
    • Use cross-validation to find optimal value

5.3. Hyperparameter tuning

Systematically searches for optimal model parameters.

Method Strategy Pros Cons Implementation
Grid Search Exhaustive search of parameter grid Guaranteed to find best in grid Computationally expensive GridSearchCV
Random Search Random sampling from distributions More efficient, explores wider space May miss optimal combination RandomizedSearchCV

Additional resources

Python libraries

  • scikit-learn: Comprehensive ML library
    • linear_model: LinearRegression, Lasso, Ridge, ElasticNet
    • model_selection: train_test_split, cross_val_score, GridSearchCV
    • metrics: mean_squared_error, mean_absolute_error, r2_score
    • pipeline: Pipeline, ColumnTransformer
  • scipy: Scientific computing library
    • stats: Statistical functions, distributions, hypothesis tests
    • optimize: Optimization algorithms for parameter estimation
    • linalg: Linear algebra operations for matrix computations

Key sklearn modules