Classification models predict categorical target variables by learning decision boundaries that separate different classes. This guide covers essential classification techniques, evaluation metrics, and best practices for building effective classifiers.

Table of contents

  1. Best practices
  2. Classification workflow
  3. Model selection guide
  4. Performance metrics
  5. Core classification techniques

1. Best practices

1.1. General guidelines

  1. Start Simple
    • Begin with logistic regression or Naive Bayes
    • Add complexity only when justified
    • Compare models systematically
  2. Always Split Your Data
    • Use train-test split (70-30 or 80-20)
    • Set random_state for reproducibility
    • Never fit scalers/transformers on test data
  3. Check Class Balance
    • Examine class distribution before training
    • Apply balancing techniques if needed
    • Use stratified splits for imbalanced data
  4. Scale Your Features
    • Essential for KNN and SVM
    • Beneficial for logistic regression
    • Not needed for decision trees
    • Use StandardScaler after train-test split
  5. Use Cross-Validation
    • Provides robust performance estimates
    • Use Stratified K-Fold for imbalanced data
    • Use cross_val_score() for quick evaluation
  6. Choose Appropriate Metrics
    • Don’t rely solely on accuracy
    • Use precision, recall, F1 for imbalanced data
    • Consider business context (FP vs FN costs)
  7. Visualize Results
    • Plot confusion matrix
    • Examine ROC curves
    • Use Precision-Recall curves for imbalanced data
  8. Document Everything
    • Track preprocessing steps
    • Record hyperparameters
    • Note performance metrics

1.2. Common pitfalls

Pitfall Problem Solution
Using accuracy for imbalanced data Misleading performance Use precision, recall, F1-score
Not scaling features Poor KNN/SVM performance Standardize features
Ignoring class imbalance Biased toward majority class Use SMOTE, undersampling, or class weights
Overfitting decision trees Poor generalization Use pruning (max_depth, min_samples_split)
Wrong metric choice Misaligned with business goals Consider cost of FP vs FN
Not using cross-validation Unreliable estimates Use Stratified K-Fold
Applying balancing before split Data leakage Balance after train-test split
Using ROC for imbalanced data Overly optimistic Use Precision-Recall curve
Not tuning hyperparameters Suboptimal performance Use GridSearchCV or RandomizedSearchCV
Ignoring probability calibration Poor probability estimates Consider threshold tuning

2. Classification workflow

1. Data Loading & Exploration
   ↓
2. Check Class Balance
   ↓
3. Train-Test Split (stratified if imbalanced)
   ↓
4. Data Preprocessing
   - Handle missing values
   - Encode categorical variables
   - Scale numerical features (if needed)
   ↓
5. Apply Balancing Techniques (if needed)
   - SMOTE
   - Undersampling
   ↓
6. Model Selection & Training
   - Logistic Regression
   - Naive Bayes
   - KNN
   - Decision Tree
   - SVM
   ↓
7. Cross-Validation (Stratified K-Fold)
   ↓
8. Model Evaluation
   - Confusion matrix
   - Accuracy, precision, recall, F1
   - ROC curve and AUC
   - Precision-Recall curve (for imbalanced data)
   ↓
9. Hyperparameter Tuning
   ↓
10. Final Model Selection
   ↓
11. Predictions on Test Set

3. Model selection guide

Algorithm Data Considerations Regularization Strengths Weaknesses
Logistic Regression Feature scaling beneficial; handles binary and multiclass; check for multicollinearity L1 (Lasso), L2 (Ridge), ElasticNet Fast, interpretable, provides probabilities; linear decision boundary; good baseline Assumes linearity; poor with non-linear relationships; may need feature engineering
Naive Bayes No scaling needed; works well with small datasets; handles high dimensions None (probabilistic) Very fast training/prediction; good baseline; handles missing data; works with small datasets Assumes feature independence (often violated); can be outperformed by other models
K-Nearest Neighbors Feature scaling critical; remove irrelevant features; handle missing values None (instance-based) Simple concept; no training phase; non-parametric; naturally handles multiclass Slow predictions; memory intensive; sensitive to feature scaling; curse of dimensionality
Decision Trees Minimal preprocessing; no scaling required; handles mixed data types; can handle missing values Pruning (max_depth, min_samples_split, min_samples_leaf) Highly interpretable; handles non-linear relationships; no scaling needed; visualizable Prone to overfitting; unstable; biased toward features with more levels
Support Vector Machines Feature scaling critical; works well in high dimensions; effective with clear margins C parameter (regularization), kernel parameters Effective in high dimensions; memory efficient; versatile kernels; good with clear margins Slow with large datasets; difficult to interpret; sensitive to kernel choice; requires scaling

4. Performance metrics

Performance metrics evaluate how well a classifier distinguishes between classes and the types of errors it makes.

4.1. Metric comparison

Metric Formula Range Best For Limitations
Accuracy $(TP + TN) / \text{Total}$ 0-1 Balanced datasets Misleading for imbalanced data
Precision $TP / (TP + FP)$ 0-1 Minimizing false positives Ignores false negatives
Recall $TP / (TP + FN)$ 0-1 Minimizing false negatives Ignores false positives
F1-Score $2 \times \frac{P \times R}{P + R}$ 0-1 Balancing precision and recall Equal weight to both metrics
AUC-ROC Area under ROC curve 0-1 Balanced datasets Optimistic for imbalanced data
AUCPR Area under PR curve 0-1 Imbalanced datasets More conservative than AUC

4.2. Confusion matrix

The confusion matrix shows all combinations of predicted vs. actual class labels:

  Predicted Positive Predicted Negative
Actual Positive True Positive (TP) False Negative (FN)
Actual Negative False Positive (FP) True Negative (TN)

Key terms:

  • TP: Correctly predicted positive cases
  • TN: Correctly predicted negative cases
  • FP: Type I error (false alarm)
  • FN: Type II error (missed detection)

Best practice: Use confusion_matrix() and visualize with heatmap

4.3. When to use each metric

Scenario Recommended Metrics Rationale
Balanced classes Accuracy, F1, AUC-ROC All metrics reliable
Imbalanced classes Precision, Recall, F1, AUCPR Accuracy misleading
False positives costly Precision Minimize FP (e.g., spam detection)
False negatives costly Recall Minimize FN (e.g., disease detection)
Both FP and FN important F1-Score Balances precision and recall
Threshold-independent evaluation AUC-ROC, AUCPR Evaluates across all thresholds

5. Core classification techniques

These techniques are crucial for successful classification modeling with any algorithm. Proper application of cross-validation, imbalanced data handling, and hyperparameter tuning significantly improves model performance.

5.1. Cross-validation

Trains and evaluates model on multiple train-validation splits to estimate generalization performance.

Cross-validation functions:

Cross-validation fold generators:

5.2. Handling imbalanced data

Address class imbalance to prevent bias toward majority class.

Technique Type How It Works Best For Implementation
SMOTE Oversampling Generates synthetic minority samples Moderate imbalance SMOTE() from imblearn
Random Undersampling Undersampling Randomly removes majority class samples Large datasets RandomUnderSampler() from imblearn
Class Weights Algorithm parameter Penalizes misclassification of minority class Slight imbalance class_weight='balanced' in most sklearn classifiers

Key considerations:

  • Apply balancing after train-test split to avoid data leakage
  • Use Stratified K-Fold for cross-validation
  • Evaluate with precision, recall, F1, and AUCPR

5.3. Hyperparameter tuning

Systematically searches for optimal model parameters.

Method Strategy Pros Cons Implementation
Grid Search Exhaustive search of parameter grid Guaranteed to find best in grid Computationally expensive GridSearchCV
Random Search Random sampling from distributions More efficient, explores wider space May miss optimal combination RandomizedSearchCV

Common hyperparameters to tune:

Algorithm Key Hyperparameters
Logistic Regression C (regularization strength), penalty (L1, L2, ElasticNet)
KNN n_neighbors (k), weights (uniform, distance), metric (euclidean, manhattan)
Decision Tree max_depth, min_samples_split, min_samples_leaf, criterion (gini, entropy)
SVM C (regularization), gamma (kernel coefficient), kernel (linear, rbf, poly)

Additional resources

Python libraries

Key sklearn modules