Classification overview
Classification models predict categorical target variables by learning decision boundaries that separate different classes. This guide covers essential classification techniques, evaluation metrics, and best practices for building effective classifiers.
Table of contents
- Best practices
- Classification workflow
- Model selection guide
- Performance metrics
- Core classification techniques
1. Best practices
1.1. General guidelines
- Start Simple
- Begin with logistic regression or Naive Bayes
- Add complexity only when justified
- Compare models systematically
- Always Split Your Data
- Use train-test split (70-30 or 80-20)
- Set
random_statefor reproducibility - Never fit scalers/transformers on test data
- Check Class Balance
- Examine class distribution before training
- Apply balancing techniques if needed
- Use stratified splits for imbalanced data
- Scale Your Features
- Essential for KNN and SVM
- Beneficial for logistic regression
- Not needed for decision trees
- Use
StandardScalerafter train-test split
- Use Cross-Validation
- Provides robust performance estimates
- Use Stratified K-Fold for imbalanced data
- Use
cross_val_score()for quick evaluation
- Choose Appropriate Metrics
- Don’t rely solely on accuracy
- Use precision, recall, F1 for imbalanced data
- Consider business context (FP vs FN costs)
- Visualize Results
- Plot confusion matrix
- Examine ROC curves
- Use Precision-Recall curves for imbalanced data
- Document Everything
- Track preprocessing steps
- Record hyperparameters
- Note performance metrics
1.2. Common pitfalls
| Pitfall | Problem | Solution |
|---|---|---|
| Using accuracy for imbalanced data | Misleading performance | Use precision, recall, F1-score |
| Not scaling features | Poor KNN/SVM performance | Standardize features |
| Ignoring class imbalance | Biased toward majority class | Use SMOTE, undersampling, or class weights |
| Overfitting decision trees | Poor generalization | Use pruning (max_depth, min_samples_split) |
| Wrong metric choice | Misaligned with business goals | Consider cost of FP vs FN |
| Not using cross-validation | Unreliable estimates | Use Stratified K-Fold |
| Applying balancing before split | Data leakage | Balance after train-test split |
| Using ROC for imbalanced data | Overly optimistic | Use Precision-Recall curve |
| Not tuning hyperparameters | Suboptimal performance | Use GridSearchCV or RandomizedSearchCV |
| Ignoring probability calibration | Poor probability estimates | Consider threshold tuning |
2. Classification workflow
1. Data Loading & Exploration
↓
2. Check Class Balance
↓
3. Train-Test Split (stratified if imbalanced)
↓
4. Data Preprocessing
- Handle missing values
- Encode categorical variables
- Scale numerical features (if needed)
↓
5. Apply Balancing Techniques (if needed)
- SMOTE
- Undersampling
↓
6. Model Selection & Training
- Logistic Regression
- Naive Bayes
- KNN
- Decision Tree
- SVM
↓
7. Cross-Validation (Stratified K-Fold)
↓
8. Model Evaluation
- Confusion matrix
- Accuracy, precision, recall, F1
- ROC curve and AUC
- Precision-Recall curve (for imbalanced data)
↓
9. Hyperparameter Tuning
↓
10. Final Model Selection
↓
11. Predictions on Test Set
3. Model selection guide
| Algorithm | Data Considerations | Regularization | Strengths | Weaknesses |
|---|---|---|---|---|
| Logistic Regression | Feature scaling beneficial; handles binary and multiclass; check for multicollinearity | L1 (Lasso), L2 (Ridge), ElasticNet | Fast, interpretable, provides probabilities; linear decision boundary; good baseline | Assumes linearity; poor with non-linear relationships; may need feature engineering |
| Naive Bayes | No scaling needed; works well with small datasets; handles high dimensions | None (probabilistic) | Very fast training/prediction; good baseline; handles missing data; works with small datasets | Assumes feature independence (often violated); can be outperformed by other models |
| K-Nearest Neighbors | Feature scaling critical; remove irrelevant features; handle missing values | None (instance-based) | Simple concept; no training phase; non-parametric; naturally handles multiclass | Slow predictions; memory intensive; sensitive to feature scaling; curse of dimensionality |
| Decision Trees | Minimal preprocessing; no scaling required; handles mixed data types; can handle missing values | Pruning (max_depth, min_samples_split, min_samples_leaf) | Highly interpretable; handles non-linear relationships; no scaling needed; visualizable | Prone to overfitting; unstable; biased toward features with more levels |
| Support Vector Machines | Feature scaling critical; works well in high dimensions; effective with clear margins | C parameter (regularization), kernel parameters | Effective in high dimensions; memory efficient; versatile kernels; good with clear margins | Slow with large datasets; difficult to interpret; sensitive to kernel choice; requires scaling |
4. Performance metrics
Performance metrics evaluate how well a classifier distinguishes between classes and the types of errors it makes.
4.1. Metric comparison
| Metric | Formula | Range | Best For | Limitations |
|---|---|---|---|---|
| Accuracy | $(TP + TN) / \text{Total}$ | 0-1 | Balanced datasets | Misleading for imbalanced data |
| Precision | $TP / (TP + FP)$ | 0-1 | Minimizing false positives | Ignores false negatives |
| Recall | $TP / (TP + FN)$ | 0-1 | Minimizing false negatives | Ignores false positives |
| F1-Score | $2 \times \frac{P \times R}{P + R}$ | 0-1 | Balancing precision and recall | Equal weight to both metrics |
| AUC-ROC | Area under ROC curve | 0-1 | Balanced datasets | Optimistic for imbalanced data |
| AUCPR | Area under PR curve | 0-1 | Imbalanced datasets | More conservative than AUC |
4.2. Confusion matrix
The confusion matrix shows all combinations of predicted vs. actual class labels:
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actual Positive | True Positive (TP) | False Negative (FN) |
| Actual Negative | False Positive (FP) | True Negative (TN) |
Key terms:
- TP: Correctly predicted positive cases
- TN: Correctly predicted negative cases
- FP: Type I error (false alarm)
- FN: Type II error (missed detection)
Best practice: Use confusion_matrix() and visualize with heatmap
4.3. When to use each metric
| Scenario | Recommended Metrics | Rationale |
|---|---|---|
| Balanced classes | Accuracy, F1, AUC-ROC | All metrics reliable |
| Imbalanced classes | Precision, Recall, F1, AUCPR | Accuracy misleading |
| False positives costly | Precision | Minimize FP (e.g., spam detection) |
| False negatives costly | Recall | Minimize FN (e.g., disease detection) |
| Both FP and FN important | F1-Score | Balances precision and recall |
| Threshold-independent evaluation | AUC-ROC, AUCPR | Evaluates across all thresholds |
5. Core classification techniques
These techniques are crucial for successful classification modeling with any algorithm. Proper application of cross-validation, imbalanced data handling, and hyperparameter tuning significantly improves model performance.
5.1. Cross-validation
Trains and evaluates model on multiple train-validation splits to estimate generalization performance.
Cross-validation functions:
cross_val_score(): Returns array of scores for each foldcross_validate(): Returns dict with scores, folds, fit times, etc
Cross-validation fold generators:
- Stratified K-Fold: Maintains class proportions in each fold (recommended for classification)
- K-Fold: Standard k equal folds
- Repeated Stratified K-Fold: Multiple runs with different random splits
5.2. Handling imbalanced data
Address class imbalance to prevent bias toward majority class.
| Technique | Type | How It Works | Best For | Implementation |
|---|---|---|---|---|
| SMOTE | Oversampling | Generates synthetic minority samples | Moderate imbalance | SMOTE() from imblearn |
| Random Undersampling | Undersampling | Randomly removes majority class samples | Large datasets | RandomUnderSampler() from imblearn |
| Class Weights | Algorithm parameter | Penalizes misclassification of minority class | Slight imbalance | class_weight='balanced' in most sklearn classifiers |
Key considerations:
- Apply balancing after train-test split to avoid data leakage
- Use Stratified K-Fold for cross-validation
- Evaluate with precision, recall, F1, and AUCPR
5.3. Hyperparameter tuning
Systematically searches for optimal model parameters.
| Method | Strategy | Pros | Cons | Implementation |
|---|---|---|---|---|
| Grid Search | Exhaustive search of parameter grid | Guaranteed to find best in grid | Computationally expensive | GridSearchCV |
| Random Search | Random sampling from distributions | More efficient, explores wider space | May miss optimal combination | RandomizedSearchCV |
Common hyperparameters to tune:
| Algorithm | Key Hyperparameters |
|---|---|
| Logistic Regression | C (regularization strength), penalty (L1, L2, ElasticNet) |
| KNN | n_neighbors (k), weights (uniform, distance), metric (euclidean, manhattan) |
| Decision Tree | max_depth, min_samples_split, min_samples_leaf, criterion (gini, entropy) |
| SVM | C (regularization), gamma (kernel coefficient), kernel (linear, rbf, poly) |
Additional resources
Python libraries
- scikit-learn: Comprehensive ML library
linear_model: LogisticRegressionnaive_bayes: GaussianNB, MultinomialNB, BernoulliNBneighbors: KNeighborsClassifiertree: DecisionTreeClassifiersvm: SVCmodel_selection: train_test_split, cross_val_score, GridSearchCV, StratifiedKFoldmetrics: confusion_matrix, classification_report, roc_curve, precision_recall_curve
- imbalanced-learn: Handling imbalanced datasets
over_sampling: SMOTEunder_sampling: RandomUnderSamplerpipeline: Pipeline (imbalanced-aware)
Key sklearn modules
- sklearn.linear_model: Logistic regression and variants
- sklearn.model_selection: Train-test split, CV, tuning
- sklearn.metrics: Classification metrics
- sklearn.preprocessing: Feature scaling and encoding
Recommended reading
- Scikit-learn Classification Guide: Comprehensive classification documentation
- “Introduction to Statistical Learning” by James, Witten, Hastie, Tibshirani
- “Hands-On Machine Learning” by Aurélien Géron
- Imbalanced-learn Documentation: Guide to handling imbalanced datasets