Classification overview

Classification models predict categorical target variables by learning decision boundaries that separate different classes. This guide covers essential classification techniques, evaluation metrics, and best practices for building effective classifiers.

Best practices
- General guidelines
- Common pitfalls
Classification workflow
Model selection guide
Performance metrics
- Metric comparison
- When to use each metric
Core classification techniques

1. Best practices

1.1. General guidelines

Start Simple
- Begin with logistic regression or Naive Bayes
- Add complexity only when justified
- Compare models systematically
Always Split Your Data
- Use train-test split (70-30 or 80-20)
- Set random_state for reproducibility
- Never fit scalers/transformers on test data
Check Class Balance
- Examine class distribution before training
- Apply balancing techniques if needed
- Use stratified splits for imbalanced data
Scale Your Features
- Essential for KNN and SVM
- Beneficial for logistic regression
- Not needed for decision trees
- Use StandardScaler after train-test split
Use Cross-Validation
- Provides robust performance estimates
- Use Stratified K-Fold for imbalanced data
- Use cross_val_score() for quick evaluation
Choose Appropriate Metrics
- Don’t rely solely on accuracy
- Use precision, recall, F1 for imbalanced data
- Consider business context (FP vs FN costs)
Visualize Results
- Plot confusion matrix
- Examine ROC curves
- Use Precision-Recall curves for imbalanced data
Document Everything
- Track preprocessing steps
- Record hyperparameters
- Note performance metrics

1.2. Common pitfalls

Pitfall	Problem	Solution
Using accuracy for imbalanced data	Misleading performance	Use precision, recall, F1-score
Not scaling features	Poor KNN/SVM performance	Standardize features
Ignoring class imbalance	Biased toward majority class	Use SMOTE, undersampling, or class weights
Overfitting decision trees	Poor generalization	Use pruning (max_depth, min_samples_split)
Wrong metric choice	Misaligned with business goals	Consider cost of FP vs FN
Not using cross-validation	Unreliable estimates	Use Stratified K-Fold
Applying balancing before split	Data leakage	Balance after train-test split
Using ROC for imbalanced data	Overly optimistic	Use Precision-Recall curve
Not tuning hyperparameters	Suboptimal performance	Use GridSearchCV or RandomizedSearchCV
Ignoring probability calibration	Poor probability estimates	Consider threshold tuning

2. Classification workflow

1. Data Loading & Exploration
   ↓
2. Check Class Balance
   ↓
3. Train-Test Split (stratified if imbalanced)
   ↓
4. Data Preprocessing
   - Handle missing values
   - Encode categorical variables
   - Scale numerical features (if needed)
   ↓
5. Apply Balancing Techniques (if needed)
   - SMOTE
   - Undersampling
   ↓
6. Model Selection & Training
   - Logistic Regression
   - Naive Bayes
   - KNN
   - Decision Tree
   - SVM
   ↓
7. Cross-Validation (Stratified K-Fold)
   ↓
8. Model Evaluation
   - Confusion matrix
   - Accuracy, precision, recall, F1
   - ROC curve and AUC
   - Precision-Recall curve (for imbalanced data)
   ↓
9. Hyperparameter Tuning
   ↓
10. Final Model Selection
   ↓
11. Predictions on Test Set

3. Model selection guide

Algorithm	Data Considerations	Regularization	Strengths	Weaknesses
Logistic Regression	Feature scaling beneficial; handles binary and multiclass; check for multicollinearity	L1 (Lasso), L2 (Ridge), ElasticNet	Fast, interpretable, provides probabilities; linear decision boundary; good baseline	Assumes linearity; poor with non-linear relationships; may need feature engineering
Naive Bayes	No scaling needed; works well with small datasets; handles high dimensions	None (probabilistic)	Very fast training/prediction; good baseline; handles missing data; works with small datasets	Assumes feature independence (often violated); can be outperformed by other models
K-Nearest Neighbors	Feature scaling critical; remove irrelevant features; handle missing values	None (instance-based)	Simple concept; no training phase; non-parametric; naturally handles multiclass	Slow predictions; memory intensive; sensitive to feature scaling; curse of dimensionality
Decision Trees	Minimal preprocessing; no scaling required; handles mixed data types; can handle missing values	Pruning (max_depth, min_samples_split, min_samples_leaf)	Highly interpretable; handles non-linear relationships; no scaling needed; visualizable	Prone to overfitting; unstable; biased toward features with more levels
Support Vector Machines	Feature scaling critical; works well in high dimensions; effective with clear margins	C parameter (regularization), kernel parameters	Effective in high dimensions; memory efficient; versatile kernels; good with clear margins	Slow with large datasets; difficult to interpret; sensitive to kernel choice; requires scaling

4. Performance metrics

Performance metrics evaluate how well a classifier distinguishes between classes and the types of errors it makes.

4.1. Metric comparison

Metric	Formula	Range	Best For	Limitations
Accuracy	$(TP + TN) / \text{Total}$	0-1	Balanced datasets	Misleading for imbalanced data
Precision	$TP / (TP + FP)$	0-1	Minimizing false positives	Ignores false negatives
Recall	$TP / (TP + FN)$	0-1	Minimizing false negatives	Ignores false positives
F1-Score	$2 \times \frac{P \times R}{P + R}$	0-1	Balancing precision and recall	Equal weight to both metrics
AUC-ROC	Area under ROC curve	0-1	Balanced datasets	Optimistic for imbalanced data
AUCPR	Area under PR curve	0-1	Imbalanced datasets	More conservative than AUC

4.2. Confusion matrix

The confusion matrix shows all combinations of predicted vs. actual class labels:

	Predicted Positive	Predicted Negative
Actual Positive	True Positive (TP)	False Negative (FN)
Actual Negative	False Positive (FP)	True Negative (TN)

Key terms:

TP: Correctly predicted positive cases
TN: Correctly predicted negative cases
FP: Type I error (false alarm)
FN: Type II error (missed detection)

Best practice: Use confusion_matrix() and visualize with heatmap

4.3. When to use each metric

Scenario	Recommended Metrics	Rationale
Balanced classes	Accuracy, F1, AUC-ROC	All metrics reliable
Imbalanced classes	Precision, Recall, F1, AUCPR	Accuracy misleading
False positives costly	Precision	Minimize FP (e.g., spam detection)
False negatives costly	Recall	Minimize FN (e.g., disease detection)
Both FP and FN important	F1-Score	Balances precision and recall
Threshold-independent evaluation	AUC-ROC, AUCPR	Evaluates across all thresholds

5. Core classification techniques

These techniques are crucial for successful classification modeling with any algorithm. Proper application of cross-validation, imbalanced data handling, and hyperparameter tuning significantly improves model performance.

5.1. Cross-validation

Trains and evaluates model on multiple train-validation splits to estimate generalization performance.

Cross-validation functions:

cross_val_score(): Returns array of scores for each fold
cross_validate(): Returns dict with scores, folds, fit times, etc

Cross-validation fold generators:

Stratified K-Fold: Maintains class proportions in each fold (recommended for classification)
K-Fold: Standard k equal folds
Repeated Stratified K-Fold: Multiple runs with different random splits

5.2. Handling imbalanced data

Address class imbalance to prevent bias toward majority class.

Technique	Type	How It Works	Best For	Implementation
SMOTE	Oversampling	Generates synthetic minority samples	Moderate imbalance	`SMOTE()` from imblearn
Random Undersampling	Undersampling	Randomly removes majority class samples	Large datasets	`RandomUnderSampler()` from imblearn
Class Weights	Algorithm parameter	Penalizes misclassification of minority class	Slight imbalance	`class_weight='balanced'` in most sklearn classifiers

Key considerations:

Apply balancing after train-test split to avoid data leakage
Use Stratified K-Fold for cross-validation
Evaluate with precision, recall, F1, and AUCPR

5.3. Hyperparameter tuning

Systematically searches for optimal model parameters.

Method	Strategy	Pros	Cons	Implementation
Grid Search	Exhaustive search of parameter grid	Guaranteed to find best in grid	Computationally expensive	`GridSearchCV`
Random Search	Random sampling from distributions	More efficient, explores wider space	May miss optimal combination	`RandomizedSearchCV`

Common hyperparameters to tune:

Algorithm	Key Hyperparameters
Logistic Regression	`C` (regularization strength), `penalty` (L1, L2, ElasticNet)
KNN	`n_neighbors` (k), `weights` (uniform, distance), `metric` (euclidean, manhattan)
Decision Tree	`max_depth`, `min_samples_split`, `min_samples_leaf`, `criterion` (gini, entropy)
SVM	`C` (regularization), `gamma` (kernel coefficient), `kernel` (linear, rbf, poly)

Additional resources

Python libraries

scikit-learn: Comprehensive ML library
- linear_model: LogisticRegression
- naive_bayes: GaussianNB, MultinomialNB, BernoulliNB
- neighbors: KNeighborsClassifier
- tree: DecisionTreeClassifier
- svm: SVC
- model_selection: train_test_split, cross_val_score, GridSearchCV, StratifiedKFold
- metrics: confusion_matrix, classification_report, roc_curve, precision_recall_curve
imbalanced-learn: Handling imbalanced datasets
- over_sampling: SMOTE
- under_sampling: RandomUnderSampler
- pipeline: Pipeline (imbalanced-aware)

Key sklearn modules

sklearn.linear_model: Logistic regression and variants
sklearn.model_selection: Train-test split, CV, tuning
sklearn.metrics: Classification metrics
sklearn.preprocessing: Feature scaling and encoding