Unsupervised learning overview

Unsupervised learning discovers hidden patterns and structures in unlabeled data without predefined target variables. This guide covers clustering, dimensionality reduction, association rules, and anomaly detection techniques essential for exploratory analysis and feature engineering.

Best practices
- General guidelines
- Common pitfalls
Technique selection guide
Clustering algorithms
- Algorithm comparison
- When to use each
Dimensionality reduction
- Method comparison
- Selection guidelines
Association rules
Anomaly detection
Evaluation metrics

1. Best practices

1.1. General guidelines

Understand Your Goal
- Clustering: group similar data points
- Dimensionality reduction: reduce features, visualize
- Association rules: find item relationships
- Anomaly detection: identify outliers
Always Preprocess Data
- Standardize features for distance-based methods
- Handle missing values appropriately
- Remove or handle outliers (unless detecting them)
- Scale features to similar ranges
Determine Optimal Parameters
- Use elbow method for k-means
- Check silhouette scores for cluster quality
- Examine scree plots for PCA components
- Validate with multiple metrics
Visualize Results
- Plot clusters in 2D/3D space
- Use dendrograms for hierarchical clustering
- Show explained variance for PCA
- Create t-SNE plots for high-dimensional data
Validate Findings
- Check if clusters make domain sense
- Verify dimensionality reduction preserves information
- Confirm association rules are actionable
- Test anomaly detection on known outliers
Consider Scalability
- K-means: excellent for large datasets
- DBSCAN: struggles with very large data
- PCA: scales well
- t-SNE: limited to moderate-sized datasets
Interpret Results Carefully
- Clustering assignments are not absolute truth
- PCA components lose interpretability
- Association rules need minimum support/confidence
- Anomalies require domain validation
Combine Techniques
- Use PCA before clustering for high dimensions
- Apply t-SNE for visualizing cluster results
- Use clustering to identify segments, then association rules within segments

1.2. Common pitfalls

Pitfall	Problem	Solution
Not scaling features	Distance-based methods fail	Always standardize for k-means, hierarchical, DBSCAN
Wrong k in k-means	Poor cluster quality	Use elbow method and silhouette scores
Using k-means for non-spherical clusters	Incorrect groupings	Try DBSCAN or hierarchical clustering
Ignoring explained variance in PCA	Information loss	Check cumulative variance, aim for 95%+
Using t-SNE distances as meaningful	Misinterpretation	Only cluster shape matters, not inter-cluster distance
Fitting PCA on entire dataset	Data leakage	Fit on training data only, transform test data
Setting arbitrary DBSCAN parameters	Poor results	Use k-distance plots to determine eps
Treating cluster assignments as labels	Over-confidence	Remember clusters are exploratory
Too many PCA components	Defeats purpose	Select components explaining 90-95% variance
Using accuracy to evaluate clustering	Wrong metric	Use silhouette score, within-cluster SS

2. Technique selection guide

Technique	Purpose	Data Type	Computational Cost	Key Advantage
K-Means	Clustering	Numerical	Low	Fast, scalable, simple
Hierarchical	Clustering	Numerical	Medium	No need to specify k, dendrogram visualization
DBSCAN	Clustering	Numerical	Medium	Arbitrary shapes, handles noise
PCA	Dimensionality reduction	Numerical	Low	Unsupervised, preserves global variance
LDA	Dimensionality reduction	Numerical (labeled)	Low	Supervised, maximizes class separation
t-SNE	Visualization	Numerical	High	Excellent for visualization, preserves local structure
Apriori/Eclat	Association rules	Transactional	Medium	Finds item relationships
Isolation Forest	Anomaly detection	Numerical	Low	Fast, unsupervised outlier detection

3. Clustering algorithms

Clustering groups similar data points together without predefined labels.

3.1. Algorithm comparison

Algorithm	Cluster Shape	Number of Clusters	Handles Noise	Key Parameters
K-Means	Spherical	Must specify k	No	`n_clusters`, `init`, `random_state`
Hierarchical (Agglomerative)	Any	Cut dendrogram at desired level	Limited	`n_clusters`, `linkage` (ward, complete, average, single)
Hierarchical (Divisive)	Any	Split from top down	Limited	Split criterion, stopping condition
DBSCAN	Arbitrary shapes	Automatic	Yes (marks as noise)	`eps` (neighborhood radius), `min_samples`

Algorithm details:

K-Means:

Iteratively assigns points to nearest centroid
Updates centroids as cluster means
Fast and scalable
Use elbow method or silhouette score to find optimal k
Best for: well-separated, spherical clusters

Hierarchical Agglomerative:

Starts with each point as cluster
Progressively merges closest clusters
Creates dendrogram showing hierarchy
Linkage methods: ward (minimize variance), complete (max distance), average, single (min distance)
Best for: understanding data hierarchy, unknown cluster count

Hierarchical Divisive:

Starts with all points in one cluster
Recursively splits largest/most diverse cluster
More computationally expensive
Less common than agglomerative
Best for: top-down decomposition needs

DBSCAN:

Groups points based on density
Core points: ≥ min_samples within eps radius
Border points: within eps of core point
Noise points: neither core nor border
Best for: arbitrary shapes, noisy data, outlier detection

3.2. When to use each

Scenario	Recommended Algorithm	Why
Spherical clusters	K-Means	Fast, efficient, optimal for spherical shapes
Unknown cluster count	Hierarchical, DBSCAN	Don’t require pre-specified k
Non-spherical clusters	DBSCAN	Handles arbitrary shapes
Need hierarchy	Hierarchical	Provides dendrogram for analysis
Large dataset	K-Means	Most scalable
Noisy data with outliers	DBSCAN	Explicitly identifies noise
Well-separated clusters	K-Means	Simple and effective
Varying densities	Hierarchical with appropriate linkage	More flexible than k-means
Need reproducibility	K-Means (with random_state)	Deterministic results

4. Dimensionality reduction

Reduces feature count while preserving essential information.

4.1. Method comparison

Method	Type	Supervised	Linear	Components	Best For
PCA	Feature extraction	No	Yes	Up to n features	General dimensionality reduction, preprocessing
LDA	Feature extraction	Yes	Yes	Up to k-1 classes	Classification preprocessing, class separation
t-SNE	Feature extraction	No	No	Typically 2-3	Visualization of high-dimensional data

Method details:

PCA (Principal Component Analysis):

Finds directions of maximum variance
Creates orthogonal (uncorrelated) components
First components capture most information
Use scree plot or cumulative explained variance
Workflow: standardize → compute covariance → find eigenvectors → transform data
Best practices: standardize data, keep 90-95% variance, check component loadings

LDA (Linear Discriminant Analysis):

Maximizes between-class variance, minimizes within-class variance
Requires class labels
Limited to k-1 components for k classes
More effective than PCA when classes are well-defined
Workflow: calculate class means → compute scatter matrices → solve eigenvalue problem → project data
Best practices: verify Gaussian assumption, check equal covariances, use with sufficient samples

t-SNE (t-Distributed Stochastic Neighbor Embedding):

Non-linear method preserving local structure
Converts distances to probabilities
Uses t-distribution in low-dimensional space
Stochastic (different runs give different results)
Key parameters: perplexity (5-50, balances local/global), learning_rate, n_iter
Best practices: use subset for large data, run multiple times, don’t interpret inter-cluster distances

4.2. Selection guidelines

Use PCA when:

Unlabeled data
General preprocessing needed
Want interpretable linear combinations
Need fast computation
Working with any dataset size

Use LDA when:

Have labeled data for classification
Want to maximize class separation
Preprocessing before classifier
Classes are distinct
Number of classes < number of features

Use t-SNE when:

Need 2D/3D visualization
Exploring cluster structure
Presenting insights to stakeholders
Local structure more important than global
Dataset is moderate-sized (< 10,000 samples)

5. Association rules

Discovers relationships between items in transactional data (market basket analysis).

Key metrics:

Metric	Formula	Interpretation
Support	Frequency(X) / Total transactions	How often itemset appears
Confidence	Frequency(X,Y) / Frequency(X)	How often rule is true
Lift	Support(X,Y) / (Support(X) × Support(Y))	Association strength vs. independence

Algorithms:

Algorithm	Approach	Best For
Apriori	Breadth-first search, uses Apriori property (subsets of frequent itemsets are frequent)	Sparse datasets, finding all frequent itemsets
Eclat	Depth-first search, vertical data format (TID lists)	Dense datasets, faster than Apriori

Best practices:

Set minimum support (e.g., 1%) to filter rare itemsets
Set minimum confidence (e.g., 20%) for reliable rules
Lift > 1 indicates positive association
Validate rules with domain knowledge
Consider computational cost for large datasets

Common applications:

Retail: product recommendations, store layout
Web: page navigation patterns, content suggestions
Healthcare: symptom-treatment associations
Finance: fraud pattern detection

6. Anomaly detection

Identifies rare observations that differ significantly from normal patterns.

Isolation Forest:

Isolates anomalies using random partitioning
Anomalies require fewer splits to isolate
Fast and scalable
No distance calculations needed

How it works:

Build isolation trees with random feature splits
Measure path length to isolate each point
Shorter paths indicate anomalies
Average across multiple trees

Key parameters:

contamination: expected proportion of outliers (default: 0.1)
n_estimators: number of trees (default: 100)
max_samples: samples per tree

Best practices:

Set contamination based on domain knowledge
Use sufficient estimators (100-300)
Validate detected anomalies with domain experts
Standardize features first
Works well with high-dimensional data

Applications:

Fraud detection in transactions
Network intrusion detection
Manufacturing defect identification
Health monitoring (abnormal vitals)

7. Evaluation metrics

7.1. Clustering metrics

Metric	Range	Best Value	Purpose
Silhouette Score	[-1, 1]	+1	Measures cluster cohesion and separation
Within-Cluster Sum of Squares (WCSS)	[0, ∞)	Lower	Used in elbow method to find optimal k
Davies-Bouldin Index	[0, ∞)	Lower	Ratio of within-cluster to between-cluster distances
Calinski-Harabasz Index	[0, ∞)	Higher	Ratio of between-cluster to within-cluster dispersion

Silhouette Score:

$s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}$
$a(i)$: average distance to points in same cluster
$b(i)$: average distance to points in nearest cluster
Score near +1: well-clustered
Score near 0: on cluster boundary
Score near -1: possibly wrong cluster

Elbow Method (WCSS):

Plot WCSS vs. number of clusters
Look for “elbow” where improvement diminishes
Choose k at elbow point

7.2. Dimensionality reduction metrics

Metric	Application	Interpretation
Explained Variance Ratio	PCA	Proportion of variance captured per component
Cumulative Explained Variance	PCA	Total variance captured by selected components
Reconstruction Error	PCA	How well reduced data reconstructs original
Downstream Task Performance	All	Performance on classification/regression after reduction

Best practices:

Retain components explaining 90-95% variance
Use scree plots to visualize explained variance
Validate with downstream task performance
Check if reduction improves or hurts model accuracy

8. Unsupervised learning workflow

1. Define Objective
   - Clustering: segment data
   - Dimensionality reduction: simplify/visualize
   - Association rules: find patterns
   - Anomaly detection: identify outliers
   ↓
2. Data Preprocessing
   - Handle missing values
   - Standardize features (for distance-based methods)
   - Remove or flag known outliers (if not detecting)
   ↓
3. Exploratory Analysis
   - Visualize data distribution
   - Check correlations
   - Identify potential number of clusters
   ↓
4. Apply Technique
   Clustering:
     - Try multiple algorithms
     - Use elbow method for k
     - Calculate silhouette scores
   
   Dimensionality Reduction:
     - Check explained variance
     - Validate with downstream tasks
     - Visualize results
   
   Association Rules:
     - Set support/confidence thresholds
     - Generate rules
     - Filter by lift
   
   Anomaly Detection:
     - Set contamination parameter
     - Identify anomalies
     - Validate with domain knowledge
   ↓
5. Evaluate Results
   - Use appropriate metrics
   - Visualize clusters/components
   - Check domain validity
   ↓
6. Iterate and Refine
   - Adjust parameters
   - Try different algorithms
   - Combine techniques
   ↓
7. Interpret and Apply
   - Validate findings with stakeholders
   - Document insights
   - Use results for downstream tasks

Additional resources

Python libraries

scikit-learn: Core unsupervised learning modules
- cluster: KMeans, AgglomerativeClustering, DBSCAN
- decomposition: PCA
- discriminant_analysis: LinearDiscriminantAnalysis
- manifold: TSNE
- ensemble: IsolationForest
mlxtend: Association rule mining
- frequent_patterns.apriori: Apriori algorithm
- frequent_patterns.association_rules: Generate rules
pyECLAT: Eclat algorithm implementation
scipy.cluster.hierarchy: Hierarchical clustering with dendrograms

Unsupervised learning overview

Table of contents

1. Best practices

1.1. General guidelines

1.2. Common pitfalls

2. Technique selection guide

3. Clustering algorithms

3.1. Algorithm comparison

3.2. When to use each

4. Dimensionality reduction

4.1. Method comparison

4.2. Selection guidelines

5. Association rules

6. Anomaly detection

7. Evaluation metrics

7.1. Clustering metrics

7.2. Dimensionality reduction metrics

8. Unsupervised learning workflow

Additional resources

Python libraries

Recommended reading