Unsupervised learning overview
Unsupervised learning discovers hidden patterns and structures in unlabeled data without predefined target variables. This guide covers clustering, dimensionality reduction, association rules, and anomaly detection techniques essential for exploratory analysis and feature engineering.
Table of contents
- Best practices
- Technique selection guide
- Clustering algorithms
- Dimensionality reduction
- Association rules
- Anomaly detection
- Evaluation metrics
1. Best practices
1.1. General guidelines
- Understand Your Goal
- Clustering: group similar data points
- Dimensionality reduction: reduce features, visualize
- Association rules: find item relationships
- Anomaly detection: identify outliers
- Always Preprocess Data
- Standardize features for distance-based methods
- Handle missing values appropriately
- Remove or handle outliers (unless detecting them)
- Scale features to similar ranges
- Determine Optimal Parameters
- Use elbow method for k-means
- Check silhouette scores for cluster quality
- Examine scree plots for PCA components
- Validate with multiple metrics
- Visualize Results
- Plot clusters in 2D/3D space
- Use dendrograms for hierarchical clustering
- Show explained variance for PCA
- Create t-SNE plots for high-dimensional data
- Validate Findings
- Check if clusters make domain sense
- Verify dimensionality reduction preserves information
- Confirm association rules are actionable
- Test anomaly detection on known outliers
- Consider Scalability
- K-means: excellent for large datasets
- DBSCAN: struggles with very large data
- PCA: scales well
- t-SNE: limited to moderate-sized datasets
- Interpret Results Carefully
- Clustering assignments are not absolute truth
- PCA components lose interpretability
- Association rules need minimum support/confidence
- Anomalies require domain validation
- Combine Techniques
- Use PCA before clustering for high dimensions
- Apply t-SNE for visualizing cluster results
- Use clustering to identify segments, then association rules within segments
1.2. Common pitfalls
| Pitfall | Problem | Solution |
|---|---|---|
| Not scaling features | Distance-based methods fail | Always standardize for k-means, hierarchical, DBSCAN |
| Wrong k in k-means | Poor cluster quality | Use elbow method and silhouette scores |
| Using k-means for non-spherical clusters | Incorrect groupings | Try DBSCAN or hierarchical clustering |
| Ignoring explained variance in PCA | Information loss | Check cumulative variance, aim for 95%+ |
| Using t-SNE distances as meaningful | Misinterpretation | Only cluster shape matters, not inter-cluster distance |
| Fitting PCA on entire dataset | Data leakage | Fit on training data only, transform test data |
| Setting arbitrary DBSCAN parameters | Poor results | Use k-distance plots to determine eps |
| Treating cluster assignments as labels | Over-confidence | Remember clusters are exploratory |
| Too many PCA components | Defeats purpose | Select components explaining 90-95% variance |
| Using accuracy to evaluate clustering | Wrong metric | Use silhouette score, within-cluster SS |
2. Technique selection guide
| Technique | Purpose | Data Type | Computational Cost | Key Advantage |
|---|---|---|---|---|
| K-Means | Clustering | Numerical | Low | Fast, scalable, simple |
| Hierarchical | Clustering | Numerical | Medium | No need to specify k, dendrogram visualization |
| DBSCAN | Clustering | Numerical | Medium | Arbitrary shapes, handles noise |
| PCA | Dimensionality reduction | Numerical | Low | Unsupervised, preserves global variance |
| LDA | Dimensionality reduction | Numerical (labeled) | Low | Supervised, maximizes class separation |
| t-SNE | Visualization | Numerical | High | Excellent for visualization, preserves local structure |
| Apriori/Eclat | Association rules | Transactional | Medium | Finds item relationships |
| Isolation Forest | Anomaly detection | Numerical | Low | Fast, unsupervised outlier detection |
3. Clustering algorithms
Clustering groups similar data points together without predefined labels.
3.1. Algorithm comparison
| Algorithm | Cluster Shape | Number of Clusters | Handles Noise | Key Parameters |
|---|---|---|---|---|
| K-Means | Spherical | Must specify k | No | n_clusters, init, random_state |
| Hierarchical (Agglomerative) | Any | Cut dendrogram at desired level | Limited | n_clusters, linkage (ward, complete, average, single) |
| Hierarchical (Divisive) | Any | Split from top down | Limited | Split criterion, stopping condition |
| DBSCAN | Arbitrary shapes | Automatic | Yes (marks as noise) | eps (neighborhood radius), min_samples |
Algorithm details:
K-Means:
- Iteratively assigns points to nearest centroid
- Updates centroids as cluster means
- Fast and scalable
- Use elbow method or silhouette score to find optimal k
- Best for: well-separated, spherical clusters
Hierarchical Agglomerative:
- Starts with each point as cluster
- Progressively merges closest clusters
- Creates dendrogram showing hierarchy
- Linkage methods: ward (minimize variance), complete (max distance), average, single (min distance)
- Best for: understanding data hierarchy, unknown cluster count
Hierarchical Divisive:
- Starts with all points in one cluster
- Recursively splits largest/most diverse cluster
- More computationally expensive
- Less common than agglomerative
- Best for: top-down decomposition needs
DBSCAN:
- Groups points based on density
- Core points: ≥ min_samples within eps radius
- Border points: within eps of core point
- Noise points: neither core nor border
- Best for: arbitrary shapes, noisy data, outlier detection
3.2. When to use each
| Scenario | Recommended Algorithm | Why |
|---|---|---|
| Spherical clusters | K-Means | Fast, efficient, optimal for spherical shapes |
| Unknown cluster count | Hierarchical, DBSCAN | Don’t require pre-specified k |
| Non-spherical clusters | DBSCAN | Handles arbitrary shapes |
| Need hierarchy | Hierarchical | Provides dendrogram for analysis |
| Large dataset | K-Means | Most scalable |
| Noisy data with outliers | DBSCAN | Explicitly identifies noise |
| Well-separated clusters | K-Means | Simple and effective |
| Varying densities | Hierarchical with appropriate linkage | More flexible than k-means |
| Need reproducibility | K-Means (with random_state) | Deterministic results |
4. Dimensionality reduction
Reduces feature count while preserving essential information.
4.1. Method comparison
| Method | Type | Supervised | Linear | Components | Best For |
|---|---|---|---|---|---|
| PCA | Feature extraction | No | Yes | Up to n features | General dimensionality reduction, preprocessing |
| LDA | Feature extraction | Yes | Yes | Up to k-1 classes | Classification preprocessing, class separation |
| t-SNE | Feature extraction | No | No | Typically 2-3 | Visualization of high-dimensional data |
Method details:
PCA (Principal Component Analysis):
- Finds directions of maximum variance
- Creates orthogonal (uncorrelated) components
- First components capture most information
- Use scree plot or cumulative explained variance
- Workflow: standardize → compute covariance → find eigenvectors → transform data
- Best practices: standardize data, keep 90-95% variance, check component loadings
LDA (Linear Discriminant Analysis):
- Maximizes between-class variance, minimizes within-class variance
- Requires class labels
- Limited to k-1 components for k classes
- More effective than PCA when classes are well-defined
- Workflow: calculate class means → compute scatter matrices → solve eigenvalue problem → project data
- Best practices: verify Gaussian assumption, check equal covariances, use with sufficient samples
t-SNE (t-Distributed Stochastic Neighbor Embedding):
- Non-linear method preserving local structure
- Converts distances to probabilities
- Uses t-distribution in low-dimensional space
- Stochastic (different runs give different results)
- Key parameters: perplexity (5-50, balances local/global), learning_rate, n_iter
- Best practices: use subset for large data, run multiple times, don’t interpret inter-cluster distances
4.2. Selection guidelines
Use PCA when:
- Unlabeled data
- General preprocessing needed
- Want interpretable linear combinations
- Need fast computation
- Working with any dataset size
Use LDA when:
- Have labeled data for classification
- Want to maximize class separation
- Preprocessing before classifier
- Classes are distinct
- Number of classes < number of features
Use t-SNE when:
- Need 2D/3D visualization
- Exploring cluster structure
- Presenting insights to stakeholders
- Local structure more important than global
- Dataset is moderate-sized (< 10,000 samples)
5. Association rules
Discovers relationships between items in transactional data (market basket analysis).
Key metrics:
| Metric | Formula | Interpretation |
|---|---|---|
| Support | Frequency(X) / Total transactions | How often itemset appears |
| Confidence | Frequency(X,Y) / Frequency(X) | How often rule is true |
| Lift | Support(X,Y) / (Support(X) × Support(Y)) | Association strength vs. independence |
Algorithms:
| Algorithm | Approach | Best For |
|---|---|---|
| Apriori | Breadth-first search, uses Apriori property (subsets of frequent itemsets are frequent) | Sparse datasets, finding all frequent itemsets |
| Eclat | Depth-first search, vertical data format (TID lists) | Dense datasets, faster than Apriori |
Best practices:
- Set minimum support (e.g., 1%) to filter rare itemsets
- Set minimum confidence (e.g., 20%) for reliable rules
- Lift > 1 indicates positive association
- Validate rules with domain knowledge
- Consider computational cost for large datasets
Common applications:
- Retail: product recommendations, store layout
- Web: page navigation patterns, content suggestions
- Healthcare: symptom-treatment associations
- Finance: fraud pattern detection
6. Anomaly detection
Identifies rare observations that differ significantly from normal patterns.
Isolation Forest:
- Isolates anomalies using random partitioning
- Anomalies require fewer splits to isolate
- Fast and scalable
- No distance calculations needed
How it works:
- Build isolation trees with random feature splits
- Measure path length to isolate each point
- Shorter paths indicate anomalies
- Average across multiple trees
Key parameters:
contamination: expected proportion of outliers (default: 0.1)n_estimators: number of trees (default: 100)max_samples: samples per tree
Best practices:
- Set contamination based on domain knowledge
- Use sufficient estimators (100-300)
- Validate detected anomalies with domain experts
- Standardize features first
- Works well with high-dimensional data
Applications:
- Fraud detection in transactions
- Network intrusion detection
- Manufacturing defect identification
- Health monitoring (abnormal vitals)
7. Evaluation metrics
7.1. Clustering metrics
| Metric | Range | Best Value | Purpose |
|---|---|---|---|
| Silhouette Score | [-1, 1] | +1 | Measures cluster cohesion and separation |
| Within-Cluster Sum of Squares (WCSS) | [0, ∞) | Lower | Used in elbow method to find optimal k |
| Davies-Bouldin Index | [0, ∞) | Lower | Ratio of within-cluster to between-cluster distances |
| Calinski-Harabasz Index | [0, ∞) | Higher | Ratio of between-cluster to within-cluster dispersion |
Silhouette Score:
- $s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}$
- $a(i)$: average distance to points in same cluster
- $b(i)$: average distance to points in nearest cluster
- Score near +1: well-clustered
- Score near 0: on cluster boundary
- Score near -1: possibly wrong cluster
Elbow Method (WCSS):
- Plot WCSS vs. number of clusters
- Look for “elbow” where improvement diminishes
- Choose k at elbow point
7.2. Dimensionality reduction metrics
| Metric | Application | Interpretation |
|---|---|---|
| Explained Variance Ratio | PCA | Proportion of variance captured per component |
| Cumulative Explained Variance | PCA | Total variance captured by selected components |
| Reconstruction Error | PCA | How well reduced data reconstructs original |
| Downstream Task Performance | All | Performance on classification/regression after reduction |
Best practices:
- Retain components explaining 90-95% variance
- Use scree plots to visualize explained variance
- Validate with downstream task performance
- Check if reduction improves or hurts model accuracy
8. Unsupervised learning workflow
1. Define Objective
- Clustering: segment data
- Dimensionality reduction: simplify/visualize
- Association rules: find patterns
- Anomaly detection: identify outliers
↓
2. Data Preprocessing
- Handle missing values
- Standardize features (for distance-based methods)
- Remove or flag known outliers (if not detecting)
↓
3. Exploratory Analysis
- Visualize data distribution
- Check correlations
- Identify potential number of clusters
↓
4. Apply Technique
Clustering:
- Try multiple algorithms
- Use elbow method for k
- Calculate silhouette scores
Dimensionality Reduction:
- Check explained variance
- Validate with downstream tasks
- Visualize results
Association Rules:
- Set support/confidence thresholds
- Generate rules
- Filter by lift
Anomaly Detection:
- Set contamination parameter
- Identify anomalies
- Validate with domain knowledge
↓
5. Evaluate Results
- Use appropriate metrics
- Visualize clusters/components
- Check domain validity
↓
6. Iterate and Refine
- Adjust parameters
- Try different algorithms
- Combine techniques
↓
7. Interpret and Apply
- Validate findings with stakeholders
- Document insights
- Use results for downstream tasks
Additional resources
Python libraries
- scikit-learn: Core unsupervised learning modules
cluster: KMeans, AgglomerativeClustering, DBSCANdecomposition: PCAdiscriminant_analysis: LinearDiscriminantAnalysismanifold: TSNEensemble: IsolationForest
- mlxtend: Association rule mining
frequent_patterns.apriori: Apriori algorithmfrequent_patterns.association_rules: Generate rules
- pyECLAT: Eclat algorithm implementation
- scipy.cluster.hierarchy: Hierarchical clustering with dendrograms
Recommended reading
- Scikit-learn Clustering Guide: Comprehensive clustering documentation
- Scikit-learn Decomposition Guide: PCA and other decomposition methods
- “How to Use t-SNE Effectively”: Interactive guide to t-SNE
- “Introduction to Statistical Learning”: Chapter 12 on Unsupervised Learning