# Unsupervised Learning Reference ## Overview Unsupervised learning discovers patterns in unlabeled data through clustering, dimensionality reduction, and density estimation. ## Clustering ### K-Means **KMeans (`sklearn.cluster.KMeans`)** - Partition-based clustering into K clusters - Key parameters: - `n_clusters`: Number of clusters to form - `init`: Initialization method ('k-means++', 'random') - `n_init`: Number of initializations (default=10) - `max_iter`: Maximum iterations - Use when: Know number of clusters, spherical cluster shapes - Fast and scalable - Example: ```python from sklearn.cluster import KMeans model = KMeans(n_clusters=3, init='k-means++', n_init=10, random_state=42) labels = model.fit_predict(X) centers = model.cluster_centers_ # Inertia (sum of squared distances to nearest center) print(f"Inertia: {model.inertia_}") ``` **MiniBatchKMeans** - Faster K-Means using mini-batches - Use when: Large datasets, need faster training - Slightly less accurate than K-Means - Example: ```python from sklearn.cluster import MiniBatchKMeans model = MiniBatchKMeans(n_clusters=3, batch_size=100, random_state=42) labels = model.fit_predict(X) ``` ### Density-Based Clustering **DBSCAN (`sklearn.cluster.DBSCAN`)** - Density-Based Spatial Clustering - Key parameters: - `eps`: Maximum distance between two samples to be neighbors - `min_samples`: Minimum samples in neighborhood to form core point - `metric`: Distance metric - Use when: Arbitrary cluster shapes, presence of noise/outliers - Automatically determines number of clusters - Labels noise points as -1 - Example: ```python from sklearn.cluster import DBSCAN model = DBSCAN(eps=0.5, min_samples=5, metric='euclidean') labels = model.fit_predict(X) # Number of clusters (excluding noise) n_clusters = len(set(labels)) - (1 if -1 in labels else 0) n_noise = list(labels).count(-1) print(f"Clusters: {n_clusters}, Noise points: {n_noise}") ``` **HDBSCAN (`sklearn.cluster.HDBSCAN`)** - Hierarchical DBSCAN with adaptive epsilon - More robust than DBSCAN - Key parameter: `min_cluster_size` - Use when: Varying density clusters - Example: ```python from sklearn.cluster import HDBSCAN model = HDBSCAN(min_cluster_size=10, min_samples=5) labels = model.fit_predict(X) ``` **OPTICS (`sklearn.cluster.OPTICS`)** - Ordering points to identify clustering structure - Similar to DBSCAN but doesn't require eps parameter - Key parameters: `min_samples`, `max_eps` - Use when: Varying density, exploratory analysis - Example: ```python from sklearn.cluster import OPTICS model = OPTICS(min_samples=5, max_eps=0.5) labels = model.fit_predict(X) ``` ### Hierarchical Clustering **AgglomerativeClustering** - Bottom-up hierarchical clustering - Key parameters: - `n_clusters`: Number of clusters (or use `distance_threshold`) - `linkage`: 'ward', 'complete', 'average', 'single' - `metric`: Distance metric - Use when: Need dendrogram, hierarchical structure important - Example: ```python from sklearn.cluster import AgglomerativeClustering model = AgglomerativeClustering(n_clusters=3, linkage='ward') labels = model.fit_predict(X) # Create dendrogram using scipy from scipy.cluster.hierarchy import dendrogram, linkage Z = linkage(X, method='ward') dendrogram(Z) ``` ### Other Clustering Methods **MeanShift** - Finds clusters by shifting points toward mode of density - Automatically determines number of clusters - Key parameter: `bandwidth` - Use when: Don't know number of clusters, arbitrary shapes - Example: ```python from sklearn.cluster import MeanShift, estimate_bandwidth # Estimate bandwidth bandwidth = estimate_bandwidth(X, quantile=0.2, n_samples=500) model = MeanShift(bandwidth=bandwidth) labels = model.fit_predict(X) ``` **SpectralClustering** - Uses graph-based approach with eigenvalues - Key parameters: `n_clusters`, `affinity` ('rbf', 'nearest_neighbors') - Use when: Non-convex clusters, graph structure - Example: ```python from sklearn.cluster import SpectralClustering model = SpectralClustering(n_clusters=3, affinity='rbf', random_state=42) labels = model.fit_predict(X) ``` **AffinityPropagation** - Finds exemplars by message passing - Automatically determines number of clusters - Key parameters: `damping`, `preference` - Use when: Don't know number of clusters - Example: ```python from sklearn.cluster import AffinityPropagation model = AffinityPropagation(damping=0.9, random_state=42) labels = model.fit_predict(X) n_clusters = len(model.cluster_centers_indices_) ``` **BIRCH** - Balanced Iterative Reducing and Clustering using Hierarchies - Memory efficient for large datasets - Key parameters: `n_clusters`, `threshold`, `branching_factor` - Use when: Very large datasets - Example: ```python from sklearn.cluster import Birch model = Birch(n_clusters=3, threshold=0.5) labels = model.fit_predict(X) ``` ### Clustering Evaluation **Metrics when ground truth is known:** ```python from sklearn.metrics import adjusted_rand_score, normalized_mutual_info_score from sklearn.metrics import adjusted_mutual_info_score, fowlkes_mallows_score # Compare predicted labels with true labels ari = adjusted_rand_score(y_true, y_pred) nmi = normalized_mutual_info_score(y_true, y_pred) ami = adjusted_mutual_info_score(y_true, y_pred) fmi = fowlkes_mallows_score(y_true, y_pred) ``` **Metrics without ground truth:** ```python from sklearn.metrics import silhouette_score, calinski_harabasz_score from sklearn.metrics import davies_bouldin_score # Silhouette: [-1, 1], higher is better silhouette = silhouette_score(X, labels) # Calinski-Harabasz: higher is better ch_score = calinski_harabasz_score(X, labels) # Davies-Bouldin: lower is better db_score = davies_bouldin_score(X, labels) ``` **Elbow method for K-Means:** ```python from sklearn.cluster import KMeans import matplotlib.pyplot as plt inertias = [] K_range = range(2, 11) for k in K_range: model = KMeans(n_clusters=k, random_state=42) model.fit(X) inertias.append(model.inertia_) plt.plot(K_range, inertias, 'bo-') plt.xlabel('Number of clusters') plt.ylabel('Inertia') plt.title('Elbow Method') ``` ## Dimensionality Reduction ### Principal Component Analysis (PCA) **PCA (`sklearn.decomposition.PCA`)** - Linear dimensionality reduction using SVD - Key parameters: - `n_components`: Number of components (int or float for explained variance) - `whiten`: Whiten components to unit variance - Use when: Linear relationships, want to explain variance - Example: ```python from sklearn.decomposition import PCA # Keep components explaining 95% variance pca = PCA(n_components=0.95) X_reduced = pca.fit_transform(X) print(f"Original dimensions: {X.shape[1]}") print(f"Reduced dimensions: {X_reduced.shape[1]}") print(f"Explained variance ratio: {pca.explained_variance_ratio_}") print(f"Total variance explained: {pca.explained_variance_ratio_.sum()}") # Or specify exact number of components pca = PCA(n_components=2) X_2d = pca.fit_transform(X) ``` **IncrementalPCA** - PCA for large datasets that don't fit in memory - Processes data in batches - Key parameter: `n_components`, `batch_size` - Example: ```python from sklearn.decomposition import IncrementalPCA pca = IncrementalPCA(n_components=50, batch_size=100) X_reduced = pca.fit_transform(X) ``` **KernelPCA** - Non-linear dimensionality reduction using kernels - Key parameters: `n_components`, `kernel` ('linear', 'poly', 'rbf', 'sigmoid') - Use when: Non-linear relationships - Example: ```python from sklearn.decomposition import KernelPCA pca = KernelPCA(n_components=2, kernel='rbf', gamma=0.1) X_reduced = pca.fit_transform(X) ``` ### Manifold Learning **t-SNE (`sklearn.manifold.TSNE`)** - t-distributed Stochastic Neighbor Embedding - Excellent for 2D/3D visualization - Key parameters: - `n_components`: Usually 2 or 3 - `perplexity`: Balance between local and global structure (5-50) - `learning_rate`: Usually 10-1000 - `n_iter`: Number of iterations (min 250) - Use when: Visualizing high-dimensional data - Note: Slow on large datasets, no transform() method - Example: ```python from sklearn.manifold import TSNE tsne = TSNE(n_components=2, perplexity=30, learning_rate=200, n_iter=1000, random_state=42) X_embedded = tsne.fit_transform(X) # Visualize import matplotlib.pyplot as plt plt.scatter(X_embedded[:, 0], X_embedded[:, 1], c=labels, cmap='viridis') plt.title('t-SNE visualization') ``` **UMAP (not in scikit-learn, but compatible)** - Uniform Manifold Approximation and Projection - Faster than t-SNE, preserves global structure better - Install: `uv pip install umap-learn` - Example: ```python from umap import UMAP reducer = UMAP(n_components=2, n_neighbors=15, min_dist=0.1, random_state=42) X_embedded = reducer.fit_transform(X) ``` **Isomap** - Isometric Mapping - Preserves geodesic distances - Key parameters: `n_components`, `n_neighbors` - Use when: Non-linear manifolds - Example: ```python from sklearn.manifold import Isomap isomap = Isomap(n_components=2, n_neighbors=5) X_embedded = isomap.fit_transform(X) ``` **Locally Linear Embedding (LLE)** - Preserves local neighborhood structure - Key parameters: `n_components`, `n_neighbors` - Example: ```python from sklearn.manifold import LocallyLinearEmbedding lle = LocallyLinearEmbedding(n_components=2, n_neighbors=10) X_embedded = lle.fit_transform(X) ``` **MDS (Multidimensional Scaling)** - Preserves pairwise distances - Key parameter: `n_components`, `metric` (True/False) - Example: ```python from sklearn.manifold import MDS mds = MDS(n_components=2, metric=True, random_state=42) X_embedded = mds.fit_transform(X) ``` ### Matrix Factorization **NMF (Non-negative Matrix Factorization)** - Factorizes into non-negative matrices - Key parameters: `n_components`, `init` ('nndsvd', 'random') - Use when: Data is non-negative (images, text) - Interpretable components - Example: ```python from sklearn.decomposition import NMF nmf = NMF(n_components=10, init='nndsvd', random_state=42) W = nmf.fit_transform(X) # Document-topic matrix H = nmf.components_ # Topic-word matrix ``` **TruncatedSVD** - SVD for sparse matrices - Similar to PCA but works with sparse data - Use when: Text data, sparse matrices - Example: ```python from sklearn.decomposition import TruncatedSVD svd = TruncatedSVD(n_components=100, random_state=42) X_reduced = svd.fit_transform(X_sparse) print(f"Explained variance: {svd.explained_variance_ratio_.sum()}") ``` **FastICA** - Independent Component Analysis - Separates multivariate signal into independent components - Key parameter: `n_components` - Use when: Signal separation (e.g., audio, EEG) - Example: ```python from sklearn.decomposition import FastICA ica = FastICA(n_components=10, random_state=42) S = ica.fit_transform(X) # Independent sources A = ica.mixing_ # Mixing matrix ``` **LatentDirichletAllocation (LDA)** - Topic modeling for text data - Key parameters: `n_components` (number of topics), `learning_method` ('batch', 'online') - Use when: Topic modeling, document clustering - Example: ```python from sklearn.decomposition import LatentDirichletAllocation lda = LatentDirichletAllocation(n_components=10, random_state=42) doc_topics = lda.fit_transform(X_counts) # Document-topic distribution # Get top words for each topic feature_names = vectorizer.get_feature_names_out() for topic_idx, topic in enumerate(lda.components_): top_words = [feature_names[i] for i in topic.argsort()[-10:]] print(f"Topic {topic_idx}: {', '.join(top_words)}") ``` ## Outlier and Novelty Detection ### Outlier Detection **IsolationForest** - Isolates anomalies using random trees - Key parameters: - `contamination`: Expected proportion of outliers - `n_estimators`: Number of trees - Use when: High-dimensional data, efficiency important - Example: ```python from sklearn.ensemble import IsolationForest model = IsolationForest(contamination=0.1, random_state=42) predictions = model.fit_predict(X) # -1 for outliers, 1 for inliers ``` **LocalOutlierFactor** - Measures local density deviation - Key parameters: `n_neighbors`, `contamination` - Use when: Varying density regions - Example: ```python from sklearn.neighbors import LocalOutlierFactor lof = LocalOutlierFactor(n_neighbors=20, contamination=0.1) predictions = lof.fit_predict(X) # -1 for outliers, 1 for inliers outlier_scores = lof.negative_outlier_factor_ ``` **One-Class SVM** - Learns decision boundary around normal data - Key parameters: `nu` (upper bound on outliers), `kernel`, `gamma` - Use when: Small training set of normal data - Example: ```python from sklearn.svm import OneClassSVM model = OneClassSVM(nu=0.1, kernel='rbf', gamma='auto') model.fit(X_train) predictions = model.predict(X_test) # -1 for outliers, 1 for inliers ``` **EllipticEnvelope** - Assumes Gaussian distribution - Key parameter: `contamination` - Use when: Data is Gaussian-distributed - Example: ```python from sklearn.covariance import EllipticEnvelope model = EllipticEnvelope(contamination=0.1, random_state=42) predictions = model.fit_predict(X) ``` ## Gaussian Mixture Models **GaussianMixture** - Probabilistic clustering with mixture of Gaussians - Key parameters: - `n_components`: Number of mixture components - `covariance_type`: 'full', 'tied', 'diag', 'spherical' - Use when: Soft clustering, need probability estimates - Example: ```python from sklearn.mixture import GaussianMixture gmm = GaussianMixture(n_components=3, covariance_type='full', random_state=42) gmm.fit(X) # Predict cluster labels labels = gmm.predict(X) # Get probability of each cluster probabilities = gmm.predict_proba(X) # Information criteria for model selection print(f"BIC: {gmm.bic(X)}") # Lower is better print(f"AIC: {gmm.aic(X)}") # Lower is better ``` ## Choosing the Right Method ### Clustering: - **Know K, spherical clusters**: K-Means - **Arbitrary shapes, noise**: DBSCAN, HDBSCAN - **Hierarchical structure**: AgglomerativeClustering - **Very large data**: MiniBatchKMeans, BIRCH - **Probabilistic**: GaussianMixture ### Dimensionality Reduction: - **Linear, variance explanation**: PCA - **Non-linear, visualization**: t-SNE, UMAP - **Non-negative data**: NMF - **Sparse data**: TruncatedSVD - **Topic modeling**: LatentDirichletAllocation ### Outlier Detection: - **High-dimensional**: IsolationForest - **Varying density**: LocalOutlierFactor - **Gaussian data**: EllipticEnvelope