Initial commit

2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions
--- a/skills/scikit-learn/references/model_evaluation.md
+++ b/skills/scikit-learn/references/model_evaluation.md
@@ -0,0 +1,592 @@
+# Model Selection and Evaluation Reference
+
+## Overview
+
+Comprehensive guide for evaluating models, tuning hyperparameters, and selecting the best model using scikit-learn's model selection tools.
+
+## Train-Test Split
+
+### Basic Splitting
+
+```python
+from sklearn.model_selection import train_test_split
+
+# Basic split (default 75/25)
+X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
+
+# With stratification (preserves class distribution)
+X_train, X_test, y_train, y_test = train_test_split(
+    X, y, test_size=0.25, stratify=y, random_state=42
+)
+
+# Three-way split (train/val/test)
+X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
+X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)
+```
+
+## Cross-Validation
+
+### Cross-Validation Strategies
+
+**KFold**
+- Standard k-fold cross-validation
+- Splits data into k consecutive folds
+```python
+from sklearn.model_selection import KFold
+
+kf = KFold(n_splits=5, shuffle=True, random_state=42)
+for train_idx, val_idx in kf.split(X):
+    X_train, X_val = X[train_idx], X[val_idx]
+    y_train, y_val = y[train_idx], y[val_idx]
+```
+
+**StratifiedKFold**
+- Preserves class distribution in each fold
+- Use for imbalanced classification
+```python
+from sklearn.model_selection import StratifiedKFold
+
+skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
+for train_idx, val_idx in skf.split(X, y):
+    X_train, X_val = X[train_idx], X[val_idx]
+    y_train, y_val = y[train_idx], y[val_idx]
+```
+
+**TimeSeriesSplit**
+- For time series data
+- Respects temporal order
+```python
+from sklearn.model_selection import TimeSeriesSplit
+
+tscv = TimeSeriesSplit(n_splits=5)
+for train_idx, val_idx in tscv.split(X):
+    X_train, X_val = X[train_idx], X[val_idx]
+    y_train, y_val = y[train_idx], y[val_idx]
+```
+
+**GroupKFold**
+- Ensures samples from same group don't appear in both train and validation
+- Use when samples are not independent
+```python
+from sklearn.model_selection import GroupKFold
+
+gkf = GroupKFold(n_splits=5)
+for train_idx, val_idx in gkf.split(X, y, groups=group_ids):
+    X_train, X_val = X[train_idx], X[val_idx]
+    y_train, y_val = y[train_idx], y[val_idx]
+```
+
+**LeaveOneOut (LOO)**
+- Each sample used as validation set once
+- Use for very small datasets
+- Computationally expensive
+```python
+from sklearn.model_selection import LeaveOneOut
+
+loo = LeaveOneOut()
+for train_idx, val_idx in loo.split(X):
+    X_train, X_val = X[train_idx], X[val_idx]
+    y_train, y_val = y[train_idx], y[val_idx]
+```
+
+### Cross-Validation Functions
+
+**cross_val_score**
+- Evaluate model using cross-validation
+- Returns array of scores
+```python
+from sklearn.model_selection import cross_val_score
+from sklearn.ensemble import RandomForestClassifier
+
+model = RandomForestClassifier(n_estimators=100, random_state=42)
+scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
+
+print(f"Scores: {scores}")
+print(f"Mean: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")
+```
+
+**cross_validate**
+- More comprehensive than cross_val_score
+- Can return multiple metrics and fit times
+```python
+from sklearn.model_selection import cross_validate
+
+model = RandomForestClassifier(n_estimators=100, random_state=42)
+cv_results = cross_validate(
+    model, X, y, cv=5,
+    scoring=['accuracy', 'precision', 'recall', 'f1'],
+    return_train_score=True,
+    return_estimator=True  # Returns fitted estimators
+)
+
+print(f"Test accuracy: {cv_results['test_accuracy'].mean():.3f}")
+print(f"Test precision: {cv_results['test_precision'].mean():.3f}")
+print(f"Fit time: {cv_results['fit_time'].mean():.3f}s")
+```
+
+**cross_val_predict**
+- Get predictions for each sample when it was in validation set
+- Useful for analyzing errors
+```python
+from sklearn.model_selection import cross_val_predict
+
+model = RandomForestClassifier(n_estimators=100, random_state=42)
+y_pred = cross_val_predict(model, X, y, cv=5)
+
+# Now can analyze predictions vs actual
+from sklearn.metrics import confusion_matrix
+cm = confusion_matrix(y, y_pred)
+```
+
+## Hyperparameter Tuning
+
+### Grid Search
+
+**GridSearchCV**
+- Exhaustive search over parameter grid
+- Tests all combinations
+```python
+from sklearn.model_selection import GridSearchCV
+from sklearn.ensemble import RandomForestClassifier
+
+param_grid = {
+    'n_estimators': [50, 100, 200],
+    'max_depth': [5, 10, 15, None],
+    'min_samples_split': [2, 5, 10],
+    'min_samples_leaf': [1, 2, 4]
+}
+
+model = RandomForestClassifier(random_state=42)
+grid_search = GridSearchCV(
+    model, param_grid,
+    cv=5,
+    scoring='accuracy',
+    n_jobs=-1,  # Use all CPU cores
+    verbose=1
+)
+
+grid_search.fit(X_train, y_train)
+
+print(f"Best parameters: {grid_search.best_params_}")
+print(f"Best cross-validation score: {grid_search.best_score_:.3f}")
+print(f"Test score: {grid_search.score(X_test, y_test):.3f}")
+
+# Access best model
+best_model = grid_search.best_estimator_
+
+# View all results
+import pandas as pd
+results_df = pd.DataFrame(grid_search.cv_results_)
+```
+
+### Randomized Search
+
+**RandomizedSearchCV**
+- Samples random combinations from parameter distributions
+- More efficient for large search spaces
+```python
+from sklearn.model_selection import RandomizedSearchCV
+from scipy.stats import randint, uniform
+
+param_distributions = {
+    'n_estimators': randint(50, 300),
+    'max_depth': [5, 10, 15, 20, None],
+    'min_samples_split': randint(2, 20),
+    'min_samples_leaf': randint(1, 10),
+    'max_features': uniform(0.1, 0.9)  # Continuous distribution
+}
+
+model = RandomForestClassifier(random_state=42)
+random_search = RandomizedSearchCV(
+    model, param_distributions,
+    n_iter=100,  # Number of parameter settings sampled
+    cv=5,
+    scoring='accuracy',
+    n_jobs=-1,
+    verbose=1,
+    random_state=42
+)
+
+random_search.fit(X_train, y_train)
+
+print(f"Best parameters: {random_search.best_params_}")
+print(f"Best score: {random_search.best_score_:.3f}")
+```
+
+### Successive Halving
+
+**HalvingGridSearchCV / HalvingRandomSearchCV**
+- Iteratively selects best candidates using successive halving
+- More efficient than exhaustive search
+```python
+from sklearn.experimental import enable_halving_search_cv
+from sklearn.model_selection import HalvingGridSearchCV
+
+param_grid = {
+    'n_estimators': [50, 100, 200, 300],
+    'max_depth': [5, 10, 15, 20, None],
+    'min_samples_split': [2, 5, 10, 20]
+}
+
+model = RandomForestClassifier(random_state=42)
+halving_search = HalvingGridSearchCV(
+    model, param_grid,
+    cv=5,
+    factor=3,  # Proportion of candidates eliminated in each iteration
+    resource='n_samples',  # Can also use 'n_estimators' for ensembles
+    max_resources='auto',
+    random_state=42
+)
+
+halving_search.fit(X_train, y_train)
+print(f"Best parameters: {halving_search.best_params_}")
+```
+
+## Classification Metrics
+
+### Basic Metrics
+
+```python
+from sklearn.metrics import (
+    accuracy_score, precision_score, recall_score, f1_score,
+    balanced_accuracy_score, matthews_corrcoef
+)
+
+y_pred = model.predict(X_test)
+
+accuracy = accuracy_score(y_test, y_pred)
+precision = precision_score(y_test, y_pred, average='weighted')  # For multiclass
+recall = recall_score(y_test, y_pred, average='weighted')
+f1 = f1_score(y_test, y_pred, average='weighted')
+balanced_acc = balanced_accuracy_score(y_test, y_pred)  # Good for imbalanced data
+mcc = matthews_corrcoef(y_test, y_pred)  # Matthews correlation coefficient
+
+print(f"Accuracy: {accuracy:.3f}")
+print(f"Precision: {precision:.3f}")
+print(f"Recall: {recall:.3f}")
+print(f"F1-score: {f1:.3f}")
+print(f"Balanced Accuracy: {balanced_acc:.3f}")
+print(f"MCC: {mcc:.3f}")
+```
+
+### Classification Report
+
+```python
+from sklearn.metrics import classification_report
+
+print(classification_report(y_test, y_pred, target_names=class_names))
+```
+
+### Confusion Matrix
+
+```python
+from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
+import matplotlib.pyplot as plt
+
+cm = confusion_matrix(y_test, y_pred)
+disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=class_names)
+disp.plot(cmap='Blues')
+plt.show()
+```
+
+### ROC and AUC
+
+```python
+from sklearn.metrics import roc_auc_score, roc_curve, RocCurveDisplay
+
+# Binary classification
+y_proba = model.predict_proba(X_test)[:, 1]
+auc = roc_auc_score(y_test, y_proba)
+print(f"ROC AUC: {auc:.3f}")
+
+# Plot ROC curve
+fpr, tpr, thresholds = roc_curve(y_test, y_proba)
+RocCurveDisplay(fpr=fpr, tpr=tpr, roc_auc=auc).plot()
+
+# Multiclass (one-vs-rest)
+auc_ovr = roc_auc_score(y_test, y_proba_multi, multi_class='ovr')
+```
+
+### Precision-Recall Curve
+
+```python
+from sklearn.metrics import precision_recall_curve, PrecisionRecallDisplay
+from sklearn.metrics import average_precision_score
+
+precision, recall, thresholds = precision_recall_curve(y_test, y_proba)
+ap = average_precision_score(y_test, y_proba)
+
+disp = PrecisionRecallDisplay(precision=precision, recall=recall, average_precision=ap)
+disp.plot()
+```
+
+### Log Loss
+
+```python
+from sklearn.metrics import log_loss
+
+y_proba = model.predict_proba(X_test)
+logloss = log_loss(y_test, y_proba)
+print(f"Log Loss: {logloss:.3f}")
+```
+
+## Regression Metrics
+
+```python
+from sklearn.metrics import (
+    mean_squared_error, mean_absolute_error, r2_score,
+    mean_absolute_percentage_error, median_absolute_error
+)
+
+y_pred = model.predict(X_test)
+
+mse = mean_squared_error(y_test, y_pred)
+rmse = mean_squared_error(y_test, y_pred, squared=False)
+mae = mean_absolute_error(y_test, y_pred)
+r2 = r2_score(y_test, y_pred)
+mape = mean_absolute_percentage_error(y_test, y_pred)
+median_ae = median_absolute_error(y_test, y_pred)
+
+print(f"MSE: {mse:.3f}")
+print(f"RMSE: {rmse:.3f}")
+print(f"MAE: {mae:.3f}")
+print(f"R² Score: {r2:.3f}")
+print(f"MAPE: {mape:.3f}")
+print(f"Median AE: {median_ae:.3f}")
+```
+
+## Clustering Metrics
+
+### With Ground Truth Labels
+
+```python
+from sklearn.metrics import (
+    adjusted_rand_score, normalized_mutual_info_score,
+    adjusted_mutual_info_score, fowlkes_mallows_score,
+    homogeneity_score, completeness_score, v_measure_score
+)
+
+ari = adjusted_rand_score(y_true, y_pred)
+nmi = normalized_mutual_info_score(y_true, y_pred)
+ami = adjusted_mutual_info_score(y_true, y_pred)
+fmi = fowlkes_mallows_score(y_true, y_pred)
+homogeneity = homogeneity_score(y_true, y_pred)
+completeness = completeness_score(y_true, y_pred)
+v_measure = v_measure_score(y_true, y_pred)
+```
+
+### Without Ground Truth
+
+```python
+from sklearn.metrics import (
+    silhouette_score, calinski_harabasz_score, davies_bouldin_score
+)
+
+silhouette = silhouette_score(X, labels)  # [-1, 1], higher better
+ch_score = calinski_harabasz_score(X, labels)  # Higher better
+db_score = davies_bouldin_score(X, labels)  # Lower better
+```
+
+## Custom Scoring
+
+### Using make_scorer
+
+```python
+from sklearn.metrics import make_scorer
+
+def custom_metric(y_true, y_pred):
+    # Your custom logic
+    return score
+
+custom_scorer = make_scorer(custom_metric, greater_is_better=True)
+
+# Use in cross-validation or grid search
+scores = cross_val_score(model, X, y, cv=5, scoring=custom_scorer)
+```
+
+### Multiple Metrics in Grid Search
+
+```python
+from sklearn.model_selection import GridSearchCV
+
+scoring = {
+    'accuracy': 'accuracy',
+    'precision': 'precision_weighted',
+    'recall': 'recall_weighted',
+    'f1': 'f1_weighted'
+}
+
+grid_search = GridSearchCV(
+    model, param_grid,
+    cv=5,
+    scoring=scoring,
+    refit='f1',  # Refit on best f1 score
+    return_train_score=True
+)
+
+grid_search.fit(X_train, y_train)
+```
+
+## Validation Curves
+
+### Learning Curve
+
+```python
+from sklearn.model_selection import learning_curve
+import matplotlib.pyplot as plt
+import numpy as np
+
+train_sizes, train_scores, val_scores = learning_curve(
+    model, X, y,
+    cv=5,
+    train_sizes=np.linspace(0.1, 1.0, 10),
+    scoring='accuracy',
+    n_jobs=-1
+)
+
+train_mean = train_scores.mean(axis=1)
+train_std = train_scores.std(axis=1)
+val_mean = val_scores.mean(axis=1)
+val_std = val_scores.std(axis=1)
+
+plt.figure(figsize=(10, 6))
+plt.plot(train_sizes, train_mean, label='Training score')
+plt.plot(train_sizes, val_mean, label='Validation score')
+plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha=0.1)
+plt.fill_between(train_sizes, val_mean - val_std, val_mean + val_std, alpha=0.1)
+plt.xlabel('Training Set Size')
+plt.ylabel('Score')
+plt.title('Learning Curve')
+plt.legend()
+plt.grid(True)
+```
+
+### Validation Curve
+
+```python
+from sklearn.model_selection import validation_curve
+
+param_range = [1, 10, 50, 100, 200, 500]
+train_scores, val_scores = validation_curve(
+    model, X, y,
+    param_name='n_estimators',
+    param_range=param_range,
+    cv=5,
+    scoring='accuracy',
+    n_jobs=-1
+)
+
+train_mean = train_scores.mean(axis=1)
+val_mean = val_scores.mean(axis=1)
+
+plt.figure(figsize=(10, 6))
+plt.plot(param_range, train_mean, label='Training score')
+plt.plot(param_range, val_mean, label='Validation score')
+plt.xlabel('n_estimators')
+plt.ylabel('Score')
+plt.title('Validation Curve')
+plt.legend()
+plt.grid(True)
+```
+
+## Model Persistence
+
+### Save and Load Models
+
+```python
+import joblib
+
+# Save model
+joblib.dump(model, 'model.pkl')
+
+# Load model
+loaded_model = joblib.load('model.pkl')
+
+# Also works with pipelines
+joblib.dump(pipeline, 'pipeline.pkl')
+```
+
+### Using pickle
+
+```python
+import pickle
+
+# Save
+with open('model.pkl', 'wb') as f:
+    pickle.dump(model, f)
+
+# Load
+with open('model.pkl', 'rb') as f:
+    loaded_model = pickle.load(f)
+```
+
+## Imbalanced Data Strategies
+
+### Class Weighting
+
+```python
+from sklearn.ensemble import RandomForestClassifier
+
+# Automatically balance classes
+model = RandomForestClassifier(class_weight='balanced', random_state=42)
+model.fit(X_train, y_train)
+
+# Custom weights
+class_weights = {0: 1, 1: 10}  # Give class 1 more weight
+model = RandomForestClassifier(class_weight=class_weights, random_state=42)
+```
+
+### Resampling (using imbalanced-learn)
+
+```python
+# Install: uv pip install imbalanced-learn
+from imblearn.over_sampling import SMOTE
+from imblearn.under_sampling import RandomUnderSampler
+from imblearn.pipeline import Pipeline as ImbPipeline
+
+# SMOTE oversampling
+smote = SMOTE(random_state=42)
+X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
+
+# Combined approach
+pipeline = ImbPipeline([
+    ('over', SMOTE(sampling_strategy=0.5)),
+    ('under', RandomUnderSampler(sampling_strategy=0.8)),
+    ('model', RandomForestClassifier())
+])
+```
+
+## Best Practices
+
+### Stratified Splitting
+Always use stratified splitting for classification:
+```python
+X_train, X_test, y_train, y_test = train_test_split(
+    X, y, test_size=0.2, stratify=y, random_state=42
+)
+```
+
+### Appropriate Metrics
+- **Balanced data**: Accuracy, F1-score
+- **Imbalanced data**: Precision, Recall, F1-score, ROC AUC, Balanced Accuracy
+- **Cost-sensitive**: Define custom scorer with costs
+- **Ranking**: ROC AUC, Average Precision
+
+### Cross-Validation
+- Use 5 or 10-fold CV for most cases
+- Use StratifiedKFold for classification
+- Use TimeSeriesSplit for time series
+- Use GroupKFold when samples are grouped
+
+### Nested Cross-Validation
+For unbiased performance estimates when tuning:
+```python
+from sklearn.model_selection import cross_val_score, GridSearchCV
+
+# Inner loop: hyperparameter tuning
+grid_search = GridSearchCV(model, param_grid, cv=5)
+
+# Outer loop: performance estimation
+scores = cross_val_score(grid_search, X, y, cv=5)
+print(f"Nested CV score: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")
+```
--- a/skills/scikit-learn/references/pipelines_and_composition.md
+++ b/skills/scikit-learn/references/pipelines_and_composition.md
@@ -0,0 +1,612 @@
+# Pipelines and Composite Estimators Reference
+
+## Overview
+
+Pipelines chain multiple processing steps into a single estimator, preventing data leakage and simplifying code. They enable reproducible workflows and seamless integration with cross-validation and hyperparameter tuning.
+
+## Pipeline Basics
+
+### Creating a Pipeline
+
+**Pipeline (`sklearn.pipeline.Pipeline`)**
+- Chains transformers with a final estimator
+- All intermediate steps must have fit_transform()
+- Final step can be any estimator (transformer, classifier, regressor, clusterer)
+- Example:
+```python
+from sklearn.pipeline import Pipeline
+from sklearn.preprocessing import StandardScaler
+from sklearn.decomposition import PCA
+from sklearn.linear_model import LogisticRegression
+
+pipeline = Pipeline([
+    ('scaler', StandardScaler()),
+    ('pca', PCA(n_components=10)),
+    ('classifier', LogisticRegression())
+])
+
+# Fit the entire pipeline
+pipeline.fit(X_train, y_train)
+
+# Predict using the pipeline
+y_pred = pipeline.predict(X_test)
+y_proba = pipeline.predict_proba(X_test)
+```
+
+### Using make_pipeline
+
+**make_pipeline**
+- Convenient constructor that auto-generates step names
+- Example:
+```python
+from sklearn.pipeline import make_pipeline
+from sklearn.preprocessing import StandardScaler
+from sklearn.svm import SVC
+
+pipeline = make_pipeline(
+    StandardScaler(),
+    PCA(n_components=10),
+    SVC(kernel='rbf')
+)
+
+pipeline.fit(X_train, y_train)
+```
+
+## Accessing Pipeline Components
+
+### Accessing Steps
+
+```python
+# By index
+scaler = pipeline.steps[0][1]
+
+# By name
+scaler = pipeline.named_steps['scaler']
+pca = pipeline.named_steps['pca']
+
+# Using indexing syntax
+scaler = pipeline['scaler']
+pca = pipeline['pca']
+
+# Get all step names
+print(pipeline.named_steps.keys())
+```
+
+### Setting Parameters
+
+```python
+# Set parameters using double underscore notation
+pipeline.set_params(
+    pca__n_components=15,
+    classifier__C=0.1
+)
+
+# Or during creation
+pipeline = Pipeline([
+    ('scaler', StandardScaler()),
+    ('pca', PCA(n_components=10)),
+    ('classifier', LogisticRegression(C=1.0))
+])
+```
+
+### Accessing Attributes
+
+```python
+# Access fitted attributes
+pca_components = pipeline.named_steps['pca'].components_
+explained_variance = pipeline.named_steps['pca'].explained_variance_ratio_
+
+# Access intermediate transformations
+X_scaled = pipeline.named_steps['scaler'].transform(X_test)
+X_pca = pipeline.named_steps['pca'].transform(X_scaled)
+```
+
+## Hyperparameter Tuning with Pipelines
+
+### Grid Search with Pipeline
+
+```python
+from sklearn.model_selection import GridSearchCV
+from sklearn.pipeline import Pipeline
+from sklearn.preprocessing import StandardScaler
+from sklearn.svm import SVC
+
+pipeline = Pipeline([
+    ('scaler', StandardScaler()),
+    ('classifier', SVC())
+])
+
+param_grid = {
+    'classifier__C': [0.1, 1, 10, 100],
+    'classifier__gamma': ['scale', 'auto', 0.001, 0.01],
+    'classifier__kernel': ['rbf', 'linear']
+}
+
+grid_search = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1)
+grid_search.fit(X_train, y_train)
+
+print(f"Best parameters: {grid_search.best_params_}")
+print(f"Best score: {grid_search.best_score_:.3f}")
+```
+
+### Tuning Multiple Pipeline Steps
+
+```python
+param_grid = {
+    # PCA parameters
+    'pca__n_components': [5, 10, 20, 50],
+
+    # Classifier parameters
+    'classifier__C': [0.1, 1, 10],
+    'classifier__kernel': ['rbf', 'linear']
+}
+
+grid_search = GridSearchCV(pipeline, param_grid, cv=5)
+grid_search.fit(X_train, y_train)
+```
+
+## ColumnTransformer
+
+### Basic Usage
+
+**ColumnTransformer (`sklearn.compose.ColumnTransformer`)**
+- Apply different preprocessing to different columns
+- Prevents data leakage in cross-validation
+- Example:
+```python
+from sklearn.compose import ColumnTransformer
+from sklearn.preprocessing import StandardScaler, OneHotEncoder
+from sklearn.impute import SimpleImputer
+
+# Define column groups
+numeric_features = ['age', 'income', 'hours_per_week']
+categorical_features = ['gender', 'occupation', 'native_country']
+
+# Create preprocessor
+preprocessor = ColumnTransformer(
+    transformers=[
+        ('num', StandardScaler(), numeric_features),
+        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
+    ],
+    remainder='passthrough'  # Keep other columns unchanged
+)
+
+X_transformed = preprocessor.fit_transform(X)
+```
+
+### With Pipeline Steps
+
+```python
+from sklearn.pipeline import Pipeline
+
+numeric_transformer = Pipeline(steps=[
+    ('imputer', SimpleImputer(strategy='median')),
+    ('scaler', StandardScaler())
+])
+
+categorical_transformer = Pipeline(steps=[
+    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
+    ('onehot', OneHotEncoder(handle_unknown='ignore'))
+])
+
+preprocessor = ColumnTransformer(
+    transformers=[
+        ('num', numeric_transformer, numeric_features),
+        ('cat', categorical_transformer, categorical_features)
+    ]
+)
+
+# Full pipeline with model
+full_pipeline = Pipeline([
+    ('preprocessor', preprocessor),
+    ('classifier', LogisticRegression())
+])
+
+full_pipeline.fit(X_train, y_train)
+```
+
+### Using make_column_transformer
+
+```python
+from sklearn.compose import make_column_transformer
+
+preprocessor = make_column_transformer(
+    (StandardScaler(), numeric_features),
+    (OneHotEncoder(), categorical_features),
+    remainder='passthrough'
+)
+```
+
+### Column Selection
+
+```python
+# By column names (if X is DataFrame)
+preprocessor = ColumnTransformer([
+    ('num', StandardScaler(), ['age', 'income']),
+    ('cat', OneHotEncoder(), ['gender', 'occupation'])
+])
+
+# By column indices
+preprocessor = ColumnTransformer([
+    ('num', StandardScaler(), [0, 1, 2]),
+    ('cat', OneHotEncoder(), [3, 4])
+])
+
+# By boolean mask
+numeric_mask = [True, True, True, False, False]
+categorical_mask = [False, False, False, True, True]
+
+preprocessor = ColumnTransformer([
+    ('num', StandardScaler(), numeric_mask),
+    ('cat', OneHotEncoder(), categorical_mask)
+])
+
+# By callable
+def is_numeric(X):
+    return X.select_dtypes(include=['number']).columns.tolist()
+
+preprocessor = ColumnTransformer([
+    ('num', StandardScaler(), is_numeric)
+])
+```
+
+### Getting Feature Names
+
+```python
+# Get output feature names
+feature_names = preprocessor.get_feature_names_out()
+
+# After fitting
+preprocessor.fit(X_train)
+output_features = preprocessor.get_feature_names_out()
+print(f"Input features: {X_train.columns.tolist()}")
+print(f"Output features: {output_features}")
+```
+
+### Remainder Handling
+
+```python
+# Drop unspecified columns (default)
+preprocessor = ColumnTransformer([...], remainder='drop')
+
+# Pass through unchanged
+preprocessor = ColumnTransformer([...], remainder='passthrough')
+
+# Apply transformer to remaining columns
+preprocessor = ColumnTransformer([...], remainder=StandardScaler())
+```
+
+## FeatureUnion
+
+### Basic Usage
+
+**FeatureUnion (`sklearn.pipeline.FeatureUnion`)**
+- Concatenates results of multiple transformers
+- Transformers are applied in parallel
+- Example:
+```python
+from sklearn.pipeline import FeatureUnion
+from sklearn.decomposition import PCA
+from sklearn.feature_selection import SelectKBest
+
+# Combine PCA and feature selection
+feature_union = FeatureUnion([
+    ('pca', PCA(n_components=10)),
+    ('select_best', SelectKBest(k=20))
+])
+
+X_combined = feature_union.fit_transform(X_train, y_train)
+print(f"Combined features: {X_combined.shape[1]}")  # 10 + 20 = 30
+```
+
+### With Pipeline
+
+```python
+from sklearn.pipeline import Pipeline, FeatureUnion
+from sklearn.preprocessing import StandardScaler
+from sklearn.decomposition import PCA, TruncatedSVD
+
+# Create feature union
+feature_union = FeatureUnion([
+    ('pca', PCA(n_components=10)),
+    ('svd', TruncatedSVD(n_components=10))
+])
+
+# Full pipeline
+pipeline = Pipeline([
+    ('scaler', StandardScaler()),
+    ('features', feature_union),
+    ('classifier', LogisticRegression())
+])
+
+pipeline.fit(X_train, y_train)
+```
+
+### Weighted Feature Union
+
+```python
+# Apply weights to transformers
+feature_union = FeatureUnion(
+    transformer_list=[
+        ('pca', PCA(n_components=10)),
+        ('select_best', SelectKBest(k=20))
+    ],
+    transformer_weights={
+        'pca': 2.0,  # Give PCA features double weight
+        'select_best': 1.0
+    }
+)
+```
+
+## Advanced Pipeline Patterns
+
+### Caching Pipeline Steps
+
+```python
+from sklearn.pipeline import Pipeline
+from tempfile import mkdtemp
+from shutil import rmtree
+
+# Cache intermediate results
+cachedir = mkdtemp()
+pipeline = Pipeline([
+    ('scaler', StandardScaler()),
+    ('pca', PCA(n_components=50)),
+    ('classifier', LogisticRegression())
+], memory=cachedir)
+
+pipeline.fit(X_train, y_train)
+
+# Clean up cache
+rmtree(cachedir)
+```
+
+### Nested Pipelines
+
+```python
+from sklearn.pipeline import Pipeline
+
+# Inner pipeline for text processing
+text_pipeline = Pipeline([
+    ('vect', CountVectorizer()),
+    ('tfidf', TfidfTransformer())
+])
+
+# Outer pipeline combining text and numeric features
+full_pipeline = Pipeline([
+    ('features', FeatureUnion([
+        ('text', text_pipeline),
+        ('numeric', StandardScaler())
+    ])),
+    ('classifier', LogisticRegression())
+])
+```
+
+### Custom Transformers in Pipelines
+
+```python
+from sklearn.base import BaseEstimator, TransformerMixin
+
+class TextLengthExtractor(BaseEstimator, TransformerMixin):
+    def fit(self, X, y=None):
+        return self
+
+    def transform(self, X):
+        return [[len(text)] for text in X]
+
+pipeline = Pipeline([
+    ('length', TextLengthExtractor()),
+    ('scaler', StandardScaler()),
+    ('classifier', LogisticRegression())
+])
+```
+
+### Slicing Pipelines
+
+```python
+# Get sub-pipeline
+sub_pipeline = pipeline[:2]  # First two steps
+
+# Get specific range
+middle_steps = pipeline[1:3]
+```
+
+## TransformedTargetRegressor
+
+### Basic Usage
+
+**TransformedTargetRegressor**
+- Transforms target variable before fitting
+- Automatically inverse-transforms predictions
+- Example:
+```python
+from sklearn.compose import TransformedTargetRegressor
+from sklearn.preprocessing import QuantileTransformer
+from sklearn.linear_model import LinearRegression
+
+model = TransformedTargetRegressor(
+    regressor=LinearRegression(),
+    transformer=QuantileTransformer(output_distribution='normal')
+)
+
+model.fit(X_train, y_train)
+y_pred = model.predict(X_test)  # Automatically inverse-transformed
+```
+
+### With Functions
+
+```python
+import numpy as np
+
+model = TransformedTargetRegressor(
+    regressor=LinearRegression(),
+    func=np.log1p,
+    inverse_func=np.expm1
+)
+
+model.fit(X_train, y_train)
+```
+
+## Complete Example: End-to-End Pipeline
+
+```python
+import pandas as pd
+from sklearn.compose import ColumnTransformer
+from sklearn.pipeline import Pipeline
+from sklearn.preprocessing import StandardScaler, OneHotEncoder
+from sklearn.impute import SimpleImputer
+from sklearn.decomposition import PCA
+from sklearn.ensemble import RandomForestClassifier
+from sklearn.model_selection import GridSearchCV
+
+# Define feature types
+numeric_features = ['age', 'income', 'hours_per_week']
+categorical_features = ['gender', 'occupation', 'education']
+
+# Numeric preprocessing pipeline
+numeric_transformer = Pipeline(steps=[
+    ('imputer', SimpleImputer(strategy='median')),
+    ('scaler', StandardScaler())
+])
+
+# Categorical preprocessing pipeline
+categorical_transformer = Pipeline(steps=[
+    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
+    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
+])
+
+# Combine preprocessing
+preprocessor = ColumnTransformer(
+    transformers=[
+        ('num', numeric_transformer, numeric_features),
+        ('cat', categorical_transformer, categorical_features)
+    ]
+)
+
+# Full pipeline
+pipeline = Pipeline([
+    ('preprocessor', preprocessor),
+    ('pca', PCA(n_components=0.95)),  # Keep 95% variance
+    ('classifier', RandomForestClassifier(random_state=42))
+])
+
+# Hyperparameter tuning
+param_grid = {
+    'preprocessor__num__imputer__strategy': ['mean', 'median'],
+    'pca__n_components': [0.90, 0.95, 0.99],
+    'classifier__n_estimators': [100, 200],
+    'classifier__max_depth': [10, 20, None]
+}
+
+grid_search = GridSearchCV(
+    pipeline, param_grid,
+    cv=5, scoring='accuracy',
+    n_jobs=-1, verbose=1
+)
+
+grid_search.fit(X_train, y_train)
+
+print(f"Best parameters: {grid_search.best_params_}")
+print(f"Best CV score: {grid_search.best_score_:.3f}")
+print(f"Test score: {grid_search.score(X_test, y_test):.3f}")
+
+# Make predictions
+best_pipeline = grid_search.best_estimator_
+y_pred = best_pipeline.predict(X_test)
+y_proba = best_pipeline.predict_proba(X_test)
+```
+
+## Visualization
+
+### Displaying Pipelines
+
+```python
+# In Jupyter notebooks, pipelines display as diagrams
+from sklearn import set_config
+set_config(display='diagram')
+
+pipeline  # Displays visual diagram
+```
+
+### Text Representation
+
+```python
+# Print pipeline structure
+print(pipeline)
+
+# Get detailed parameters
+print(pipeline.get_params())
+```
+
+## Best Practices
+
+### Always Use Pipelines
+- Prevents data leakage
+- Ensures consistency between training and prediction
+- Makes code more maintainable
+- Enables easy hyperparameter tuning
+
+### Proper Pipeline Construction
+```python
+# Good: Preprocessing inside pipeline
+pipeline = Pipeline([
+    ('scaler', StandardScaler()),
+    ('model', LogisticRegression())
+])
+pipeline.fit(X_train, y_train)
+
+# Bad: Preprocessing outside pipeline (can cause leakage)
+X_train_scaled = StandardScaler().fit_transform(X_train)
+model = LogisticRegression()
+model.fit(X_train_scaled, y_train)
+```
+
+### Use ColumnTransformer for Mixed Data
+Always use ColumnTransformer when you have both numerical and categorical features:
+```python
+preprocessor = ColumnTransformer([
+    ('num', StandardScaler(), numeric_features),
+    ('cat', OneHotEncoder(), categorical_features)
+])
+```
+
+### Name Your Steps Meaningfully
+```python
+# Good
+pipeline = Pipeline([
+    ('imputer', SimpleImputer()),
+    ('scaler', StandardScaler()),
+    ('pca', PCA(n_components=10)),
+    ('rf_classifier', RandomForestClassifier())
+])
+
+# Bad
+pipeline = Pipeline([
+    ('step1', SimpleImputer()),
+    ('step2', StandardScaler()),
+    ('step3', PCA(n_components=10)),
+    ('step4', RandomForestClassifier())
+])
+```
+
+### Cache Expensive Transformations
+For repeated fitting (e.g., during grid search), cache expensive steps:
+```python
+from tempfile import mkdtemp
+
+cachedir = mkdtemp()
+pipeline = Pipeline([
+    ('expensive_preprocessing', ExpensiveTransformer()),
+    ('classifier', LogisticRegression())
+], memory=cachedir)
+```
+
+### Test Pipeline Compatibility
+Ensure all steps are compatible:
+- All intermediate steps must have fit() and transform()
+- Final step needs fit() and predict() (or transform())
+- Use set_output(transform='pandas') for DataFrame output
+```python
+pipeline.set_output(transform='pandas')
+X_transformed = pipeline.transform(X)  # Returns DataFrame
+```
--- a/skills/scikit-learn/references/preprocessing.md
+++ b/skills/scikit-learn/references/preprocessing.md
@@ -0,0 +1,606 @@
+# Data Preprocessing and Feature Engineering Reference
+
+## Overview
+
+Data preprocessing transforms raw data into a format suitable for machine learning models. This includes scaling, encoding, handling missing values, and feature engineering.
+
+## Feature Scaling and Normalization
+
+### StandardScaler
+
+**StandardScaler (`sklearn.preprocessing.StandardScaler`)**
+- Standardizes features to zero mean and unit variance
+- Formula: z = (x - mean) / std
+- Use when: Features have different scales, algorithm assumes normally distributed data
+- Required for: SVM, KNN, Neural Networks, PCA, Linear Regression with regularization
+- Example:
+```python
+from sklearn.preprocessing import StandardScaler
+
+scaler = StandardScaler()
+X_train_scaled = scaler.fit_transform(X_train)
+X_test_scaled = scaler.transform(X_test)  # Use same parameters as training
+
+# Access learned parameters
+print(f"Mean: {scaler.mean_}")
+print(f"Std: {scaler.scale_}")
+```
+
+### MinMaxScaler
+
+**MinMaxScaler (`sklearn.preprocessing.MinMaxScaler`)**
+- Scales features to a given range (default [0, 1])
+- Formula: X_scaled = (X - X.min) / (X.max - X.min)
+- Use when: Need bounded values, data not normally distributed
+- Sensitive to outliers
+- Example:
+```python
+from sklearn.preprocessing import MinMaxScaler
+
+scaler = MinMaxScaler(feature_range=(0, 1))
+X_scaled = scaler.fit_transform(X_train)
+
+# Custom range
+scaler = MinMaxScaler(feature_range=(-1, 1))
+X_scaled = scaler.fit_transform(X_train)
+```
+
+### RobustScaler
+
+**RobustScaler (`sklearn.preprocessing.RobustScaler`)**
+- Scales using median and interquartile range (IQR)
+- Formula: X_scaled = (X - median) / IQR
+- Use when: Data contains outliers
+- Robust to outliers
+- Example:
+```python
+from sklearn.preprocessing import RobustScaler
+
+scaler = RobustScaler()
+X_scaled = scaler.fit_transform(X_train)
+```
+
+### Normalizer
+
+**Normalizer (`sklearn.preprocessing.Normalizer`)**
+- Normalizes samples individually to unit norm
+- Common norms: 'l1', 'l2', 'max'
+- Use when: Need to normalize each sample independently (e.g., text features)
+- Example:
+```python
+from sklearn.preprocessing import Normalizer
+
+normalizer = Normalizer(norm='l2')  # Euclidean norm
+X_normalized = normalizer.fit_transform(X)
+```
+
+### MaxAbsScaler
+
+**MaxAbsScaler (`sklearn.preprocessing.MaxAbsScaler`)**
+- Scales by maximum absolute value
+- Range: [-1, 1]
+- Doesn't shift/center data (preserves sparsity)
+- Use when: Data is already centered or sparse
+- Example:
+```python
+from sklearn.preprocessing import MaxAbsScaler
+
+scaler = MaxAbsScaler()
+X_scaled = scaler.fit_transform(X_sparse)
+```
+
+## Encoding Categorical Variables
+
+### OneHotEncoder
+
+**OneHotEncoder (`sklearn.preprocessing.OneHotEncoder`)**
+- Creates binary columns for each category
+- Use when: Nominal categories (no order), tree-based models or linear models
+- Example:
+```python
+from sklearn.preprocessing import OneHotEncoder
+
+encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
+X_encoded = encoder.fit_transform(X_categorical)
+
+# Get feature names
+feature_names = encoder.get_feature_names_out(['color', 'size'])
+
+# Handle unknown categories during transform
+X_test_encoded = encoder.transform(X_test_categorical)
+```
+
+### OrdinalEncoder
+
+**OrdinalEncoder (`sklearn.preprocessing.OrdinalEncoder`)**
+- Encodes categories as integers
+- Use when: Ordinal categories (ordered), or tree-based models
+- Example:
+```python
+from sklearn.preprocessing import OrdinalEncoder
+
+# Natural ordering
+encoder = OrdinalEncoder()
+X_encoded = encoder.fit_transform(X_categorical)
+
+# Custom ordering
+encoder = OrdinalEncoder(categories=[['small', 'medium', 'large']])
+X_encoded = encoder.fit_transform(X_categorical)
+```
+
+### LabelEncoder
+
+**LabelEncoder (`sklearn.preprocessing.LabelEncoder`)**
+- Encodes target labels (y) as integers
+- Use for: Target variable encoding
+- Example:
+```python
+from sklearn.preprocessing import LabelEncoder
+
+le = LabelEncoder()
+y_encoded = le.fit_transform(y)
+
+# Decode back
+y_decoded = le.inverse_transform(y_encoded)
+print(f"Classes: {le.classes_}")
+```
+
+### Target Encoding (using category_encoders)
+
+```python
+# Install: uv pip install category-encoders
+from category_encoders import TargetEncoder
+
+encoder = TargetEncoder()
+X_train_encoded = encoder.fit_transform(X_train_categorical, y_train)
+X_test_encoded = encoder.transform(X_test_categorical)
+```
+
+## Non-linear Transformations
+
+### Power Transforms
+
+**PowerTransformer**
+- Makes data more Gaussian-like
+- Methods: 'yeo-johnson' (works with negative values), 'box-cox' (positive only)
+- Use when: Data is skewed, algorithm assumes normality
+- Example:
+```python
+from sklearn.preprocessing import PowerTransformer
+
+# Yeo-Johnson (handles negative values)
+pt = PowerTransformer(method='yeo-johnson', standardize=True)
+X_transformed = pt.fit_transform(X)
+
+# Box-Cox (positive values only)
+pt = PowerTransformer(method='box-cox', standardize=True)
+X_transformed = pt.fit_transform(X)
+```
+
+### Quantile Transformation
+
+**QuantileTransformer**
+- Transforms features to follow uniform or normal distribution
+- Robust to outliers
+- Use when: Want to reduce outlier impact
+- Example:
+```python
+from sklearn.preprocessing import QuantileTransformer
+
+# Transform to uniform distribution
+qt = QuantileTransformer(output_distribution='uniform', random_state=42)
+X_transformed = qt.fit_transform(X)
+
+# Transform to normal distribution
+qt = QuantileTransformer(output_distribution='normal', random_state=42)
+X_transformed = qt.fit_transform(X)
+```
+
+### Log Transform
+
+```python
+import numpy as np
+
+# Log1p (log(1 + x)) - handles zeros
+X_log = np.log1p(X)
+
+# Or use FunctionTransformer
+from sklearn.preprocessing import FunctionTransformer
+
+log_transformer = FunctionTransformer(np.log1p, inverse_func=np.expm1)
+X_log = log_transformer.fit_transform(X)
+```
+
+## Missing Value Imputation
+
+### SimpleImputer
+
+**SimpleImputer (`sklearn.impute.SimpleImputer`)**
+- Basic imputation strategies
+- Strategies: 'mean', 'median', 'most_frequent', 'constant'
+- Example:
+```python
+from sklearn.impute import SimpleImputer
+
+# For numerical features
+imputer = SimpleImputer(strategy='mean')
+X_imputed = imputer.fit_transform(X)
+
+# For categorical features
+imputer = SimpleImputer(strategy='most_frequent')
+X_imputed = imputer.fit_transform(X_categorical)
+
+# Fill with constant
+imputer = SimpleImputer(strategy='constant', fill_value=0)
+X_imputed = imputer.fit_transform(X)
+```
+
+### Iterative Imputer
+
+**IterativeImputer**
+- Models each feature with missing values as function of other features
+- More sophisticated than SimpleImputer
+- Example:
+```python
+from sklearn.experimental import enable_iterative_imputer
+from sklearn.impute import IterativeImputer
+
+imputer = IterativeImputer(max_iter=10, random_state=42)
+X_imputed = imputer.fit_transform(X)
+```
+
+### KNN Imputer
+
+**KNNImputer**
+- Imputes using k-nearest neighbors
+- Use when: Features are correlated
+- Example:
+```python
+from sklearn.impute import KNNImputer
+
+imputer = KNNImputer(n_neighbors=5)
+X_imputed = imputer.fit_transform(X)
+```
+
+## Feature Engineering
+
+### Polynomial Features
+
+**PolynomialFeatures**
+- Creates polynomial and interaction features
+- Use when: Need non-linear features for linear models
+- Example:
+```python
+from sklearn.preprocessing import PolynomialFeatures
+
+# Degree 2: includes x1, x2, x1^2, x2^2, x1*x2
+poly = PolynomialFeatures(degree=2, include_bias=False)
+X_poly = poly.fit_transform(X)
+
+# Get feature names
+feature_names = poly.get_feature_names_out(['x1', 'x2'])
+
+# Only interactions (no powers)
+poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
+X_interactions = poly.fit_transform(X)
+```
+
+### Binning/Discretization
+
+**KBinsDiscretizer**
+- Bins continuous features into discrete intervals
+- Strategies: 'uniform', 'quantile', 'kmeans'
+- Encoding: 'onehot', 'ordinal', 'onehot-dense'
+- Example:
+```python
+from sklearn.preprocessing import KBinsDiscretizer
+
+# Equal-width bins
+binner = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='uniform')
+X_binned = binner.fit_transform(X)
+
+# Equal-frequency bins (quantile-based)
+binner = KBinsDiscretizer(n_bins=5, encode='onehot', strategy='quantile')
+X_binned = binner.fit_transform(X)
+```
+
+### Binarization
+
+**Binarizer**
+- Converts features to binary (0 or 1) based on threshold
+- Example:
+```python
+from sklearn.preprocessing import Binarizer
+
+binarizer = Binarizer(threshold=0.5)
+X_binary = binarizer.fit_transform(X)
+```
+
+### Spline Features
+
+**SplineTransformer**
+- Creates spline basis functions
+- Useful for capturing non-linear relationships
+- Example:
+```python
+from sklearn.preprocessing import SplineTransformer
+
+spline = SplineTransformer(n_knots=5, degree=3)
+X_splines = spline.fit_transform(X)
+```
+
+## Text Feature Extraction
+
+### CountVectorizer
+
+**CountVectorizer (`sklearn.feature_extraction.text.CountVectorizer`)**
+- Converts text to token count matrix
+- Use for: Bag-of-words representation
+- Example:
+```python
+from sklearn.feature_extraction.text import CountVectorizer
+
+vectorizer = CountVectorizer(
+    max_features=5000,  # Keep top 5000 features
+    min_df=2,  # Ignore terms appearing in < 2 documents
+    max_df=0.8,  # Ignore terms appearing in > 80% documents
+    ngram_range=(1, 2)  # Unigrams and bigrams
+)
+
+X_counts = vectorizer.fit_transform(documents)
+feature_names = vectorizer.get_feature_names_out()
+```
+
+### TfidfVectorizer
+
+**TfidfVectorizer**
+- TF-IDF (Term Frequency-Inverse Document Frequency) transformation
+- Better than CountVectorizer for most tasks
+- Example:
+```python
+from sklearn.feature_extraction.text import TfidfVectorizer
+
+vectorizer = TfidfVectorizer(
+    max_features=5000,
+    min_df=2,
+    max_df=0.8,
+    ngram_range=(1, 2),
+    stop_words='english'  # Remove English stop words
+)
+
+X_tfidf = vectorizer.fit_transform(documents)
+```
+
+### HashingVectorizer
+
+**HashingVectorizer**
+- Uses hashing trick for memory efficiency
+- No fit needed, can't reverse transform
+- Use when: Very large vocabulary, streaming data
+- Example:
+```python
+from sklearn.feature_extraction.text import HashingVectorizer
+
+vectorizer = HashingVectorizer(n_features=2**18)
+X_hashed = vectorizer.transform(documents)  # No fit needed
+```
+
+## Feature Selection
+
+### Filter Methods
+
+**Variance Threshold**
+- Removes low-variance features
+- Example:
+```python
+from sklearn.feature_selection import VarianceThreshold
+
+selector = VarianceThreshold(threshold=0.01)
+X_selected = selector.fit_transform(X)
+```
+
+**SelectKBest / SelectPercentile**
+- Select features based on statistical tests
+- Tests: f_classif, chi2, mutual_info_classif
+- Example:
+```python
+from sklearn.feature_selection import SelectKBest, f_classif
+
+# Select top 10 features
+selector = SelectKBest(score_func=f_classif, k=10)
+X_selected = selector.fit_transform(X_train, y_train)
+
+# Get selected feature indices
+selected_indices = selector.get_support(indices=True)
+```
+
+### Wrapper Methods
+
+**Recursive Feature Elimination (RFE)**
+- Recursively removes features
+- Uses model feature importances
+- Example:
+```python
+from sklearn.feature_selection import RFE
+from sklearn.ensemble import RandomForestClassifier
+
+model = RandomForestClassifier(n_estimators=100, random_state=42)
+rfe = RFE(estimator=model, n_features_to_select=10, step=1)
+X_selected = rfe.fit_transform(X_train, y_train)
+
+# Get selected features
+selected_features = rfe.support_
+feature_ranking = rfe.ranking_
+```
+
+**RFECV (with Cross-Validation)**
+- RFE with cross-validation to find optimal number of features
+- Example:
+```python
+from sklearn.feature_selection import RFECV
+
+model = RandomForestClassifier(n_estimators=100, random_state=42)
+rfecv = RFECV(estimator=model, cv=5, scoring='accuracy')
+X_selected = rfecv.fit_transform(X_train, y_train)
+
+print(f"Optimal number of features: {rfecv.n_features_}")
+```
+
+### Embedded Methods
+
+**SelectFromModel**
+- Select features based on model coefficients/importances
+- Works with: Linear models (L1), Tree-based models
+- Example:
+```python
+from sklearn.feature_selection import SelectFromModel
+from sklearn.ensemble import RandomForestClassifier
+
+model = RandomForestClassifier(n_estimators=100, random_state=42)
+selector = SelectFromModel(model, threshold='median')
+selector.fit(X_train, y_train)
+X_selected = selector.transform(X_train)
+
+# Get selected features
+selected_features = selector.get_support()
+```
+
+**L1-based Feature Selection**
+```python
+from sklearn.linear_model import LogisticRegression
+from sklearn.feature_selection import SelectFromModel
+
+model = LogisticRegression(penalty='l1', solver='liblinear', C=0.1)
+selector = SelectFromModel(model)
+selector.fit(X_train, y_train)
+X_selected = selector.transform(X_train)
+```
+
+## Handling Outliers
+
+### IQR Method
+
+```python
+import numpy as np
+
+Q1 = np.percentile(X, 25, axis=0)
+Q3 = np.percentile(X, 75, axis=0)
+IQR = Q3 - Q1
+
+# Define outlier boundaries
+lower_bound = Q1 - 1.5 * IQR
+upper_bound = Q3 + 1.5 * IQR
+
+# Remove outliers
+mask = np.all((X >= lower_bound) & (X <= upper_bound), axis=1)
+X_no_outliers = X[mask]
+```
+
+### Winsorization
+
+```python
+from scipy.stats import mstats
+
+# Clip outliers at 5th and 95th percentiles
+X_winsorized = mstats.winsorize(X, limits=[0.05, 0.05], axis=0)
+```
+
+## Custom Transformers
+
+### Using FunctionTransformer
+
+```python
+from sklearn.preprocessing import FunctionTransformer
+import numpy as np
+
+def log_transform(X):
+    return np.log1p(X)
+
+transformer = FunctionTransformer(log_transform, inverse_func=np.expm1)
+X_transformed = transformer.fit_transform(X)
+```
+
+### Creating Custom Transformer
+
+```python
+from sklearn.base import BaseEstimator, TransformerMixin
+
+class CustomTransformer(BaseEstimator, TransformerMixin):
+    def __init__(self, parameter=1):
+        self.parameter = parameter
+
+    def fit(self, X, y=None):
+        # Learn parameters from X if needed
+        return self
+
+    def transform(self, X):
+        # Transform X
+        return X * self.parameter
+
+transformer = CustomTransformer(parameter=2)
+X_transformed = transformer.fit_transform(X)
+```
+
+## Best Practices
+
+### Fit on Training Data Only
+Always fit transformers on training data only:
+```python
+# Correct
+scaler = StandardScaler()
+X_train_scaled = scaler.fit_transform(X_train)
+X_test_scaled = scaler.transform(X_test)
+
+# Wrong - causes data leakage
+scaler = StandardScaler()
+X_all_scaled = scaler.fit_transform(np.vstack([X_train, X_test]))
+```
+
+### Use Pipelines
+Combine preprocessing with models:
+```python
+from sklearn.pipeline import Pipeline
+from sklearn.preprocessing import StandardScaler
+from sklearn.linear_model import LogisticRegression
+
+pipeline = Pipeline([
+    ('scaler', StandardScaler()),
+    ('classifier', LogisticRegression())
+])
+
+pipeline.fit(X_train, y_train)
+```
+
+### Handle Categorical and Numerical Separately
+Use ColumnTransformer:
+```python
+from sklearn.compose import ColumnTransformer
+from sklearn.preprocessing import StandardScaler, OneHotEncoder
+
+numeric_features = ['age', 'income']
+categorical_features = ['gender', 'occupation']
+
+preprocessor = ColumnTransformer(
+    transformers=[
+        ('num', StandardScaler(), numeric_features),
+        ('cat', OneHotEncoder(), categorical_features)
+    ]
+)
+
+X_transformed = preprocessor.fit_transform(X)
+```
+
+### Algorithm-Specific Requirements
+
+**Require Scaling:**
+- SVM, KNN, Neural Networks
+- PCA, Linear/Logistic Regression with regularization
+- K-Means clustering
+
+**Don't Require Scaling:**
+- Tree-based models (Decision Trees, Random Forest, Gradient Boosting)
+- Naive Bayes
+
+**Encoding Requirements:**
+- Linear models, SVM, KNN: One-hot encoding for nominal features
+- Tree-based models: Can handle ordinal encoding directly
--- a/skills/scikit-learn/references/quick_reference.md
+++ b/skills/scikit-learn/references/quick_reference.md
@@ -0,0 +1,433 @@
+# Scikit-learn Quick Reference
+
+## Common Import Patterns
+
+```python
+# Core scikit-learn
+import sklearn
+
+# Data splitting and cross-validation
+from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
+
+# Preprocessing
+from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
+from sklearn.impute import SimpleImputer
+
+# Feature selection
+from sklearn.feature_selection import SelectKBest, RFE
+
+# Supervised learning
+from sklearn.linear_model import LogisticRegression, Ridge, Lasso
+from sklearn.ensemble import RandomForestClassifier, GradientBoostingRegressor
+from sklearn.svm import SVC, SVR
+from sklearn.tree import DecisionTreeClassifier
+
+# Unsupervised learning
+from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
+from sklearn.decomposition import PCA, NMF
+
+# Metrics
+from sklearn.metrics import (
+    accuracy_score, precision_score, recall_score, f1_score,
+    mean_squared_error, r2_score, confusion_matrix, classification_report
+)
+
+# Pipeline
+from sklearn.pipeline import Pipeline, make_pipeline
+from sklearn.compose import ColumnTransformer, make_column_transformer
+
+# Utilities
+import numpy as np
+import pandas as pd
+import matplotlib.pyplot as plt
+```
+
+## Installation
+
+```bash
+# Using uv (recommended)
+uv pip install scikit-learn
+
+# Optional dependencies
+uv pip install scikit-learn[plots]  # For plotting utilities
+uv pip install pandas numpy matplotlib seaborn  # Common companions
+```
+
+## Quick Workflow Templates
+
+### Classification Pipeline
+
+```python
+from sklearn.model_selection import train_test_split
+from sklearn.preprocessing import StandardScaler
+from sklearn.ensemble import RandomForestClassifier
+from sklearn.metrics import classification_report, confusion_matrix
+
+# Split data
+X_train, X_test, y_train, y_test = train_test_split(
+    X, y, test_size=0.2, stratify=y, random_state=42
+)
+
+# Preprocess
+scaler = StandardScaler()
+X_train_scaled = scaler.fit_transform(X_train)
+X_test_scaled = scaler.transform(X_test)
+
+# Train
+model = RandomForestClassifier(n_estimators=100, random_state=42)
+model.fit(X_train_scaled, y_train)
+
+# Evaluate
+y_pred = model.predict(X_test_scaled)
+print(classification_report(y_test, y_pred))
+print(confusion_matrix(y_test, y_pred))
+```
+
+### Regression Pipeline
+
+```python
+from sklearn.model_selection import train_test_split
+from sklearn.preprocessing import StandardScaler
+from sklearn.ensemble import GradientBoostingRegressor
+from sklearn.metrics import mean_squared_error, r2_score
+
+# Split
+X_train, X_test, y_train, y_test = train_test_split(
+    X, y, test_size=0.2, random_state=42
+)
+
+# Preprocess and train
+scaler = StandardScaler()
+X_train_scaled = scaler.fit_transform(X_train)
+X_test_scaled = scaler.transform(X_test)
+
+model = GradientBoostingRegressor(n_estimators=100, random_state=42)
+model.fit(X_train_scaled, y_train)
+
+# Evaluate
+y_pred = model.predict(X_test_scaled)
+print(f"RMSE: {mean_squared_error(y_test, y_pred, squared=False):.3f}")
+print(f"R² Score: {r2_score(y_test, y_pred):.3f}")
+```
+
+### Cross-Validation
+
+```python
+from sklearn.model_selection import cross_val_score
+from sklearn.ensemble import RandomForestClassifier
+
+model = RandomForestClassifier(n_estimators=100, random_state=42)
+scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
+print(f"CV Accuracy: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")
+```
+
+### Complete Pipeline with Mixed Data Types
+
+```python
+from sklearn.pipeline import Pipeline
+from sklearn.compose import ColumnTransformer
+from sklearn.preprocessing import StandardScaler, OneHotEncoder
+from sklearn.impute import SimpleImputer
+from sklearn.ensemble import RandomForestClassifier
+
+# Define feature types
+numeric_features = ['age', 'income']
+categorical_features = ['gender', 'occupation']
+
+# Create preprocessing pipelines
+numeric_transformer = Pipeline([
+    ('imputer', SimpleImputer(strategy='median')),
+    ('scaler', StandardScaler())
+])
+
+categorical_transformer = Pipeline([
+    ('imputer', SimpleImputer(strategy='most_frequent')),
+    ('onehot', OneHotEncoder(handle_unknown='ignore'))
+])
+
+# Combine transformers
+preprocessor = ColumnTransformer([
+    ('num', numeric_transformer, numeric_features),
+    ('cat', categorical_transformer, categorical_features)
+])
+
+# Full pipeline
+model = Pipeline([
+    ('preprocessor', preprocessor),
+    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
+])
+
+# Fit and predict
+model.fit(X_train, y_train)
+y_pred = model.predict(X_test)
+```
+
+### Hyperparameter Tuning
+
+```python
+from sklearn.model_selection import GridSearchCV
+from sklearn.ensemble import RandomForestClassifier
+
+param_grid = {
+    'n_estimators': [100, 200, 300],
+    'max_depth': [10, 20, None],
+    'min_samples_split': [2, 5, 10]
+}
+
+model = RandomForestClassifier(random_state=42)
+grid_search = GridSearchCV(
+    model, param_grid, cv=5, scoring='accuracy', n_jobs=-1
+)
+
+grid_search.fit(X_train, y_train)
+print(f"Best params: {grid_search.best_params_}")
+print(f"Best score: {grid_search.best_score_:.3f}")
+
+# Use best model
+best_model = grid_search.best_estimator_
+```
+
+## Common Patterns
+
+### Loading Data
+
+```python
+# From scikit-learn datasets
+from sklearn.datasets import load_iris, load_digits, make_classification
+
+# Built-in datasets
+iris = load_iris()
+X, y = iris.data, iris.target
+
+# Synthetic data
+X, y = make_classification(
+    n_samples=1000, n_features=20, n_classes=2, random_state=42
+)
+
+# From pandas
+import pandas as pd
+df = pd.read_csv('data.csv')
+X = df.drop('target', axis=1)
+y = df['target']
+```
+
+### Handling Imbalanced Data
+
+```python
+from sklearn.ensemble import RandomForestClassifier
+
+# Use class_weight parameter
+model = RandomForestClassifier(class_weight='balanced', random_state=42)
+model.fit(X_train, y_train)
+
+# Or use appropriate metrics
+from sklearn.metrics import balanced_accuracy_score, f1_score
+print(f"Balanced Accuracy: {balanced_accuracy_score(y_test, y_pred):.3f}")
+print(f"F1 Score: {f1_score(y_test, y_pred):.3f}")
+```
+
+### Feature Importance
+
+```python
+from sklearn.ensemble import RandomForestClassifier
+import pandas as pd
+
+model = RandomForestClassifier(n_estimators=100, random_state=42)
+model.fit(X_train, y_train)
+
+# Get feature importances
+importances = pd.DataFrame({
+    'feature': feature_names,
+    'importance': model.feature_importances_
+}).sort_values('importance', ascending=False)
+
+print(importances.head(10))
+```
+
+### Clustering
+
+```python
+from sklearn.cluster import KMeans
+from sklearn.preprocessing import StandardScaler
+
+# Scale data first
+scaler = StandardScaler()
+X_scaled = scaler.fit_transform(X)
+
+# Fit K-Means
+kmeans = KMeans(n_clusters=3, random_state=42)
+labels = kmeans.fit_predict(X_scaled)
+
+# Evaluate
+from sklearn.metrics import silhouette_score
+score = silhouette_score(X_scaled, labels)
+print(f"Silhouette Score: {score:.3f}")
+```
+
+### Dimensionality Reduction
+
+```python
+from sklearn.decomposition import PCA
+import matplotlib.pyplot as plt
+
+# Fit PCA
+pca = PCA(n_components=2)
+X_reduced = pca.fit_transform(X)
+
+# Plot
+plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y, cmap='viridis')
+plt.xlabel('PC1')
+plt.ylabel('PC2')
+plt.title(f'PCA (explained variance: {pca.explained_variance_ratio_.sum():.2%})')
+```
+
+### Model Persistence
+
+```python
+import joblib
+
+# Save model
+joblib.dump(model, 'model.pkl')
+
+# Load model
+loaded_model = joblib.load('model.pkl')
+predictions = loaded_model.predict(X_new)
+```
+
+## Common Gotchas and Solutions
+
+### Data Leakage
+```python
+# WRONG: Fitting scaler on all data
+scaler = StandardScaler()
+X_scaled = scaler.fit_transform(X)
+X_train, X_test = train_test_split(X_scaled)
+
+# RIGHT: Fit on training data only
+X_train, X_test = train_test_split(X)
+scaler = StandardScaler()
+X_train_scaled = scaler.fit_transform(X_train)
+X_test_scaled = scaler.transform(X_test)
+
+# BEST: Use Pipeline
+from sklearn.pipeline import Pipeline
+pipeline = Pipeline([
+    ('scaler', StandardScaler()),
+    ('model', LogisticRegression())
+])
+pipeline.fit(X_train, y_train)  # No leakage!
+```
+
+### Stratified Splitting for Classification
+```python
+# Always use stratify for classification
+X_train, X_test, y_train, y_test = train_test_split(
+    X, y, test_size=0.2, stratify=y, random_state=42
+)
+```
+
+### Random State for Reproducibility
+```python
+# Set random_state for reproducibility
+model = RandomForestClassifier(n_estimators=100, random_state=42)
+```
+
+### Handling Unknown Categories
+```python
+# Use handle_unknown='ignore' for OneHotEncoder
+encoder = OneHotEncoder(handle_unknown='ignore')
+```
+
+### Feature Names with Pipelines
+```python
+# Get feature names after transformation
+preprocessor.fit(X_train)
+feature_names = preprocessor.get_feature_names_out()
+```
+
+## Cheat Sheet: Algorithm Selection
+
+### Classification
+
+| Problem | Algorithm | When to Use |
+|---------|-----------|-------------|
+| Binary/Multiclass | Logistic Regression | Fast baseline, interpretability |
+| Binary/Multiclass | Random Forest | Good default, robust |
+| Binary/Multiclass | Gradient Boosting | Best accuracy, willing to tune |
+| Binary/Multiclass | SVM | Small data, complex boundaries |
+| Binary/Multiclass | Naive Bayes | Text classification, fast |
+| High dimensions | Linear SVM or Logistic | Text, many features |
+
+### Regression
+
+| Problem | Algorithm | When to Use |
+|---------|-----------|-------------|
+| Continuous target | Linear Regression | Fast baseline, interpretability |
+| Continuous target | Ridge/Lasso | Regularization needed |
+| Continuous target | Random Forest | Good default, non-linear |
+| Continuous target | Gradient Boosting | Best accuracy |
+| Continuous target | SVR | Small data, non-linear |
+
+### Clustering
+
+| Problem | Algorithm | When to Use |
+|---------|-----------|-------------|
+| Known K, spherical | K-Means | Fast, simple |
+| Unknown K, arbitrary shapes | DBSCAN | Noise/outliers present |
+| Hierarchical structure | Agglomerative | Need dendrogram |
+| Soft clustering | Gaussian Mixture | Probability estimates |
+
+### Dimensionality Reduction
+
+| Problem | Algorithm | When to Use |
+|---------|-----------|-------------|
+| Linear reduction | PCA | Variance explanation |
+| Visualization | t-SNE | 2D/3D plots |
+| Non-negative data | NMF | Images, text |
+| Sparse data | TruncatedSVD | Text, recommender systems |
+
+## Performance Tips
+
+### Speed Up Training
+```python
+# Use n_jobs=-1 for parallel processing
+model = RandomForestClassifier(n_estimators=100, n_jobs=-1)
+
+# Use warm_start for incremental learning
+model = RandomForestClassifier(n_estimators=100, warm_start=True)
+model.fit(X, y)
+model.n_estimators += 50
+model.fit(X, y)  # Adds 50 more trees
+
+# Use partial_fit for online learning
+from sklearn.linear_model import SGDClassifier
+model = SGDClassifier()
+for X_batch, y_batch in batches:
+    model.partial_fit(X_batch, y_batch, classes=np.unique(y))
+```
+
+### Memory Efficiency
+```python
+# Use sparse matrices
+from scipy.sparse import csr_matrix
+X_sparse = csr_matrix(X)
+
+# Use MiniBatchKMeans for large data
+from sklearn.cluster import MiniBatchKMeans
+model = MiniBatchKMeans(n_clusters=8, batch_size=100)
+```
+
+## Version Check
+
+```python
+import sklearn
+print(f"scikit-learn version: {sklearn.__version__}")
+```
+
+## Useful Resources
+
+- Official Documentation: https://scikit-learn.org/stable/
+- User Guide: https://scikit-learn.org/stable/user_guide.html
+- API Reference: https://scikit-learn.org/stable/api/index.html
+- Examples: https://scikit-learn.org/stable/auto_examples/index.html
+- Tutorials: https://scikit-learn.org/stable/tutorial/index.html
--- a/skills/scikit-learn/references/supervised_learning.md
+++ b/skills/scikit-learn/references/supervised_learning.md
@@ -0,0 +1,378 @@
+# Supervised Learning Reference
+
+## Overview
+
+Supervised learning algorithms learn from labeled training data to make predictions on new data. Scikit-learn provides comprehensive implementations for both classification and regression tasks.
+
+## Linear Models
+
+### Regression
+
+**Linear Regression (`sklearn.linear_model.LinearRegression`)**
+- Ordinary least squares regression
+- Fast, interpretable, no hyperparameters
+- Use when: Linear relationships, interpretability matters
+- Example:
+```python
+from sklearn.linear_model import LinearRegression
+
+model = LinearRegression()
+model.fit(X_train, y_train)
+predictions = model.predict(X_test)
+```
+
+**Ridge Regression (`sklearn.linear_model.Ridge`)**
+- L2 regularization to prevent overfitting
+- Key parameter: `alpha` (regularization strength, default=1.0)
+- Use when: Multicollinearity present, need regularization
+- Example:
+```python
+from sklearn.linear_model import Ridge
+
+model = Ridge(alpha=1.0)
+model.fit(X_train, y_train)
+```
+
+**Lasso (`sklearn.linear_model.Lasso`)**
+- L1 regularization with feature selection
+- Key parameter: `alpha` (regularization strength)
+- Use when: Want sparse models, feature selection
+- Can reduce some coefficients to exactly zero
+- Example:
+```python
+from sklearn.linear_model import Lasso
+
+model = Lasso(alpha=0.1)
+model.fit(X_train, y_train)
+# Check which features were selected
+print(f"Non-zero coefficients: {sum(model.coef_ != 0)}")
+```
+
+**ElasticNet (`sklearn.linear_model.ElasticNet`)**
+- Combines L1 and L2 regularization
+- Key parameters: `alpha`, `l1_ratio` (0=Ridge, 1=Lasso)
+- Use when: Need both feature selection and regularization
+- Example:
+```python
+from sklearn.linear_model import ElasticNet
+
+model = ElasticNet(alpha=0.1, l1_ratio=0.5)
+model.fit(X_train, y_train)
+```
+
+### Classification
+
+**Logistic Regression (`sklearn.linear_model.LogisticRegression`)**
+- Binary and multiclass classification
+- Key parameters: `C` (inverse regularization), `penalty` ('l1', 'l2', 'elasticnet')
+- Returns probability estimates
+- Use when: Need probabilistic predictions, interpretability
+- Example:
+```python
+from sklearn.linear_model import LogisticRegression
+
+model = LogisticRegression(C=1.0, max_iter=1000)
+model.fit(X_train, y_train)
+probas = model.predict_proba(X_test)
+```
+
+**Stochastic Gradient Descent (SGD)**
+- `SGDClassifier`, `SGDRegressor`
+- Efficient for large-scale learning
+- Key parameters: `loss`, `penalty`, `alpha`, `learning_rate`
+- Use when: Very large datasets (>10^4 samples)
+- Example:
+```python
+from sklearn.linear_model import SGDClassifier
+
+model = SGDClassifier(loss='log_loss', max_iter=1000, tol=1e-3)
+model.fit(X_train, y_train)
+```
+
+## Support Vector Machines
+
+**SVC (`sklearn.svm.SVC`)**
+- Classification with kernel methods
+- Key parameters: `C`, `kernel` ('linear', 'rbf', 'poly'), `gamma`
+- Use when: Small to medium datasets, complex decision boundaries
+- Note: Does not scale well to large datasets
+- Example:
+```python
+from sklearn.svm import SVC
+
+# Linear kernel for linearly separable data
+model_linear = SVC(kernel='linear', C=1.0)
+
+# RBF kernel for non-linear data
+model_rbf = SVC(kernel='rbf', C=1.0, gamma='scale')
+model_rbf.fit(X_train, y_train)
+```
+
+**SVR (`sklearn.svm.SVR`)**
+- Regression with kernel methods
+- Similar parameters to SVC
+- Additional parameter: `epsilon` (tube width)
+- Example:
+```python
+from sklearn.svm import SVR
+
+model = SVR(kernel='rbf', C=1.0, epsilon=0.1)
+model.fit(X_train, y_train)
+```
+
+## Decision Trees
+
+**DecisionTreeClassifier / DecisionTreeRegressor**
+- Non-parametric model learning decision rules
+- Key parameters:
+  - `max_depth`: Maximum tree depth (prevents overfitting)
+  - `min_samples_split`: Minimum samples to split a node
+  - `min_samples_leaf`: Minimum samples in leaf
+  - `criterion`: 'gini', 'entropy' for classification; 'squared_error', 'absolute_error' for regression
+- Use when: Need interpretable model, non-linear relationships, mixed feature types
+- Prone to overfitting - use ensembles or pruning
+- Example:
+```python
+from sklearn.tree import DecisionTreeClassifier
+
+model = DecisionTreeClassifier(
+    max_depth=5,
+    min_samples_split=20,
+    min_samples_leaf=10,
+    criterion='gini'
+)
+model.fit(X_train, y_train)
+
+# Visualize the tree
+from sklearn.tree import plot_tree
+plot_tree(model, feature_names=feature_names, class_names=class_names)
+```
+
+## Ensemble Methods
+
+### Random Forests
+
+**RandomForestClassifier / RandomForestRegressor**
+- Ensemble of decision trees with bagging
+- Key parameters:
+  - `n_estimators`: Number of trees (default=100)
+  - `max_depth`: Maximum tree depth
+  - `max_features`: Features to consider for splits ('sqrt', 'log2', or int)
+  - `min_samples_split`, `min_samples_leaf`: Control tree growth
+- Use when: High accuracy needed, can afford computation
+- Provides feature importance
+- Example:
+```python
+from sklearn.ensemble import RandomForestClassifier
+
+model = RandomForestClassifier(
+    n_estimators=100,
+    max_depth=10,
+    max_features='sqrt',
+    n_jobs=-1  # Use all CPU cores
+)
+model.fit(X_train, y_train)
+
+# Feature importance
+importances = model.feature_importances_
+```
+
+### Gradient Boosting
+
+**GradientBoostingClassifier / GradientBoostingRegressor**
+- Sequential ensemble building trees on residuals
+- Key parameters:
+  - `n_estimators`: Number of boosting stages
+  - `learning_rate`: Shrinks contribution of each tree
+  - `max_depth`: Depth of individual trees (typically 3-5)
+  - `subsample`: Fraction of samples for training each tree
+- Use when: Need high accuracy, can afford training time
+- Often achieves best performance
+- Example:
+```python
+from sklearn.ensemble import GradientBoostingClassifier
+
+model = GradientBoostingClassifier(
+    n_estimators=100,
+    learning_rate=0.1,
+    max_depth=3,
+    subsample=0.8
+)
+model.fit(X_train, y_train)
+```
+
+**HistGradientBoostingClassifier / HistGradientBoostingRegressor**
+- Faster gradient boosting with histogram-based algorithm
+- Native support for missing values and categorical features
+- Key parameters: Similar to GradientBoosting
+- Use when: Large datasets, need faster training
+- Example:
+```python
+from sklearn.ensemble import HistGradientBoostingClassifier
+
+model = HistGradientBoostingClassifier(
+    max_iter=100,
+    learning_rate=0.1,
+    max_depth=None,  # No limit by default
+    categorical_features='from_dtype'  # Auto-detect categorical
+)
+model.fit(X_train, y_train)
+```
+
+### Other Ensemble Methods
+
+**AdaBoost**
+- Adaptive boosting focusing on misclassified samples
+- Key parameters: `n_estimators`, `learning_rate`, `estimator` (base estimator)
+- Use when: Simple boosting approach needed
+- Example:
+```python
+from sklearn.ensemble import AdaBoostClassifier
+
+model = AdaBoostClassifier(n_estimators=50, learning_rate=1.0)
+model.fit(X_train, y_train)
+```
+
+**Voting Classifier / Regressor**
+- Combines predictions from multiple models
+- Types: 'hard' (majority vote) or 'soft' (average probabilities)
+- Use when: Want to ensemble different model types
+- Example:
+```python
+from sklearn.ensemble import VotingClassifier
+from sklearn.linear_model import LogisticRegression
+from sklearn.tree import DecisionTreeClassifier
+from sklearn.svm import SVC
+
+model = VotingClassifier(
+    estimators=[
+        ('lr', LogisticRegression()),
+        ('dt', DecisionTreeClassifier()),
+        ('svc', SVC(probability=True))
+    ],
+    voting='soft'
+)
+model.fit(X_train, y_train)
+```
+
+**Stacking Classifier / Regressor**
+- Trains a meta-model on predictions from base models
+- More sophisticated than voting
+- Key parameter: `final_estimator` (meta-learner)
+- Example:
+```python
+from sklearn.ensemble import StackingClassifier
+from sklearn.linear_model import LogisticRegression
+from sklearn.tree import DecisionTreeClassifier
+from sklearn.svm import SVC
+
+model = StackingClassifier(
+    estimators=[
+        ('dt', DecisionTreeClassifier()),
+        ('svc', SVC())
+    ],
+    final_estimator=LogisticRegression()
+)
+model.fit(X_train, y_train)
+```
+
+## K-Nearest Neighbors
+
+**KNeighborsClassifier / KNeighborsRegressor**
+- Non-parametric method based on distance
+- Key parameters:
+  - `n_neighbors`: Number of neighbors (default=5)
+  - `weights`: 'uniform' or 'distance'
+  - `metric`: Distance metric ('euclidean', 'manhattan', etc.)
+- Use when: Small dataset, simple baseline needed
+- Slow prediction on large datasets
+- Example:
+```python
+from sklearn.neighbors import KNeighborsClassifier
+
+model = KNeighborsClassifier(n_neighbors=5, weights='distance')
+model.fit(X_train, y_train)
+```
+
+## Naive Bayes
+
+**GaussianNB, MultinomialNB, BernoulliNB**
+- Probabilistic classifiers based on Bayes' theorem
+- Fast training and prediction
+- GaussianNB: Continuous features (assumes Gaussian distribution)
+- MultinomialNB: Count features (text classification)
+- BernoulliNB: Binary features
+- Use when: Text classification, fast baseline, probabilistic predictions
+- Example:
+```python
+from sklearn.naive_bayes import GaussianNB, MultinomialNB
+
+# For continuous features
+model_gaussian = GaussianNB()
+
+# For text/count data
+model_multinomial = MultinomialNB(alpha=1.0)  # alpha is smoothing parameter
+model_multinomial.fit(X_train, y_train)
+```
+
+## Neural Networks
+
+**MLPClassifier / MLPRegressor**
+- Multi-layer perceptron (feedforward neural network)
+- Key parameters:
+  - `hidden_layer_sizes`: Tuple of hidden layer sizes, e.g., (100, 50)
+  - `activation`: 'relu', 'tanh', 'logistic'
+  - `solver`: 'adam', 'sgd', 'lbfgs'
+  - `alpha`: L2 regularization parameter
+  - `learning_rate`: 'constant', 'adaptive'
+- Use when: Complex non-linear patterns, large datasets
+- Requires feature scaling
+- Example:
+```python
+from sklearn.neural_network import MLPClassifier
+from sklearn.preprocessing import StandardScaler
+
+# Scale features first
+scaler = StandardScaler()
+X_train_scaled = scaler.fit_transform(X_train)
+
+model = MLPClassifier(
+    hidden_layer_sizes=(100, 50),
+    activation='relu',
+    solver='adam',
+    alpha=0.0001,
+    max_iter=1000
+)
+model.fit(X_train_scaled, y_train)
+```
+
+## Algorithm Selection Guide
+
+### Choose based on:
+
+**Dataset size:**
+- Small (<1k samples): KNN, SVM, Decision Trees
+- Medium (1k-100k): Random Forest, Gradient Boosting, Linear Models
+- Large (>100k): SGD, Linear Models, HistGradientBoosting
+
+**Interpretability:**
+- High: Linear Models, Decision Trees
+- Medium: Random Forest (feature importance)
+- Low: SVM with RBF kernel, Neural Networks
+
+**Accuracy vs Speed:**
+- Fast training: Naive Bayes, Linear Models, KNN
+- High accuracy: Gradient Boosting, Random Forest, Stacking
+- Fast prediction: Linear Models, Naive Bayes
+- Slow prediction: KNN (on large datasets), SVM
+
+**Feature types:**
+- Continuous: Most algorithms work well
+- Categorical: Trees, HistGradientBoosting (native support)
+- Mixed: Trees, Gradient Boosting
+- Text: Naive Bayes, Linear Models with TF-IDF
+
+**Common starting points:**
+1. Logistic Regression (classification) / Linear Regression (regression) - fast baseline
+2. Random Forest - good default choice
+3. Gradient Boosting - optimize for best accuracy
--- a/skills/scikit-learn/references/unsupervised_learning.md
+++ b/skills/scikit-learn/references/unsupervised_learning.md
@@ -0,0 +1,505 @@
+# Unsupervised Learning Reference
+
+## Overview
+
+Unsupervised learning discovers patterns in unlabeled data through clustering, dimensionality reduction, and density estimation.
+
+## Clustering
+
+### K-Means
+
+**KMeans (`sklearn.cluster.KMeans`)**
+- Partition-based clustering into K clusters
+- Key parameters:
+  - `n_clusters`: Number of clusters to form
+  - `init`: Initialization method ('k-means++', 'random')
+  - `n_init`: Number of initializations (default=10)
+  - `max_iter`: Maximum iterations
+- Use when: Know number of clusters, spherical cluster shapes
+- Fast and scalable
+- Example:
+```python
+from sklearn.cluster import KMeans
+
+model = KMeans(n_clusters=3, init='k-means++', n_init=10, random_state=42)
+labels = model.fit_predict(X)
+centers = model.cluster_centers_
+
+# Inertia (sum of squared distances to nearest center)
+print(f"Inertia: {model.inertia_}")
+```
+
+**MiniBatchKMeans**
+- Faster K-Means using mini-batches
+- Use when: Large datasets, need faster training
+- Slightly less accurate than K-Means
+- Example:
+```python
+from sklearn.cluster import MiniBatchKMeans
+
+model = MiniBatchKMeans(n_clusters=3, batch_size=100, random_state=42)
+labels = model.fit_predict(X)
+```
+
+### Density-Based Clustering
+
+**DBSCAN (`sklearn.cluster.DBSCAN`)**
+- Density-Based Spatial Clustering
+- Key parameters:
+  - `eps`: Maximum distance between two samples to be neighbors
+  - `min_samples`: Minimum samples in neighborhood to form core point
+  - `metric`: Distance metric
+- Use when: Arbitrary cluster shapes, presence of noise/outliers
+- Automatically determines number of clusters
+- Labels noise points as -1
+- Example:
+```python
+from sklearn.cluster import DBSCAN
+
+model = DBSCAN(eps=0.5, min_samples=5, metric='euclidean')
+labels = model.fit_predict(X)
+
+# Number of clusters (excluding noise)
+n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
+n_noise = list(labels).count(-1)
+print(f"Clusters: {n_clusters}, Noise points: {n_noise}")
+```
+
+**HDBSCAN (`sklearn.cluster.HDBSCAN`)**
+- Hierarchical DBSCAN with adaptive epsilon
+- More robust than DBSCAN
+- Key parameter: `min_cluster_size`
+- Use when: Varying density clusters
+- Example:
+```python
+from sklearn.cluster import HDBSCAN
+
+model = HDBSCAN(min_cluster_size=10, min_samples=5)
+labels = model.fit_predict(X)
+```
+
+**OPTICS (`sklearn.cluster.OPTICS`)**
+- Ordering points to identify clustering structure
+- Similar to DBSCAN but doesn't require eps parameter
+- Key parameters: `min_samples`, `max_eps`
+- Use when: Varying density, exploratory analysis
+- Example:
+```python
+from sklearn.cluster import OPTICS
+
+model = OPTICS(min_samples=5, max_eps=0.5)
+labels = model.fit_predict(X)
+```
+
+### Hierarchical Clustering
+
+**AgglomerativeClustering**
+- Bottom-up hierarchical clustering
+- Key parameters:
+  - `n_clusters`: Number of clusters (or use `distance_threshold`)
+  - `linkage`: 'ward', 'complete', 'average', 'single'
+  - `metric`: Distance metric
+- Use when: Need dendrogram, hierarchical structure important
+- Example:
+```python
+from sklearn.cluster import AgglomerativeClustering
+
+model = AgglomerativeClustering(n_clusters=3, linkage='ward')
+labels = model.fit_predict(X)
+
+# Create dendrogram using scipy
+from scipy.cluster.hierarchy import dendrogram, linkage
+Z = linkage(X, method='ward')
+dendrogram(Z)
+```
+
+### Other Clustering Methods
+
+**MeanShift**
+- Finds clusters by shifting points toward mode of density
+- Automatically determines number of clusters
+- Key parameter: `bandwidth`
+- Use when: Don't know number of clusters, arbitrary shapes
+- Example:
+```python
+from sklearn.cluster import MeanShift, estimate_bandwidth
+
+# Estimate bandwidth
+bandwidth = estimate_bandwidth(X, quantile=0.2, n_samples=500)
+model = MeanShift(bandwidth=bandwidth)
+labels = model.fit_predict(X)
+```
+
+**SpectralClustering**
+- Uses graph-based approach with eigenvalues
+- Key parameters: `n_clusters`, `affinity` ('rbf', 'nearest_neighbors')
+- Use when: Non-convex clusters, graph structure
+- Example:
+```python
+from sklearn.cluster import SpectralClustering
+
+model = SpectralClustering(n_clusters=3, affinity='rbf', random_state=42)
+labels = model.fit_predict(X)
+```
+
+**AffinityPropagation**
+- Finds exemplars by message passing
+- Automatically determines number of clusters
+- Key parameters: `damping`, `preference`
+- Use when: Don't know number of clusters
+- Example:
+```python
+from sklearn.cluster import AffinityPropagation
+
+model = AffinityPropagation(damping=0.9, random_state=42)
+labels = model.fit_predict(X)
+n_clusters = len(model.cluster_centers_indices_)
+```
+
+**BIRCH**
+- Balanced Iterative Reducing and Clustering using Hierarchies
+- Memory efficient for large datasets
+- Key parameters: `n_clusters`, `threshold`, `branching_factor`
+- Use when: Very large datasets
+- Example:
+```python
+from sklearn.cluster import Birch
+
+model = Birch(n_clusters=3, threshold=0.5)
+labels = model.fit_predict(X)
+```
+
+### Clustering Evaluation
+
+**Metrics when ground truth is known:**
+```python
+from sklearn.metrics import adjusted_rand_score, normalized_mutual_info_score
+from sklearn.metrics import adjusted_mutual_info_score, fowlkes_mallows_score
+
+# Compare predicted labels with true labels
+ari = adjusted_rand_score(y_true, y_pred)
+nmi = normalized_mutual_info_score(y_true, y_pred)
+ami = adjusted_mutual_info_score(y_true, y_pred)
+fmi = fowlkes_mallows_score(y_true, y_pred)
+```
+
+**Metrics without ground truth:**
+```python
+from sklearn.metrics import silhouette_score, calinski_harabasz_score
+from sklearn.metrics import davies_bouldin_score
+
+# Silhouette: [-1, 1], higher is better
+silhouette = silhouette_score(X, labels)
+
+# Calinski-Harabasz: higher is better
+ch_score = calinski_harabasz_score(X, labels)
+
+# Davies-Bouldin: lower is better
+db_score = davies_bouldin_score(X, labels)
+```
+
+**Elbow method for K-Means:**
+```python
+from sklearn.cluster import KMeans
+import matplotlib.pyplot as plt
+
+inertias = []
+K_range = range(2, 11)
+for k in K_range:
+    model = KMeans(n_clusters=k, random_state=42)
+    model.fit(X)
+    inertias.append(model.inertia_)
+
+plt.plot(K_range, inertias, 'bo-')
+plt.xlabel('Number of clusters')
+plt.ylabel('Inertia')
+plt.title('Elbow Method')
+```
+
+## Dimensionality Reduction
+
+### Principal Component Analysis (PCA)
+
+**PCA (`sklearn.decomposition.PCA`)**
+- Linear dimensionality reduction using SVD
+- Key parameters:
+  - `n_components`: Number of components (int or float for explained variance)
+  - `whiten`: Whiten components to unit variance
+- Use when: Linear relationships, want to explain variance
+- Example:
+```python
+from sklearn.decomposition import PCA
+
+# Keep components explaining 95% variance
+pca = PCA(n_components=0.95)
+X_reduced = pca.fit_transform(X)
+
+print(f"Original dimensions: {X.shape[1]}")
+print(f"Reduced dimensions: {X_reduced.shape[1]}")
+print(f"Explained variance ratio: {pca.explained_variance_ratio_}")
+print(f"Total variance explained: {pca.explained_variance_ratio_.sum()}")
+
+# Or specify exact number of components
+pca = PCA(n_components=2)
+X_2d = pca.fit_transform(X)
+```
+
+**IncrementalPCA**
+- PCA for large datasets that don't fit in memory
+- Processes data in batches
+- Key parameter: `n_components`, `batch_size`
+- Example:
+```python
+from sklearn.decomposition import IncrementalPCA
+
+pca = IncrementalPCA(n_components=50, batch_size=100)
+X_reduced = pca.fit_transform(X)
+```
+
+**KernelPCA**
+- Non-linear dimensionality reduction using kernels
+- Key parameters: `n_components`, `kernel` ('linear', 'poly', 'rbf', 'sigmoid')
+- Use when: Non-linear relationships
+- Example:
+```python
+from sklearn.decomposition import KernelPCA
+
+pca = KernelPCA(n_components=2, kernel='rbf', gamma=0.1)
+X_reduced = pca.fit_transform(X)
+```
+
+### Manifold Learning
+
+**t-SNE (`sklearn.manifold.TSNE`)**
+- t-distributed Stochastic Neighbor Embedding
+- Excellent for 2D/3D visualization
+- Key parameters:
+  - `n_components`: Usually 2 or 3
+  - `perplexity`: Balance between local and global structure (5-50)
+  - `learning_rate`: Usually 10-1000
+  - `n_iter`: Number of iterations (min 250)
+- Use when: Visualizing high-dimensional data
+- Note: Slow on large datasets, no transform() method
+- Example:
+```python
+from sklearn.manifold import TSNE
+
+tsne = TSNE(n_components=2, perplexity=30, learning_rate=200, n_iter=1000, random_state=42)
+X_embedded = tsne.fit_transform(X)
+
+# Visualize
+import matplotlib.pyplot as plt
+plt.scatter(X_embedded[:, 0], X_embedded[:, 1], c=labels, cmap='viridis')
+plt.title('t-SNE visualization')
+```
+
+**UMAP (not in scikit-learn, but compatible)**
+- Uniform Manifold Approximation and Projection
+- Faster than t-SNE, preserves global structure better
+- Install: `uv pip install umap-learn`
+- Example:
+```python
+from umap import UMAP
+
+reducer = UMAP(n_components=2, n_neighbors=15, min_dist=0.1, random_state=42)
+X_embedded = reducer.fit_transform(X)
+```
+
+**Isomap**
+- Isometric Mapping
+- Preserves geodesic distances
+- Key parameters: `n_components`, `n_neighbors`
+- Use when: Non-linear manifolds
+- Example:
+```python
+from sklearn.manifold import Isomap
+
+isomap = Isomap(n_components=2, n_neighbors=5)
+X_embedded = isomap.fit_transform(X)
+```
+
+**Locally Linear Embedding (LLE)**
+- Preserves local neighborhood structure
+- Key parameters: `n_components`, `n_neighbors`
+- Example:
+```python
+from sklearn.manifold import LocallyLinearEmbedding
+
+lle = LocallyLinearEmbedding(n_components=2, n_neighbors=10)
+X_embedded = lle.fit_transform(X)
+```
+
+**MDS (Multidimensional Scaling)**
+- Preserves pairwise distances
+- Key parameter: `n_components`, `metric` (True/False)
+- Example:
+```python
+from sklearn.manifold import MDS
+
+mds = MDS(n_components=2, metric=True, random_state=42)
+X_embedded = mds.fit_transform(X)
+```
+
+### Matrix Factorization
+
+**NMF (Non-negative Matrix Factorization)**
+- Factorizes into non-negative matrices
+- Key parameters: `n_components`, `init` ('nndsvd', 'random')
+- Use when: Data is non-negative (images, text)
+- Interpretable components
+- Example:
+```python
+from sklearn.decomposition import NMF
+
+nmf = NMF(n_components=10, init='nndsvd', random_state=42)
+W = nmf.fit_transform(X)  # Document-topic matrix
+H = nmf.components_  # Topic-word matrix
+```
+
+**TruncatedSVD**
+- SVD for sparse matrices
+- Similar to PCA but works with sparse data
+- Use when: Text data, sparse matrices
+- Example:
+```python
+from sklearn.decomposition import TruncatedSVD
+
+svd = TruncatedSVD(n_components=100, random_state=42)
+X_reduced = svd.fit_transform(X_sparse)
+print(f"Explained variance: {svd.explained_variance_ratio_.sum()}")
+```
+
+**FastICA**
+- Independent Component Analysis
+- Separates multivariate signal into independent components
+- Key parameter: `n_components`
+- Use when: Signal separation (e.g., audio, EEG)
+- Example:
+```python
+from sklearn.decomposition import FastICA
+
+ica = FastICA(n_components=10, random_state=42)
+S = ica.fit_transform(X)  # Independent sources
+A = ica.mixing_  # Mixing matrix
+```
+
+**LatentDirichletAllocation (LDA)**
+- Topic modeling for text data
+- Key parameters: `n_components` (number of topics), `learning_method` ('batch', 'online')
+- Use when: Topic modeling, document clustering
+- Example:
+```python
+from sklearn.decomposition import LatentDirichletAllocation
+
+lda = LatentDirichletAllocation(n_components=10, random_state=42)
+doc_topics = lda.fit_transform(X_counts)  # Document-topic distribution
+
+# Get top words for each topic
+feature_names = vectorizer.get_feature_names_out()
+for topic_idx, topic in enumerate(lda.components_):
+    top_words = [feature_names[i] for i in topic.argsort()[-10:]]
+    print(f"Topic {topic_idx}: {', '.join(top_words)}")
+```
+
+## Outlier and Novelty Detection
+
+### Outlier Detection
+
+**IsolationForest**
+- Isolates anomalies using random trees
+- Key parameters:
+  - `contamination`: Expected proportion of outliers
+  - `n_estimators`: Number of trees
+- Use when: High-dimensional data, efficiency important
+- Example:
+```python
+from sklearn.ensemble import IsolationForest
+
+model = IsolationForest(contamination=0.1, random_state=42)
+predictions = model.fit_predict(X)  # -1 for outliers, 1 for inliers
+```
+
+**LocalOutlierFactor**
+- Measures local density deviation
+- Key parameters: `n_neighbors`, `contamination`
+- Use when: Varying density regions
+- Example:
+```python
+from sklearn.neighbors import LocalOutlierFactor
+
+lof = LocalOutlierFactor(n_neighbors=20, contamination=0.1)
+predictions = lof.fit_predict(X)  # -1 for outliers, 1 for inliers
+outlier_scores = lof.negative_outlier_factor_
+```
+
+**One-Class SVM**
+- Learns decision boundary around normal data
+- Key parameters: `nu` (upper bound on outliers), `kernel`, `gamma`
+- Use when: Small training set of normal data
+- Example:
+```python
+from sklearn.svm import OneClassSVM
+
+model = OneClassSVM(nu=0.1, kernel='rbf', gamma='auto')
+model.fit(X_train)
+predictions = model.predict(X_test)  # -1 for outliers, 1 for inliers
+```
+
+**EllipticEnvelope**
+- Assumes Gaussian distribution
+- Key parameter: `contamination`
+- Use when: Data is Gaussian-distributed
+- Example:
+```python
+from sklearn.covariance import EllipticEnvelope
+
+model = EllipticEnvelope(contamination=0.1, random_state=42)
+predictions = model.fit_predict(X)
+```
+
+## Gaussian Mixture Models
+
+**GaussianMixture**
+- Probabilistic clustering with mixture of Gaussians
+- Key parameters:
+  - `n_components`: Number of mixture components
+  - `covariance_type`: 'full', 'tied', 'diag', 'spherical'
+- Use when: Soft clustering, need probability estimates
+- Example:
+```python
+from sklearn.mixture import GaussianMixture
+
+gmm = GaussianMixture(n_components=3, covariance_type='full', random_state=42)
+gmm.fit(X)
+
+# Predict cluster labels
+labels = gmm.predict(X)
+
+# Get probability of each cluster
+probabilities = gmm.predict_proba(X)
+
+# Information criteria for model selection
+print(f"BIC: {gmm.bic(X)}")  # Lower is better
+print(f"AIC: {gmm.aic(X)}")  # Lower is better
+```
+
+## Choosing the Right Method
+
+### Clustering:
+- **Know K, spherical clusters**: K-Means
+- **Arbitrary shapes, noise**: DBSCAN, HDBSCAN
+- **Hierarchical structure**: AgglomerativeClustering
+- **Very large data**: MiniBatchKMeans, BIRCH
+- **Probabilistic**: GaussianMixture
+
+### Dimensionality Reduction:
+- **Linear, variance explanation**: PCA
+- **Non-linear, visualization**: t-SNE, UMAP
+- **Non-negative data**: NMF
+- **Sparse data**: TruncatedSVD
+- **Topic modeling**: LatentDirichletAllocation
+
+### Outlier Detection:
+- **High-dimensional**: IsolationForest
+- **Varying density**: LocalOutlierFactor
+- **Gaussian data**: EllipticEnvelope