# Supervised Learning Reference

## Overview

Supervised learning algorithms learn from labeled training data to make predictions on new data. Scikit-learn provides comprehensive implementations for both classification and regression tasks.

## Linear Models

### Regression

**Linear Regression (`sklearn.linear_model.LinearRegression`)**
- Ordinary least squares regression
- Fast, interpretable, no hyperparameters
- Use when: Linear relationships, interpretability matters
- Example:
```python
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
```

**Ridge Regression (`sklearn.linear_model.Ridge`)**
- L2 regularization to prevent overfitting
- Key parameter: `alpha` (regularization strength, default=1.0)
- Use when: Multicollinearity present, need regularization
- Example:
```python
from sklearn.linear_model import Ridge

model = Ridge(alpha=1.0)
model.fit(X_train, y_train)
```

**Lasso (`sklearn.linear_model.Lasso`)**
- L1 regularization with feature selection
- Key parameter: `alpha` (regularization strength)
- Use when: Want sparse models, feature selection
- Can reduce some coefficients to exactly zero
- Example:
```python
from sklearn.linear_model import Lasso

model = Lasso(alpha=0.1)
model.fit(X_train, y_train)
# Check which features were selected
print(f"Non-zero coefficients: {sum(model.coef_ != 0)}")
```

**ElasticNet (`sklearn.linear_model.ElasticNet`)**
- Combines L1 and L2 regularization
- Key parameters: `alpha`, `l1_ratio` (0=Ridge, 1=Lasso)
- Use when: Need both feature selection and regularization
- Example:
```python
from sklearn.linear_model import ElasticNet

model = ElasticNet(alpha=0.1, l1_ratio=0.5)
model.fit(X_train, y_train)
```

### Classification

**Logistic Regression (`sklearn.linear_model.LogisticRegression`)**
- Binary and multiclass classification
- Key parameters: `C` (inverse regularization), `penalty` ('l1', 'l2', 'elasticnet')
- Returns probability estimates
- Use when: Need probabilistic predictions, interpretability
- Example:
```python
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(C=1.0, max_iter=1000)
model.fit(X_train, y_train)
probas = model.predict_proba(X_test)
```

**Stochastic Gradient Descent (SGD)**
- `SGDClassifier`, `SGDRegressor`
- Efficient for large-scale learning
- Key parameters: `loss`, `penalty`, `alpha`, `learning_rate`
- Use when: Very large datasets (>10^4 samples)
- Example:
```python
from sklearn.linear_model import SGDClassifier

model = SGDClassifier(loss='log_loss', max_iter=1000, tol=1e-3)
model.fit(X_train, y_train)
```

## Support Vector Machines

**SVC (`sklearn.svm.SVC`)**
- Classification with kernel methods
- Key parameters: `C`, `kernel` ('linear', 'rbf', 'poly'), `gamma`
- Use when: Small to medium datasets, complex decision boundaries
- Note: Does not scale well to large datasets
- Example:
```python
from sklearn.svm import SVC

# Linear kernel for linearly separable data
model_linear = SVC(kernel='linear', C=1.0)

# RBF kernel for non-linear data
model_rbf = SVC(kernel='rbf', C=1.0, gamma='scale')
model_rbf.fit(X_train, y_train)
```

**SVR (`sklearn.svm.SVR`)**
- Regression with kernel methods
- Similar parameters to SVC
- Additional parameter: `epsilon` (tube width)
- Example:
```python
from sklearn.svm import SVR

model = SVR(kernel='rbf', C=1.0, epsilon=0.1)
model.fit(X_train, y_train)
```

## Decision Trees

**DecisionTreeClassifier / DecisionTreeRegressor**
- Non-parametric model learning decision rules
- Key parameters:
  - `max_depth`: Maximum tree depth (prevents overfitting)
  - `min_samples_split`: Minimum samples to split a node
  - `min_samples_leaf`: Minimum samples in leaf
  - `criterion`: 'gini', 'entropy' for classification; 'squared_error', 'absolute_error' for regression
- Use when: Need interpretable model, non-linear relationships, mixed feature types
- Prone to overfitting - use ensembles or pruning
- Example:
```python
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier(
    max_depth=5,
    min_samples_split=20,
    min_samples_leaf=10,
    criterion='gini'
)
model.fit(X_train, y_train)

# Visualize the tree
from sklearn.tree import plot_tree
plot_tree(model, feature_names=feature_names, class_names=class_names)
```

## Ensemble Methods

### Random Forests

**RandomForestClassifier / RandomForestRegressor**
- Ensemble of decision trees with bagging
- Key parameters:
  - `n_estimators`: Number of trees (default=100)
  - `max_depth`: Maximum tree depth
  - `max_features`: Features to consider for splits ('sqrt', 'log2', or int)
  - `min_samples_split`, `min_samples_leaf`: Control tree growth
- Use when: High accuracy needed, can afford computation
- Provides feature importance
- Example:
```python
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    max_features='sqrt',
    n_jobs=-1  # Use all CPU cores
)
model.fit(X_train, y_train)

# Feature importance
importances = model.feature_importances_
```

### Gradient Boosting

**GradientBoostingClassifier / GradientBoostingRegressor**
- Sequential ensemble building trees on residuals
- Key parameters:
  - `n_estimators`: Number of boosting stages
  - `learning_rate`: Shrinks contribution of each tree
  - `max_depth`: Depth of individual trees (typically 3-5)
  - `subsample`: Fraction of samples for training each tree
- Use when: Need high accuracy, can afford training time
- Often achieves best performance
- Example:
```python
from sklearn.ensemble import GradientBoostingClassifier

model = GradientBoostingClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    subsample=0.8
)
model.fit(X_train, y_train)
```

**HistGradientBoostingClassifier / HistGradientBoostingRegressor**
- Faster gradient boosting with histogram-based algorithm
- Native support for missing values and categorical features
- Key parameters: Similar to GradientBoosting
- Use when: Large datasets, need faster training
- Example:
```python
from sklearn.ensemble import HistGradientBoostingClassifier

model = HistGradientBoostingClassifier(
    max_iter=100,
    learning_rate=0.1,
    max_depth=None,  # No limit by default
    categorical_features='from_dtype'  # Auto-detect categorical
)
model.fit(X_train, y_train)
```

### Other Ensemble Methods

**AdaBoost**
- Adaptive boosting focusing on misclassified samples
- Key parameters: `n_estimators`, `learning_rate`, `estimator` (base estimator)
- Use when: Simple boosting approach needed
- Example:
```python
from sklearn.ensemble import AdaBoostClassifier

model = AdaBoostClassifier(n_estimators=50, learning_rate=1.0)
model.fit(X_train, y_train)
```

**Voting Classifier / Regressor**
- Combines predictions from multiple models
- Types: 'hard' (majority vote) or 'soft' (average probabilities)
- Use when: Want to ensemble different model types
- Example:
```python
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

model = VotingClassifier(
    estimators=[
        ('lr', LogisticRegression()),
        ('dt', DecisionTreeClassifier()),
        ('svc', SVC(probability=True))
    ],
    voting='soft'
)
model.fit(X_train, y_train)
```

**Stacking Classifier / Regressor**
- Trains a meta-model on predictions from base models
- More sophisticated than voting
- Key parameter: `final_estimator` (meta-learner)
- Example:
```python
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

model = StackingClassifier(
    estimators=[
        ('dt', DecisionTreeClassifier()),
        ('svc', SVC())
    ],
    final_estimator=LogisticRegression()
)
model.fit(X_train, y_train)
```

## K-Nearest Neighbors

**KNeighborsClassifier / KNeighborsRegressor**
- Non-parametric method based on distance
- Key parameters:
  - `n_neighbors`: Number of neighbors (default=5)
  - `weights`: 'uniform' or 'distance'
  - `metric`: Distance metric ('euclidean', 'manhattan', etc.)
- Use when: Small dataset, simple baseline needed
- Slow prediction on large datasets
- Example:
```python
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(n_neighbors=5, weights='distance')
model.fit(X_train, y_train)
```

## Naive Bayes

**GaussianNB, MultinomialNB, BernoulliNB**
- Probabilistic classifiers based on Bayes' theorem
- Fast training and prediction
- GaussianNB: Continuous features (assumes Gaussian distribution)
- MultinomialNB: Count features (text classification)
- BernoulliNB: Binary features
- Use when: Text classification, fast baseline, probabilistic predictions
- Example:
```python
from sklearn.naive_bayes import GaussianNB, MultinomialNB

# For continuous features
model_gaussian = GaussianNB()

# For text/count data
model_multinomial = MultinomialNB(alpha=1.0)  # alpha is smoothing parameter
model_multinomial.fit(X_train, y_train)
```

## Neural Networks

**MLPClassifier / MLPRegressor**
- Multi-layer perceptron (feedforward neural network)
- Key parameters:
  - `hidden_layer_sizes`: Tuple of hidden layer sizes, e.g., (100, 50)
  - `activation`: 'relu', 'tanh', 'logistic'
  - `solver`: 'adam', 'sgd', 'lbfgs'
  - `alpha`: L2 regularization parameter
  - `learning_rate`: 'constant', 'adaptive'
- Use when: Complex non-linear patterns, large datasets
- Requires feature scaling
- Example:
```python
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler

# Scale features first
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

model = MLPClassifier(
    hidden_layer_sizes=(100, 50),
    activation='relu',
    solver='adam',
    alpha=0.0001,
    max_iter=1000
)
model.fit(X_train_scaled, y_train)
```

## Algorithm Selection Guide

### Choose based on:

**Dataset size:**
- Small (<1k samples): KNN, SVM, Decision Trees
- Medium (1k-100k): Random Forest, Gradient Boosting, Linear Models
- Large (>100k): SGD, Linear Models, HistGradientBoosting

**Interpretability:**
- High: Linear Models, Decision Trees
- Medium: Random Forest (feature importance)
- Low: SVM with RBF kernel, Neural Networks

**Accuracy vs Speed:**
- Fast training: Naive Bayes, Linear Models, KNN
- High accuracy: Gradient Boosting, Random Forest, Stacking
- Fast prediction: Linear Models, Naive Bayes
- Slow prediction: KNN (on large datasets), SVM

**Feature types:**
- Continuous: Most algorithms work well
- Categorical: Trees, HistGradientBoosting (native support)
- Mixed: Trees, Gradient Boosting
- Text: Naive Bayes, Linear Models with TF-IDF

**Common starting points:**
1. Logistic Regression (classification) / Linear Regression (regression) - fast baseline
2. Random Forest - good default choice
3. Gradient Boosting - optimize for best accuracy