379 lines
11 KiB
Markdown
379 lines
11 KiB
Markdown
# Supervised Learning Reference
|
|
|
|
## Overview
|
|
|
|
Supervised learning algorithms learn from labeled training data to make predictions on new data. Scikit-learn provides comprehensive implementations for both classification and regression tasks.
|
|
|
|
## Linear Models
|
|
|
|
### Regression
|
|
|
|
**Linear Regression (`sklearn.linear_model.LinearRegression`)**
|
|
- Ordinary least squares regression
|
|
- Fast, interpretable, no hyperparameters
|
|
- Use when: Linear relationships, interpretability matters
|
|
- Example:
|
|
```python
|
|
from sklearn.linear_model import LinearRegression
|
|
|
|
model = LinearRegression()
|
|
model.fit(X_train, y_train)
|
|
predictions = model.predict(X_test)
|
|
```
|
|
|
|
**Ridge Regression (`sklearn.linear_model.Ridge`)**
|
|
- L2 regularization to prevent overfitting
|
|
- Key parameter: `alpha` (regularization strength, default=1.0)
|
|
- Use when: Multicollinearity present, need regularization
|
|
- Example:
|
|
```python
|
|
from sklearn.linear_model import Ridge
|
|
|
|
model = Ridge(alpha=1.0)
|
|
model.fit(X_train, y_train)
|
|
```
|
|
|
|
**Lasso (`sklearn.linear_model.Lasso`)**
|
|
- L1 regularization with feature selection
|
|
- Key parameter: `alpha` (regularization strength)
|
|
- Use when: Want sparse models, feature selection
|
|
- Can reduce some coefficients to exactly zero
|
|
- Example:
|
|
```python
|
|
from sklearn.linear_model import Lasso
|
|
|
|
model = Lasso(alpha=0.1)
|
|
model.fit(X_train, y_train)
|
|
# Check which features were selected
|
|
print(f"Non-zero coefficients: {sum(model.coef_ != 0)}")
|
|
```
|
|
|
|
**ElasticNet (`sklearn.linear_model.ElasticNet`)**
|
|
- Combines L1 and L2 regularization
|
|
- Key parameters: `alpha`, `l1_ratio` (0=Ridge, 1=Lasso)
|
|
- Use when: Need both feature selection and regularization
|
|
- Example:
|
|
```python
|
|
from sklearn.linear_model import ElasticNet
|
|
|
|
model = ElasticNet(alpha=0.1, l1_ratio=0.5)
|
|
model.fit(X_train, y_train)
|
|
```
|
|
|
|
### Classification
|
|
|
|
**Logistic Regression (`sklearn.linear_model.LogisticRegression`)**
|
|
- Binary and multiclass classification
|
|
- Key parameters: `C` (inverse regularization), `penalty` ('l1', 'l2', 'elasticnet')
|
|
- Returns probability estimates
|
|
- Use when: Need probabilistic predictions, interpretability
|
|
- Example:
|
|
```python
|
|
from sklearn.linear_model import LogisticRegression
|
|
|
|
model = LogisticRegression(C=1.0, max_iter=1000)
|
|
model.fit(X_train, y_train)
|
|
probas = model.predict_proba(X_test)
|
|
```
|
|
|
|
**Stochastic Gradient Descent (SGD)**
|
|
- `SGDClassifier`, `SGDRegressor`
|
|
- Efficient for large-scale learning
|
|
- Key parameters: `loss`, `penalty`, `alpha`, `learning_rate`
|
|
- Use when: Very large datasets (>10^4 samples)
|
|
- Example:
|
|
```python
|
|
from sklearn.linear_model import SGDClassifier
|
|
|
|
model = SGDClassifier(loss='log_loss', max_iter=1000, tol=1e-3)
|
|
model.fit(X_train, y_train)
|
|
```
|
|
|
|
## Support Vector Machines
|
|
|
|
**SVC (`sklearn.svm.SVC`)**
|
|
- Classification with kernel methods
|
|
- Key parameters: `C`, `kernel` ('linear', 'rbf', 'poly'), `gamma`
|
|
- Use when: Small to medium datasets, complex decision boundaries
|
|
- Note: Does not scale well to large datasets
|
|
- Example:
|
|
```python
|
|
from sklearn.svm import SVC
|
|
|
|
# Linear kernel for linearly separable data
|
|
model_linear = SVC(kernel='linear', C=1.0)
|
|
|
|
# RBF kernel for non-linear data
|
|
model_rbf = SVC(kernel='rbf', C=1.0, gamma='scale')
|
|
model_rbf.fit(X_train, y_train)
|
|
```
|
|
|
|
**SVR (`sklearn.svm.SVR`)**
|
|
- Regression with kernel methods
|
|
- Similar parameters to SVC
|
|
- Additional parameter: `epsilon` (tube width)
|
|
- Example:
|
|
```python
|
|
from sklearn.svm import SVR
|
|
|
|
model = SVR(kernel='rbf', C=1.0, epsilon=0.1)
|
|
model.fit(X_train, y_train)
|
|
```
|
|
|
|
## Decision Trees
|
|
|
|
**DecisionTreeClassifier / DecisionTreeRegressor**
|
|
- Non-parametric model learning decision rules
|
|
- Key parameters:
|
|
- `max_depth`: Maximum tree depth (prevents overfitting)
|
|
- `min_samples_split`: Minimum samples to split a node
|
|
- `min_samples_leaf`: Minimum samples in leaf
|
|
- `criterion`: 'gini', 'entropy' for classification; 'squared_error', 'absolute_error' for regression
|
|
- Use when: Need interpretable model, non-linear relationships, mixed feature types
|
|
- Prone to overfitting - use ensembles or pruning
|
|
- Example:
|
|
```python
|
|
from sklearn.tree import DecisionTreeClassifier
|
|
|
|
model = DecisionTreeClassifier(
|
|
max_depth=5,
|
|
min_samples_split=20,
|
|
min_samples_leaf=10,
|
|
criterion='gini'
|
|
)
|
|
model.fit(X_train, y_train)
|
|
|
|
# Visualize the tree
|
|
from sklearn.tree import plot_tree
|
|
plot_tree(model, feature_names=feature_names, class_names=class_names)
|
|
```
|
|
|
|
## Ensemble Methods
|
|
|
|
### Random Forests
|
|
|
|
**RandomForestClassifier / RandomForestRegressor**
|
|
- Ensemble of decision trees with bagging
|
|
- Key parameters:
|
|
- `n_estimators`: Number of trees (default=100)
|
|
- `max_depth`: Maximum tree depth
|
|
- `max_features`: Features to consider for splits ('sqrt', 'log2', or int)
|
|
- `min_samples_split`, `min_samples_leaf`: Control tree growth
|
|
- Use when: High accuracy needed, can afford computation
|
|
- Provides feature importance
|
|
- Example:
|
|
```python
|
|
from sklearn.ensemble import RandomForestClassifier
|
|
|
|
model = RandomForestClassifier(
|
|
n_estimators=100,
|
|
max_depth=10,
|
|
max_features='sqrt',
|
|
n_jobs=-1 # Use all CPU cores
|
|
)
|
|
model.fit(X_train, y_train)
|
|
|
|
# Feature importance
|
|
importances = model.feature_importances_
|
|
```
|
|
|
|
### Gradient Boosting
|
|
|
|
**GradientBoostingClassifier / GradientBoostingRegressor**
|
|
- Sequential ensemble building trees on residuals
|
|
- Key parameters:
|
|
- `n_estimators`: Number of boosting stages
|
|
- `learning_rate`: Shrinks contribution of each tree
|
|
- `max_depth`: Depth of individual trees (typically 3-5)
|
|
- `subsample`: Fraction of samples for training each tree
|
|
- Use when: Need high accuracy, can afford training time
|
|
- Often achieves best performance
|
|
- Example:
|
|
```python
|
|
from sklearn.ensemble import GradientBoostingClassifier
|
|
|
|
model = GradientBoostingClassifier(
|
|
n_estimators=100,
|
|
learning_rate=0.1,
|
|
max_depth=3,
|
|
subsample=0.8
|
|
)
|
|
model.fit(X_train, y_train)
|
|
```
|
|
|
|
**HistGradientBoostingClassifier / HistGradientBoostingRegressor**
|
|
- Faster gradient boosting with histogram-based algorithm
|
|
- Native support for missing values and categorical features
|
|
- Key parameters: Similar to GradientBoosting
|
|
- Use when: Large datasets, need faster training
|
|
- Example:
|
|
```python
|
|
from sklearn.ensemble import HistGradientBoostingClassifier
|
|
|
|
model = HistGradientBoostingClassifier(
|
|
max_iter=100,
|
|
learning_rate=0.1,
|
|
max_depth=None, # No limit by default
|
|
categorical_features='from_dtype' # Auto-detect categorical
|
|
)
|
|
model.fit(X_train, y_train)
|
|
```
|
|
|
|
### Other Ensemble Methods
|
|
|
|
**AdaBoost**
|
|
- Adaptive boosting focusing on misclassified samples
|
|
- Key parameters: `n_estimators`, `learning_rate`, `estimator` (base estimator)
|
|
- Use when: Simple boosting approach needed
|
|
- Example:
|
|
```python
|
|
from sklearn.ensemble import AdaBoostClassifier
|
|
|
|
model = AdaBoostClassifier(n_estimators=50, learning_rate=1.0)
|
|
model.fit(X_train, y_train)
|
|
```
|
|
|
|
**Voting Classifier / Regressor**
|
|
- Combines predictions from multiple models
|
|
- Types: 'hard' (majority vote) or 'soft' (average probabilities)
|
|
- Use when: Want to ensemble different model types
|
|
- Example:
|
|
```python
|
|
from sklearn.ensemble import VotingClassifier
|
|
from sklearn.linear_model import LogisticRegression
|
|
from sklearn.tree import DecisionTreeClassifier
|
|
from sklearn.svm import SVC
|
|
|
|
model = VotingClassifier(
|
|
estimators=[
|
|
('lr', LogisticRegression()),
|
|
('dt', DecisionTreeClassifier()),
|
|
('svc', SVC(probability=True))
|
|
],
|
|
voting='soft'
|
|
)
|
|
model.fit(X_train, y_train)
|
|
```
|
|
|
|
**Stacking Classifier / Regressor**
|
|
- Trains a meta-model on predictions from base models
|
|
- More sophisticated than voting
|
|
- Key parameter: `final_estimator` (meta-learner)
|
|
- Example:
|
|
```python
|
|
from sklearn.ensemble import StackingClassifier
|
|
from sklearn.linear_model import LogisticRegression
|
|
from sklearn.tree import DecisionTreeClassifier
|
|
from sklearn.svm import SVC
|
|
|
|
model = StackingClassifier(
|
|
estimators=[
|
|
('dt', DecisionTreeClassifier()),
|
|
('svc', SVC())
|
|
],
|
|
final_estimator=LogisticRegression()
|
|
)
|
|
model.fit(X_train, y_train)
|
|
```
|
|
|
|
## K-Nearest Neighbors
|
|
|
|
**KNeighborsClassifier / KNeighborsRegressor**
|
|
- Non-parametric method based on distance
|
|
- Key parameters:
|
|
- `n_neighbors`: Number of neighbors (default=5)
|
|
- `weights`: 'uniform' or 'distance'
|
|
- `metric`: Distance metric ('euclidean', 'manhattan', etc.)
|
|
- Use when: Small dataset, simple baseline needed
|
|
- Slow prediction on large datasets
|
|
- Example:
|
|
```python
|
|
from sklearn.neighbors import KNeighborsClassifier
|
|
|
|
model = KNeighborsClassifier(n_neighbors=5, weights='distance')
|
|
model.fit(X_train, y_train)
|
|
```
|
|
|
|
## Naive Bayes
|
|
|
|
**GaussianNB, MultinomialNB, BernoulliNB**
|
|
- Probabilistic classifiers based on Bayes' theorem
|
|
- Fast training and prediction
|
|
- GaussianNB: Continuous features (assumes Gaussian distribution)
|
|
- MultinomialNB: Count features (text classification)
|
|
- BernoulliNB: Binary features
|
|
- Use when: Text classification, fast baseline, probabilistic predictions
|
|
- Example:
|
|
```python
|
|
from sklearn.naive_bayes import GaussianNB, MultinomialNB
|
|
|
|
# For continuous features
|
|
model_gaussian = GaussianNB()
|
|
|
|
# For text/count data
|
|
model_multinomial = MultinomialNB(alpha=1.0) # alpha is smoothing parameter
|
|
model_multinomial.fit(X_train, y_train)
|
|
```
|
|
|
|
## Neural Networks
|
|
|
|
**MLPClassifier / MLPRegressor**
|
|
- Multi-layer perceptron (feedforward neural network)
|
|
- Key parameters:
|
|
- `hidden_layer_sizes`: Tuple of hidden layer sizes, e.g., (100, 50)
|
|
- `activation`: 'relu', 'tanh', 'logistic'
|
|
- `solver`: 'adam', 'sgd', 'lbfgs'
|
|
- `alpha`: L2 regularization parameter
|
|
- `learning_rate`: 'constant', 'adaptive'
|
|
- Use when: Complex non-linear patterns, large datasets
|
|
- Requires feature scaling
|
|
- Example:
|
|
```python
|
|
from sklearn.neural_network import MLPClassifier
|
|
from sklearn.preprocessing import StandardScaler
|
|
|
|
# Scale features first
|
|
scaler = StandardScaler()
|
|
X_train_scaled = scaler.fit_transform(X_train)
|
|
|
|
model = MLPClassifier(
|
|
hidden_layer_sizes=(100, 50),
|
|
activation='relu',
|
|
solver='adam',
|
|
alpha=0.0001,
|
|
max_iter=1000
|
|
)
|
|
model.fit(X_train_scaled, y_train)
|
|
```
|
|
|
|
## Algorithm Selection Guide
|
|
|
|
### Choose based on:
|
|
|
|
**Dataset size:**
|
|
- Small (<1k samples): KNN, SVM, Decision Trees
|
|
- Medium (1k-100k): Random Forest, Gradient Boosting, Linear Models
|
|
- Large (>100k): SGD, Linear Models, HistGradientBoosting
|
|
|
|
**Interpretability:**
|
|
- High: Linear Models, Decision Trees
|
|
- Medium: Random Forest (feature importance)
|
|
- Low: SVM with RBF kernel, Neural Networks
|
|
|
|
**Accuracy vs Speed:**
|
|
- Fast training: Naive Bayes, Linear Models, KNN
|
|
- High accuracy: Gradient Boosting, Random Forest, Stacking
|
|
- Fast prediction: Linear Models, Naive Bayes
|
|
- Slow prediction: KNN (on large datasets), SVM
|
|
|
|
**Feature types:**
|
|
- Continuous: Most algorithms work well
|
|
- Categorical: Trees, HistGradientBoosting (native support)
|
|
- Mixed: Trees, Gradient Boosting
|
|
- Text: Naive Bayes, Linear Models with TF-IDF
|
|
|
|
**Common starting points:**
|
|
1. Logistic Regression (classification) / Linear Regression (regression) - fast baseline
|
|
2. Random Forest - good default choice
|
|
3. Gradient Boosting - optimize for best accuracy
|