Initial commit
This commit is contained in:
612
skills/scikit-learn/references/pipelines_and_composition.md
Normal file
612
skills/scikit-learn/references/pipelines_and_composition.md
Normal file
@@ -0,0 +1,612 @@
|
||||
# Pipelines and Composite Estimators Reference
|
||||
|
||||
## Overview
|
||||
|
||||
Pipelines chain multiple processing steps into a single estimator, preventing data leakage and simplifying code. They enable reproducible workflows and seamless integration with cross-validation and hyperparameter tuning.
|
||||
|
||||
## Pipeline Basics
|
||||
|
||||
### Creating a Pipeline
|
||||
|
||||
**Pipeline (`sklearn.pipeline.Pipeline`)**
|
||||
- Chains transformers with a final estimator
|
||||
- All intermediate steps must have fit_transform()
|
||||
- Final step can be any estimator (transformer, classifier, regressor, clusterer)
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.pipeline import Pipeline
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
from sklearn.decomposition import PCA
|
||||
from sklearn.linear_model import LogisticRegression
|
||||
|
||||
pipeline = Pipeline([
|
||||
('scaler', StandardScaler()),
|
||||
('pca', PCA(n_components=10)),
|
||||
('classifier', LogisticRegression())
|
||||
])
|
||||
|
||||
# Fit the entire pipeline
|
||||
pipeline.fit(X_train, y_train)
|
||||
|
||||
# Predict using the pipeline
|
||||
y_pred = pipeline.predict(X_test)
|
||||
y_proba = pipeline.predict_proba(X_test)
|
||||
```
|
||||
|
||||
### Using make_pipeline
|
||||
|
||||
**make_pipeline**
|
||||
- Convenient constructor that auto-generates step names
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.pipeline import make_pipeline
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
from sklearn.svm import SVC
|
||||
|
||||
pipeline = make_pipeline(
|
||||
StandardScaler(),
|
||||
PCA(n_components=10),
|
||||
SVC(kernel='rbf')
|
||||
)
|
||||
|
||||
pipeline.fit(X_train, y_train)
|
||||
```
|
||||
|
||||
## Accessing Pipeline Components
|
||||
|
||||
### Accessing Steps
|
||||
|
||||
```python
|
||||
# By index
|
||||
scaler = pipeline.steps[0][1]
|
||||
|
||||
# By name
|
||||
scaler = pipeline.named_steps['scaler']
|
||||
pca = pipeline.named_steps['pca']
|
||||
|
||||
# Using indexing syntax
|
||||
scaler = pipeline['scaler']
|
||||
pca = pipeline['pca']
|
||||
|
||||
# Get all step names
|
||||
print(pipeline.named_steps.keys())
|
||||
```
|
||||
|
||||
### Setting Parameters
|
||||
|
||||
```python
|
||||
# Set parameters using double underscore notation
|
||||
pipeline.set_params(
|
||||
pca__n_components=15,
|
||||
classifier__C=0.1
|
||||
)
|
||||
|
||||
# Or during creation
|
||||
pipeline = Pipeline([
|
||||
('scaler', StandardScaler()),
|
||||
('pca', PCA(n_components=10)),
|
||||
('classifier', LogisticRegression(C=1.0))
|
||||
])
|
||||
```
|
||||
|
||||
### Accessing Attributes
|
||||
|
||||
```python
|
||||
# Access fitted attributes
|
||||
pca_components = pipeline.named_steps['pca'].components_
|
||||
explained_variance = pipeline.named_steps['pca'].explained_variance_ratio_
|
||||
|
||||
# Access intermediate transformations
|
||||
X_scaled = pipeline.named_steps['scaler'].transform(X_test)
|
||||
X_pca = pipeline.named_steps['pca'].transform(X_scaled)
|
||||
```
|
||||
|
||||
## Hyperparameter Tuning with Pipelines
|
||||
|
||||
### Grid Search with Pipeline
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import GridSearchCV
|
||||
from sklearn.pipeline import Pipeline
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
from sklearn.svm import SVC
|
||||
|
||||
pipeline = Pipeline([
|
||||
('scaler', StandardScaler()),
|
||||
('classifier', SVC())
|
||||
])
|
||||
|
||||
param_grid = {
|
||||
'classifier__C': [0.1, 1, 10, 100],
|
||||
'classifier__gamma': ['scale', 'auto', 0.001, 0.01],
|
||||
'classifier__kernel': ['rbf', 'linear']
|
||||
}
|
||||
|
||||
grid_search = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1)
|
||||
grid_search.fit(X_train, y_train)
|
||||
|
||||
print(f"Best parameters: {grid_search.best_params_}")
|
||||
print(f"Best score: {grid_search.best_score_:.3f}")
|
||||
```
|
||||
|
||||
### Tuning Multiple Pipeline Steps
|
||||
|
||||
```python
|
||||
param_grid = {
|
||||
# PCA parameters
|
||||
'pca__n_components': [5, 10, 20, 50],
|
||||
|
||||
# Classifier parameters
|
||||
'classifier__C': [0.1, 1, 10],
|
||||
'classifier__kernel': ['rbf', 'linear']
|
||||
}
|
||||
|
||||
grid_search = GridSearchCV(pipeline, param_grid, cv=5)
|
||||
grid_search.fit(X_train, y_train)
|
||||
```
|
||||
|
||||
## ColumnTransformer
|
||||
|
||||
### Basic Usage
|
||||
|
||||
**ColumnTransformer (`sklearn.compose.ColumnTransformer`)**
|
||||
- Apply different preprocessing to different columns
|
||||
- Prevents data leakage in cross-validation
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.compose import ColumnTransformer
|
||||
from sklearn.preprocessing import StandardScaler, OneHotEncoder
|
||||
from sklearn.impute import SimpleImputer
|
||||
|
||||
# Define column groups
|
||||
numeric_features = ['age', 'income', 'hours_per_week']
|
||||
categorical_features = ['gender', 'occupation', 'native_country']
|
||||
|
||||
# Create preprocessor
|
||||
preprocessor = ColumnTransformer(
|
||||
transformers=[
|
||||
('num', StandardScaler(), numeric_features),
|
||||
('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
|
||||
],
|
||||
remainder='passthrough' # Keep other columns unchanged
|
||||
)
|
||||
|
||||
X_transformed = preprocessor.fit_transform(X)
|
||||
```
|
||||
|
||||
### With Pipeline Steps
|
||||
|
||||
```python
|
||||
from sklearn.pipeline import Pipeline
|
||||
|
||||
numeric_transformer = Pipeline(steps=[
|
||||
('imputer', SimpleImputer(strategy='median')),
|
||||
('scaler', StandardScaler())
|
||||
])
|
||||
|
||||
categorical_transformer = Pipeline(steps=[
|
||||
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
|
||||
('onehot', OneHotEncoder(handle_unknown='ignore'))
|
||||
])
|
||||
|
||||
preprocessor = ColumnTransformer(
|
||||
transformers=[
|
||||
('num', numeric_transformer, numeric_features),
|
||||
('cat', categorical_transformer, categorical_features)
|
||||
]
|
||||
)
|
||||
|
||||
# Full pipeline with model
|
||||
full_pipeline = Pipeline([
|
||||
('preprocessor', preprocessor),
|
||||
('classifier', LogisticRegression())
|
||||
])
|
||||
|
||||
full_pipeline.fit(X_train, y_train)
|
||||
```
|
||||
|
||||
### Using make_column_transformer
|
||||
|
||||
```python
|
||||
from sklearn.compose import make_column_transformer
|
||||
|
||||
preprocessor = make_column_transformer(
|
||||
(StandardScaler(), numeric_features),
|
||||
(OneHotEncoder(), categorical_features),
|
||||
remainder='passthrough'
|
||||
)
|
||||
```
|
||||
|
||||
### Column Selection
|
||||
|
||||
```python
|
||||
# By column names (if X is DataFrame)
|
||||
preprocessor = ColumnTransformer([
|
||||
('num', StandardScaler(), ['age', 'income']),
|
||||
('cat', OneHotEncoder(), ['gender', 'occupation'])
|
||||
])
|
||||
|
||||
# By column indices
|
||||
preprocessor = ColumnTransformer([
|
||||
('num', StandardScaler(), [0, 1, 2]),
|
||||
('cat', OneHotEncoder(), [3, 4])
|
||||
])
|
||||
|
||||
# By boolean mask
|
||||
numeric_mask = [True, True, True, False, False]
|
||||
categorical_mask = [False, False, False, True, True]
|
||||
|
||||
preprocessor = ColumnTransformer([
|
||||
('num', StandardScaler(), numeric_mask),
|
||||
('cat', OneHotEncoder(), categorical_mask)
|
||||
])
|
||||
|
||||
# By callable
|
||||
def is_numeric(X):
|
||||
return X.select_dtypes(include=['number']).columns.tolist()
|
||||
|
||||
preprocessor = ColumnTransformer([
|
||||
('num', StandardScaler(), is_numeric)
|
||||
])
|
||||
```
|
||||
|
||||
### Getting Feature Names
|
||||
|
||||
```python
|
||||
# Get output feature names
|
||||
feature_names = preprocessor.get_feature_names_out()
|
||||
|
||||
# After fitting
|
||||
preprocessor.fit(X_train)
|
||||
output_features = preprocessor.get_feature_names_out()
|
||||
print(f"Input features: {X_train.columns.tolist()}")
|
||||
print(f"Output features: {output_features}")
|
||||
```
|
||||
|
||||
### Remainder Handling
|
||||
|
||||
```python
|
||||
# Drop unspecified columns (default)
|
||||
preprocessor = ColumnTransformer([...], remainder='drop')
|
||||
|
||||
# Pass through unchanged
|
||||
preprocessor = ColumnTransformer([...], remainder='passthrough')
|
||||
|
||||
# Apply transformer to remaining columns
|
||||
preprocessor = ColumnTransformer([...], remainder=StandardScaler())
|
||||
```
|
||||
|
||||
## FeatureUnion
|
||||
|
||||
### Basic Usage
|
||||
|
||||
**FeatureUnion (`sklearn.pipeline.FeatureUnion`)**
|
||||
- Concatenates results of multiple transformers
|
||||
- Transformers are applied in parallel
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.pipeline import FeatureUnion
|
||||
from sklearn.decomposition import PCA
|
||||
from sklearn.feature_selection import SelectKBest
|
||||
|
||||
# Combine PCA and feature selection
|
||||
feature_union = FeatureUnion([
|
||||
('pca', PCA(n_components=10)),
|
||||
('select_best', SelectKBest(k=20))
|
||||
])
|
||||
|
||||
X_combined = feature_union.fit_transform(X_train, y_train)
|
||||
print(f"Combined features: {X_combined.shape[1]}") # 10 + 20 = 30
|
||||
```
|
||||
|
||||
### With Pipeline
|
||||
|
||||
```python
|
||||
from sklearn.pipeline import Pipeline, FeatureUnion
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
from sklearn.decomposition import PCA, TruncatedSVD
|
||||
|
||||
# Create feature union
|
||||
feature_union = FeatureUnion([
|
||||
('pca', PCA(n_components=10)),
|
||||
('svd', TruncatedSVD(n_components=10))
|
||||
])
|
||||
|
||||
# Full pipeline
|
||||
pipeline = Pipeline([
|
||||
('scaler', StandardScaler()),
|
||||
('features', feature_union),
|
||||
('classifier', LogisticRegression())
|
||||
])
|
||||
|
||||
pipeline.fit(X_train, y_train)
|
||||
```
|
||||
|
||||
### Weighted Feature Union
|
||||
|
||||
```python
|
||||
# Apply weights to transformers
|
||||
feature_union = FeatureUnion(
|
||||
transformer_list=[
|
||||
('pca', PCA(n_components=10)),
|
||||
('select_best', SelectKBest(k=20))
|
||||
],
|
||||
transformer_weights={
|
||||
'pca': 2.0, # Give PCA features double weight
|
||||
'select_best': 1.0
|
||||
}
|
||||
)
|
||||
```
|
||||
|
||||
## Advanced Pipeline Patterns
|
||||
|
||||
### Caching Pipeline Steps
|
||||
|
||||
```python
|
||||
from sklearn.pipeline import Pipeline
|
||||
from tempfile import mkdtemp
|
||||
from shutil import rmtree
|
||||
|
||||
# Cache intermediate results
|
||||
cachedir = mkdtemp()
|
||||
pipeline = Pipeline([
|
||||
('scaler', StandardScaler()),
|
||||
('pca', PCA(n_components=50)),
|
||||
('classifier', LogisticRegression())
|
||||
], memory=cachedir)
|
||||
|
||||
pipeline.fit(X_train, y_train)
|
||||
|
||||
# Clean up cache
|
||||
rmtree(cachedir)
|
||||
```
|
||||
|
||||
### Nested Pipelines
|
||||
|
||||
```python
|
||||
from sklearn.pipeline import Pipeline
|
||||
|
||||
# Inner pipeline for text processing
|
||||
text_pipeline = Pipeline([
|
||||
('vect', CountVectorizer()),
|
||||
('tfidf', TfidfTransformer())
|
||||
])
|
||||
|
||||
# Outer pipeline combining text and numeric features
|
||||
full_pipeline = Pipeline([
|
||||
('features', FeatureUnion([
|
||||
('text', text_pipeline),
|
||||
('numeric', StandardScaler())
|
||||
])),
|
||||
('classifier', LogisticRegression())
|
||||
])
|
||||
```
|
||||
|
||||
### Custom Transformers in Pipelines
|
||||
|
||||
```python
|
||||
from sklearn.base import BaseEstimator, TransformerMixin
|
||||
|
||||
class TextLengthExtractor(BaseEstimator, TransformerMixin):
|
||||
def fit(self, X, y=None):
|
||||
return self
|
||||
|
||||
def transform(self, X):
|
||||
return [[len(text)] for text in X]
|
||||
|
||||
pipeline = Pipeline([
|
||||
('length', TextLengthExtractor()),
|
||||
('scaler', StandardScaler()),
|
||||
('classifier', LogisticRegression())
|
||||
])
|
||||
```
|
||||
|
||||
### Slicing Pipelines
|
||||
|
||||
```python
|
||||
# Get sub-pipeline
|
||||
sub_pipeline = pipeline[:2] # First two steps
|
||||
|
||||
# Get specific range
|
||||
middle_steps = pipeline[1:3]
|
||||
```
|
||||
|
||||
## TransformedTargetRegressor
|
||||
|
||||
### Basic Usage
|
||||
|
||||
**TransformedTargetRegressor**
|
||||
- Transforms target variable before fitting
|
||||
- Automatically inverse-transforms predictions
|
||||
- Example:
|
||||
```python
|
||||
from sklearn.compose import TransformedTargetRegressor
|
||||
from sklearn.preprocessing import QuantileTransformer
|
||||
from sklearn.linear_model import LinearRegression
|
||||
|
||||
model = TransformedTargetRegressor(
|
||||
regressor=LinearRegression(),
|
||||
transformer=QuantileTransformer(output_distribution='normal')
|
||||
)
|
||||
|
||||
model.fit(X_train, y_train)
|
||||
y_pred = model.predict(X_test) # Automatically inverse-transformed
|
||||
```
|
||||
|
||||
### With Functions
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
|
||||
model = TransformedTargetRegressor(
|
||||
regressor=LinearRegression(),
|
||||
func=np.log1p,
|
||||
inverse_func=np.expm1
|
||||
)
|
||||
|
||||
model.fit(X_train, y_train)
|
||||
```
|
||||
|
||||
## Complete Example: End-to-End Pipeline
|
||||
|
||||
```python
|
||||
import pandas as pd
|
||||
from sklearn.compose import ColumnTransformer
|
||||
from sklearn.pipeline import Pipeline
|
||||
from sklearn.preprocessing import StandardScaler, OneHotEncoder
|
||||
from sklearn.impute import SimpleImputer
|
||||
from sklearn.decomposition import PCA
|
||||
from sklearn.ensemble import RandomForestClassifier
|
||||
from sklearn.model_selection import GridSearchCV
|
||||
|
||||
# Define feature types
|
||||
numeric_features = ['age', 'income', 'hours_per_week']
|
||||
categorical_features = ['gender', 'occupation', 'education']
|
||||
|
||||
# Numeric preprocessing pipeline
|
||||
numeric_transformer = Pipeline(steps=[
|
||||
('imputer', SimpleImputer(strategy='median')),
|
||||
('scaler', StandardScaler())
|
||||
])
|
||||
|
||||
# Categorical preprocessing pipeline
|
||||
categorical_transformer = Pipeline(steps=[
|
||||
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
|
||||
('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
|
||||
])
|
||||
|
||||
# Combine preprocessing
|
||||
preprocessor = ColumnTransformer(
|
||||
transformers=[
|
||||
('num', numeric_transformer, numeric_features),
|
||||
('cat', categorical_transformer, categorical_features)
|
||||
]
|
||||
)
|
||||
|
||||
# Full pipeline
|
||||
pipeline = Pipeline([
|
||||
('preprocessor', preprocessor),
|
||||
('pca', PCA(n_components=0.95)), # Keep 95% variance
|
||||
('classifier', RandomForestClassifier(random_state=42))
|
||||
])
|
||||
|
||||
# Hyperparameter tuning
|
||||
param_grid = {
|
||||
'preprocessor__num__imputer__strategy': ['mean', 'median'],
|
||||
'pca__n_components': [0.90, 0.95, 0.99],
|
||||
'classifier__n_estimators': [100, 200],
|
||||
'classifier__max_depth': [10, 20, None]
|
||||
}
|
||||
|
||||
grid_search = GridSearchCV(
|
||||
pipeline, param_grid,
|
||||
cv=5, scoring='accuracy',
|
||||
n_jobs=-1, verbose=1
|
||||
)
|
||||
|
||||
grid_search.fit(X_train, y_train)
|
||||
|
||||
print(f"Best parameters: {grid_search.best_params_}")
|
||||
print(f"Best CV score: {grid_search.best_score_:.3f}")
|
||||
print(f"Test score: {grid_search.score(X_test, y_test):.3f}")
|
||||
|
||||
# Make predictions
|
||||
best_pipeline = grid_search.best_estimator_
|
||||
y_pred = best_pipeline.predict(X_test)
|
||||
y_proba = best_pipeline.predict_proba(X_test)
|
||||
```
|
||||
|
||||
## Visualization
|
||||
|
||||
### Displaying Pipelines
|
||||
|
||||
```python
|
||||
# In Jupyter notebooks, pipelines display as diagrams
|
||||
from sklearn import set_config
|
||||
set_config(display='diagram')
|
||||
|
||||
pipeline # Displays visual diagram
|
||||
```
|
||||
|
||||
### Text Representation
|
||||
|
||||
```python
|
||||
# Print pipeline structure
|
||||
print(pipeline)
|
||||
|
||||
# Get detailed parameters
|
||||
print(pipeline.get_params())
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Always Use Pipelines
|
||||
- Prevents data leakage
|
||||
- Ensures consistency between training and prediction
|
||||
- Makes code more maintainable
|
||||
- Enables easy hyperparameter tuning
|
||||
|
||||
### Proper Pipeline Construction
|
||||
```python
|
||||
# Good: Preprocessing inside pipeline
|
||||
pipeline = Pipeline([
|
||||
('scaler', StandardScaler()),
|
||||
('model', LogisticRegression())
|
||||
])
|
||||
pipeline.fit(X_train, y_train)
|
||||
|
||||
# Bad: Preprocessing outside pipeline (can cause leakage)
|
||||
X_train_scaled = StandardScaler().fit_transform(X_train)
|
||||
model = LogisticRegression()
|
||||
model.fit(X_train_scaled, y_train)
|
||||
```
|
||||
|
||||
### Use ColumnTransformer for Mixed Data
|
||||
Always use ColumnTransformer when you have both numerical and categorical features:
|
||||
```python
|
||||
preprocessor = ColumnTransformer([
|
||||
('num', StandardScaler(), numeric_features),
|
||||
('cat', OneHotEncoder(), categorical_features)
|
||||
])
|
||||
```
|
||||
|
||||
### Name Your Steps Meaningfully
|
||||
```python
|
||||
# Good
|
||||
pipeline = Pipeline([
|
||||
('imputer', SimpleImputer()),
|
||||
('scaler', StandardScaler()),
|
||||
('pca', PCA(n_components=10)),
|
||||
('rf_classifier', RandomForestClassifier())
|
||||
])
|
||||
|
||||
# Bad
|
||||
pipeline = Pipeline([
|
||||
('step1', SimpleImputer()),
|
||||
('step2', StandardScaler()),
|
||||
('step3', PCA(n_components=10)),
|
||||
('step4', RandomForestClassifier())
|
||||
])
|
||||
```
|
||||
|
||||
### Cache Expensive Transformations
|
||||
For repeated fitting (e.g., during grid search), cache expensive steps:
|
||||
```python
|
||||
from tempfile import mkdtemp
|
||||
|
||||
cachedir = mkdtemp()
|
||||
pipeline = Pipeline([
|
||||
('expensive_preprocessing', ExpensiveTransformer()),
|
||||
('classifier', LogisticRegression())
|
||||
], memory=cachedir)
|
||||
```
|
||||
|
||||
### Test Pipeline Compatibility
|
||||
Ensure all steps are compatible:
|
||||
- All intermediate steps must have fit() and transform()
|
||||
- Final step needs fit() and predict() (or transform())
|
||||
- Use set_output(transform='pandas') for DataFrame output
|
||||
```python
|
||||
pipeline.set_output(transform='pandas')
|
||||
X_transformed = pipeline.transform(X) # Returns DataFrame
|
||||
```
|
||||
Reference in New Issue
Block a user