613 lines
14 KiB
Markdown
613 lines
14 KiB
Markdown
# Pipelines and Composite Estimators Reference
|
|
|
|
## Overview
|
|
|
|
Pipelines chain multiple processing steps into a single estimator, preventing data leakage and simplifying code. They enable reproducible workflows and seamless integration with cross-validation and hyperparameter tuning.
|
|
|
|
## Pipeline Basics
|
|
|
|
### Creating a Pipeline
|
|
|
|
**Pipeline (`sklearn.pipeline.Pipeline`)**
|
|
- Chains transformers with a final estimator
|
|
- All intermediate steps must have fit_transform()
|
|
- Final step can be any estimator (transformer, classifier, regressor, clusterer)
|
|
- Example:
|
|
```python
|
|
from sklearn.pipeline import Pipeline
|
|
from sklearn.preprocessing import StandardScaler
|
|
from sklearn.decomposition import PCA
|
|
from sklearn.linear_model import LogisticRegression
|
|
|
|
pipeline = Pipeline([
|
|
('scaler', StandardScaler()),
|
|
('pca', PCA(n_components=10)),
|
|
('classifier', LogisticRegression())
|
|
])
|
|
|
|
# Fit the entire pipeline
|
|
pipeline.fit(X_train, y_train)
|
|
|
|
# Predict using the pipeline
|
|
y_pred = pipeline.predict(X_test)
|
|
y_proba = pipeline.predict_proba(X_test)
|
|
```
|
|
|
|
### Using make_pipeline
|
|
|
|
**make_pipeline**
|
|
- Convenient constructor that auto-generates step names
|
|
- Example:
|
|
```python
|
|
from sklearn.pipeline import make_pipeline
|
|
from sklearn.preprocessing import StandardScaler
|
|
from sklearn.svm import SVC
|
|
|
|
pipeline = make_pipeline(
|
|
StandardScaler(),
|
|
PCA(n_components=10),
|
|
SVC(kernel='rbf')
|
|
)
|
|
|
|
pipeline.fit(X_train, y_train)
|
|
```
|
|
|
|
## Accessing Pipeline Components
|
|
|
|
### Accessing Steps
|
|
|
|
```python
|
|
# By index
|
|
scaler = pipeline.steps[0][1]
|
|
|
|
# By name
|
|
scaler = pipeline.named_steps['scaler']
|
|
pca = pipeline.named_steps['pca']
|
|
|
|
# Using indexing syntax
|
|
scaler = pipeline['scaler']
|
|
pca = pipeline['pca']
|
|
|
|
# Get all step names
|
|
print(pipeline.named_steps.keys())
|
|
```
|
|
|
|
### Setting Parameters
|
|
|
|
```python
|
|
# Set parameters using double underscore notation
|
|
pipeline.set_params(
|
|
pca__n_components=15,
|
|
classifier__C=0.1
|
|
)
|
|
|
|
# Or during creation
|
|
pipeline = Pipeline([
|
|
('scaler', StandardScaler()),
|
|
('pca', PCA(n_components=10)),
|
|
('classifier', LogisticRegression(C=1.0))
|
|
])
|
|
```
|
|
|
|
### Accessing Attributes
|
|
|
|
```python
|
|
# Access fitted attributes
|
|
pca_components = pipeline.named_steps['pca'].components_
|
|
explained_variance = pipeline.named_steps['pca'].explained_variance_ratio_
|
|
|
|
# Access intermediate transformations
|
|
X_scaled = pipeline.named_steps['scaler'].transform(X_test)
|
|
X_pca = pipeline.named_steps['pca'].transform(X_scaled)
|
|
```
|
|
|
|
## Hyperparameter Tuning with Pipelines
|
|
|
|
### Grid Search with Pipeline
|
|
|
|
```python
|
|
from sklearn.model_selection import GridSearchCV
|
|
from sklearn.pipeline import Pipeline
|
|
from sklearn.preprocessing import StandardScaler
|
|
from sklearn.svm import SVC
|
|
|
|
pipeline = Pipeline([
|
|
('scaler', StandardScaler()),
|
|
('classifier', SVC())
|
|
])
|
|
|
|
param_grid = {
|
|
'classifier__C': [0.1, 1, 10, 100],
|
|
'classifier__gamma': ['scale', 'auto', 0.001, 0.01],
|
|
'classifier__kernel': ['rbf', 'linear']
|
|
}
|
|
|
|
grid_search = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1)
|
|
grid_search.fit(X_train, y_train)
|
|
|
|
print(f"Best parameters: {grid_search.best_params_}")
|
|
print(f"Best score: {grid_search.best_score_:.3f}")
|
|
```
|
|
|
|
### Tuning Multiple Pipeline Steps
|
|
|
|
```python
|
|
param_grid = {
|
|
# PCA parameters
|
|
'pca__n_components': [5, 10, 20, 50],
|
|
|
|
# Classifier parameters
|
|
'classifier__C': [0.1, 1, 10],
|
|
'classifier__kernel': ['rbf', 'linear']
|
|
}
|
|
|
|
grid_search = GridSearchCV(pipeline, param_grid, cv=5)
|
|
grid_search.fit(X_train, y_train)
|
|
```
|
|
|
|
## ColumnTransformer
|
|
|
|
### Basic Usage
|
|
|
|
**ColumnTransformer (`sklearn.compose.ColumnTransformer`)**
|
|
- Apply different preprocessing to different columns
|
|
- Prevents data leakage in cross-validation
|
|
- Example:
|
|
```python
|
|
from sklearn.compose import ColumnTransformer
|
|
from sklearn.preprocessing import StandardScaler, OneHotEncoder
|
|
from sklearn.impute import SimpleImputer
|
|
|
|
# Define column groups
|
|
numeric_features = ['age', 'income', 'hours_per_week']
|
|
categorical_features = ['gender', 'occupation', 'native_country']
|
|
|
|
# Create preprocessor
|
|
preprocessor = ColumnTransformer(
|
|
transformers=[
|
|
('num', StandardScaler(), numeric_features),
|
|
('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
|
|
],
|
|
remainder='passthrough' # Keep other columns unchanged
|
|
)
|
|
|
|
X_transformed = preprocessor.fit_transform(X)
|
|
```
|
|
|
|
### With Pipeline Steps
|
|
|
|
```python
|
|
from sklearn.pipeline import Pipeline
|
|
|
|
numeric_transformer = Pipeline(steps=[
|
|
('imputer', SimpleImputer(strategy='median')),
|
|
('scaler', StandardScaler())
|
|
])
|
|
|
|
categorical_transformer = Pipeline(steps=[
|
|
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
|
|
('onehot', OneHotEncoder(handle_unknown='ignore'))
|
|
])
|
|
|
|
preprocessor = ColumnTransformer(
|
|
transformers=[
|
|
('num', numeric_transformer, numeric_features),
|
|
('cat', categorical_transformer, categorical_features)
|
|
]
|
|
)
|
|
|
|
# Full pipeline with model
|
|
full_pipeline = Pipeline([
|
|
('preprocessor', preprocessor),
|
|
('classifier', LogisticRegression())
|
|
])
|
|
|
|
full_pipeline.fit(X_train, y_train)
|
|
```
|
|
|
|
### Using make_column_transformer
|
|
|
|
```python
|
|
from sklearn.compose import make_column_transformer
|
|
|
|
preprocessor = make_column_transformer(
|
|
(StandardScaler(), numeric_features),
|
|
(OneHotEncoder(), categorical_features),
|
|
remainder='passthrough'
|
|
)
|
|
```
|
|
|
|
### Column Selection
|
|
|
|
```python
|
|
# By column names (if X is DataFrame)
|
|
preprocessor = ColumnTransformer([
|
|
('num', StandardScaler(), ['age', 'income']),
|
|
('cat', OneHotEncoder(), ['gender', 'occupation'])
|
|
])
|
|
|
|
# By column indices
|
|
preprocessor = ColumnTransformer([
|
|
('num', StandardScaler(), [0, 1, 2]),
|
|
('cat', OneHotEncoder(), [3, 4])
|
|
])
|
|
|
|
# By boolean mask
|
|
numeric_mask = [True, True, True, False, False]
|
|
categorical_mask = [False, False, False, True, True]
|
|
|
|
preprocessor = ColumnTransformer([
|
|
('num', StandardScaler(), numeric_mask),
|
|
('cat', OneHotEncoder(), categorical_mask)
|
|
])
|
|
|
|
# By callable
|
|
def is_numeric(X):
|
|
return X.select_dtypes(include=['number']).columns.tolist()
|
|
|
|
preprocessor = ColumnTransformer([
|
|
('num', StandardScaler(), is_numeric)
|
|
])
|
|
```
|
|
|
|
### Getting Feature Names
|
|
|
|
```python
|
|
# Get output feature names
|
|
feature_names = preprocessor.get_feature_names_out()
|
|
|
|
# After fitting
|
|
preprocessor.fit(X_train)
|
|
output_features = preprocessor.get_feature_names_out()
|
|
print(f"Input features: {X_train.columns.tolist()}")
|
|
print(f"Output features: {output_features}")
|
|
```
|
|
|
|
### Remainder Handling
|
|
|
|
```python
|
|
# Drop unspecified columns (default)
|
|
preprocessor = ColumnTransformer([...], remainder='drop')
|
|
|
|
# Pass through unchanged
|
|
preprocessor = ColumnTransformer([...], remainder='passthrough')
|
|
|
|
# Apply transformer to remaining columns
|
|
preprocessor = ColumnTransformer([...], remainder=StandardScaler())
|
|
```
|
|
|
|
## FeatureUnion
|
|
|
|
### Basic Usage
|
|
|
|
**FeatureUnion (`sklearn.pipeline.FeatureUnion`)**
|
|
- Concatenates results of multiple transformers
|
|
- Transformers are applied in parallel
|
|
- Example:
|
|
```python
|
|
from sklearn.pipeline import FeatureUnion
|
|
from sklearn.decomposition import PCA
|
|
from sklearn.feature_selection import SelectKBest
|
|
|
|
# Combine PCA and feature selection
|
|
feature_union = FeatureUnion([
|
|
('pca', PCA(n_components=10)),
|
|
('select_best', SelectKBest(k=20))
|
|
])
|
|
|
|
X_combined = feature_union.fit_transform(X_train, y_train)
|
|
print(f"Combined features: {X_combined.shape[1]}") # 10 + 20 = 30
|
|
```
|
|
|
|
### With Pipeline
|
|
|
|
```python
|
|
from sklearn.pipeline import Pipeline, FeatureUnion
|
|
from sklearn.preprocessing import StandardScaler
|
|
from sklearn.decomposition import PCA, TruncatedSVD
|
|
|
|
# Create feature union
|
|
feature_union = FeatureUnion([
|
|
('pca', PCA(n_components=10)),
|
|
('svd', TruncatedSVD(n_components=10))
|
|
])
|
|
|
|
# Full pipeline
|
|
pipeline = Pipeline([
|
|
('scaler', StandardScaler()),
|
|
('features', feature_union),
|
|
('classifier', LogisticRegression())
|
|
])
|
|
|
|
pipeline.fit(X_train, y_train)
|
|
```
|
|
|
|
### Weighted Feature Union
|
|
|
|
```python
|
|
# Apply weights to transformers
|
|
feature_union = FeatureUnion(
|
|
transformer_list=[
|
|
('pca', PCA(n_components=10)),
|
|
('select_best', SelectKBest(k=20))
|
|
],
|
|
transformer_weights={
|
|
'pca': 2.0, # Give PCA features double weight
|
|
'select_best': 1.0
|
|
}
|
|
)
|
|
```
|
|
|
|
## Advanced Pipeline Patterns
|
|
|
|
### Caching Pipeline Steps
|
|
|
|
```python
|
|
from sklearn.pipeline import Pipeline
|
|
from tempfile import mkdtemp
|
|
from shutil import rmtree
|
|
|
|
# Cache intermediate results
|
|
cachedir = mkdtemp()
|
|
pipeline = Pipeline([
|
|
('scaler', StandardScaler()),
|
|
('pca', PCA(n_components=50)),
|
|
('classifier', LogisticRegression())
|
|
], memory=cachedir)
|
|
|
|
pipeline.fit(X_train, y_train)
|
|
|
|
# Clean up cache
|
|
rmtree(cachedir)
|
|
```
|
|
|
|
### Nested Pipelines
|
|
|
|
```python
|
|
from sklearn.pipeline import Pipeline
|
|
|
|
# Inner pipeline for text processing
|
|
text_pipeline = Pipeline([
|
|
('vect', CountVectorizer()),
|
|
('tfidf', TfidfTransformer())
|
|
])
|
|
|
|
# Outer pipeline combining text and numeric features
|
|
full_pipeline = Pipeline([
|
|
('features', FeatureUnion([
|
|
('text', text_pipeline),
|
|
('numeric', StandardScaler())
|
|
])),
|
|
('classifier', LogisticRegression())
|
|
])
|
|
```
|
|
|
|
### Custom Transformers in Pipelines
|
|
|
|
```python
|
|
from sklearn.base import BaseEstimator, TransformerMixin
|
|
|
|
class TextLengthExtractor(BaseEstimator, TransformerMixin):
|
|
def fit(self, X, y=None):
|
|
return self
|
|
|
|
def transform(self, X):
|
|
return [[len(text)] for text in X]
|
|
|
|
pipeline = Pipeline([
|
|
('length', TextLengthExtractor()),
|
|
('scaler', StandardScaler()),
|
|
('classifier', LogisticRegression())
|
|
])
|
|
```
|
|
|
|
### Slicing Pipelines
|
|
|
|
```python
|
|
# Get sub-pipeline
|
|
sub_pipeline = pipeline[:2] # First two steps
|
|
|
|
# Get specific range
|
|
middle_steps = pipeline[1:3]
|
|
```
|
|
|
|
## TransformedTargetRegressor
|
|
|
|
### Basic Usage
|
|
|
|
**TransformedTargetRegressor**
|
|
- Transforms target variable before fitting
|
|
- Automatically inverse-transforms predictions
|
|
- Example:
|
|
```python
|
|
from sklearn.compose import TransformedTargetRegressor
|
|
from sklearn.preprocessing import QuantileTransformer
|
|
from sklearn.linear_model import LinearRegression
|
|
|
|
model = TransformedTargetRegressor(
|
|
regressor=LinearRegression(),
|
|
transformer=QuantileTransformer(output_distribution='normal')
|
|
)
|
|
|
|
model.fit(X_train, y_train)
|
|
y_pred = model.predict(X_test) # Automatically inverse-transformed
|
|
```
|
|
|
|
### With Functions
|
|
|
|
```python
|
|
import numpy as np
|
|
|
|
model = TransformedTargetRegressor(
|
|
regressor=LinearRegression(),
|
|
func=np.log1p,
|
|
inverse_func=np.expm1
|
|
)
|
|
|
|
model.fit(X_train, y_train)
|
|
```
|
|
|
|
## Complete Example: End-to-End Pipeline
|
|
|
|
```python
|
|
import pandas as pd
|
|
from sklearn.compose import ColumnTransformer
|
|
from sklearn.pipeline import Pipeline
|
|
from sklearn.preprocessing import StandardScaler, OneHotEncoder
|
|
from sklearn.impute import SimpleImputer
|
|
from sklearn.decomposition import PCA
|
|
from sklearn.ensemble import RandomForestClassifier
|
|
from sklearn.model_selection import GridSearchCV
|
|
|
|
# Define feature types
|
|
numeric_features = ['age', 'income', 'hours_per_week']
|
|
categorical_features = ['gender', 'occupation', 'education']
|
|
|
|
# Numeric preprocessing pipeline
|
|
numeric_transformer = Pipeline(steps=[
|
|
('imputer', SimpleImputer(strategy='median')),
|
|
('scaler', StandardScaler())
|
|
])
|
|
|
|
# Categorical preprocessing pipeline
|
|
categorical_transformer = Pipeline(steps=[
|
|
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
|
|
('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
|
|
])
|
|
|
|
# Combine preprocessing
|
|
preprocessor = ColumnTransformer(
|
|
transformers=[
|
|
('num', numeric_transformer, numeric_features),
|
|
('cat', categorical_transformer, categorical_features)
|
|
]
|
|
)
|
|
|
|
# Full pipeline
|
|
pipeline = Pipeline([
|
|
('preprocessor', preprocessor),
|
|
('pca', PCA(n_components=0.95)), # Keep 95% variance
|
|
('classifier', RandomForestClassifier(random_state=42))
|
|
])
|
|
|
|
# Hyperparameter tuning
|
|
param_grid = {
|
|
'preprocessor__num__imputer__strategy': ['mean', 'median'],
|
|
'pca__n_components': [0.90, 0.95, 0.99],
|
|
'classifier__n_estimators': [100, 200],
|
|
'classifier__max_depth': [10, 20, None]
|
|
}
|
|
|
|
grid_search = GridSearchCV(
|
|
pipeline, param_grid,
|
|
cv=5, scoring='accuracy',
|
|
n_jobs=-1, verbose=1
|
|
)
|
|
|
|
grid_search.fit(X_train, y_train)
|
|
|
|
print(f"Best parameters: {grid_search.best_params_}")
|
|
print(f"Best CV score: {grid_search.best_score_:.3f}")
|
|
print(f"Test score: {grid_search.score(X_test, y_test):.3f}")
|
|
|
|
# Make predictions
|
|
best_pipeline = grid_search.best_estimator_
|
|
y_pred = best_pipeline.predict(X_test)
|
|
y_proba = best_pipeline.predict_proba(X_test)
|
|
```
|
|
|
|
## Visualization
|
|
|
|
### Displaying Pipelines
|
|
|
|
```python
|
|
# In Jupyter notebooks, pipelines display as diagrams
|
|
from sklearn import set_config
|
|
set_config(display='diagram')
|
|
|
|
pipeline # Displays visual diagram
|
|
```
|
|
|
|
### Text Representation
|
|
|
|
```python
|
|
# Print pipeline structure
|
|
print(pipeline)
|
|
|
|
# Get detailed parameters
|
|
print(pipeline.get_params())
|
|
```
|
|
|
|
## Best Practices
|
|
|
|
### Always Use Pipelines
|
|
- Prevents data leakage
|
|
- Ensures consistency between training and prediction
|
|
- Makes code more maintainable
|
|
- Enables easy hyperparameter tuning
|
|
|
|
### Proper Pipeline Construction
|
|
```python
|
|
# Good: Preprocessing inside pipeline
|
|
pipeline = Pipeline([
|
|
('scaler', StandardScaler()),
|
|
('model', LogisticRegression())
|
|
])
|
|
pipeline.fit(X_train, y_train)
|
|
|
|
# Bad: Preprocessing outside pipeline (can cause leakage)
|
|
X_train_scaled = StandardScaler().fit_transform(X_train)
|
|
model = LogisticRegression()
|
|
model.fit(X_train_scaled, y_train)
|
|
```
|
|
|
|
### Use ColumnTransformer for Mixed Data
|
|
Always use ColumnTransformer when you have both numerical and categorical features:
|
|
```python
|
|
preprocessor = ColumnTransformer([
|
|
('num', StandardScaler(), numeric_features),
|
|
('cat', OneHotEncoder(), categorical_features)
|
|
])
|
|
```
|
|
|
|
### Name Your Steps Meaningfully
|
|
```python
|
|
# Good
|
|
pipeline = Pipeline([
|
|
('imputer', SimpleImputer()),
|
|
('scaler', StandardScaler()),
|
|
('pca', PCA(n_components=10)),
|
|
('rf_classifier', RandomForestClassifier())
|
|
])
|
|
|
|
# Bad
|
|
pipeline = Pipeline([
|
|
('step1', SimpleImputer()),
|
|
('step2', StandardScaler()),
|
|
('step3', PCA(n_components=10)),
|
|
('step4', RandomForestClassifier())
|
|
])
|
|
```
|
|
|
|
### Cache Expensive Transformations
|
|
For repeated fitting (e.g., during grid search), cache expensive steps:
|
|
```python
|
|
from tempfile import mkdtemp
|
|
|
|
cachedir = mkdtemp()
|
|
pipeline = Pipeline([
|
|
('expensive_preprocessing', ExpensiveTransformer()),
|
|
('classifier', LogisticRegression())
|
|
], memory=cachedir)
|
|
```
|
|
|
|
### Test Pipeline Compatibility
|
|
Ensure all steps are compatible:
|
|
- All intermediate steps must have fit() and transform()
|
|
- Final step needs fit() and predict() (or transform())
|
|
- Use set_output(transform='pandas') for DataFrame output
|
|
```python
|
|
pipeline.set_output(transform='pandas')
|
|
X_transformed = pipeline.transform(X) # Returns DataFrame
|
|
```
|