14 KiB
14 KiB
Pipelines and Composite Estimators Reference
Overview
Pipelines chain multiple processing steps into a single estimator, preventing data leakage and simplifying code. They enable reproducible workflows and seamless integration with cross-validation and hyperparameter tuning.
Pipeline Basics
Creating a Pipeline
Pipeline (sklearn.pipeline.Pipeline)
- Chains transformers with a final estimator
- All intermediate steps must have fit_transform()
- Final step can be any estimator (transformer, classifier, regressor, clusterer)
- Example:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
pipeline = Pipeline([
('scaler', StandardScaler()),
('pca', PCA(n_components=10)),
('classifier', LogisticRegression())
])
# Fit the entire pipeline
pipeline.fit(X_train, y_train)
# Predict using the pipeline
y_pred = pipeline.predict(X_test)
y_proba = pipeline.predict_proba(X_test)
Using make_pipeline
make_pipeline
- Convenient constructor that auto-generates step names
- Example:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
pipeline = make_pipeline(
StandardScaler(),
PCA(n_components=10),
SVC(kernel='rbf')
)
pipeline.fit(X_train, y_train)
Accessing Pipeline Components
Accessing Steps
# By index
scaler = pipeline.steps[0][1]
# By name
scaler = pipeline.named_steps['scaler']
pca = pipeline.named_steps['pca']
# Using indexing syntax
scaler = pipeline['scaler']
pca = pipeline['pca']
# Get all step names
print(pipeline.named_steps.keys())
Setting Parameters
# Set parameters using double underscore notation
pipeline.set_params(
pca__n_components=15,
classifier__C=0.1
)
# Or during creation
pipeline = Pipeline([
('scaler', StandardScaler()),
('pca', PCA(n_components=10)),
('classifier', LogisticRegression(C=1.0))
])
Accessing Attributes
# Access fitted attributes
pca_components = pipeline.named_steps['pca'].components_
explained_variance = pipeline.named_steps['pca'].explained_variance_ratio_
# Access intermediate transformations
X_scaled = pipeline.named_steps['scaler'].transform(X_test)
X_pca = pipeline.named_steps['pca'].transform(X_scaled)
Hyperparameter Tuning with Pipelines
Grid Search with Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', SVC())
])
param_grid = {
'classifier__C': [0.1, 1, 10, 100],
'classifier__gamma': ['scale', 'auto', 0.001, 0.01],
'classifier__kernel': ['rbf', 'linear']
}
grid_search = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1)
grid_search.fit(X_train, y_train)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_:.3f}")
Tuning Multiple Pipeline Steps
param_grid = {
# PCA parameters
'pca__n_components': [5, 10, 20, 50],
# Classifier parameters
'classifier__C': [0.1, 1, 10],
'classifier__kernel': ['rbf', 'linear']
}
grid_search = GridSearchCV(pipeline, param_grid, cv=5)
grid_search.fit(X_train, y_train)
ColumnTransformer
Basic Usage
ColumnTransformer (sklearn.compose.ColumnTransformer)
- Apply different preprocessing to different columns
- Prevents data leakage in cross-validation
- Example:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
# Define column groups
numeric_features = ['age', 'income', 'hours_per_week']
categorical_features = ['gender', 'occupation', 'native_country']
# Create preprocessor
preprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), numeric_features),
('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
],
remainder='passthrough' # Keep other columns unchanged
)
X_transformed = preprocessor.fit_transform(X)
With Pipeline Steps
from sklearn.pipeline import Pipeline
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
]
)
# Full pipeline with model
full_pipeline = Pipeline([
('preprocessor', preprocessor),
('classifier', LogisticRegression())
])
full_pipeline.fit(X_train, y_train)
Using make_column_transformer
from sklearn.compose import make_column_transformer
preprocessor = make_column_transformer(
(StandardScaler(), numeric_features),
(OneHotEncoder(), categorical_features),
remainder='passthrough'
)
Column Selection
# By column names (if X is DataFrame)
preprocessor = ColumnTransformer([
('num', StandardScaler(), ['age', 'income']),
('cat', OneHotEncoder(), ['gender', 'occupation'])
])
# By column indices
preprocessor = ColumnTransformer([
('num', StandardScaler(), [0, 1, 2]),
('cat', OneHotEncoder(), [3, 4])
])
# By boolean mask
numeric_mask = [True, True, True, False, False]
categorical_mask = [False, False, False, True, True]
preprocessor = ColumnTransformer([
('num', StandardScaler(), numeric_mask),
('cat', OneHotEncoder(), categorical_mask)
])
# By callable
def is_numeric(X):
return X.select_dtypes(include=['number']).columns.tolist()
preprocessor = ColumnTransformer([
('num', StandardScaler(), is_numeric)
])
Getting Feature Names
# Get output feature names
feature_names = preprocessor.get_feature_names_out()
# After fitting
preprocessor.fit(X_train)
output_features = preprocessor.get_feature_names_out()
print(f"Input features: {X_train.columns.tolist()}")
print(f"Output features: {output_features}")
Remainder Handling
# Drop unspecified columns (default)
preprocessor = ColumnTransformer([...], remainder='drop')
# Pass through unchanged
preprocessor = ColumnTransformer([...], remainder='passthrough')
# Apply transformer to remaining columns
preprocessor = ColumnTransformer([...], remainder=StandardScaler())
FeatureUnion
Basic Usage
FeatureUnion (sklearn.pipeline.FeatureUnion)
- Concatenates results of multiple transformers
- Transformers are applied in parallel
- Example:
from sklearn.pipeline import FeatureUnion
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
# Combine PCA and feature selection
feature_union = FeatureUnion([
('pca', PCA(n_components=10)),
('select_best', SelectKBest(k=20))
])
X_combined = feature_union.fit_transform(X_train, y_train)
print(f"Combined features: {X_combined.shape[1]}") # 10 + 20 = 30
With Pipeline
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA, TruncatedSVD
# Create feature union
feature_union = FeatureUnion([
('pca', PCA(n_components=10)),
('svd', TruncatedSVD(n_components=10))
])
# Full pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('features', feature_union),
('classifier', LogisticRegression())
])
pipeline.fit(X_train, y_train)
Weighted Feature Union
# Apply weights to transformers
feature_union = FeatureUnion(
transformer_list=[
('pca', PCA(n_components=10)),
('select_best', SelectKBest(k=20))
],
transformer_weights={
'pca': 2.0, # Give PCA features double weight
'select_best': 1.0
}
)
Advanced Pipeline Patterns
Caching Pipeline Steps
from sklearn.pipeline import Pipeline
from tempfile import mkdtemp
from shutil import rmtree
# Cache intermediate results
cachedir = mkdtemp()
pipeline = Pipeline([
('scaler', StandardScaler()),
('pca', PCA(n_components=50)),
('classifier', LogisticRegression())
], memory=cachedir)
pipeline.fit(X_train, y_train)
# Clean up cache
rmtree(cachedir)
Nested Pipelines
from sklearn.pipeline import Pipeline
# Inner pipeline for text processing
text_pipeline = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer())
])
# Outer pipeline combining text and numeric features
full_pipeline = Pipeline([
('features', FeatureUnion([
('text', text_pipeline),
('numeric', StandardScaler())
])),
('classifier', LogisticRegression())
])
Custom Transformers in Pipelines
from sklearn.base import BaseEstimator, TransformerMixin
class TextLengthExtractor(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X):
return [[len(text)] for text in X]
pipeline = Pipeline([
('length', TextLengthExtractor()),
('scaler', StandardScaler()),
('classifier', LogisticRegression())
])
Slicing Pipelines
# Get sub-pipeline
sub_pipeline = pipeline[:2] # First two steps
# Get specific range
middle_steps = pipeline[1:3]
TransformedTargetRegressor
Basic Usage
TransformedTargetRegressor
- Transforms target variable before fitting
- Automatically inverse-transforms predictions
- Example:
from sklearn.compose import TransformedTargetRegressor
from sklearn.preprocessing import QuantileTransformer
from sklearn.linear_model import LinearRegression
model = TransformedTargetRegressor(
regressor=LinearRegression(),
transformer=QuantileTransformer(output_distribution='normal')
)
model.fit(X_train, y_train)
y_pred = model.predict(X_test) # Automatically inverse-transformed
With Functions
import numpy as np
model = TransformedTargetRegressor(
regressor=LinearRegression(),
func=np.log1p,
inverse_func=np.expm1
)
model.fit(X_train, y_train)
Complete Example: End-to-End Pipeline
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
# Define feature types
numeric_features = ['age', 'income', 'hours_per_week']
categorical_features = ['gender', 'occupation', 'education']
# Numeric preprocessing pipeline
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
# Categorical preprocessing pipeline
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])
# Combine preprocessing
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
]
)
# Full pipeline
pipeline = Pipeline([
('preprocessor', preprocessor),
('pca', PCA(n_components=0.95)), # Keep 95% variance
('classifier', RandomForestClassifier(random_state=42))
])
# Hyperparameter tuning
param_grid = {
'preprocessor__num__imputer__strategy': ['mean', 'median'],
'pca__n_components': [0.90, 0.95, 0.99],
'classifier__n_estimators': [100, 200],
'classifier__max_depth': [10, 20, None]
}
grid_search = GridSearchCV(
pipeline, param_grid,
cv=5, scoring='accuracy',
n_jobs=-1, verbose=1
)
grid_search.fit(X_train, y_train)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.3f}")
print(f"Test score: {grid_search.score(X_test, y_test):.3f}")
# Make predictions
best_pipeline = grid_search.best_estimator_
y_pred = best_pipeline.predict(X_test)
y_proba = best_pipeline.predict_proba(X_test)
Visualization
Displaying Pipelines
# In Jupyter notebooks, pipelines display as diagrams
from sklearn import set_config
set_config(display='diagram')
pipeline # Displays visual diagram
Text Representation
# Print pipeline structure
print(pipeline)
# Get detailed parameters
print(pipeline.get_params())
Best Practices
Always Use Pipelines
- Prevents data leakage
- Ensures consistency between training and prediction
- Makes code more maintainable
- Enables easy hyperparameter tuning
Proper Pipeline Construction
# Good: Preprocessing inside pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('model', LogisticRegression())
])
pipeline.fit(X_train, y_train)
# Bad: Preprocessing outside pipeline (can cause leakage)
X_train_scaled = StandardScaler().fit_transform(X_train)
model = LogisticRegression()
model.fit(X_train_scaled, y_train)
Use ColumnTransformer for Mixed Data
Always use ColumnTransformer when you have both numerical and categorical features:
preprocessor = ColumnTransformer([
('num', StandardScaler(), numeric_features),
('cat', OneHotEncoder(), categorical_features)
])
Name Your Steps Meaningfully
# Good
pipeline = Pipeline([
('imputer', SimpleImputer()),
('scaler', StandardScaler()),
('pca', PCA(n_components=10)),
('rf_classifier', RandomForestClassifier())
])
# Bad
pipeline = Pipeline([
('step1', SimpleImputer()),
('step2', StandardScaler()),
('step3', PCA(n_components=10)),
('step4', RandomForestClassifier())
])
Cache Expensive Transformations
For repeated fitting (e.g., during grid search), cache expensive steps:
from tempfile import mkdtemp
cachedir = mkdtemp()
pipeline = Pipeline([
('expensive_preprocessing', ExpensiveTransformer()),
('classifier', LogisticRegression())
], memory=cachedir)
Test Pipeline Compatibility
Ensure all steps are compatible:
- All intermediate steps must have fit() and transform()
- Final step needs fit() and predict() (or transform())
- Use set_output(transform='pandas') for DataFrame output
pipeline.set_output(transform='pandas')
X_transformed = pipeline.transform(X) # Returns DataFrame