# Data Preprocessing and Feature Engineering Reference ## Overview Data preprocessing transforms raw data into a format suitable for machine learning models. This includes scaling, encoding, handling missing values, and feature engineering. ## Feature Scaling and Normalization ### StandardScaler **StandardScaler (`sklearn.preprocessing.StandardScaler`)** - Standardizes features to zero mean and unit variance - Formula: z = (x - mean) / std - Use when: Features have different scales, algorithm assumes normally distributed data - Required for: SVM, KNN, Neural Networks, PCA, Linear Regression with regularization - Example: ```python from sklearn.preprocessing import StandardScaler scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # Use same parameters as training # Access learned parameters print(f"Mean: {scaler.mean_}") print(f"Std: {scaler.scale_}") ``` ### MinMaxScaler **MinMaxScaler (`sklearn.preprocessing.MinMaxScaler`)** - Scales features to a given range (default [0, 1]) - Formula: X_scaled = (X - X.min) / (X.max - X.min) - Use when: Need bounded values, data not normally distributed - Sensitive to outliers - Example: ```python from sklearn.preprocessing import MinMaxScaler scaler = MinMaxScaler(feature_range=(0, 1)) X_scaled = scaler.fit_transform(X_train) # Custom range scaler = MinMaxScaler(feature_range=(-1, 1)) X_scaled = scaler.fit_transform(X_train) ``` ### RobustScaler **RobustScaler (`sklearn.preprocessing.RobustScaler`)** - Scales using median and interquartile range (IQR) - Formula: X_scaled = (X - median) / IQR - Use when: Data contains outliers - Robust to outliers - Example: ```python from sklearn.preprocessing import RobustScaler scaler = RobustScaler() X_scaled = scaler.fit_transform(X_train) ``` ### Normalizer **Normalizer (`sklearn.preprocessing.Normalizer`)** - Normalizes samples individually to unit norm - Common norms: 'l1', 'l2', 'max' - Use when: Need to normalize each sample independently (e.g., text features) - Example: ```python from sklearn.preprocessing import Normalizer normalizer = Normalizer(norm='l2') # Euclidean norm X_normalized = normalizer.fit_transform(X) ``` ### MaxAbsScaler **MaxAbsScaler (`sklearn.preprocessing.MaxAbsScaler`)** - Scales by maximum absolute value - Range: [-1, 1] - Doesn't shift/center data (preserves sparsity) - Use when: Data is already centered or sparse - Example: ```python from sklearn.preprocessing import MaxAbsScaler scaler = MaxAbsScaler() X_scaled = scaler.fit_transform(X_sparse) ``` ## Encoding Categorical Variables ### OneHotEncoder **OneHotEncoder (`sklearn.preprocessing.OneHotEncoder`)** - Creates binary columns for each category - Use when: Nominal categories (no order), tree-based models or linear models - Example: ```python from sklearn.preprocessing import OneHotEncoder encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore') X_encoded = encoder.fit_transform(X_categorical) # Get feature names feature_names = encoder.get_feature_names_out(['color', 'size']) # Handle unknown categories during transform X_test_encoded = encoder.transform(X_test_categorical) ``` ### OrdinalEncoder **OrdinalEncoder (`sklearn.preprocessing.OrdinalEncoder`)** - Encodes categories as integers - Use when: Ordinal categories (ordered), or tree-based models - Example: ```python from sklearn.preprocessing import OrdinalEncoder # Natural ordering encoder = OrdinalEncoder() X_encoded = encoder.fit_transform(X_categorical) # Custom ordering encoder = OrdinalEncoder(categories=[['small', 'medium', 'large']]) X_encoded = encoder.fit_transform(X_categorical) ``` ### LabelEncoder **LabelEncoder (`sklearn.preprocessing.LabelEncoder`)** - Encodes target labels (y) as integers - Use for: Target variable encoding - Example: ```python from sklearn.preprocessing import LabelEncoder le = LabelEncoder() y_encoded = le.fit_transform(y) # Decode back y_decoded = le.inverse_transform(y_encoded) print(f"Classes: {le.classes_}") ``` ### Target Encoding (using category_encoders) ```python # Install: uv pip install category-encoders from category_encoders import TargetEncoder encoder = TargetEncoder() X_train_encoded = encoder.fit_transform(X_train_categorical, y_train) X_test_encoded = encoder.transform(X_test_categorical) ``` ## Non-linear Transformations ### Power Transforms **PowerTransformer** - Makes data more Gaussian-like - Methods: 'yeo-johnson' (works with negative values), 'box-cox' (positive only) - Use when: Data is skewed, algorithm assumes normality - Example: ```python from sklearn.preprocessing import PowerTransformer # Yeo-Johnson (handles negative values) pt = PowerTransformer(method='yeo-johnson', standardize=True) X_transformed = pt.fit_transform(X) # Box-Cox (positive values only) pt = PowerTransformer(method='box-cox', standardize=True) X_transformed = pt.fit_transform(X) ``` ### Quantile Transformation **QuantileTransformer** - Transforms features to follow uniform or normal distribution - Robust to outliers - Use when: Want to reduce outlier impact - Example: ```python from sklearn.preprocessing import QuantileTransformer # Transform to uniform distribution qt = QuantileTransformer(output_distribution='uniform', random_state=42) X_transformed = qt.fit_transform(X) # Transform to normal distribution qt = QuantileTransformer(output_distribution='normal', random_state=42) X_transformed = qt.fit_transform(X) ``` ### Log Transform ```python import numpy as np # Log1p (log(1 + x)) - handles zeros X_log = np.log1p(X) # Or use FunctionTransformer from sklearn.preprocessing import FunctionTransformer log_transformer = FunctionTransformer(np.log1p, inverse_func=np.expm1) X_log = log_transformer.fit_transform(X) ``` ## Missing Value Imputation ### SimpleImputer **SimpleImputer (`sklearn.impute.SimpleImputer`)** - Basic imputation strategies - Strategies: 'mean', 'median', 'most_frequent', 'constant' - Example: ```python from sklearn.impute import SimpleImputer # For numerical features imputer = SimpleImputer(strategy='mean') X_imputed = imputer.fit_transform(X) # For categorical features imputer = SimpleImputer(strategy='most_frequent') X_imputed = imputer.fit_transform(X_categorical) # Fill with constant imputer = SimpleImputer(strategy='constant', fill_value=0) X_imputed = imputer.fit_transform(X) ``` ### Iterative Imputer **IterativeImputer** - Models each feature with missing values as function of other features - More sophisticated than SimpleImputer - Example: ```python from sklearn.experimental import enable_iterative_imputer from sklearn.impute import IterativeImputer imputer = IterativeImputer(max_iter=10, random_state=42) X_imputed = imputer.fit_transform(X) ``` ### KNN Imputer **KNNImputer** - Imputes using k-nearest neighbors - Use when: Features are correlated - Example: ```python from sklearn.impute import KNNImputer imputer = KNNImputer(n_neighbors=5) X_imputed = imputer.fit_transform(X) ``` ## Feature Engineering ### Polynomial Features **PolynomialFeatures** - Creates polynomial and interaction features - Use when: Need non-linear features for linear models - Example: ```python from sklearn.preprocessing import PolynomialFeatures # Degree 2: includes x1, x2, x1^2, x2^2, x1*x2 poly = PolynomialFeatures(degree=2, include_bias=False) X_poly = poly.fit_transform(X) # Get feature names feature_names = poly.get_feature_names_out(['x1', 'x2']) # Only interactions (no powers) poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False) X_interactions = poly.fit_transform(X) ``` ### Binning/Discretization **KBinsDiscretizer** - Bins continuous features into discrete intervals - Strategies: 'uniform', 'quantile', 'kmeans' - Encoding: 'onehot', 'ordinal', 'onehot-dense' - Example: ```python from sklearn.preprocessing import KBinsDiscretizer # Equal-width bins binner = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='uniform') X_binned = binner.fit_transform(X) # Equal-frequency bins (quantile-based) binner = KBinsDiscretizer(n_bins=5, encode='onehot', strategy='quantile') X_binned = binner.fit_transform(X) ``` ### Binarization **Binarizer** - Converts features to binary (0 or 1) based on threshold - Example: ```python from sklearn.preprocessing import Binarizer binarizer = Binarizer(threshold=0.5) X_binary = binarizer.fit_transform(X) ``` ### Spline Features **SplineTransformer** - Creates spline basis functions - Useful for capturing non-linear relationships - Example: ```python from sklearn.preprocessing import SplineTransformer spline = SplineTransformer(n_knots=5, degree=3) X_splines = spline.fit_transform(X) ``` ## Text Feature Extraction ### CountVectorizer **CountVectorizer (`sklearn.feature_extraction.text.CountVectorizer`)** - Converts text to token count matrix - Use for: Bag-of-words representation - Example: ```python from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer( max_features=5000, # Keep top 5000 features min_df=2, # Ignore terms appearing in < 2 documents max_df=0.8, # Ignore terms appearing in > 80% documents ngram_range=(1, 2) # Unigrams and bigrams ) X_counts = vectorizer.fit_transform(documents) feature_names = vectorizer.get_feature_names_out() ``` ### TfidfVectorizer **TfidfVectorizer** - TF-IDF (Term Frequency-Inverse Document Frequency) transformation - Better than CountVectorizer for most tasks - Example: ```python from sklearn.feature_extraction.text import TfidfVectorizer vectorizer = TfidfVectorizer( max_features=5000, min_df=2, max_df=0.8, ngram_range=(1, 2), stop_words='english' # Remove English stop words ) X_tfidf = vectorizer.fit_transform(documents) ``` ### HashingVectorizer **HashingVectorizer** - Uses hashing trick for memory efficiency - No fit needed, can't reverse transform - Use when: Very large vocabulary, streaming data - Example: ```python from sklearn.feature_extraction.text import HashingVectorizer vectorizer = HashingVectorizer(n_features=2**18) X_hashed = vectorizer.transform(documents) # No fit needed ``` ## Feature Selection ### Filter Methods **Variance Threshold** - Removes low-variance features - Example: ```python from sklearn.feature_selection import VarianceThreshold selector = VarianceThreshold(threshold=0.01) X_selected = selector.fit_transform(X) ``` **SelectKBest / SelectPercentile** - Select features based on statistical tests - Tests: f_classif, chi2, mutual_info_classif - Example: ```python from sklearn.feature_selection import SelectKBest, f_classif # Select top 10 features selector = SelectKBest(score_func=f_classif, k=10) X_selected = selector.fit_transform(X_train, y_train) # Get selected feature indices selected_indices = selector.get_support(indices=True) ``` ### Wrapper Methods **Recursive Feature Elimination (RFE)** - Recursively removes features - Uses model feature importances - Example: ```python from sklearn.feature_selection import RFE from sklearn.ensemble import RandomForestClassifier model = RandomForestClassifier(n_estimators=100, random_state=42) rfe = RFE(estimator=model, n_features_to_select=10, step=1) X_selected = rfe.fit_transform(X_train, y_train) # Get selected features selected_features = rfe.support_ feature_ranking = rfe.ranking_ ``` **RFECV (with Cross-Validation)** - RFE with cross-validation to find optimal number of features - Example: ```python from sklearn.feature_selection import RFECV model = RandomForestClassifier(n_estimators=100, random_state=42) rfecv = RFECV(estimator=model, cv=5, scoring='accuracy') X_selected = rfecv.fit_transform(X_train, y_train) print(f"Optimal number of features: {rfecv.n_features_}") ``` ### Embedded Methods **SelectFromModel** - Select features based on model coefficients/importances - Works with: Linear models (L1), Tree-based models - Example: ```python from sklearn.feature_selection import SelectFromModel from sklearn.ensemble import RandomForestClassifier model = RandomForestClassifier(n_estimators=100, random_state=42) selector = SelectFromModel(model, threshold='median') selector.fit(X_train, y_train) X_selected = selector.transform(X_train) # Get selected features selected_features = selector.get_support() ``` **L1-based Feature Selection** ```python from sklearn.linear_model import LogisticRegression from sklearn.feature_selection import SelectFromModel model = LogisticRegression(penalty='l1', solver='liblinear', C=0.1) selector = SelectFromModel(model) selector.fit(X_train, y_train) X_selected = selector.transform(X_train) ``` ## Handling Outliers ### IQR Method ```python import numpy as np Q1 = np.percentile(X, 25, axis=0) Q3 = np.percentile(X, 75, axis=0) IQR = Q3 - Q1 # Define outlier boundaries lower_bound = Q1 - 1.5 * IQR upper_bound = Q3 + 1.5 * IQR # Remove outliers mask = np.all((X >= lower_bound) & (X <= upper_bound), axis=1) X_no_outliers = X[mask] ``` ### Winsorization ```python from scipy.stats import mstats # Clip outliers at 5th and 95th percentiles X_winsorized = mstats.winsorize(X, limits=[0.05, 0.05], axis=0) ``` ## Custom Transformers ### Using FunctionTransformer ```python from sklearn.preprocessing import FunctionTransformer import numpy as np def log_transform(X): return np.log1p(X) transformer = FunctionTransformer(log_transform, inverse_func=np.expm1) X_transformed = transformer.fit_transform(X) ``` ### Creating Custom Transformer ```python from sklearn.base import BaseEstimator, TransformerMixin class CustomTransformer(BaseEstimator, TransformerMixin): def __init__(self, parameter=1): self.parameter = parameter def fit(self, X, y=None): # Learn parameters from X if needed return self def transform(self, X): # Transform X return X * self.parameter transformer = CustomTransformer(parameter=2) X_transformed = transformer.fit_transform(X) ``` ## Best Practices ### Fit on Training Data Only Always fit transformers on training data only: ```python # Correct scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # Wrong - causes data leakage scaler = StandardScaler() X_all_scaled = scaler.fit_transform(np.vstack([X_train, X_test])) ``` ### Use Pipelines Combine preprocessing with models: ```python from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression pipeline = Pipeline([ ('scaler', StandardScaler()), ('classifier', LogisticRegression()) ]) pipeline.fit(X_train, y_train) ``` ### Handle Categorical and Numerical Separately Use ColumnTransformer: ```python from sklearn.compose import ColumnTransformer from sklearn.preprocessing import StandardScaler, OneHotEncoder numeric_features = ['age', 'income'] categorical_features = ['gender', 'occupation'] preprocessor = ColumnTransformer( transformers=[ ('num', StandardScaler(), numeric_features), ('cat', OneHotEncoder(), categorical_features) ] ) X_transformed = preprocessor.fit_transform(X) ``` ### Algorithm-Specific Requirements **Require Scaling:** - SVM, KNN, Neural Networks - PCA, Linear/Logistic Regression with regularization - K-Means clustering **Don't Require Scaling:** - Tree-based models (Decision Trees, Random Forest, Gradient Boosting) - Naive Bayes **Encoding Requirements:** - Linear models, SVM, KNN: One-hot encoding for nominal features - Tree-based models: Can handle ordinal encoding directly