# Generalized Linear Models (GLM) Reference This document provides comprehensive guidance on generalized linear models in statsmodels, including families, link functions, and applications. ## Overview GLMs extend linear regression to non-normal response distributions through: 1. **Distribution family**: Specifies the conditional distribution of the response 2. **Link function**: Transforms the linear predictor to the scale of the mean 3. **Variance function**: Relates variance to the mean **General form**: g(μ) = Xβ, where g is the link function and μ = E(Y|X) ## When to Use GLM - **Binary outcomes**: Logistic regression (Binomial family with logit link) - **Count data**: Poisson or Negative Binomial regression - **Positive continuous data**: Gamma or Inverse Gaussian - **Non-normal distributions**: When OLS assumptions violated - **Link functions**: Need non-linear relationship between predictors and response scale ## Distribution Families ### Binomial Family For binary outcomes (0/1) or proportions (k/n). **When to use:** - Binary classification - Success/failure outcomes - Proportions or rates **Common links:** - Logit (default): log(μ/(1-μ)) - Probit: Φ⁻¹(μ) - Log: log(μ) ```python import statsmodels.api as sm import statsmodels.formula.api as smf # Binary logistic regression model = sm.GLM(y, X, family=sm.families.Binomial()) results = model.fit() # Formula API results = smf.glm('success ~ x1 + x2', data=df, family=sm.families.Binomial()).fit() # Access predictions (probabilities) probs = results.predict(X_new) # Classification (0.5 threshold) predictions = (probs > 0.5).astype(int) ``` **Interpretation:** ```python import numpy as np # Odds ratios (for logit link) odds_ratios = np.exp(results.params) print("Odds ratios:", odds_ratios) # For 1-unit increase in x, odds multiply by exp(beta) ``` ### Poisson Family For count data (non-negative integers). **When to use:** - Count outcomes (number of events) - Rare events - Rate modeling (with offset) **Common links:** - Log (default): log(μ) - Identity: μ - Sqrt: √μ ```python # Poisson regression model = sm.GLM(y, X, family=sm.families.Poisson()) results = model.fit() # With exposure/offset for rates # If modeling rate = counts/exposure model = sm.GLM(y, X, family=sm.families.Poisson(), offset=np.log(exposure)) results = model.fit() # Interpretation: exp(beta) = multiplicative effect on expected count import numpy as np rate_ratios = np.exp(results.params) print("Rate ratios:", rate_ratios) ``` **Overdispersion check:** ```python # Deviance / df should be ~1 for Poisson overdispersion = results.deviance / results.df_resid print(f"Overdispersion: {overdispersion}") # If >> 1, consider Negative Binomial if overdispersion > 1.5: print("Consider Negative Binomial model for overdispersion") ``` ### Negative Binomial Family For overdispersed count data. **When to use:** - Count data with variance > mean - Excess zeros or large variance - Poisson model shows overdispersion ```python # Negative Binomial GLM model = sm.GLM(y, X, family=sm.families.NegativeBinomial()) results = model.fit() # Alternative: use discrete choice model with alpha estimation from statsmodels.discrete.discrete_model import NegativeBinomial nb_model = NegativeBinomial(y, X) nb_results = nb_model.fit() print(f"Dispersion parameter alpha: {nb_results.params[-1]}") ``` ### Gaussian Family Equivalent to OLS but fit via IRLS (Iteratively Reweighted Least Squares). **When to use:** - Want GLM framework for consistency - Need robust standard errors - Comparing with other GLMs **Common links:** - Identity (default): μ - Log: log(μ) - Inverse: 1/μ ```python # Gaussian GLM (equivalent to OLS) model = sm.GLM(y, X, family=sm.families.Gaussian()) results = model.fit() # Verify equivalence with OLS ols_results = sm.OLS(y, X).fit() print("Parameters close:", np.allclose(results.params, ols_results.params)) ``` ### Gamma Family For positive continuous data, often right-skewed. **When to use:** - Positive outcomes (insurance claims, survival times) - Right-skewed distributions - Variance proportional to mean² **Common links:** - Inverse (default): 1/μ - Log: log(μ) - Identity: μ ```python # Gamma regression (common for cost data) model = sm.GLM(y, X, family=sm.families.Gamma()) results = model.fit() # Log link often preferred for interpretation model = sm.GLM(y, X, family=sm.families.Gamma(link=sm.families.links.Log())) results = model.fit() # With log link, exp(beta) = multiplicative effect import numpy as np effects = np.exp(results.params) ``` ### Inverse Gaussian Family For positive continuous data with specific variance structure. **When to use:** - Positive skewed outcomes - Variance proportional to mean³ - Alternative to Gamma **Common links:** - Inverse squared (default): 1/μ² - Log: log(μ) ```python model = sm.GLM(y, X, family=sm.families.InverseGaussian()) results = model.fit() ``` ### Tweedie Family Flexible family covering multiple distributions. **When to use:** - Insurance claims (mixture of zeros and continuous) - Semi-continuous data - Need flexible variance function **Special cases (power parameter p):** - p=0: Normal - p=1: Poisson - p=2: Gamma - p=3: Inverse Gaussian - 11 suggests overdispersion, <1 underdispersion ``` ### Residual Analysis ```python import matplotlib.pyplot as plt # Deviance residuals vs fitted plt.figure(figsize=(10, 6)) plt.scatter(results.fittedvalues, results.resid_deviance, alpha=0.5) plt.xlabel('Fitted values') plt.ylabel('Deviance residuals') plt.axhline(y=0, color='r', linestyle='--') plt.title('Deviance Residuals vs Fitted') plt.show() # Q-Q plot of deviance residuals from statsmodels.graphics.gofplots import qqplot qqplot(results.resid_deviance, line='s') plt.title('Q-Q Plot of Deviance Residuals') plt.show() # For binary outcomes: binned residual plot if isinstance(results.model.family, sm.families.Binomial): from statsmodels.graphics.gofplots import qqplot # Group predictions and compute average residuals # (custom implementation needed) pass ``` ### Influence and Outliers ```python from statsmodels.stats.outliers_influence import GLMInfluence influence = GLMInfluence(results) # Leverage leverage = influence.hat_matrix_diag # Cook's distance cooks_d = influence.cooks_distance[0] # DFFITS dffits = influence.dffits[0] # Find influential observations influential = np.where(cooks_d > 4/len(y))[0] print(f"Influential observations: {influential}") ``` ## Hypothesis Testing ```python # Wald test for single parameter (automatically in summary) # Likelihood ratio test for nested models # Fit reduced model model_reduced = sm.GLM(y, X_reduced, family=family).fit() model_full = sm.GLM(y, X_full, family=family).fit() # LR statistic lr_stat = 2 * (model_full.llf - model_reduced.llf) df = model_full.df_model - model_reduced.df_model from scipy import stats lr_pval = 1 - stats.chi2.cdf(lr_stat, df) print(f"LR test p-value: {lr_pval}") # Wald test for multiple parameters # Test beta_1 = beta_2 = 0 R = [[0, 1, 0, 0], [0, 0, 1, 0]] wald_test = results.wald_test(R) print(wald_test) ``` ## Robust Standard Errors ```python # Heteroscedasticity-robust (sandwich estimator) results_robust = results.get_robustcov_results(cov_type='HC0') # Cluster-robust results_cluster = results.get_robustcov_results(cov_type='cluster', groups=cluster_ids) # Compare standard errors print("Regular SE:", results.bse) print("Robust SE:", results_robust.bse) ``` ## Model Comparison ```python # AIC/BIC for non-nested models models = [model1_results, model2_results, model3_results] for i, res in enumerate(models, 1): print(f"Model {i}: AIC={res.aic:.2f}, BIC={res.bic:.2f}") # Likelihood ratio test for nested models (as shown above) # Cross-validation for predictive performance from sklearn.model_selection import KFold from sklearn.metrics import log_loss kf = KFold(n_splits=5, shuffle=True, random_state=42) cv_scores = [] for train_idx, val_idx in kf.split(X): X_train, X_val = X[train_idx], X[val_idx] y_train, y_val = y[train_idx], y[val_idx] model_cv = sm.GLM(y_train, X_train, family=family).fit() pred_probs = model_cv.predict(X_val) score = log_loss(y_val, pred_probs) cv_scores.append(score) print(f"CV Log Loss: {np.mean(cv_scores):.4f} ± {np.std(cv_scores):.4f}") ``` ## Prediction ```python # Point predictions predictions = results.predict(X_new) # For classification: get probabilities and convert if isinstance(family, sm.families.Binomial): probs = predictions class_predictions = (probs > 0.5).astype(int) # For counts: predictions are expected counts if isinstance(family, sm.families.Poisson): expected_counts = predictions # Prediction intervals via bootstrap n_boot = 1000 boot_preds = np.zeros((n_boot, len(X_new))) for i in range(n_boot): # Bootstrap resample boot_idx = np.random.choice(len(y), size=len(y), replace=True) X_boot, y_boot = X[boot_idx], y[boot_idx] # Fit and predict boot_model = sm.GLM(y_boot, X_boot, family=family).fit() boot_preds[i] = boot_model.predict(X_new) # 95% prediction intervals pred_lower = np.percentile(boot_preds, 2.5, axis=0) pred_upper = np.percentile(boot_preds, 97.5, axis=0) ``` ## Common Applications ### Logistic Regression (Binary Classification) ```python import statsmodels.api as sm # Fit logistic regression X = sm.add_constant(X_data) model = sm.GLM(y, X, family=sm.families.Binomial()) results = model.fit() # Odds ratios odds_ratios = np.exp(results.params) odds_ci = np.exp(results.conf_int()) # Classification metrics from sklearn.metrics import classification_report, roc_auc_score probs = results.predict(X) predictions = (probs > 0.5).astype(int) print(classification_report(y, predictions)) print(f"AUC: {roc_auc_score(y, probs):.4f}") # ROC curve from sklearn.metrics import roc_curve import matplotlib.pyplot as plt fpr, tpr, thresholds = roc_curve(y, probs) plt.plot(fpr, tpr) plt.plot([0, 1], [0, 1], 'k--') plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title('ROC Curve') plt.show() ``` ### Poisson Regression (Count Data) ```python # Fit Poisson model X = sm.add_constant(X_data) model = sm.GLM(y_counts, X, family=sm.families.Poisson()) results = model.fit() # Rate ratios rate_ratios = np.exp(results.params) print("Rate ratios:", rate_ratios) # Check overdispersion dispersion = results.pearson_chi2 / results.df_resid if dispersion > 1.5: print(f"Overdispersion detected ({dispersion:.2f}). Consider Negative Binomial.") ``` ### Gamma Regression (Cost/Duration Data) ```python # Fit Gamma model with log link X = sm.add_constant(X_data) model = sm.GLM(y_cost, X, family=sm.families.Gamma(link=sm.families.links.Log())) results = model.fit() # Multiplicative effects effects = np.exp(results.params) print("Multiplicative effects on mean:", effects) ``` ## Best Practices 1. **Check distribution assumptions**: Plot histograms and Q-Q plots of response 2. **Verify link function**: Use canonical links unless there's a reason not to 3. **Examine residuals**: Deviance residuals should be approximately normal 4. **Test for overdispersion**: Especially for Poisson models 5. **Use offsets appropriately**: For rate modeling with varying exposure 6. **Consider robust SEs**: When variance assumptions questionable 7. **Compare models**: Use AIC/BIC for non-nested, LR test for nested 8. **Interpret on original scale**: Transform coefficients (e.g., exp for log link) 9. **Check influential observations**: Use Cook's distance 10. **Validate predictions**: Use cross-validation or holdout set ## Common Pitfalls 1. **Forgetting to add constant**: No intercept term 2. **Using wrong family**: Check distribution of response 3. **Ignoring overdispersion**: Use Negative Binomial instead of Poisson 4. **Misinterpreting coefficients**: Remember link function transformation 5. **Not checking convergence**: IRLS may not converge; check warnings 6. **Complete separation in logistic**: Some categories perfectly predict outcome 7. **Using identity link with bounded outcomes**: May predict outside valid range 8. **Comparing models with different samples**: Use same observations 9. **Forgetting offset in rate models**: Must use log(exposure) as offset 10. **Not considering alternatives**: Mixed models, zero-inflation for complex data