Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions

608
skills/statsmodels/SKILL.md Normal file
View File

@@ -0,0 +1,608 @@
---
name: statsmodels
description: "Statistical modeling toolkit. OLS, GLM, logistic, ARIMA, time series, hypothesis tests, diagnostics, AIC/BIC, for rigorous statistical inference and econometric analysis."
---
# Statsmodels: Statistical Modeling and Econometrics
## Overview
Statsmodels is Python's premier library for statistical modeling, providing tools for estimation, inference, and diagnostics across a wide range of statistical methods. Apply this skill for rigorous statistical analysis, from simple linear regression to complex time series models and econometric analyses.
## When to Use This Skill
This skill should be used when:
- Fitting regression models (OLS, WLS, GLS, quantile regression)
- Performing generalized linear modeling (logistic, Poisson, Gamma, etc.)
- Analyzing discrete outcomes (binary, multinomial, count, ordinal)
- Conducting time series analysis (ARIMA, SARIMAX, VAR, forecasting)
- Running statistical tests and diagnostics
- Testing model assumptions (heteroskedasticity, autocorrelation, normality)
- Detecting outliers and influential observations
- Comparing models (AIC/BIC, likelihood ratio tests)
- Estimating causal effects
- Producing publication-ready statistical tables and inference
## Quick Start Guide
### Linear Regression (OLS)
```python
import statsmodels.api as sm
import numpy as np
import pandas as pd
# Prepare data - ALWAYS add constant for intercept
X = sm.add_constant(X_data)
# Fit OLS model
model = sm.OLS(y, X)
results = model.fit()
# View comprehensive results
print(results.summary())
# Key results
print(f"R-squared: {results.rsquared:.4f}")
print(f"Coefficients:\\n{results.params}")
print(f"P-values:\\n{results.pvalues}")
# Predictions with confidence intervals
predictions = results.get_prediction(X_new)
pred_summary = predictions.summary_frame()
print(pred_summary) # includes mean, CI, prediction intervals
# Diagnostics
from statsmodels.stats.diagnostic import het_breuschpagan
bp_test = het_breuschpagan(results.resid, X)
print(f"Breusch-Pagan p-value: {bp_test[1]:.4f}")
# Visualize residuals
import matplotlib.pyplot as plt
plt.scatter(results.fittedvalues, results.resid)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Fitted values')
plt.ylabel('Residuals')
plt.show()
```
### Logistic Regression (Binary Outcomes)
```python
from statsmodels.discrete.discrete_model import Logit
# Add constant
X = sm.add_constant(X_data)
# Fit logit model
model = Logit(y_binary, X)
results = model.fit()
print(results.summary())
# Odds ratios
odds_ratios = np.exp(results.params)
print("Odds ratios:\\n", odds_ratios)
# Predicted probabilities
probs = results.predict(X)
# Binary predictions (0.5 threshold)
predictions = (probs > 0.5).astype(int)
# Model evaluation
from sklearn.metrics import classification_report, roc_auc_score
print(classification_report(y_binary, predictions))
print(f"AUC: {roc_auc_score(y_binary, probs):.4f}")
# Marginal effects
marginal = results.get_margeff()
print(marginal.summary())
```
### Time Series (ARIMA)
```python
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
# Check stationarity
from statsmodels.tsa.stattools import adfuller
adf_result = adfuller(y_series)
print(f"ADF p-value: {adf_result[1]:.4f}")
if adf_result[1] > 0.05:
# Series is non-stationary, difference it
y_diff = y_series.diff().dropna()
# Plot ACF/PACF to identify p, q
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 8))
plot_acf(y_diff, lags=40, ax=ax1)
plot_pacf(y_diff, lags=40, ax=ax2)
plt.show()
# Fit ARIMA(p,d,q)
model = ARIMA(y_series, order=(1, 1, 1))
results = model.fit()
print(results.summary())
# Forecast
forecast = results.forecast(steps=10)
forecast_obj = results.get_forecast(steps=10)
forecast_df = forecast_obj.summary_frame()
print(forecast_df) # includes mean and confidence intervals
# Residual diagnostics
results.plot_diagnostics(figsize=(12, 8))
plt.show()
```
### Generalized Linear Models (GLM)
```python
import statsmodels.api as sm
# Poisson regression for count data
X = sm.add_constant(X_data)
model = sm.GLM(y_counts, X, family=sm.families.Poisson())
results = model.fit()
print(results.summary())
# Rate ratios (for Poisson with log link)
rate_ratios = np.exp(results.params)
print("Rate ratios:\\n", rate_ratios)
# Check overdispersion
overdispersion = results.pearson_chi2 / results.df_resid
print(f"Overdispersion: {overdispersion:.2f}")
if overdispersion > 1.5:
# Use Negative Binomial instead
from statsmodels.discrete.count_model import NegativeBinomial
nb_model = NegativeBinomial(y_counts, X)
nb_results = nb_model.fit()
print(nb_results.summary())
```
## Core Statistical Modeling Capabilities
### 1. Linear Regression Models
Comprehensive suite of linear models for continuous outcomes with various error structures.
**Available models:**
- **OLS**: Standard linear regression with i.i.d. errors
- **WLS**: Weighted least squares for heteroskedastic errors
- **GLS**: Generalized least squares for arbitrary covariance structure
- **GLSAR**: GLS with autoregressive errors for time series
- **Quantile Regression**: Conditional quantiles (robust to outliers)
- **Mixed Effects**: Hierarchical/multilevel models with random effects
- **Recursive/Rolling**: Time-varying parameter estimation
**Key features:**
- Comprehensive diagnostic tests
- Robust standard errors (HC, HAC, cluster-robust)
- Influence statistics (Cook's distance, leverage, DFFITS)
- Hypothesis testing (F-tests, Wald tests)
- Model comparison (AIC, BIC, likelihood ratio tests)
- Prediction with confidence and prediction intervals
**When to use:** Continuous outcome variable, want inference on coefficients, need diagnostics
**Reference:** See `references/linear_models.md` for detailed guidance on model selection, diagnostics, and best practices.
### 2. Generalized Linear Models (GLM)
Flexible framework extending linear models to non-normal distributions.
**Distribution families:**
- **Binomial**: Binary outcomes or proportions (logistic regression)
- **Poisson**: Count data
- **Negative Binomial**: Overdispersed counts
- **Gamma**: Positive continuous, right-skewed data
- **Inverse Gaussian**: Positive continuous with specific variance structure
- **Gaussian**: Equivalent to OLS
- **Tweedie**: Flexible family for semi-continuous data
**Link functions:**
- Logit, Probit, Log, Identity, Inverse, Sqrt, CLogLog, Power
- Choose based on interpretation needs and model fit
**Key features:**
- Maximum likelihood estimation via IRLS
- Deviance and Pearson residuals
- Goodness-of-fit statistics
- Pseudo R-squared measures
- Robust standard errors
**When to use:** Non-normal outcomes, need flexible variance and link specifications
**Reference:** See `references/glm.md` for family selection, link functions, interpretation, and diagnostics.
### 3. Discrete Choice Models
Models for categorical and count outcomes.
**Binary models:**
- **Logit**: Logistic regression (odds ratios)
- **Probit**: Probit regression (normal distribution)
**Multinomial models:**
- **MNLogit**: Unordered categories (3+ levels)
- **Conditional Logit**: Choice models with alternative-specific variables
- **Ordered Model**: Ordinal outcomes (ordered categories)
**Count models:**
- **Poisson**: Standard count model
- **Negative Binomial**: Overdispersed counts
- **Zero-Inflated**: Excess zeros (ZIP, ZINB)
- **Hurdle Models**: Two-stage models for zero-heavy data
**Key features:**
- Maximum likelihood estimation
- Marginal effects at means or average marginal effects
- Model comparison via AIC/BIC
- Predicted probabilities and classification
- Goodness-of-fit tests
**When to use:** Binary, categorical, or count outcomes
**Reference:** See `references/discrete_choice.md` for model selection, interpretation, and evaluation.
### 4. Time Series Analysis
Comprehensive time series modeling and forecasting capabilities.
**Univariate models:**
- **AutoReg (AR)**: Autoregressive models
- **ARIMA**: Autoregressive integrated moving average
- **SARIMAX**: Seasonal ARIMA with exogenous variables
- **Exponential Smoothing**: Simple, Holt, Holt-Winters
- **ETS**: Innovations state space models
**Multivariate models:**
- **VAR**: Vector autoregression
- **VARMAX**: VAR with MA and exogenous variables
- **Dynamic Factor Models**: Extract common factors
- **VECM**: Vector error correction models (cointegration)
**Advanced models:**
- **State Space**: Kalman filtering, custom specifications
- **Regime Switching**: Markov switching models
- **ARDL**: Autoregressive distributed lag
**Key features:**
- ACF/PACF analysis for model identification
- Stationarity tests (ADF, KPSS)
- Forecasting with prediction intervals
- Residual diagnostics (Ljung-Box, heteroskedasticity)
- Granger causality testing
- Impulse response functions (IRF)
- Forecast error variance decomposition (FEVD)
**When to use:** Time-ordered data, forecasting, understanding temporal dynamics
**Reference:** See `references/time_series.md` for model selection, diagnostics, and forecasting methods.
### 5. Statistical Tests and Diagnostics
Extensive testing and diagnostic capabilities for model validation.
**Residual diagnostics:**
- Autocorrelation tests (Ljung-Box, Durbin-Watson, Breusch-Godfrey)
- Heteroskedasticity tests (Breusch-Pagan, White, ARCH)
- Normality tests (Jarque-Bera, Omnibus, Anderson-Darling, Lilliefors)
- Specification tests (RESET, Harvey-Collier)
**Influence and outliers:**
- Leverage (hat values)
- Cook's distance
- DFFITS and DFBETAs
- Studentized residuals
- Influence plots
**Hypothesis testing:**
- t-tests (one-sample, two-sample, paired)
- Proportion tests
- Chi-square tests
- Non-parametric tests (Mann-Whitney, Wilcoxon, Kruskal-Wallis)
- ANOVA (one-way, two-way, repeated measures)
**Multiple comparisons:**
- Tukey's HSD
- Bonferroni correction
- False Discovery Rate (FDR)
**Effect sizes and power:**
- Cohen's d, eta-squared
- Power analysis for t-tests, proportions
- Sample size calculations
**Robust inference:**
- Heteroskedasticity-consistent SEs (HC0-HC3)
- HAC standard errors (Newey-West)
- Cluster-robust standard errors
**When to use:** Validating assumptions, detecting problems, ensuring robust inference
**Reference:** See `references/stats_diagnostics.md` for comprehensive testing and diagnostic procedures.
## Formula API (R-style)
Statsmodels supports R-style formulas for intuitive model specification:
```python
import statsmodels.formula.api as smf
# OLS with formula
results = smf.ols('y ~ x1 + x2 + x1:x2', data=df).fit()
# Categorical variables (automatic dummy coding)
results = smf.ols('y ~ x1 + C(category)', data=df).fit()
# Interactions
results = smf.ols('y ~ x1 * x2', data=df).fit() # x1 + x2 + x1:x2
# Polynomial terms
results = smf.ols('y ~ x + I(x**2)', data=df).fit()
# Logit
results = smf.logit('y ~ x1 + x2 + C(group)', data=df).fit()
# Poisson
results = smf.poisson('count ~ x1 + x2', data=df).fit()
# ARIMA (not available via formula, use regular API)
```
## Model Selection and Comparison
### Information Criteria
```python
# Compare models using AIC/BIC
models = {
'Model 1': model1_results,
'Model 2': model2_results,
'Model 3': model3_results
}
comparison = pd.DataFrame({
'AIC': {name: res.aic for name, res in models.items()},
'BIC': {name: res.bic for name, res in models.items()},
'Log-Likelihood': {name: res.llf for name, res in models.items()}
})
print(comparison.sort_values('AIC'))
# Lower AIC/BIC indicates better model
```
### Likelihood Ratio Test (Nested Models)
```python
# For nested models (one is subset of the other)
from scipy import stats
lr_stat = 2 * (full_model.llf - reduced_model.llf)
df = full_model.df_model - reduced_model.df_model
p_value = 1 - stats.chi2.cdf(lr_stat, df)
print(f"LR statistic: {lr_stat:.4f}")
print(f"p-value: {p_value:.4f}")
if p_value < 0.05:
print("Full model significantly better")
else:
print("Reduced model preferred (parsimony)")
```
### Cross-Validation
```python
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
kf = KFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = []
for train_idx, val_idx in kf.split(X):
X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]
# Fit model
model = sm.OLS(y_train, X_train).fit()
# Predict
y_pred = model.predict(X_val)
# Score
rmse = np.sqrt(mean_squared_error(y_val, y_pred))
cv_scores.append(rmse)
print(f"CV RMSE: {np.mean(cv_scores):.4f} ± {np.std(cv_scores):.4f}")
```
## Best Practices
### Data Preparation
1. **Always add constant**: Use `sm.add_constant()` unless excluding intercept
2. **Check for missing values**: Handle or impute before fitting
3. **Scale if needed**: Improves convergence, interpretation (but not required for tree models)
4. **Encode categoricals**: Use formula API or manual dummy coding
### Model Building
1. **Start simple**: Begin with basic model, add complexity as needed
2. **Check assumptions**: Test residuals, heteroskedasticity, autocorrelation
3. **Use appropriate model**: Match model to outcome type (binary→Logit, count→Poisson)
4. **Consider alternatives**: If assumptions violated, use robust methods or different model
### Inference
1. **Report effect sizes**: Not just p-values
2. **Use robust SEs**: When heteroskedasticity or clustering present
3. **Multiple comparisons**: Correct when testing many hypotheses
4. **Confidence intervals**: Always report alongside point estimates
### Model Evaluation
1. **Check residuals**: Plot residuals vs fitted, Q-Q plot
2. **Influence diagnostics**: Identify and investigate influential observations
3. **Out-of-sample validation**: Test on holdout set or cross-validate
4. **Compare models**: Use AIC/BIC for non-nested, LR test for nested
### Reporting
1. **Comprehensive summary**: Use `.summary()` for detailed output
2. **Document decisions**: Note transformations, excluded observations
3. **Interpret carefully**: Account for link functions (e.g., exp(β) for log link)
4. **Visualize**: Plot predictions, confidence intervals, diagnostics
## Common Workflows
### Workflow 1: Linear Regression Analysis
1. Explore data (plots, descriptives)
2. Fit initial OLS model
3. Check residual diagnostics
4. Test for heteroskedasticity, autocorrelation
5. Check for multicollinearity (VIF)
6. Identify influential observations
7. Refit with robust SEs if needed
8. Interpret coefficients and inference
9. Validate on holdout or via CV
### Workflow 2: Binary Classification
1. Fit logistic regression (Logit)
2. Check for convergence issues
3. Interpret odds ratios
4. Calculate marginal effects
5. Evaluate classification performance (AUC, confusion matrix)
6. Check for influential observations
7. Compare with alternative models (Probit)
8. Validate predictions on test set
### Workflow 3: Count Data Analysis
1. Fit Poisson regression
2. Check for overdispersion
3. If overdispersed, fit Negative Binomial
4. Check for excess zeros (consider ZIP/ZINB)
5. Interpret rate ratios
6. Assess goodness of fit
7. Compare models via AIC
8. Validate predictions
### Workflow 4: Time Series Forecasting
1. Plot series, check for trend/seasonality
2. Test for stationarity (ADF, KPSS)
3. Difference if non-stationary
4. Identify p, q from ACF/PACF
5. Fit ARIMA or SARIMAX
6. Check residual diagnostics (Ljung-Box)
7. Generate forecasts with confidence intervals
8. Evaluate forecast accuracy on test set
## Reference Documentation
This skill includes comprehensive reference files for detailed guidance:
### references/linear_models.md
Detailed coverage of linear regression models including:
- OLS, WLS, GLS, GLSAR, Quantile Regression
- Mixed effects models
- Recursive and rolling regression
- Comprehensive diagnostics (heteroskedasticity, autocorrelation, multicollinearity)
- Influence statistics and outlier detection
- Robust standard errors (HC, HAC, cluster)
- Hypothesis testing and model comparison
### references/glm.md
Complete guide to generalized linear models:
- All distribution families (Binomial, Poisson, Gamma, etc.)
- Link functions and when to use each
- Model fitting and interpretation
- Pseudo R-squared and goodness of fit
- Diagnostics and residual analysis
- Applications (logistic, Poisson, Gamma regression)
### references/discrete_choice.md
Comprehensive guide to discrete outcome models:
- Binary models (Logit, Probit)
- Multinomial models (MNLogit, Conditional Logit)
- Count models (Poisson, Negative Binomial, Zero-Inflated, Hurdle)
- Ordinal models
- Marginal effects and interpretation
- Model diagnostics and comparison
### references/time_series.md
In-depth time series analysis guidance:
- Univariate models (AR, ARIMA, SARIMAX, Exponential Smoothing)
- Multivariate models (VAR, VARMAX, Dynamic Factor)
- State space models
- Stationarity testing and diagnostics
- Forecasting methods and evaluation
- Granger causality, IRF, FEVD
### references/stats_diagnostics.md
Comprehensive statistical testing and diagnostics:
- Residual diagnostics (autocorrelation, heteroskedasticity, normality)
- Influence and outlier detection
- Hypothesis tests (parametric and non-parametric)
- ANOVA and post-hoc tests
- Multiple comparisons correction
- Robust covariance matrices
- Power analysis and effect sizes
**When to reference:**
- Need detailed parameter explanations
- Choosing between similar models
- Troubleshooting convergence or diagnostic issues
- Understanding specific test statistics
- Looking for code examples for advanced features
**Search patterns:**
```bash
# Find information about specific models
grep -r "Quantile Regression" references/
# Find diagnostic tests
grep -r "Breusch-Pagan" references/stats_diagnostics.md
# Find time series guidance
grep -r "SARIMAX" references/time_series.md
```
## Common Pitfalls to Avoid
1. **Forgetting constant term**: Always use `sm.add_constant()` unless no intercept desired
2. **Ignoring assumptions**: Check residuals, heteroskedasticity, autocorrelation
3. **Wrong model for outcome type**: Binary→Logit/Probit, Count→Poisson/NB, not OLS
4. **Not checking convergence**: Look for optimization warnings
5. **Misinterpreting coefficients**: Remember link functions (log, logit, etc.)
6. **Using Poisson with overdispersion**: Check dispersion, use Negative Binomial if needed
7. **Not using robust SEs**: When heteroskedasticity or clustering present
8. **Overfitting**: Too many parameters relative to sample size
9. **Data leakage**: Fitting on test data or using future information
10. **Not validating predictions**: Always check out-of-sample performance
11. **Comparing non-nested models**: Use AIC/BIC, not LR test
12. **Ignoring influential observations**: Check Cook's distance and leverage
13. **Multiple testing**: Correct p-values when testing many hypotheses
14. **Not differencing time series**: Fit ARIMA on non-stationary data
15. **Confusing prediction vs confidence intervals**: Prediction intervals are wider
## Getting Help
For detailed documentation and examples:
- Official docs: https://www.statsmodels.org/stable/
- User guide: https://www.statsmodels.org/stable/user-guide.html
- Examples: https://www.statsmodels.org/stable/examples/index.html
- API reference: https://www.statsmodels.org/stable/api.html

View File

@@ -0,0 +1,669 @@
# Discrete Choice Models Reference
This document provides comprehensive guidance on discrete choice models in statsmodels, including binary, multinomial, count, and ordinal models.
## Overview
Discrete choice models handle outcomes that are:
- **Binary**: 0/1, success/failure
- **Multinomial**: Multiple unordered categories
- **Ordinal**: Ordered categories
- **Count**: Non-negative integers
All models use maximum likelihood estimation and assume i.i.d. errors.
## Binary Models
### Logit (Logistic Regression)
Uses logistic distribution for binary outcomes.
**When to use:**
- Binary classification (yes/no, success/failure)
- Probability estimation for binary outcomes
- Interpretable odds ratios
**Model**: P(Y=1|X) = 1 / (1 + exp(-Xβ))
```python
import statsmodels.api as sm
from statsmodels.discrete.discrete_model import Logit
# Prepare data
X = sm.add_constant(X_data)
# Fit model
model = Logit(y, X)
results = model.fit()
print(results.summary())
```
**Interpretation:**
```python
import numpy as np
# Odds ratios
odds_ratios = np.exp(results.params)
print("Odds ratios:", odds_ratios)
# For 1-unit increase in X, odds multiply by exp(β)
# OR > 1: increases odds of success
# OR < 1: decreases odds of success
# OR = 1: no effect
# Confidence intervals for odds ratios
odds_ci = np.exp(results.conf_int())
print("Odds ratio 95% CI:")
print(odds_ci)
```
**Marginal effects:**
```python
# Average marginal effects (AME)
marginal_effects = results.get_margeff(at='mean')
print(marginal_effects.summary())
# Marginal effects at means (MEM)
marginal_effects_mem = results.get_margeff(at='mean', method='dydx')
# Marginal effects at representative values
marginal_effects_custom = results.get_margeff(at='mean',
atexog={'x1': 1, 'x2': 5})
```
**Predictions:**
```python
# Predicted probabilities
probs = results.predict(X)
# Binary predictions (0.5 threshold)
predictions = (probs > 0.5).astype(int)
# Custom threshold
threshold = 0.3
predictions_custom = (probs > threshold).astype(int)
# For new data
X_new = sm.add_constant(X_new_data)
new_probs = results.predict(X_new)
```
**Model evaluation:**
```python
from sklearn.metrics import (classification_report, confusion_matrix,
roc_auc_score, roc_curve)
# Classification report
print(classification_report(y, predictions))
# Confusion matrix
print(confusion_matrix(y, predictions))
# AUC-ROC
auc = roc_auc_score(y, probs)
print(f"AUC: {auc:.4f}")
# Pseudo R-squared
print(f"McFadden's Pseudo R²: {results.prsquared:.4f}")
```
### Probit
Uses normal distribution for binary outcomes.
**When to use:**
- Binary outcomes
- Prefer normal distribution assumption
- Field convention (econometrics often uses probit)
**Model**: P(Y=1|X) = Φ(Xβ), where Φ is standard normal CDF
```python
from statsmodels.discrete.discrete_model import Probit
model = Probit(y, X)
results = model.fit()
print(results.summary())
```
**Comparison with Logit:**
- Probit and Logit usually give similar results
- Probit: symmetric, based on normal distribution
- Logit: slightly heavier tails, easier interpretation (odds ratios)
- Coefficients not directly comparable (scale difference)
```python
# Marginal effects are comparable
logit_me = logit_results.get_margeff().margeff
probit_me = probit_results.get_margeff().margeff
print("Logit marginal effects:", logit_me)
print("Probit marginal effects:", probit_me)
```
## Multinomial Models
### MNLogit (Multinomial Logit)
For unordered categorical outcomes with 3+ categories.
**When to use:**
- Multiple unordered categories (e.g., transportation mode, brand choice)
- No natural ordering among categories
- Need probabilities for each category
**Model**: P(Y=j|X) = exp(Xβⱼ) / Σₖ exp(Xβₖ)
```python
from statsmodels.discrete.discrete_model import MNLogit
# y should be integers 0, 1, 2, ... for categories
model = MNLogit(y, X)
results = model.fit()
print(results.summary())
```
**Interpretation:**
```python
# One category is reference (usually category 0)
# Coefficients represent log-odds relative to reference
# For category j vs reference:
# exp(β_j) = odds ratio of category j vs reference
# Predicted probabilities for each category
probs = results.predict(X) # Shape: (n_samples, n_categories)
# Most likely category
predicted_categories = probs.argmax(axis=1)
```
**Relative risk ratios:**
```python
# Exponentiate coefficients for relative risk ratios
import numpy as np
import pandas as pd
# Get parameter names and values
params_df = pd.DataFrame({
'coef': results.params,
'RRR': np.exp(results.params)
})
print(params_df)
```
### Conditional Logit
For choice models where alternatives have characteristics.
**When to use:**
- Alternative-specific regressors (vary across choices)
- Panel data with choices
- Discrete choice experiments
```python
from statsmodels.discrete.conditional_models import ConditionalLogit
# Data structure: long format with choice indicator
model = ConditionalLogit(y_choice, X_alternatives, groups=individual_id)
results = model.fit()
```
## Count Models
### Poisson
Standard model for count data.
**When to use:**
- Count outcomes (events, occurrences)
- Rare events
- Mean ≈ variance
**Model**: P(Y=k|X) = exp(-λ) λᵏ / k!, where log(λ) = Xβ
```python
from statsmodels.discrete.count_model import Poisson
model = Poisson(y_counts, X)
results = model.fit()
print(results.summary())
```
**Interpretation:**
```python
# Rate ratios (incident rate ratios)
rate_ratios = np.exp(results.params)
print("Rate ratios:", rate_ratios)
# For 1-unit increase in X, expected count multiplies by exp(β)
```
**Check overdispersion:**
```python
# Mean and variance should be similar for Poisson
print(f"Mean: {y_counts.mean():.2f}")
print(f"Variance: {y_counts.var():.2f}")
# Formal test
from statsmodels.stats.stattools import durbin_watson
# Overdispersion if variance >> mean
# Rule of thumb: variance/mean > 1.5 suggests overdispersion
overdispersion_ratio = y_counts.var() / y_counts.mean()
print(f"Variance/Mean: {overdispersion_ratio:.2f}")
if overdispersion_ratio > 1.5:
print("Consider Negative Binomial model")
```
**With offset (for rates):**
```python
# When modeling rates with varying exposure
# log(λ) = log(exposure) + Xβ
model = Poisson(y_counts, X, offset=np.log(exposure))
results = model.fit()
```
### Negative Binomial
For overdispersed count data (variance > mean).
**When to use:**
- Count data with overdispersion
- Excess variance not explained by Poisson
- Heterogeneity in counts
**Model**: Adds dispersion parameter α to account for overdispersion
```python
from statsmodels.discrete.count_model import NegativeBinomial
model = NegativeBinomial(y_counts, X)
results = model.fit()
print(results.summary())
print(f"Dispersion parameter alpha: {results.params['alpha']:.4f}")
```
**Compare with Poisson:**
```python
# Fit both models
poisson_results = Poisson(y_counts, X).fit()
nb_results = NegativeBinomial(y_counts, X).fit()
# AIC comparison (lower is better)
print(f"Poisson AIC: {poisson_results.aic:.2f}")
print(f"Negative Binomial AIC: {nb_results.aic:.2f}")
# Likelihood ratio test (if NB is better)
from scipy import stats
lr_stat = 2 * (nb_results.llf - poisson_results.llf)
lr_pval = 1 - stats.chi2.cdf(lr_stat, df=1) # 1 extra parameter (alpha)
print(f"LR test p-value: {lr_pval:.4f}")
if lr_pval < 0.05:
print("Negative Binomial significantly better")
```
### Zero-Inflated Models
For count data with excess zeros.
**When to use:**
- More zeros than expected from Poisson/NB
- Two processes: one for zeros, one for counts
- Examples: number of doctor visits, insurance claims
**Models:**
- ZeroInflatedPoisson (ZIP)
- ZeroInflatedNegativeBinomialP (ZINB)
```python
from statsmodels.discrete.count_model import (ZeroInflatedPoisson,
ZeroInflatedNegativeBinomialP)
# ZIP model
zip_model = ZeroInflatedPoisson(y_counts, X, exog_infl=X_inflation)
zip_results = zip_model.fit()
# ZINB model (for overdispersion + excess zeros)
zinb_model = ZeroInflatedNegativeBinomialP(y_counts, X, exog_infl=X_inflation)
zinb_results = zinb_model.fit()
print(zip_results.summary())
```
**Two parts of the model:**
```python
# 1. Inflation model: P(Y=0 due to inflation)
# 2. Count model: distribution of counts
# Predicted probabilities of inflation
inflation_probs = zip_results.predict(X, which='prob')
# Predicted counts
predicted_counts = zip_results.predict(X, which='mean')
```
### Hurdle Models
Two-stage model: whether any counts, then how many.
**When to use:**
- Excess zeros
- Different processes for zero vs positive counts
- Zeros structurally different from positive values
```python
from statsmodels.discrete.count_model import HurdleCountModel
# Specify count distribution and zero inflation
model = HurdleCountModel(y_counts, X,
exog_infl=X_hurdle,
dist='poisson') # or 'negbin'
results = model.fit()
print(results.summary())
```
## Ordinal Models
### Ordered Logit/Probit
For ordered categorical outcomes.
**When to use:**
- Ordered categories (e.g., low/medium/high, ratings 1-5)
- Natural ordering matters
- Want to respect ordinal structure
**Model**: Cumulative probability model with cutpoints
```python
from statsmodels.miscmodels.ordinal_model import OrderedModel
# y should be ordered integers: 0, 1, 2, ...
model = OrderedModel(y_ordered, X, distr='logit') # or 'probit'
results = model.fit(method='bfgs')
print(results.summary())
```
**Interpretation:**
```python
# Cutpoints (thresholds between categories)
cutpoints = results.params[-n_categories+1:]
print("Cutpoints:", cutpoints)
# Coefficients
coefficients = results.params[:-n_categories+1]
print("Coefficients:", coefficients)
# Predicted probabilities for each category
probs = results.predict(X) # Shape: (n_samples, n_categories)
# Most likely category
predicted_categories = probs.argmax(axis=1)
```
**Proportional odds assumption:**
```python
# Test if coefficients are same across cutpoints
# (Brant test - implement manually or check residuals)
# Check: model each cutpoint separately and compare coefficients
```
## Model Diagnostics
### Goodness of Fit
```python
# Pseudo R-squared (McFadden)
print(f"Pseudo R²: {results.prsquared:.4f}")
# AIC/BIC for model comparison
print(f"AIC: {results.aic:.2f}")
print(f"BIC: {results.bic:.2f}")
# Log-likelihood
print(f"Log-likelihood: {results.llf:.2f}")
# Likelihood ratio test vs null model
lr_stat = 2 * (results.llf - results.llnull)
from scipy import stats
lr_pval = 1 - stats.chi2.cdf(lr_stat, results.df_model)
print(f"LR test p-value: {lr_pval}")
```
### Classification Metrics (Binary)
```python
from sklearn.metrics import (accuracy_score, precision_score, recall_score,
f1_score, roc_auc_score)
# Predictions
probs = results.predict(X)
predictions = (probs > 0.5).astype(int)
# Metrics
print(f"Accuracy: {accuracy_score(y, predictions):.4f}")
print(f"Precision: {precision_score(y, predictions):.4f}")
print(f"Recall: {recall_score(y, predictions):.4f}")
print(f"F1: {f1_score(y, predictions):.4f}")
print(f"AUC: {roc_auc_score(y, probs):.4f}")
```
### Classification Metrics (Multinomial)
```python
from sklearn.metrics import accuracy_score, classification_report, log_loss
# Predicted categories
probs = results.predict(X)
predictions = probs.argmax(axis=1)
# Accuracy
accuracy = accuracy_score(y, predictions)
print(f"Accuracy: {accuracy:.4f}")
# Classification report
print(classification_report(y, predictions))
# Log loss
logloss = log_loss(y, probs)
print(f"Log Loss: {logloss:.4f}")
```
### Count Model Diagnostics
```python
# Observed vs predicted frequencies
observed = pd.Series(y_counts).value_counts().sort_index()
predicted = results.predict(X)
predicted_counts = pd.Series(np.round(predicted)).value_counts().sort_index()
# Compare distributions
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
observed.plot(kind='bar', alpha=0.5, label='Observed', ax=ax)
predicted_counts.plot(kind='bar', alpha=0.5, label='Predicted', ax=ax)
ax.legend()
ax.set_xlabel('Count')
ax.set_ylabel('Frequency')
plt.show()
# Rootogram (better visualization)
from statsmodels.graphics.agreement import mean_diff_plot
# Custom rootogram implementation needed
```
### Influence and Outliers
```python
# Standardized residuals
std_resid = (y - results.predict(X)) / np.sqrt(results.predict(X))
# Check for outliers (|std_resid| > 2)
outliers = np.where(np.abs(std_resid) > 2)[0]
print(f"Number of outliers: {len(outliers)}")
# Leverage (hat values) - for logit/probit
# from statsmodels.stats.outliers_influence
```
## Hypothesis Testing
```python
# Single parameter test (automatic in summary)
# Multiple parameters: Wald test
# Test H0: β₁ = β₂ = 0
R = [[0, 1, 0, 0], [0, 0, 1, 0]]
wald_test = results.wald_test(R)
print(wald_test)
# Likelihood ratio test for nested models
model_reduced = Logit(y, X_reduced).fit()
model_full = Logit(y, X_full).fit()
lr_stat = 2 * (model_full.llf - model_reduced.llf)
df = model_full.df_model - model_reduced.df_model
from scipy import stats
lr_pval = 1 - stats.chi2.cdf(lr_stat, df)
print(f"LR test p-value: {lr_pval:.4f}")
```
## Model Selection and Comparison
```python
# Fit multiple models
models = {
'Logit': Logit(y, X).fit(),
'Probit': Probit(y, X).fit(),
# Add more models
}
# Compare AIC/BIC
comparison = pd.DataFrame({
'AIC': {name: model.aic for name, model in models.items()},
'BIC': {name: model.bic for name, model in models.items()},
'Pseudo R²': {name: model.prsquared for name, model in models.items()}
})
print(comparison.sort_values('AIC'))
# Cross-validation for predictive performance
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
# Use sklearn wrapper or manual CV
```
## Formula API
Use R-style formulas for easier specification.
```python
import statsmodels.formula.api as smf
# Logit with formula
formula = 'y ~ x1 + x2 + C(category) + x1:x2'
results = smf.logit(formula, data=df).fit()
# MNLogit with formula
results = smf.mnlogit(formula, data=df).fit()
# Poisson with formula
results = smf.poisson(formula, data=df).fit()
# Negative Binomial with formula
results = smf.negativebinomial(formula, data=df).fit()
```
## Common Applications
### Binary Classification (Marketing Response)
```python
# Predict customer purchase probability
X = sm.add_constant(customer_features)
model = Logit(purchased, X)
results = model.fit()
# Targeting: select top 20% likely to purchase
probs = results.predict(X)
top_20_pct_idx = np.argsort(probs)[-int(0.2*len(probs)):]
```
### Multinomial Choice (Transportation Mode)
```python
# Predict transportation mode choice
model = MNLogit(mode_choice, X)
results = model.fit()
# Predicted mode for new commuter
new_commuter = sm.add_constant(new_features)
mode_probs = results.predict(new_commuter)
predicted_mode = mode_probs.argmax(axis=1)
```
### Count Data (Number of Doctor Visits)
```python
# Model healthcare utilization
model = NegativeBinomial(num_visits, X)
results = model.fit()
# Expected visits for new patient
expected_visits = results.predict(new_patient_X)
```
### Zero-Inflated (Insurance Claims)
```python
# Many people have zero claims
# Zero-inflation: some never claim
# Count process: those who might claim
zip_model = ZeroInflatedPoisson(claims, X_count, exog_infl=X_inflation)
results = zip_model.fit()
# P(never file claim)
never_claim_prob = results.predict(X, which='prob-zero')
# Expected claims
expected_claims = results.predict(X, which='mean')
```
## Best Practices
1. **Check data type**: Ensure response matches model (binary, counts, categories)
2. **Add constant**: Always use `sm.add_constant()` unless no intercept desired
3. **Scale continuous predictors**: For better convergence and interpretation
4. **Check convergence**: Look for convergence warnings
5. **Use formula API**: For categorical variables and interactions
6. **Marginal effects**: Report marginal effects, not just coefficients
7. **Model comparison**: Use AIC/BIC and cross-validation
8. **Validate**: Holdout set or cross-validation for predictive models
9. **Check overdispersion**: For count models, test Poisson assumption
10. **Consider alternatives**: Zero-inflation, hurdle models for excess zeros
## Common Pitfalls
1. **Forgetting constant**: No intercept term
2. **Perfect separation**: Logit/probit may not converge
3. **Using Poisson with overdispersion**: Check and use Negative Binomial
4. **Misinterpreting coefficients**: Remember they're on log-odds/log scale
5. **Not checking convergence**: Optimization may fail silently
6. **Wrong distribution**: Match model to data type (binary/count/categorical)
7. **Ignoring excess zeros**: Use ZIP/ZINB when appropriate
8. **Not validating predictions**: Always check out-of-sample performance
9. **Comparing non-nested models**: Use AIC/BIC, not likelihood ratio test
10. **Ordinal as nominal**: Use OrderedModel for ordered categories

View File

@@ -0,0 +1,619 @@
# Generalized Linear Models (GLM) Reference
This document provides comprehensive guidance on generalized linear models in statsmodels, including families, link functions, and applications.
## Overview
GLMs extend linear regression to non-normal response distributions through:
1. **Distribution family**: Specifies the conditional distribution of the response
2. **Link function**: Transforms the linear predictor to the scale of the mean
3. **Variance function**: Relates variance to the mean
**General form**: g(μ) = Xβ, where g is the link function and μ = E(Y|X)
## When to Use GLM
- **Binary outcomes**: Logistic regression (Binomial family with logit link)
- **Count data**: Poisson or Negative Binomial regression
- **Positive continuous data**: Gamma or Inverse Gaussian
- **Non-normal distributions**: When OLS assumptions violated
- **Link functions**: Need non-linear relationship between predictors and response scale
## Distribution Families
### Binomial Family
For binary outcomes (0/1) or proportions (k/n).
**When to use:**
- Binary classification
- Success/failure outcomes
- Proportions or rates
**Common links:**
- Logit (default): log(μ/(1-μ))
- Probit: Φ⁻¹(μ)
- Log: log(μ)
```python
import statsmodels.api as sm
import statsmodels.formula.api as smf
# Binary logistic regression
model = sm.GLM(y, X, family=sm.families.Binomial())
results = model.fit()
# Formula API
results = smf.glm('success ~ x1 + x2', data=df,
family=sm.families.Binomial()).fit()
# Access predictions (probabilities)
probs = results.predict(X_new)
# Classification (0.5 threshold)
predictions = (probs > 0.5).astype(int)
```
**Interpretation:**
```python
import numpy as np
# Odds ratios (for logit link)
odds_ratios = np.exp(results.params)
print("Odds ratios:", odds_ratios)
# For 1-unit increase in x, odds multiply by exp(beta)
```
### Poisson Family
For count data (non-negative integers).
**When to use:**
- Count outcomes (number of events)
- Rare events
- Rate modeling (with offset)
**Common links:**
- Log (default): log(μ)
- Identity: μ
- Sqrt: √μ
```python
# Poisson regression
model = sm.GLM(y, X, family=sm.families.Poisson())
results = model.fit()
# With exposure/offset for rates
# If modeling rate = counts/exposure
model = sm.GLM(y, X, family=sm.families.Poisson(),
offset=np.log(exposure))
results = model.fit()
# Interpretation: exp(beta) = multiplicative effect on expected count
import numpy as np
rate_ratios = np.exp(results.params)
print("Rate ratios:", rate_ratios)
```
**Overdispersion check:**
```python
# Deviance / df should be ~1 for Poisson
overdispersion = results.deviance / results.df_resid
print(f"Overdispersion: {overdispersion}")
# If >> 1, consider Negative Binomial
if overdispersion > 1.5:
print("Consider Negative Binomial model for overdispersion")
```
### Negative Binomial Family
For overdispersed count data.
**When to use:**
- Count data with variance > mean
- Excess zeros or large variance
- Poisson model shows overdispersion
```python
# Negative Binomial GLM
model = sm.GLM(y, X, family=sm.families.NegativeBinomial())
results = model.fit()
# Alternative: use discrete choice model with alpha estimation
from statsmodels.discrete.discrete_model import NegativeBinomial
nb_model = NegativeBinomial(y, X)
nb_results = nb_model.fit()
print(f"Dispersion parameter alpha: {nb_results.params[-1]}")
```
### Gaussian Family
Equivalent to OLS but fit via IRLS (Iteratively Reweighted Least Squares).
**When to use:**
- Want GLM framework for consistency
- Need robust standard errors
- Comparing with other GLMs
**Common links:**
- Identity (default): μ
- Log: log(μ)
- Inverse: 1/μ
```python
# Gaussian GLM (equivalent to OLS)
model = sm.GLM(y, X, family=sm.families.Gaussian())
results = model.fit()
# Verify equivalence with OLS
ols_results = sm.OLS(y, X).fit()
print("Parameters close:", np.allclose(results.params, ols_results.params))
```
### Gamma Family
For positive continuous data, often right-skewed.
**When to use:**
- Positive outcomes (insurance claims, survival times)
- Right-skewed distributions
- Variance proportional to mean²
**Common links:**
- Inverse (default): 1/μ
- Log: log(μ)
- Identity: μ
```python
# Gamma regression (common for cost data)
model = sm.GLM(y, X, family=sm.families.Gamma())
results = model.fit()
# Log link often preferred for interpretation
model = sm.GLM(y, X, family=sm.families.Gamma(link=sm.families.links.Log()))
results = model.fit()
# With log link, exp(beta) = multiplicative effect
import numpy as np
effects = np.exp(results.params)
```
### Inverse Gaussian Family
For positive continuous data with specific variance structure.
**When to use:**
- Positive skewed outcomes
- Variance proportional to mean³
- Alternative to Gamma
**Common links:**
- Inverse squared (default): 1/μ²
- Log: log(μ)
```python
model = sm.GLM(y, X, family=sm.families.InverseGaussian())
results = model.fit()
```
### Tweedie Family
Flexible family covering multiple distributions.
**When to use:**
- Insurance claims (mixture of zeros and continuous)
- Semi-continuous data
- Need flexible variance function
**Special cases (power parameter p):**
- p=0: Normal
- p=1: Poisson
- p=2: Gamma
- p=3: Inverse Gaussian
- 1<p<2: Compound Poisson-Gamma (common for insurance)
```python
# Tweedie with power=1.5
model = sm.GLM(y, X, family=sm.families.Tweedie(link=sm.families.links.Log(),
var_power=1.5))
results = model.fit()
```
## Link Functions
Link functions connect the linear predictor to the mean of the response.
### Available Links
```python
from statsmodels.genmod import families
# Identity: g(μ) = μ
link = families.links.Identity()
# Log: g(μ) = log(μ)
link = families.links.Log()
# Logit: g(μ) = log(μ/(1-μ))
link = families.links.Logit()
# Probit: g(μ) = Φ⁻¹(μ)
link = families.links.Probit()
# Complementary log-log: g(μ) = log(-log(1-μ))
link = families.links.CLogLog()
# Inverse: g(μ) = 1/μ
link = families.links.InversePower()
# Inverse squared: g(μ) = 1/μ²
link = families.links.InverseSquared()
# Square root: g(μ) = √μ
link = families.links.Sqrt()
# Power: g(μ) = μ^p
link = families.links.Power(power=2)
```
### Choosing Link Functions
**Canonical links** (default for each family):
- Binomial → Logit
- Poisson → Log
- Gamma → Inverse
- Gaussian → Identity
- Inverse Gaussian → Inverse squared
**When to use non-canonical:**
- **Log link with Binomial**: Risk ratios instead of odds ratios
- **Identity link**: Direct additive effects (when sensible)
- **Probit vs Logit**: Similar results, preference based on field
- **CLogLog**: Asymmetric relationship, common in survival analysis
```python
# Example: Risk ratios with log-binomial model
model = sm.GLM(y, X, family=sm.families.Binomial(link=sm.families.links.Log()))
results = model.fit()
# exp(beta) now gives risk ratios, not odds ratios
risk_ratios = np.exp(results.params)
```
## Model Fitting and Results
### Basic Workflow
```python
import statsmodels.api as sm
# Add constant
X = sm.add_constant(X_data)
# Specify family and link
family = sm.families.Poisson(link=sm.families.links.Log())
# Fit model using IRLS
model = sm.GLM(y, X, family=family)
results = model.fit()
# Summary
print(results.summary())
```
### Results Attributes
```python
# Parameters and inference
results.params # Coefficients
results.bse # Standard errors
results.tvalues # Z-statistics
results.pvalues # P-values
results.conf_int() # Confidence intervals
# Predictions
results.fittedvalues # Fitted values (μ)
results.predict(X_new) # Predictions for new data
# Model fit statistics
results.aic # Akaike Information Criterion
results.bic # Bayesian Information Criterion
results.deviance # Deviance
results.null_deviance # Null model deviance
results.pearson_chi2 # Pearson chi-squared statistic
results.df_resid # Residual degrees of freedom
results.llf # Log-likelihood
# Residuals
results.resid_response # Response residuals (y - μ)
results.resid_pearson # Pearson residuals
results.resid_deviance # Deviance residuals
results.resid_anscombe # Anscombe residuals
results.resid_working # Working residuals
```
### Pseudo R-squared
```python
# McFadden's pseudo R-squared
pseudo_r2 = 1 - (results.deviance / results.null_deviance)
print(f"Pseudo R²: {pseudo_r2:.4f}")
# Adjusted pseudo R-squared
n = len(y)
k = len(results.params)
adj_pseudo_r2 = 1 - ((n-1)/(n-k)) * (results.deviance / results.null_deviance)
print(f"Adjusted Pseudo R²: {adj_pseudo_r2:.4f}")
```
## Diagnostics
### Goodness of Fit
```python
# Deviance should be approximately χ² with df_resid degrees of freedom
from scipy import stats
deviance_pval = 1 - stats.chi2.cdf(results.deviance, results.df_resid)
print(f"Deviance test p-value: {deviance_pval}")
# Pearson chi-squared test
pearson_pval = 1 - stats.chi2.cdf(results.pearson_chi2, results.df_resid)
print(f"Pearson chi² test p-value: {pearson_pval}")
# Check for overdispersion/underdispersion
dispersion = results.pearson_chi2 / results.df_resid
print(f"Dispersion: {dispersion}")
# Should be ~1; >1 suggests overdispersion, <1 underdispersion
```
### Residual Analysis
```python
import matplotlib.pyplot as plt
# Deviance residuals vs fitted
plt.figure(figsize=(10, 6))
plt.scatter(results.fittedvalues, results.resid_deviance, alpha=0.5)
plt.xlabel('Fitted values')
plt.ylabel('Deviance residuals')
plt.axhline(y=0, color='r', linestyle='--')
plt.title('Deviance Residuals vs Fitted')
plt.show()
# Q-Q plot of deviance residuals
from statsmodels.graphics.gofplots import qqplot
qqplot(results.resid_deviance, line='s')
plt.title('Q-Q Plot of Deviance Residuals')
plt.show()
# For binary outcomes: binned residual plot
if isinstance(results.model.family, sm.families.Binomial):
from statsmodels.graphics.gofplots import qqplot
# Group predictions and compute average residuals
# (custom implementation needed)
pass
```
### Influence and Outliers
```python
from statsmodels.stats.outliers_influence import GLMInfluence
influence = GLMInfluence(results)
# Leverage
leverage = influence.hat_matrix_diag
# Cook's distance
cooks_d = influence.cooks_distance[0]
# DFFITS
dffits = influence.dffits[0]
# Find influential observations
influential = np.where(cooks_d > 4/len(y))[0]
print(f"Influential observations: {influential}")
```
## Hypothesis Testing
```python
# Wald test for single parameter (automatically in summary)
# Likelihood ratio test for nested models
# Fit reduced model
model_reduced = sm.GLM(y, X_reduced, family=family).fit()
model_full = sm.GLM(y, X_full, family=family).fit()
# LR statistic
lr_stat = 2 * (model_full.llf - model_reduced.llf)
df = model_full.df_model - model_reduced.df_model
from scipy import stats
lr_pval = 1 - stats.chi2.cdf(lr_stat, df)
print(f"LR test p-value: {lr_pval}")
# Wald test for multiple parameters
# Test beta_1 = beta_2 = 0
R = [[0, 1, 0, 0], [0, 0, 1, 0]]
wald_test = results.wald_test(R)
print(wald_test)
```
## Robust Standard Errors
```python
# Heteroscedasticity-robust (sandwich estimator)
results_robust = results.get_robustcov_results(cov_type='HC0')
# Cluster-robust
results_cluster = results.get_robustcov_results(cov_type='cluster',
groups=cluster_ids)
# Compare standard errors
print("Regular SE:", results.bse)
print("Robust SE:", results_robust.bse)
```
## Model Comparison
```python
# AIC/BIC for non-nested models
models = [model1_results, model2_results, model3_results]
for i, res in enumerate(models, 1):
print(f"Model {i}: AIC={res.aic:.2f}, BIC={res.bic:.2f}")
# Likelihood ratio test for nested models (as shown above)
# Cross-validation for predictive performance
from sklearn.model_selection import KFold
from sklearn.metrics import log_loss
kf = KFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = []
for train_idx, val_idx in kf.split(X):
X_train, X_val = X[train_idx], X[val_idx]
y_train, y_val = y[train_idx], y[val_idx]
model_cv = sm.GLM(y_train, X_train, family=family).fit()
pred_probs = model_cv.predict(X_val)
score = log_loss(y_val, pred_probs)
cv_scores.append(score)
print(f"CV Log Loss: {np.mean(cv_scores):.4f} ± {np.std(cv_scores):.4f}")
```
## Prediction
```python
# Point predictions
predictions = results.predict(X_new)
# For classification: get probabilities and convert
if isinstance(family, sm.families.Binomial):
probs = predictions
class_predictions = (probs > 0.5).astype(int)
# For counts: predictions are expected counts
if isinstance(family, sm.families.Poisson):
expected_counts = predictions
# Prediction intervals via bootstrap
n_boot = 1000
boot_preds = np.zeros((n_boot, len(X_new)))
for i in range(n_boot):
# Bootstrap resample
boot_idx = np.random.choice(len(y), size=len(y), replace=True)
X_boot, y_boot = X[boot_idx], y[boot_idx]
# Fit and predict
boot_model = sm.GLM(y_boot, X_boot, family=family).fit()
boot_preds[i] = boot_model.predict(X_new)
# 95% prediction intervals
pred_lower = np.percentile(boot_preds, 2.5, axis=0)
pred_upper = np.percentile(boot_preds, 97.5, axis=0)
```
## Common Applications
### Logistic Regression (Binary Classification)
```python
import statsmodels.api as sm
# Fit logistic regression
X = sm.add_constant(X_data)
model = sm.GLM(y, X, family=sm.families.Binomial())
results = model.fit()
# Odds ratios
odds_ratios = np.exp(results.params)
odds_ci = np.exp(results.conf_int())
# Classification metrics
from sklearn.metrics import classification_report, roc_auc_score
probs = results.predict(X)
predictions = (probs > 0.5).astype(int)
print(classification_report(y, predictions))
print(f"AUC: {roc_auc_score(y, probs):.4f}")
# ROC curve
from sklearn.metrics import roc_curve
import matplotlib.pyplot as plt
fpr, tpr, thresholds = roc_curve(y, probs)
plt.plot(fpr, tpr)
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()
```
### Poisson Regression (Count Data)
```python
# Fit Poisson model
X = sm.add_constant(X_data)
model = sm.GLM(y_counts, X, family=sm.families.Poisson())
results = model.fit()
# Rate ratios
rate_ratios = np.exp(results.params)
print("Rate ratios:", rate_ratios)
# Check overdispersion
dispersion = results.pearson_chi2 / results.df_resid
if dispersion > 1.5:
print(f"Overdispersion detected ({dispersion:.2f}). Consider Negative Binomial.")
```
### Gamma Regression (Cost/Duration Data)
```python
# Fit Gamma model with log link
X = sm.add_constant(X_data)
model = sm.GLM(y_cost, X,
family=sm.families.Gamma(link=sm.families.links.Log()))
results = model.fit()
# Multiplicative effects
effects = np.exp(results.params)
print("Multiplicative effects on mean:", effects)
```
## Best Practices
1. **Check distribution assumptions**: Plot histograms and Q-Q plots of response
2. **Verify link function**: Use canonical links unless there's a reason not to
3. **Examine residuals**: Deviance residuals should be approximately normal
4. **Test for overdispersion**: Especially for Poisson models
5. **Use offsets appropriately**: For rate modeling with varying exposure
6. **Consider robust SEs**: When variance assumptions questionable
7. **Compare models**: Use AIC/BIC for non-nested, LR test for nested
8. **Interpret on original scale**: Transform coefficients (e.g., exp for log link)
9. **Check influential observations**: Use Cook's distance
10. **Validate predictions**: Use cross-validation or holdout set
## Common Pitfalls
1. **Forgetting to add constant**: No intercept term
2. **Using wrong family**: Check distribution of response
3. **Ignoring overdispersion**: Use Negative Binomial instead of Poisson
4. **Misinterpreting coefficients**: Remember link function transformation
5. **Not checking convergence**: IRLS may not converge; check warnings
6. **Complete separation in logistic**: Some categories perfectly predict outcome
7. **Using identity link with bounded outcomes**: May predict outside valid range
8. **Comparing models with different samples**: Use same observations
9. **Forgetting offset in rate models**: Must use log(exposure) as offset
10. **Not considering alternatives**: Mixed models, zero-inflation for complex data

View File

@@ -0,0 +1,447 @@
# Linear Regression Models Reference
This document provides detailed guidance on linear regression models in statsmodels, including OLS, GLS, WLS, quantile regression, and specialized variants.
## Core Model Classes
### OLS (Ordinary Least Squares)
Assumes independent, identically distributed errors (Σ=I). Best for standard regression with homoscedastic errors.
**When to use:**
- Standard regression analysis
- Errors are independent and have constant variance
- No autocorrelation or heteroscedasticity
- Most common starting point
**Basic usage:**
```python
import statsmodels.api as sm
import numpy as np
# Prepare data - ALWAYS add constant for intercept
X = sm.add_constant(X_data) # Adds column of 1s for intercept
# Fit model
model = sm.OLS(y, X)
results = model.fit()
# View results
print(results.summary())
```
**Key results attributes:**
```python
results.params # Coefficients
results.bse # Standard errors
results.tvalues # T-statistics
results.pvalues # P-values
results.rsquared # R-squared
results.rsquared_adj # Adjusted R-squared
results.fittedvalues # Fitted values (predictions on training data)
results.resid # Residuals
results.conf_int() # Confidence intervals for parameters
```
**Prediction with confidence/prediction intervals:**
```python
# For in-sample predictions
pred = results.get_prediction(X)
pred_summary = pred.summary_frame()
print(pred_summary) # Contains mean, std, confidence intervals
# For out-of-sample predictions
X_new = sm.add_constant(X_new_data)
pred_new = results.get_prediction(X_new)
pred_summary = pred_new.summary_frame()
# Access intervals
mean_ci_lower = pred_summary["mean_ci_lower"]
mean_ci_upper = pred_summary["mean_ci_upper"]
obs_ci_lower = pred_summary["obs_ci_lower"] # Prediction intervals
obs_ci_upper = pred_summary["obs_ci_upper"]
```
**Formula API (R-style):**
```python
import statsmodels.formula.api as smf
# Automatic handling of categorical variables and interactions
formula = 'y ~ x1 + x2 + C(category) + x1:x2'
results = smf.ols(formula, data=df).fit()
```
### WLS (Weighted Least Squares)
Handles heteroscedastic errors (diagonal Σ) where variance differs across observations.
**When to use:**
- Known heteroscedasticity (non-constant error variance)
- Different observations have different reliability
- Weights are known or can be estimated
**Usage:**
```python
# If you know the weights (inverse variance)
weights = 1 / error_variance
model = sm.WLS(y, X, weights=weights)
results = model.fit()
# Common weight patterns:
# - 1/variance: when variance is known
# - n_i: sample size for grouped data
# - 1/x: when variance proportional to x
```
**Feasible WLS (estimating weights):**
```python
# Step 1: Fit OLS
ols_results = sm.OLS(y, X).fit()
# Step 2: Model squared residuals to estimate variance
abs_resid = np.abs(ols_results.resid)
variance_model = sm.OLS(np.log(abs_resid**2), X).fit()
# Step 3: Use estimated variance as weights
weights = 1 / np.exp(variance_model.fittedvalues)
wls_results = sm.WLS(y, X, weights=weights).fit()
```
### GLS (Generalized Least Squares)
Handles arbitrary covariance structure (Σ). Superclass for other regression methods.
**When to use:**
- Known covariance structure
- Correlated errors
- More general than WLS
**Usage:**
```python
# Specify covariance structure
# Sigma should be (n x n) covariance matrix
model = sm.GLS(y, X, sigma=Sigma)
results = model.fit()
```
### GLSAR (GLS with Autoregressive Errors)
Feasible generalized least squares with AR(p) errors for time series data.
**When to use:**
- Time series regression with autocorrelated errors
- Need to account for serial correlation
- Violations of error independence
**Usage:**
```python
# AR(1) errors
model = sm.GLSAR(y, X, rho=1) # rho=1 for AR(1), rho=2 for AR(2), etc.
results = model.iterative_fit() # Iteratively estimates AR parameters
print(results.summary())
print(f"Estimated rho: {results.model.rho}")
```
### RLS (Recursive Least Squares)
Sequential parameter estimation, useful for adaptive or online learning.
**When to use:**
- Parameters change over time
- Online/streaming data
- Want to see parameter evolution
**Usage:**
```python
from statsmodels.regression.recursive_ls import RecursiveLS
model = RecursiveLS(y, X)
results = model.fit()
# Access time-varying parameters
params_over_time = results.recursive_coefficients
cusum = results.cusum # CUSUM statistic for structural breaks
```
### Rolling Regressions
Compute estimates across moving windows for time-varying parameter detection.
**When to use:**
- Parameters vary over time
- Want to detect structural changes
- Time series with evolving relationships
**Usage:**
```python
from statsmodels.regression.rolling import RollingOLS, RollingWLS
# Rolling OLS with 60-period window
rolling_model = RollingOLS(y, X, window=60)
rolling_results = rolling_model.fit()
# Extract time-varying parameters
rolling_params = rolling_results.params # DataFrame with parameters over time
rolling_rsquared = rolling_results.rsquared
# Plot parameter evolution
import matplotlib.pyplot as plt
rolling_params.plot()
plt.title('Time-Varying Coefficients')
plt.show()
```
### Quantile Regression
Analyzes conditional quantiles rather than conditional mean.
**When to use:**
- Interest in quantiles (median, 90th percentile, etc.)
- Robust to outliers (median regression)
- Distributional effects across quantiles
- Heterogeneous effects
**Usage:**
```python
from statsmodels.regression.quantile_regression import QuantReg
# Median regression (50th percentile)
model = QuantReg(y, X)
results_median = model.fit(q=0.5)
# Multiple quantiles
quantiles = [0.1, 0.25, 0.5, 0.75, 0.9]
results_dict = {}
for q in quantiles:
results_dict[q] = model.fit(q=q)
# Plot quantile-varying effects
import matplotlib.pyplot as plt
coef_dict = {q: res.params for q, res in results_dict.items()}
coef_df = pd.DataFrame(coef_dict).T
coef_df.plot()
plt.xlabel('Quantile')
plt.ylabel('Coefficient')
plt.show()
```
## Mixed Effects Models
For hierarchical/nested data with random effects.
**When to use:**
- Clustered/grouped data (students in schools, patients in hospitals)
- Repeated measures
- Need random effects to account for grouping
**Usage:**
```python
from statsmodels.regression.mixed_linear_model import MixedLM
# Random intercept model
model = MixedLM(y, X, groups=group_ids)
results = model.fit()
# Random intercept and slope
model = MixedLM(y, X, groups=group_ids, exog_re=X_random)
results = model.fit()
print(results.summary())
```
## Diagnostics and Model Assessment
### Residual Analysis
```python
# Basic residual plots
import matplotlib.pyplot as plt
# Residuals vs fitted
plt.scatter(results.fittedvalues, results.resid)
plt.xlabel('Fitted values')
plt.ylabel('Residuals')
plt.axhline(y=0, color='r', linestyle='--')
plt.title('Residuals vs Fitted')
plt.show()
# Q-Q plot for normality
from statsmodels.graphics.gofplots import qqplot
qqplot(results.resid, line='s')
plt.show()
# Histogram of residuals
plt.hist(results.resid, bins=30, edgecolor='black')
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.title('Distribution of Residuals')
plt.show()
```
### Specification Tests
```python
from statsmodels.stats.diagnostic import het_breuschpagan, het_white
from statsmodels.stats.stattools import durbin_watson, jarque_bera
# Heteroscedasticity tests
lm_stat, lm_pval, f_stat, f_pval = het_breuschpagan(results.resid, X)
print(f"Breusch-Pagan test p-value: {lm_pval}")
# White test
white_test = het_white(results.resid, X)
print(f"White test p-value: {white_test[1]}")
# Autocorrelation
dw_stat = durbin_watson(results.resid)
print(f"Durbin-Watson statistic: {dw_stat}")
# DW ~ 2 indicates no autocorrelation
# DW < 2 suggests positive autocorrelation
# DW > 2 suggests negative autocorrelation
# Normality test
jb_stat, jb_pval, skew, kurtosis = jarque_bera(results.resid)
print(f"Jarque-Bera test p-value: {jb_pval}")
```
### Multicollinearity
```python
from statsmodels.stats.outliers_influence import variance_inflation_factor
# Calculate VIF for each variable
vif_data = pd.DataFrame()
vif_data["Variable"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif_data)
# VIF > 10 indicates problematic multicollinearity
# VIF > 5 suggests moderate multicollinearity
# Condition number (from summary)
print(f"Condition number: {results.condition_number}")
# Condition number > 20 suggests multicollinearity
# Condition number > 30 indicates serious problems
```
### Influence Statistics
```python
from statsmodels.stats.outliers_influence import OLSInfluence
influence = results.get_influence()
# Leverage (hat values)
leverage = influence.hat_matrix_diag
# High leverage: > 2*p/n (p=predictors, n=observations)
# Cook's distance
cooks_d = influence.cooks_distance[0]
# Influential if Cook's D > 4/n
# DFFITS
dffits = influence.dffits[0]
# Influential if |DFFITS| > 2*sqrt(p/n)
# Create influence plot
from statsmodels.graphics.regressionplots import influence_plot
fig, ax = plt.subplots(figsize=(12, 8))
influence_plot(results, ax=ax)
plt.show()
```
### Hypothesis Testing
```python
# Test single coefficient
# H0: beta_i = 0 (automatically in summary)
# Test multiple restrictions using F-test
# Example: Test beta_1 = beta_2 = 0
R = [[0, 1, 0, 0], [0, 0, 1, 0]] # Restriction matrix
f_test = results.f_test(R)
print(f_test)
# Formula-based hypothesis testing
f_test = results.f_test("x1 = x2 = 0")
print(f_test)
# Test linear combination: beta_1 + beta_2 = 1
r_matrix = [[0, 1, 1, 0]]
q_matrix = [1] # RHS value
f_test = results.f_test((r_matrix, q_matrix))
print(f_test)
# Wald test (equivalent to F-test for linear restrictions)
wald_test = results.wald_test(R)
print(wald_test)
```
## Model Comparison
```python
# Compare nested models using likelihood ratio test (if using MLE)
from statsmodels.stats.anova import anova_lm
# Fit restricted and unrestricted models
model_restricted = sm.OLS(y, X_restricted).fit()
model_full = sm.OLS(y, X_full).fit()
# ANOVA table for model comparison
anova_results = anova_lm(model_restricted, model_full)
print(anova_results)
# AIC/BIC for non-nested model comparison
print(f"Model 1 AIC: {model1.aic}, BIC: {model1.bic}")
print(f"Model 2 AIC: {model2.aic}, BIC: {model2.bic}")
# Lower AIC/BIC indicates better model
```
## Robust Standard Errors
Handle heteroscedasticity or clustering without reweighting.
```python
# Heteroscedasticity-robust (HC) standard errors
results_hc = results.get_robustcov_results(cov_type='HC0') # White's
results_hc1 = results.get_robustcov_results(cov_type='HC1')
results_hc2 = results.get_robustcov_results(cov_type='HC2')
results_hc3 = results.get_robustcov_results(cov_type='HC3') # Most conservative
# Newey-West HAC (Heteroscedasticity and Autocorrelation Consistent)
results_hac = results.get_robustcov_results(cov_type='HAC', maxlags=4)
# Cluster-robust standard errors
results_cluster = results.get_robustcov_results(cov_type='cluster',
groups=cluster_ids)
# View robust results
print(results_hc3.summary())
```
## Best Practices
1. **Always add constant**: Use `sm.add_constant()` unless you specifically want to exclude the intercept
2. **Check assumptions**: Run diagnostic tests (heteroscedasticity, autocorrelation, normality)
3. **Use formula API for categorical variables**: `smf.ols()` handles categorical variables automatically
4. **Robust standard errors**: Use when heteroscedasticity detected but model specification is correct
5. **Model selection**: Use AIC/BIC for non-nested models, F-test/likelihood ratio for nested models
6. **Outliers and influence**: Always check Cook's distance and leverage
7. **Multicollinearity**: Check VIF and condition number before interpretation
8. **Time series**: Use `GLSAR` or robust HAC standard errors for autocorrelated errors
9. **Grouped data**: Consider mixed effects models or cluster-robust standard errors
10. **Quantile regression**: Use for robust estimation or when interested in distributional effects
## Common Pitfalls
1. **Forgetting to add constant**: Results in no-intercept model
2. **Ignoring heteroscedasticity**: Use WLS or robust standard errors
3. **Using OLS with autocorrelated errors**: Use GLSAR or HAC standard errors
4. **Over-interpreting with multicollinearity**: Check VIF first
5. **Not checking residuals**: Always plot residuals vs fitted values
6. **Using t-SNE/PCA residuals**: Residuals should be from original space
7. **Confusing prediction vs confidence intervals**: Prediction intervals are wider
8. **Not handling categorical variables properly**: Use formula API or manual dummy coding
9. **Comparing models with different sample sizes**: Ensure same observations used
10. **Ignoring influential observations**: Check Cook's distance and DFFITS

View File

@@ -0,0 +1,859 @@
# Statistical Tests and Diagnostics Reference
This document provides comprehensive guidance on statistical tests, diagnostics, and tools available in statsmodels.
## Overview
Statsmodels provides extensive statistical testing capabilities:
- Residual diagnostics and specification tests
- Hypothesis testing (parametric and non-parametric)
- Goodness-of-fit tests
- Multiple comparisons and post-hoc tests
- Power and sample size calculations
- Robust covariance matrices
- Influence and outlier detection
## Residual Diagnostics
### Autocorrelation Tests
**Ljung-Box Test**: Tests for autocorrelation in residuals
```python
from statsmodels.stats.diagnostic import acorr_ljungbox
# Test residuals for autocorrelation
lb_test = acorr_ljungbox(residuals, lags=10, return_df=True)
print(lb_test)
# H0: No autocorrelation up to lag k
# If p-value < 0.05, reject H0 (autocorrelation present)
```
**Durbin-Watson Test**: Tests for first-order autocorrelation
```python
from statsmodels.stats.stattools import durbin_watson
dw_stat = durbin_watson(residuals)
print(f"Durbin-Watson: {dw_stat:.4f}")
# DW ≈ 2: no autocorrelation
# DW < 2: positive autocorrelation
# DW > 2: negative autocorrelation
# Exact critical values depend on n and k
```
**Breusch-Godfrey Test**: More general test for autocorrelation
```python
from statsmodels.stats.diagnostic import acorr_breusch_godfrey
bg_test = acorr_breusch_godfrey(results, nlags=5)
lm_stat, lm_pval, f_stat, f_pval = bg_test
print(f"LM statistic: {lm_stat:.4f}, p-value: {lm_pval:.4f}")
# H0: No autocorrelation up to lag k
```
### Heteroskedasticity Tests
**Breusch-Pagan Test**: Tests for heteroskedasticity
```python
from statsmodels.stats.diagnostic import het_breuschpagan
bp_test = het_breuschpagan(residuals, exog)
lm_stat, lm_pval, f_stat, f_pval = bp_test
print(f"Breusch-Pagan test p-value: {lm_pval:.4f}")
# H0: Homoskedasticity (constant variance)
# If p-value < 0.05, reject H0 (heteroskedasticity present)
```
**White Test**: More general test for heteroskedasticity
```python
from statsmodels.stats.diagnostic import het_white
white_test = het_white(residuals, exog)
lm_stat, lm_pval, f_stat, f_pval = white_test
print(f"White test p-value: {lm_pval:.4f}")
# H0: Homoskedasticity
```
**ARCH Test**: Tests for autoregressive conditional heteroskedasticity
```python
from statsmodels.stats.diagnostic import het_arch
arch_test = het_arch(residuals, nlags=5)
lm_stat, lm_pval, f_stat, f_pval = arch_test
print(f"ARCH test p-value: {lm_pval:.4f}")
# H0: No ARCH effects
# If significant, consider GARCH model
```
### Normality Tests
**Jarque-Bera Test**: Tests for normality using skewness and kurtosis
```python
from statsmodels.stats.stattools import jarque_bera
jb_stat, jb_pval, skew, kurtosis = jarque_bera(residuals)
print(f"Jarque-Bera statistic: {jb_stat:.4f}")
print(f"p-value: {jb_pval:.4f}")
print(f"Skewness: {skew:.4f}")
print(f"Kurtosis: {kurtosis:.4f}")
# H0: Residuals are normally distributed
# Normal: skewness ≈ 0, kurtosis ≈ 3
```
**Omnibus Test**: Another normality test (also based on skewness/kurtosis)
```python
from statsmodels.stats.stattools import omni_normtest
omni_stat, omni_pval = omni_normtest(residuals)
print(f"Omnibus test p-value: {omni_pval:.4f}")
# H0: Normality
```
**Anderson-Darling Test**: Distribution fit test
```python
from statsmodels.stats.diagnostic import normal_ad
ad_stat, ad_pval = normal_ad(residuals)
print(f"Anderson-Darling test p-value: {ad_pval:.4f}")
```
**Lilliefors Test**: Modified Kolmogorov-Smirnov test
```python
from statsmodels.stats.diagnostic import lilliefors
lf_stat, lf_pval = lilliefors(residuals, dist='norm')
print(f"Lilliefors test p-value: {lf_pval:.4f}")
```
### Linearity and Specification Tests
**Ramsey RESET Test**: Tests for functional form misspecification
```python
from statsmodels.stats.diagnostic import linear_reset
reset_test = linear_reset(results, power=2)
f_stat, f_pval = reset_test
print(f"RESET test p-value: {f_pval:.4f}")
# H0: Model is correctly specified (linear)
# If rejected, may need polynomial terms or transformations
```
**Harvey-Collier Test**: Tests for linearity
```python
from statsmodels.stats.diagnostic import linear_harvey_collier
hc_stat, hc_pval = linear_harvey_collier(results)
print(f"Harvey-Collier test p-value: {hc_pval:.4f}")
# H0: Linear specification is correct
```
## Multicollinearity Detection
**Variance Inflation Factor (VIF)**:
```python
from statsmodels.stats.outliers_influence import variance_inflation_factor
import pandas as pd
# Calculate VIF for each variable
vif_data = pd.DataFrame()
vif_data["Variable"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i)
for i in range(X.shape[1])]
print(vif_data.sort_values('VIF', ascending=False))
# Interpretation:
# VIF = 1: No correlation with other predictors
# VIF > 5: Moderate multicollinearity
# VIF > 10: Serious multicollinearity problem
# VIF > 20: Severe multicollinearity (consider removing variable)
```
**Condition Number**: From regression results
```python
print(f"Condition number: {results.condition_number:.2f}")
# Interpretation:
# < 10: No multicollinearity concern
# 10-30: Moderate multicollinearity
# > 30: Strong multicollinearity
# > 100: Severe multicollinearity
```
## Influence and Outlier Detection
### Leverage
High leverage points have extreme predictor values.
```python
from statsmodels.stats.outliers_influence import OLSInfluence
influence = results.get_influence()
# Hat values (leverage)
leverage = influence.hat_matrix_diag
# Rule of thumb: leverage > 2*p/n or 3*p/n is high
# p = number of parameters, n = sample size
threshold = 2 * len(results.params) / len(y)
high_leverage = np.where(leverage > threshold)[0]
print(f"High leverage observations: {high_leverage}")
```
### Cook's Distance
Measures overall influence of each observation.
```python
# Cook's distance
cooks_d = influence.cooks_distance[0]
# Rule of thumb: Cook's D > 4/n is influential
threshold = 4 / len(y)
influential = np.where(cooks_d > threshold)[0]
print(f"Influential observations (Cook's D): {influential}")
# Plot
import matplotlib.pyplot as plt
plt.stem(range(len(cooks_d)), cooks_d)
plt.axhline(y=threshold, color='r', linestyle='--', label=f'Threshold (4/n)')
plt.xlabel('Observation')
plt.ylabel("Cook's Distance")
plt.legend()
plt.show()
```
### DFFITS
Measures influence on fitted value.
```python
# DFFITS
dffits = influence.dffits[0]
# Rule of thumb: |DFFITS| > 2*sqrt(p/n) is influential
p = len(results.params)
n = len(y)
threshold = 2 * np.sqrt(p / n)
influential_dffits = np.where(np.abs(dffits) > threshold)[0]
print(f"Influential observations (DFFITS): {influential_dffits}")
```
### DFBETAs
Measures influence on each coefficient.
```python
# DFBETAs (one for each parameter)
dfbetas = influence.dfbetas
# Rule of thumb: |DFBETA| > 2/sqrt(n)
threshold = 2 / np.sqrt(n)
for i, param_name in enumerate(results.params.index):
influential = np.where(np.abs(dfbetas[:, i]) > threshold)[0]
if len(influential) > 0:
print(f"Influential for {param_name}: {influential}")
```
### Influence Plot
```python
from statsmodels.graphics.regressionplots import influence_plot
fig, ax = plt.subplots(figsize=(12, 8))
influence_plot(results, ax=ax, criterion='cooks')
plt.show()
# Combines leverage, residuals, and Cook's distance
# Large bubbles = high Cook's distance
# Far from x=0 = high leverage
# Far from y=0 = large residual
```
### Studentized Residuals
```python
# Studentized residuals (outliers)
student_resid = influence.resid_studentized_internal
# External studentized residuals (more conservative)
student_resid_external = influence.resid_studentized_external
# Outliers: |studentized residual| > 3 (or > 2.5)
outliers = np.where(np.abs(student_resid_external) > 3)[0]
print(f"Outliers: {outliers}")
```
## Hypothesis Testing
### t-tests
**One-sample t-test**: Test if mean equals specific value
```python
from scipy import stats
# H0: population mean = mu_0
t_stat, p_value = stats.ttest_1samp(data, popmean=mu_0)
print(f"t-statistic: {t_stat:.4f}")
print(f"p-value: {p_value:.4f}")
```
**Two-sample t-test**: Compare means of two groups
```python
# H0: mean1 = mean2 (equal variances)
t_stat, p_value = stats.ttest_ind(group1, group2)
# Welch's t-test (unequal variances)
t_stat, p_value = stats.ttest_ind(group1, group2, equal_var=False)
print(f"t-statistic: {t_stat:.4f}")
print(f"p-value: {p_value:.4f}")
```
**Paired t-test**: Compare paired observations
```python
# H0: mean difference = 0
t_stat, p_value = stats.ttest_rel(before, after)
print(f"t-statistic: {t_stat:.4f}")
print(f"p-value: {p_value:.4f}")
```
### Proportion Tests
**One-proportion test**:
```python
from statsmodels.stats.proportion import proportions_ztest
# H0: proportion = p0
count = 45 # successes
nobs = 100 # total observations
p0 = 0.5 # hypothesized proportion
z_stat, p_value = proportions_ztest(count, nobs, value=p0)
print(f"z-statistic: {z_stat:.4f}")
print(f"p-value: {p_value:.4f}")
```
**Two-proportion test**:
```python
# H0: proportion1 = proportion2
counts = [45, 60]
nobs = [100, 120]
z_stat, p_value = proportions_ztest(counts, nobs)
print(f"z-statistic: {z_stat:.4f}")
print(f"p-value: {p_value:.4f}")
```
### Chi-square Tests
**Chi-square test of independence**:
```python
from scipy.stats import chi2_contingency
# Contingency table
contingency_table = pd.crosstab(variable1, variable2)
chi2, p_value, dof, expected = chi2_contingency(contingency_table)
print(f"Chi-square statistic: {chi2:.4f}")
print(f"p-value: {p_value:.4f}")
print(f"Degrees of freedom: {dof}")
# H0: Variables are independent
```
**Chi-square goodness-of-fit**:
```python
from scipy.stats import chisquare
# Observed frequencies
observed = [20, 30, 25, 25]
# Expected frequencies (equal by default)
expected = [25, 25, 25, 25]
chi2, p_value = chisquare(observed, expected)
print(f"Chi-square statistic: {chi2:.4f}")
print(f"p-value: {p_value:.4f}")
# H0: Data follow the expected distribution
```
### Non-parametric Tests
**Mann-Whitney U test** (independent samples):
```python
from scipy.stats import mannwhitneyu
# H0: Distributions are equal
u_stat, p_value = mannwhitneyu(group1, group2, alternative='two-sided')
print(f"U statistic: {u_stat:.4f}")
print(f"p-value: {p_value:.4f}")
```
**Wilcoxon signed-rank test** (paired samples):
```python
from scipy.stats import wilcoxon
# H0: Median difference = 0
w_stat, p_value = wilcoxon(before, after)
print(f"W statistic: {w_stat:.4f}")
print(f"p-value: {p_value:.4f}")
```
**Kruskal-Wallis H test** (>2 groups):
```python
from scipy.stats import kruskal
# H0: All groups have same distribution
h_stat, p_value = kruskal(group1, group2, group3)
print(f"H statistic: {h_stat:.4f}")
print(f"p-value: {p_value:.4f}")
```
**Sign test**:
```python
from statsmodels.stats.descriptivestats import sign_test
# H0: Median = m0
result = sign_test(data, m0=0)
print(result)
```
### ANOVA
**One-way ANOVA**:
```python
from scipy.stats import f_oneway
# H0: All group means are equal
f_stat, p_value = f_oneway(group1, group2, group3)
print(f"F-statistic: {f_stat:.4f}")
print(f"p-value: {p_value:.4f}")
```
**Two-way ANOVA** (with statsmodels):
```python
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
# Fit model
model = ols('response ~ C(factor1) + C(factor2) + C(factor1):C(factor2)',
data=df).fit()
# ANOVA table
anova_table = anova_lm(model, typ=2)
print(anova_table)
```
**Repeated measures ANOVA**:
```python
from statsmodels.stats.anova import AnovaRM
# Requires long-format data
aovrm = AnovaRM(df, depvar='score', subject='subject_id', within=['time'])
results = aovrm.fit()
print(results.summary())
```
## Multiple Comparisons
### Post-hoc Tests
**Tukey's HSD** (Honest Significant Difference):
```python
from statsmodels.stats.multicomp import pairwise_tukeyhsd
# Perform Tukey HSD test
tukey = pairwise_tukeyhsd(data, groups, alpha=0.05)
print(tukey.summary())
# Plot confidence intervals
tukey.plot_simultaneous()
plt.show()
```
**Bonferroni correction**:
```python
from statsmodels.stats.multitest import multipletests
# P-values from multiple tests
p_values = [0.01, 0.03, 0.04, 0.15, 0.001]
# Apply correction
reject, pvals_corrected, alphac_sidak, alphac_bonf = multipletests(
p_values,
alpha=0.05,
method='bonferroni'
)
print("Rejected:", reject)
print("Corrected p-values:", pvals_corrected)
```
**False Discovery Rate (FDR)**:
```python
# FDR correction (less conservative than Bonferroni)
reject, pvals_corrected, alphac_sidak, alphac_bonf = multipletests(
p_values,
alpha=0.05,
method='fdr_bh' # Benjamini-Hochberg
)
print("Rejected:", reject)
print("Corrected p-values:", pvals_corrected)
```
## Robust Covariance Matrices
### Heteroskedasticity-Consistent (HC) Standard Errors
```python
# After fitting OLS
results = sm.OLS(y, X).fit()
# HC0 (White's heteroskedasticity-consistent SEs)
results_hc0 = results.get_robustcov_results(cov_type='HC0')
# HC1 (degrees of freedom adjustment)
results_hc1 = results.get_robustcov_results(cov_type='HC1')
# HC2 (leverage adjustment)
results_hc2 = results.get_robustcov_results(cov_type='HC2')
# HC3 (most conservative, recommended for small samples)
results_hc3 = results.get_robustcov_results(cov_type='HC3')
print("Standard OLS SEs:", results.bse)
print("Robust HC3 SEs:", results_hc3.bse)
```
### HAC (Heteroskedasticity and Autocorrelation Consistent)
**Newey-West standard errors**:
```python
# For time series with autocorrelation and heteroskedasticity
results_hac = results.get_robustcov_results(cov_type='HAC', maxlags=4)
print("HAC (Newey-West) SEs:", results_hac.bse)
print(results_hac.summary())
```
### Cluster-Robust Standard Errors
```python
# For clustered/grouped data
results_cluster = results.get_robustcov_results(
cov_type='cluster',
groups=cluster_ids
)
print("Cluster-robust SEs:", results_cluster.bse)
```
## Descriptive Statistics
**Basic descriptive statistics**:
```python
from statsmodels.stats.api import DescrStatsW
# Comprehensive descriptive stats
desc = DescrStatsW(data)
print("Mean:", desc.mean)
print("Std Dev:", desc.std)
print("Variance:", desc.var)
print("Confidence interval:", desc.tconfint_mean())
# Quantiles
print("Median:", desc.quantile(0.5))
print("IQR:", desc.quantile([0.25, 0.75]))
```
**Weighted statistics**:
```python
# With weights
desc_weighted = DescrStatsW(data, weights=weights)
print("Weighted mean:", desc_weighted.mean)
print("Weighted std:", desc_weighted.std)
```
**Compare two groups**:
```python
from statsmodels.stats.weightstats import CompareMeans
# Create comparison object
cm = CompareMeans(DescrStatsW(group1), DescrStatsW(group2))
# t-test
print("t-test:", cm.ttest_ind())
# Confidence interval for difference
print("CI for difference:", cm.tconfint_diff())
# Test for equal variances
print("Equal variance test:", cm.test_equal_var())
```
## Power Analysis and Sample Size
**Power for t-test**:
```python
from statsmodels.stats.power import tt_ind_solve_power
# Solve for sample size
effect_size = 0.5 # Cohen's d
alpha = 0.05
power = 0.8
n = tt_ind_solve_power(effect_size=effect_size,
alpha=alpha,
power=power,
alternative='two-sided')
print(f"Required sample size per group: {n:.0f}")
# Solve for power given n
power = tt_ind_solve_power(effect_size=0.5,
nobs1=50,
alpha=0.05,
alternative='two-sided')
print(f"Power: {power:.4f}")
```
**Power for proportion test**:
```python
from statsmodels.stats.power import zt_ind_solve_power
# For proportion tests (z-test)
effect_size = 0.3 # Difference in proportions
alpha = 0.05
power = 0.8
n = zt_ind_solve_power(effect_size=effect_size,
alpha=alpha,
power=power,
alternative='two-sided')
print(f"Required sample size per group: {n:.0f}")
```
**Power curves**:
```python
from statsmodels.stats.power import TTestIndPower
import matplotlib.pyplot as plt
# Create power analysis object
analysis = TTestIndPower()
# Plot power curves for different sample sizes
sample_sizes = range(10, 200, 10)
effect_sizes = [0.2, 0.5, 0.8] # Small, medium, large
fig, ax = plt.subplots(figsize=(10, 6))
for es in effect_sizes:
power = [analysis.solve_power(effect_size=es, nobs1=n, alpha=0.05)
for n in sample_sizes]
ax.plot(sample_sizes, power, label=f'Effect size = {es}')
ax.axhline(y=0.8, color='r', linestyle='--', label='Power = 0.8')
ax.set_xlabel('Sample size per group')
ax.set_ylabel('Power')
ax.set_title('Power Curves for Two-Sample t-test')
ax.legend()
ax.grid(True, alpha=0.3)
plt.show()
```
## Effect Sizes
**Cohen's d** (standardized mean difference):
```python
def cohens_d(group1, group2):
\"\"\"Calculate Cohen's d for independent samples\"\"\"
n1, n2 = len(group1), len(group2)
var1, var2 = np.var(group1, ddof=1), np.var(group2, ddof=1)
# Pooled standard deviation
pooled_std = np.sqrt(((n1-1)*var1 + (n2-1)*var2) / (n1+n2-2))
# Cohen's d
d = (np.mean(group1) - np.mean(group2)) / pooled_std
return d
d = cohens_d(group1, group2)
print(f"Cohen's d: {d:.4f}")
# Interpretation:
# |d| < 0.2: negligible
# |d| ~ 0.2: small
# |d| ~ 0.5: medium
# |d| ~ 0.8: large
```
**Eta-squared** (for ANOVA):
```python
# From ANOVA table
# η² = SS_between / SS_total
def eta_squared(anova_table):
return anova_table['sum_sq'][0] / anova_table['sum_sq'].sum()
# After running ANOVA
eta_sq = eta_squared(anova_table)
print(f"Eta-squared: {eta_sq:.4f}")
# Interpretation:
# 0.01: small effect
# 0.06: medium effect
# 0.14: large effect
```
## Contingency Tables and Association
**McNemar's test** (paired binary data):
```python
from statsmodels.stats.contingency_tables import mcnemar
# 2x2 contingency table
table = [[a, b],
[c, d]]
result = mcnemar(table, exact=True) # or exact=False for large samples
print(f"p-value: {result.pvalue:.4f}")
# H0: Marginal probabilities are equal
```
**Cochran-Mantel-Haenszel test**:
```python
from statsmodels.stats.contingency_tables import StratifiedTable
# For stratified 2x2 tables
strat_table = StratifiedTable(tables_list)
result = strat_table.test_null_odds()
print(f"p-value: {result.pvalue:.4f}")
```
## Treatment Effects and Causal Inference
**Propensity score matching**:
```python
from statsmodels.treatment import propensity_score
# Estimate propensity scores
ps_model = sm.Logit(treatment, X).fit()
propensity_scores = ps_model.predict(X)
# Use for matching or weighting
# (manual implementation of matching needed)
```
**Difference-in-differences**:
```python
# Did formula: outcome ~ treatment * post
model = ols('outcome ~ treatment + post + treatment:post', data=df).fit()
# DiD estimate is the interaction coefficient
did_estimate = model.params['treatment:post']
print(f"DiD estimate: {did_estimate:.4f}")
```
## Best Practices
1. **Always check assumptions**: Test before interpreting results
2. **Report effect sizes**: Not just p-values
3. **Use appropriate tests**: Match test to data type and distribution
4. **Correct for multiple comparisons**: When conducting many tests
5. **Check sample size**: Ensure adequate power
6. **Visual inspection**: Plot data before testing
7. **Report confidence intervals**: Along with point estimates
8. **Consider alternatives**: Non-parametric when assumptions violated
9. **Robust standard errors**: Use when heteroskedasticity/autocorrelation present
10. **Document decisions**: Note which tests used and why
## Common Pitfalls
1. **Not checking test assumptions**: May invalidate results
2. **Multiple testing without correction**: Inflated Type I error
3. **Using parametric tests on non-normal data**: Consider non-parametric
4. **Ignoring heteroskedasticity**: Use robust SEs
5. **Confusing statistical and practical significance**: Check effect sizes
6. **Not reporting confidence intervals**: Only p-values insufficient
7. **Using wrong test**: Match test to research question
8. **Insufficient power**: Risk of Type II error (false negatives)
9. **p-hacking**: Testing many specifications until significant
10. **Overinterpreting p-values**: Remember limitations of NHST

View File

@@ -0,0 +1,716 @@
# Time Series Analysis Reference
This document provides comprehensive guidance on time series models in statsmodels, including ARIMA, state space models, VAR, exponential smoothing, and forecasting methods.
## Overview
Statsmodels offers extensive time series capabilities:
- **Univariate models**: AR, ARIMA, SARIMAX, Exponential Smoothing
- **Multivariate models**: VAR, VARMAX, Dynamic Factor Models
- **State space framework**: Custom models, Kalman filtering
- **Diagnostic tools**: ACF, PACF, stationarity tests, residual analysis
- **Forecasting**: Point forecasts and prediction intervals
## Univariate Time Series Models
### AutoReg (AR Model)
Autoregressive model: current value depends on past values.
**When to use:**
- Univariate time series
- Past values predict future
- Stationary series
**Model**: yₜ = c + φ₁yₜ₋₁ + φ₂yₜ₋₂ + ... + φₚyₜ₋ₚ + εₜ
```python
from statsmodels.tsa.ar_model import AutoReg
import pandas as pd
# Fit AR(p) model
model = AutoReg(y, lags=5) # AR(5)
results = model.fit()
print(results.summary())
```
**With exogenous regressors:**
```python
# AR with exogenous variables (ARX)
model = AutoReg(y, lags=5, exog=X_exog)
results = model.fit()
```
**Seasonal AR:**
```python
# Seasonal lags (e.g., monthly data with yearly seasonality)
model = AutoReg(y, lags=12, seasonal=True)
results = model.fit()
```
### ARIMA (Autoregressive Integrated Moving Average)
Combines AR, differencing (I), and MA components.
**When to use:**
- Non-stationary time series (needs differencing)
- Past values and errors predict future
- Flexible model for many time series
**Model**: ARIMA(p,d,q)
- p: AR order (lags)
- d: differencing order (to achieve stationarity)
- q: MA order (lagged forecast errors)
```python
from statsmodels.tsa.arima.model import ARIMA
# Fit ARIMA(p,d,q)
model = ARIMA(y, order=(1, 1, 1)) # ARIMA(1,1,1)
results = model.fit()
print(results.summary())
```
**Choosing p, d, q:**
1. **Determine d (differencing order)**:
```python
from statsmodels.tsa.stattools import adfuller
# ADF test for stationarity
def check_stationarity(series):
result = adfuller(series)
print(f"ADF Statistic: {result[0]:.4f}")
print(f"p-value: {result[1]:.4f}")
if result[1] <= 0.05:
print("Series is stationary")
return True
else:
print("Series is non-stationary, needs differencing")
return False
# Test original series
if not check_stationarity(y):
# Difference once
y_diff = y.diff().dropna()
if not check_stationarity(y_diff):
# Difference again
y_diff2 = y_diff.diff().dropna()
check_stationarity(y_diff2)
```
2. **Determine p and q (ACF/PACF)**:
```python
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
import matplotlib.pyplot as plt
# After differencing to stationarity
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 8))
# ACF: helps determine q (MA order)
plot_acf(y_stationary, lags=40, ax=ax1)
ax1.set_title('Autocorrelation Function (ACF)')
# PACF: helps determine p (AR order)
plot_pacf(y_stationary, lags=40, ax=ax2)
ax2.set_title('Partial Autocorrelation Function (PACF)')
plt.tight_layout()
plt.show()
# Rules of thumb:
# - PACF cuts off at lag p → AR(p)
# - ACF cuts off at lag q → MA(q)
# - Both decay → ARMA(p,q)
```
3. **Model selection (AIC/BIC)**:
```python
# Grid search for best (p,q) given d
import numpy as np
best_aic = np.inf
best_order = None
for p in range(5):
for q in range(5):
try:
model = ARIMA(y, order=(p, d, q))
results = model.fit()
if results.aic < best_aic:
best_aic = results.aic
best_order = (p, d, q)
except:
continue
print(f"Best order: {best_order} with AIC: {best_aic:.2f}")
```
### SARIMAX (Seasonal ARIMA with Exogenous Variables)
Extends ARIMA with seasonality and exogenous regressors.
**When to use:**
- Seasonal patterns (monthly, quarterly data)
- External variables influence series
- Most flexible univariate model
**Model**: SARIMAX(p,d,q)(P,D,Q,s)
- (p,d,q): Non-seasonal ARIMA
- (P,D,Q,s): Seasonal ARIMA with period s
```python
from statsmodels.tsa.statespace.sarimax import SARIMAX
# Seasonal ARIMA for monthly data (s=12)
model = SARIMAX(y,
order=(1, 1, 1), # (p,d,q)
seasonal_order=(1, 1, 1, 12)) # (P,D,Q,s)
results = model.fit()
print(results.summary())
```
**With exogenous variables:**
```python
# SARIMAX with external predictors
model = SARIMAX(y,
exog=X_exog,
order=(1, 1, 1),
seasonal_order=(1, 1, 1, 12))
results = model.fit()
```
**Example: Monthly sales with trend and seasonality**
```python
# Typical for monthly data: (p,d,q)(P,D,Q,12)
# Start with (1,1,1)(1,1,1,12) or (0,1,1)(0,1,1,12)
model = SARIMAX(monthly_sales,
order=(0, 1, 1),
seasonal_order=(0, 1, 1, 12),
enforce_stationarity=False,
enforce_invertibility=False)
results = model.fit()
```
### Exponential Smoothing
Weighted averages of past observations with exponentially decreasing weights.
**When to use:**
- Simple, interpretable forecasts
- Trend and/or seasonality present
- No need for explicit model specification
**Types:**
- Simple Exponential Smoothing: no trend, no seasonality
- Holt's method: with trend
- Holt-Winters: with trend and seasonality
```python
from statsmodels.tsa.holtwinters import ExponentialSmoothing
# Simple exponential smoothing
model = ExponentialSmoothing(y, trend=None, seasonal=None)
results = model.fit()
# Holt's method (with trend)
model = ExponentialSmoothing(y, trend='add', seasonal=None)
results = model.fit()
# Holt-Winters (trend + seasonality)
model = ExponentialSmoothing(y,
trend='add', # 'add' or 'mul'
seasonal='add', # 'add' or 'mul'
seasonal_periods=12) # e.g., 12 for monthly
results = model.fit()
print(results.summary())
```
**Additive vs Multiplicative:**
```python
# Additive: constant seasonal variation
# yₜ = Level + Trend + Seasonal + Error
# Multiplicative: proportional seasonal variation
# yₜ = Level × Trend × Seasonal × Error
# Choose based on data:
# - Additive: seasonal variation constant over time
# - Multiplicative: seasonal variation increases with level
```
**Innovations state space (ETS):**
```python
from statsmodels.tsa.exponential_smoothing.ets import ETSModel
# More robust, state space formulation
model = ETSModel(y,
error='add', # 'add' or 'mul'
trend='add', # 'add', 'mul', or None
seasonal='add', # 'add', 'mul', or None
seasonal_periods=12)
results = model.fit()
```
## Multivariate Time Series
### VAR (Vector Autoregression)
System of equations where each variable depends on past values of all variables.
**When to use:**
- Multiple interrelated time series
- Bidirectional relationships
- Granger causality testing
**Model**: Each variable is AR on all variables:
- y₁ₜ = c₁ + φ₁₁y₁ₜ₋₁ + φ₁₂y₂ₜ₋₁ + ... + ε₁ₜ
- y₂ₜ = c₂ + φ₂₁y₁ₜ₋₁ + φ₂₂y₂ₜ₋₁ + ... + ε₂ₜ
```python
from statsmodels.tsa.api import VAR
import pandas as pd
# Data should be DataFrame with multiple columns
# Each column is a time series
df_multivariate = pd.DataFrame({'series1': y1, 'series2': y2, 'series3': y3})
# Fit VAR
model = VAR(df_multivariate)
# Select lag order using AIC/BIC
lag_order_results = model.select_order(maxlags=15)
print(lag_order_results.summary())
# Fit with optimal lags
results = model.fit(maxlags=5, ic='aic')
print(results.summary())
```
**Granger causality testing:**
```python
# Test if series1 Granger-causes series2
from statsmodels.tsa.stattools import grangercausalitytests
# Requires 2D array [series2, series1]
test_data = df_multivariate[['series2', 'series1']]
# Test up to max_lag
max_lag = 5
results = grangercausalitytests(test_data, max_lag, verbose=True)
# P-values for each lag
for lag in range(1, max_lag + 1):
p_value = results[lag][0]['ssr_ftest'][1]
print(f"Lag {lag}: p-value = {p_value:.4f}")
```
**Impulse Response Functions (IRF):**
```python
# Trace effect of shock through system
irf = results.irf(10) # 10 periods ahead
# Plot IRFs
irf.plot(orth=True) # Orthogonalized (Cholesky decomposition)
plt.show()
# Cumulative effects
irf.plot_cum_effects(orth=True)
plt.show()
```
**Forecast Error Variance Decomposition:**
```python
# Contribution of each variable to forecast error variance
fevd = results.fevd(10) # 10 periods ahead
fevd.plot()
plt.show()
```
### VARMAX (VAR with Moving Average and Exogenous Variables)
Extends VAR with MA component and external regressors.
**When to use:**
- VAR inadequate (MA component needed)
- External variables affect system
- More flexible multivariate model
```python
from statsmodels.tsa.statespace.varmax import VARMAX
# VARMAX(p, q) with exogenous variables
model = VARMAX(df_multivariate,
order=(1, 1), # (p, q)
exog=X_exog)
results = model.fit()
print(results.summary())
```
## State Space Models
Flexible framework for custom time series models.
**When to use:**
- Custom model specification
- Unobserved components
- Kalman filtering/smoothing
- Missing data
```python
from statsmodels.tsa.statespace.mlemodel import MLEModel
# Extend MLEModel for custom state space models
# Example: Local level model (random walk + noise)
```
**Dynamic Factor Models:**
```python
from statsmodels.tsa.statespace.dynamic_factor import DynamicFactor
# Extract common factors from multiple time series
model = DynamicFactor(df_multivariate,
k_factors=2, # Number of factors
factor_order=2) # AR order of factors
results = model.fit()
# Estimated factors
factors = results.factors.filtered
```
## Forecasting
### Point Forecasts
```python
# ARIMA forecasting
model = ARIMA(y, order=(1, 1, 1))
results = model.fit()
# Forecast h steps ahead
h = 10
forecast = results.forecast(steps=h)
# With exogenous variables (SARIMAX)
model = SARIMAX(y, exog=X, order=(1, 1, 1))
results = model.fit()
# Need future exogenous values
forecast = results.forecast(steps=h, exog=X_future)
```
### Prediction Intervals
```python
# Get forecast with confidence intervals
forecast_obj = results.get_forecast(steps=h)
forecast_df = forecast_obj.summary_frame()
print(forecast_df)
# Contains: mean, mean_se, mean_ci_lower, mean_ci_upper
# Extract components
forecast_mean = forecast_df['mean']
forecast_ci_lower = forecast_df['mean_ci_lower']
forecast_ci_upper = forecast_df['mean_ci_upper']
# Plot
import matplotlib.pyplot as plt
plt.figure(figsize=(12, 6))
plt.plot(y.index, y, label='Historical')
plt.plot(forecast_df.index, forecast_mean, label='Forecast', color='red')
plt.fill_between(forecast_df.index,
forecast_ci_lower,
forecast_ci_upper,
alpha=0.3, color='red', label='95% CI')
plt.legend()
plt.title('Forecast with Prediction Intervals')
plt.show()
```
### Dynamic vs Static Forecasts
```python
# Static (one-step-ahead, using actual values)
static_forecast = results.get_prediction(start=split_point, end=len(y)-1)
# Dynamic (multi-step, using predicted values)
dynamic_forecast = results.get_prediction(start=split_point,
end=len(y)-1,
dynamic=True)
# Plot comparison
fig, ax = plt.subplots(figsize=(12, 6))
y.plot(ax=ax, label='Actual')
static_forecast.predicted_mean.plot(ax=ax, label='Static forecast')
dynamic_forecast.predicted_mean.plot(ax=ax, label='Dynamic forecast')
ax.legend()
plt.show()
```
## Diagnostic Tests
### Stationarity Tests
```python
from statsmodels.tsa.stattools import adfuller, kpss
# Augmented Dickey-Fuller (ADF) test
# H0: unit root (non-stationary)
adf_result = adfuller(y, autolag='AIC')
print(f"ADF Statistic: {adf_result[0]:.4f}")
print(f"p-value: {adf_result[1]:.4f}")
if adf_result[1] <= 0.05:
print("Reject H0: Series is stationary")
else:
print("Fail to reject H0: Series is non-stationary")
# KPSS test
# H0: stationary (opposite of ADF)
kpss_result = kpss(y, regression='c', nlags='auto')
print(f"KPSS Statistic: {kpss_result[0]:.4f}")
print(f"p-value: {kpss_result[1]:.4f}")
if kpss_result[1] <= 0.05:
print("Reject H0: Series is non-stationary")
else:
print("Fail to reject H0: Series is stationary")
```
### Residual Diagnostics
```python
# Ljung-Box test for autocorrelation in residuals
from statsmodels.stats.diagnostic import acorr_ljungbox
lb_test = acorr_ljungbox(results.resid, lags=10, return_df=True)
print(lb_test)
# P-values > 0.05 indicate no significant autocorrelation (good)
# Plot residual diagnostics
results.plot_diagnostics(figsize=(12, 8))
plt.show()
# Components:
# 1. Standardized residuals over time
# 2. Histogram + KDE of residuals
# 3. Q-Q plot for normality
# 4. Correlogram (ACF of residuals)
```
### Heteroskedasticity Tests
```python
from statsmodels.stats.diagnostic import het_arch
# ARCH test for heteroskedasticity
arch_test = het_arch(results.resid, nlags=10)
print(f"ARCH test statistic: {arch_test[0]:.4f}")
print(f"p-value: {arch_test[1]:.4f}")
# If significant, consider GARCH model
```
## Seasonal Decomposition
```python
from statsmodels.tsa.seasonal import seasonal_decompose
# Decompose into trend, seasonal, residual
decomposition = seasonal_decompose(y,
model='additive', # or 'multiplicative'
period=12) # seasonal period
# Plot components
fig = decomposition.plot()
fig.set_size_inches(12, 8)
plt.show()
# Access components
trend = decomposition.trend
seasonal = decomposition.seasonal
residual = decomposition.resid
# STL decomposition (more robust)
from statsmodels.tsa.seasonal import STL
stl = STL(y, seasonal=13) # seasonal must be odd
stl_result = stl.fit()
fig = stl_result.plot()
plt.show()
```
## Model Evaluation
### In-Sample Metrics
```python
# From results object
print(f"AIC: {results.aic:.2f}")
print(f"BIC: {results.bic:.2f}")
print(f"Log-likelihood: {results.llf:.2f}")
# MSE on training data
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y, results.fittedvalues)
rmse = np.sqrt(mse)
print(f"RMSE: {rmse:.4f}")
# MAE
from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(y, results.fittedvalues)
print(f"MAE: {mae:.4f}")
```
### Out-of-Sample Evaluation
```python
# Train-test split for time series (no shuffle!)
train_size = int(0.8 * len(y))
y_train = y[:train_size]
y_test = y[train_size:]
# Fit on training data
model = ARIMA(y_train, order=(1, 1, 1))
results = model.fit()
# Forecast test period
forecast = results.forecast(steps=len(y_test))
# Metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error
rmse = np.sqrt(mean_squared_error(y_test, forecast))
mae = mean_absolute_error(y_test, forecast)
mape = np.mean(np.abs((y_test - forecast) / y_test)) * 100
print(f"Test RMSE: {rmse:.4f}")
print(f"Test MAE: {mae:.4f}")
print(f"Test MAPE: {mape:.2f}%")
```
### Rolling Forecast
```python
# More realistic evaluation: rolling one-step-ahead forecasts
forecasts = []
for t in range(len(y_test)):
# Refit or update with new observation
y_current = y[:train_size + t]
model = ARIMA(y_current, order=(1, 1, 1))
fit = model.fit()
# One-step forecast
fc = fit.forecast(steps=1)[0]
forecasts.append(fc)
forecasts = np.array(forecasts)
rmse = np.sqrt(mean_squared_error(y_test, forecasts))
print(f"Rolling forecast RMSE: {rmse:.4f}")
```
### Cross-Validation
```python
# Time series cross-validation (expanding window)
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
rmse_scores = []
for train_idx, test_idx in tscv.split(y):
y_train_cv = y.iloc[train_idx]
y_test_cv = y.iloc[test_idx]
model = ARIMA(y_train_cv, order=(1, 1, 1))
results = model.fit()
forecast = results.forecast(steps=len(test_idx))
rmse = np.sqrt(mean_squared_error(y_test_cv, forecast))
rmse_scores.append(rmse)
print(f"CV RMSE: {np.mean(rmse_scores):.4f} ± {np.std(rmse_scores):.4f}")
```
## Advanced Topics
### ARDL (Autoregressive Distributed Lag)
Bridges univariate and multivariate time series.
```python
from statsmodels.tsa.ardl import ARDL
# ARDL(p, q) model
# y depends on its own lags and lags of X
model = ARDL(y, lags=2, exog=X, exog_lags=2)
results = model.fit()
```
### Error Correction Models
For cointegrated series.
```python
from statsmodels.tsa.vector_ar.vecm import coint_johansen
# Test for cointegration
johansen_test = coint_johansen(df_multivariate, det_order=0, k_ar_diff=1)
# Fit VECM if cointegrated
from statsmodels.tsa.vector_ar.vecm import VECM
model = VECM(df_multivariate, k_ar_diff=1, coint_rank=1)
results = model.fit()
```
### Regime Switching Models
For structural breaks and regime changes.
```python
from statsmodels.tsa.regime_switching.markov_regression import MarkovRegression
# Markov switching model
model = MarkovRegression(y, k_regimes=2, order=1)
results = model.fit()
# Smoothed probabilities of regimes
regime_probs = results.smoothed_marginal_probabilities
```
## Best Practices
1. **Check stationarity**: Difference if needed, verify with ADF/KPSS tests
2. **Plot data**: Always visualize before modeling
3. **Identify seasonality**: Use appropriate seasonal models (SARIMAX, Holt-Winters)
4. **Model selection**: Use AIC/BIC and out-of-sample validation
5. **Residual diagnostics**: Check for autocorrelation, normality, heteroskedasticity
6. **Forecast evaluation**: Use rolling forecasts and proper time series CV
7. **Avoid overfitting**: Prefer simpler models, use information criteria
8. **Document assumptions**: Note any data transformations (log, differencing)
9. **Prediction intervals**: Always provide uncertainty estimates
10. **Refit regularly**: Update models as new data arrives
## Common Pitfalls
1. **Not checking stationarity**: Fit ARIMA on non-stationary data
2. **Data leakage**: Using future data in transformations
3. **Wrong seasonal period**: S=4 for quarterly, S=12 for monthly
4. **Overfitting**: Too many parameters relative to data
5. **Ignoring residual autocorrelation**: Model inadequate
6. **Using inappropriate metrics**: MAPE fails with zeros or negatives
7. **Not handling missing data**: Affects model estimation
8. **Extrapolating exogenous variables**: Need future X values for SARIMAX
9. **Confusing static vs dynamic forecasts**: Dynamic more realistic for multi-step
10. **Not validating forecasts**: Always check out-of-sample performance