Initial commit
This commit is contained in:
608
skills/statsmodels/SKILL.md
Normal file
608
skills/statsmodels/SKILL.md
Normal file
@@ -0,0 +1,608 @@
|
||||
---
|
||||
name: statsmodels
|
||||
description: "Statistical modeling toolkit. OLS, GLM, logistic, ARIMA, time series, hypothesis tests, diagnostics, AIC/BIC, for rigorous statistical inference and econometric analysis."
|
||||
---
|
||||
|
||||
# Statsmodels: Statistical Modeling and Econometrics
|
||||
|
||||
## Overview
|
||||
|
||||
Statsmodels is Python's premier library for statistical modeling, providing tools for estimation, inference, and diagnostics across a wide range of statistical methods. Apply this skill for rigorous statistical analysis, from simple linear regression to complex time series models and econometric analyses.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
This skill should be used when:
|
||||
- Fitting regression models (OLS, WLS, GLS, quantile regression)
|
||||
- Performing generalized linear modeling (logistic, Poisson, Gamma, etc.)
|
||||
- Analyzing discrete outcomes (binary, multinomial, count, ordinal)
|
||||
- Conducting time series analysis (ARIMA, SARIMAX, VAR, forecasting)
|
||||
- Running statistical tests and diagnostics
|
||||
- Testing model assumptions (heteroskedasticity, autocorrelation, normality)
|
||||
- Detecting outliers and influential observations
|
||||
- Comparing models (AIC/BIC, likelihood ratio tests)
|
||||
- Estimating causal effects
|
||||
- Producing publication-ready statistical tables and inference
|
||||
|
||||
## Quick Start Guide
|
||||
|
||||
### Linear Regression (OLS)
|
||||
|
||||
```python
|
||||
import statsmodels.api as sm
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
|
||||
# Prepare data - ALWAYS add constant for intercept
|
||||
X = sm.add_constant(X_data)
|
||||
|
||||
# Fit OLS model
|
||||
model = sm.OLS(y, X)
|
||||
results = model.fit()
|
||||
|
||||
# View comprehensive results
|
||||
print(results.summary())
|
||||
|
||||
# Key results
|
||||
print(f"R-squared: {results.rsquared:.4f}")
|
||||
print(f"Coefficients:\\n{results.params}")
|
||||
print(f"P-values:\\n{results.pvalues}")
|
||||
|
||||
# Predictions with confidence intervals
|
||||
predictions = results.get_prediction(X_new)
|
||||
pred_summary = predictions.summary_frame()
|
||||
print(pred_summary) # includes mean, CI, prediction intervals
|
||||
|
||||
# Diagnostics
|
||||
from statsmodels.stats.diagnostic import het_breuschpagan
|
||||
bp_test = het_breuschpagan(results.resid, X)
|
||||
print(f"Breusch-Pagan p-value: {bp_test[1]:.4f}")
|
||||
|
||||
# Visualize residuals
|
||||
import matplotlib.pyplot as plt
|
||||
plt.scatter(results.fittedvalues, results.resid)
|
||||
plt.axhline(y=0, color='r', linestyle='--')
|
||||
plt.xlabel('Fitted values')
|
||||
plt.ylabel('Residuals')
|
||||
plt.show()
|
||||
```
|
||||
|
||||
### Logistic Regression (Binary Outcomes)
|
||||
|
||||
```python
|
||||
from statsmodels.discrete.discrete_model import Logit
|
||||
|
||||
# Add constant
|
||||
X = sm.add_constant(X_data)
|
||||
|
||||
# Fit logit model
|
||||
model = Logit(y_binary, X)
|
||||
results = model.fit()
|
||||
|
||||
print(results.summary())
|
||||
|
||||
# Odds ratios
|
||||
odds_ratios = np.exp(results.params)
|
||||
print("Odds ratios:\\n", odds_ratios)
|
||||
|
||||
# Predicted probabilities
|
||||
probs = results.predict(X)
|
||||
|
||||
# Binary predictions (0.5 threshold)
|
||||
predictions = (probs > 0.5).astype(int)
|
||||
|
||||
# Model evaluation
|
||||
from sklearn.metrics import classification_report, roc_auc_score
|
||||
|
||||
print(classification_report(y_binary, predictions))
|
||||
print(f"AUC: {roc_auc_score(y_binary, probs):.4f}")
|
||||
|
||||
# Marginal effects
|
||||
marginal = results.get_margeff()
|
||||
print(marginal.summary())
|
||||
```
|
||||
|
||||
### Time Series (ARIMA)
|
||||
|
||||
```python
|
||||
from statsmodels.tsa.arima.model import ARIMA
|
||||
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
|
||||
|
||||
# Check stationarity
|
||||
from statsmodels.tsa.stattools import adfuller
|
||||
|
||||
adf_result = adfuller(y_series)
|
||||
print(f"ADF p-value: {adf_result[1]:.4f}")
|
||||
|
||||
if adf_result[1] > 0.05:
|
||||
# Series is non-stationary, difference it
|
||||
y_diff = y_series.diff().dropna()
|
||||
|
||||
# Plot ACF/PACF to identify p, q
|
||||
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 8))
|
||||
plot_acf(y_diff, lags=40, ax=ax1)
|
||||
plot_pacf(y_diff, lags=40, ax=ax2)
|
||||
plt.show()
|
||||
|
||||
# Fit ARIMA(p,d,q)
|
||||
model = ARIMA(y_series, order=(1, 1, 1))
|
||||
results = model.fit()
|
||||
|
||||
print(results.summary())
|
||||
|
||||
# Forecast
|
||||
forecast = results.forecast(steps=10)
|
||||
forecast_obj = results.get_forecast(steps=10)
|
||||
forecast_df = forecast_obj.summary_frame()
|
||||
|
||||
print(forecast_df) # includes mean and confidence intervals
|
||||
|
||||
# Residual diagnostics
|
||||
results.plot_diagnostics(figsize=(12, 8))
|
||||
plt.show()
|
||||
```
|
||||
|
||||
### Generalized Linear Models (GLM)
|
||||
|
||||
```python
|
||||
import statsmodels.api as sm
|
||||
|
||||
# Poisson regression for count data
|
||||
X = sm.add_constant(X_data)
|
||||
model = sm.GLM(y_counts, X, family=sm.families.Poisson())
|
||||
results = model.fit()
|
||||
|
||||
print(results.summary())
|
||||
|
||||
# Rate ratios (for Poisson with log link)
|
||||
rate_ratios = np.exp(results.params)
|
||||
print("Rate ratios:\\n", rate_ratios)
|
||||
|
||||
# Check overdispersion
|
||||
overdispersion = results.pearson_chi2 / results.df_resid
|
||||
print(f"Overdispersion: {overdispersion:.2f}")
|
||||
|
||||
if overdispersion > 1.5:
|
||||
# Use Negative Binomial instead
|
||||
from statsmodels.discrete.count_model import NegativeBinomial
|
||||
nb_model = NegativeBinomial(y_counts, X)
|
||||
nb_results = nb_model.fit()
|
||||
print(nb_results.summary())
|
||||
```
|
||||
|
||||
## Core Statistical Modeling Capabilities
|
||||
|
||||
### 1. Linear Regression Models
|
||||
|
||||
Comprehensive suite of linear models for continuous outcomes with various error structures.
|
||||
|
||||
**Available models:**
|
||||
- **OLS**: Standard linear regression with i.i.d. errors
|
||||
- **WLS**: Weighted least squares for heteroskedastic errors
|
||||
- **GLS**: Generalized least squares for arbitrary covariance structure
|
||||
- **GLSAR**: GLS with autoregressive errors for time series
|
||||
- **Quantile Regression**: Conditional quantiles (robust to outliers)
|
||||
- **Mixed Effects**: Hierarchical/multilevel models with random effects
|
||||
- **Recursive/Rolling**: Time-varying parameter estimation
|
||||
|
||||
**Key features:**
|
||||
- Comprehensive diagnostic tests
|
||||
- Robust standard errors (HC, HAC, cluster-robust)
|
||||
- Influence statistics (Cook's distance, leverage, DFFITS)
|
||||
- Hypothesis testing (F-tests, Wald tests)
|
||||
- Model comparison (AIC, BIC, likelihood ratio tests)
|
||||
- Prediction with confidence and prediction intervals
|
||||
|
||||
**When to use:** Continuous outcome variable, want inference on coefficients, need diagnostics
|
||||
|
||||
**Reference:** See `references/linear_models.md` for detailed guidance on model selection, diagnostics, and best practices.
|
||||
|
||||
### 2. Generalized Linear Models (GLM)
|
||||
|
||||
Flexible framework extending linear models to non-normal distributions.
|
||||
|
||||
**Distribution families:**
|
||||
- **Binomial**: Binary outcomes or proportions (logistic regression)
|
||||
- **Poisson**: Count data
|
||||
- **Negative Binomial**: Overdispersed counts
|
||||
- **Gamma**: Positive continuous, right-skewed data
|
||||
- **Inverse Gaussian**: Positive continuous with specific variance structure
|
||||
- **Gaussian**: Equivalent to OLS
|
||||
- **Tweedie**: Flexible family for semi-continuous data
|
||||
|
||||
**Link functions:**
|
||||
- Logit, Probit, Log, Identity, Inverse, Sqrt, CLogLog, Power
|
||||
- Choose based on interpretation needs and model fit
|
||||
|
||||
**Key features:**
|
||||
- Maximum likelihood estimation via IRLS
|
||||
- Deviance and Pearson residuals
|
||||
- Goodness-of-fit statistics
|
||||
- Pseudo R-squared measures
|
||||
- Robust standard errors
|
||||
|
||||
**When to use:** Non-normal outcomes, need flexible variance and link specifications
|
||||
|
||||
**Reference:** See `references/glm.md` for family selection, link functions, interpretation, and diagnostics.
|
||||
|
||||
### 3. Discrete Choice Models
|
||||
|
||||
Models for categorical and count outcomes.
|
||||
|
||||
**Binary models:**
|
||||
- **Logit**: Logistic regression (odds ratios)
|
||||
- **Probit**: Probit regression (normal distribution)
|
||||
|
||||
**Multinomial models:**
|
||||
- **MNLogit**: Unordered categories (3+ levels)
|
||||
- **Conditional Logit**: Choice models with alternative-specific variables
|
||||
- **Ordered Model**: Ordinal outcomes (ordered categories)
|
||||
|
||||
**Count models:**
|
||||
- **Poisson**: Standard count model
|
||||
- **Negative Binomial**: Overdispersed counts
|
||||
- **Zero-Inflated**: Excess zeros (ZIP, ZINB)
|
||||
- **Hurdle Models**: Two-stage models for zero-heavy data
|
||||
|
||||
**Key features:**
|
||||
- Maximum likelihood estimation
|
||||
- Marginal effects at means or average marginal effects
|
||||
- Model comparison via AIC/BIC
|
||||
- Predicted probabilities and classification
|
||||
- Goodness-of-fit tests
|
||||
|
||||
**When to use:** Binary, categorical, or count outcomes
|
||||
|
||||
**Reference:** See `references/discrete_choice.md` for model selection, interpretation, and evaluation.
|
||||
|
||||
### 4. Time Series Analysis
|
||||
|
||||
Comprehensive time series modeling and forecasting capabilities.
|
||||
|
||||
**Univariate models:**
|
||||
- **AutoReg (AR)**: Autoregressive models
|
||||
- **ARIMA**: Autoregressive integrated moving average
|
||||
- **SARIMAX**: Seasonal ARIMA with exogenous variables
|
||||
- **Exponential Smoothing**: Simple, Holt, Holt-Winters
|
||||
- **ETS**: Innovations state space models
|
||||
|
||||
**Multivariate models:**
|
||||
- **VAR**: Vector autoregression
|
||||
- **VARMAX**: VAR with MA and exogenous variables
|
||||
- **Dynamic Factor Models**: Extract common factors
|
||||
- **VECM**: Vector error correction models (cointegration)
|
||||
|
||||
**Advanced models:**
|
||||
- **State Space**: Kalman filtering, custom specifications
|
||||
- **Regime Switching**: Markov switching models
|
||||
- **ARDL**: Autoregressive distributed lag
|
||||
|
||||
**Key features:**
|
||||
- ACF/PACF analysis for model identification
|
||||
- Stationarity tests (ADF, KPSS)
|
||||
- Forecasting with prediction intervals
|
||||
- Residual diagnostics (Ljung-Box, heteroskedasticity)
|
||||
- Granger causality testing
|
||||
- Impulse response functions (IRF)
|
||||
- Forecast error variance decomposition (FEVD)
|
||||
|
||||
**When to use:** Time-ordered data, forecasting, understanding temporal dynamics
|
||||
|
||||
**Reference:** See `references/time_series.md` for model selection, diagnostics, and forecasting methods.
|
||||
|
||||
### 5. Statistical Tests and Diagnostics
|
||||
|
||||
Extensive testing and diagnostic capabilities for model validation.
|
||||
|
||||
**Residual diagnostics:**
|
||||
- Autocorrelation tests (Ljung-Box, Durbin-Watson, Breusch-Godfrey)
|
||||
- Heteroskedasticity tests (Breusch-Pagan, White, ARCH)
|
||||
- Normality tests (Jarque-Bera, Omnibus, Anderson-Darling, Lilliefors)
|
||||
- Specification tests (RESET, Harvey-Collier)
|
||||
|
||||
**Influence and outliers:**
|
||||
- Leverage (hat values)
|
||||
- Cook's distance
|
||||
- DFFITS and DFBETAs
|
||||
- Studentized residuals
|
||||
- Influence plots
|
||||
|
||||
**Hypothesis testing:**
|
||||
- t-tests (one-sample, two-sample, paired)
|
||||
- Proportion tests
|
||||
- Chi-square tests
|
||||
- Non-parametric tests (Mann-Whitney, Wilcoxon, Kruskal-Wallis)
|
||||
- ANOVA (one-way, two-way, repeated measures)
|
||||
|
||||
**Multiple comparisons:**
|
||||
- Tukey's HSD
|
||||
- Bonferroni correction
|
||||
- False Discovery Rate (FDR)
|
||||
|
||||
**Effect sizes and power:**
|
||||
- Cohen's d, eta-squared
|
||||
- Power analysis for t-tests, proportions
|
||||
- Sample size calculations
|
||||
|
||||
**Robust inference:**
|
||||
- Heteroskedasticity-consistent SEs (HC0-HC3)
|
||||
- HAC standard errors (Newey-West)
|
||||
- Cluster-robust standard errors
|
||||
|
||||
**When to use:** Validating assumptions, detecting problems, ensuring robust inference
|
||||
|
||||
**Reference:** See `references/stats_diagnostics.md` for comprehensive testing and diagnostic procedures.
|
||||
|
||||
## Formula API (R-style)
|
||||
|
||||
Statsmodels supports R-style formulas for intuitive model specification:
|
||||
|
||||
```python
|
||||
import statsmodels.formula.api as smf
|
||||
|
||||
# OLS with formula
|
||||
results = smf.ols('y ~ x1 + x2 + x1:x2', data=df).fit()
|
||||
|
||||
# Categorical variables (automatic dummy coding)
|
||||
results = smf.ols('y ~ x1 + C(category)', data=df).fit()
|
||||
|
||||
# Interactions
|
||||
results = smf.ols('y ~ x1 * x2', data=df).fit() # x1 + x2 + x1:x2
|
||||
|
||||
# Polynomial terms
|
||||
results = smf.ols('y ~ x + I(x**2)', data=df).fit()
|
||||
|
||||
# Logit
|
||||
results = smf.logit('y ~ x1 + x2 + C(group)', data=df).fit()
|
||||
|
||||
# Poisson
|
||||
results = smf.poisson('count ~ x1 + x2', data=df).fit()
|
||||
|
||||
# ARIMA (not available via formula, use regular API)
|
||||
```
|
||||
|
||||
## Model Selection and Comparison
|
||||
|
||||
### Information Criteria
|
||||
|
||||
```python
|
||||
# Compare models using AIC/BIC
|
||||
models = {
|
||||
'Model 1': model1_results,
|
||||
'Model 2': model2_results,
|
||||
'Model 3': model3_results
|
||||
}
|
||||
|
||||
comparison = pd.DataFrame({
|
||||
'AIC': {name: res.aic for name, res in models.items()},
|
||||
'BIC': {name: res.bic for name, res in models.items()},
|
||||
'Log-Likelihood': {name: res.llf for name, res in models.items()}
|
||||
})
|
||||
|
||||
print(comparison.sort_values('AIC'))
|
||||
# Lower AIC/BIC indicates better model
|
||||
```
|
||||
|
||||
### Likelihood Ratio Test (Nested Models)
|
||||
|
||||
```python
|
||||
# For nested models (one is subset of the other)
|
||||
from scipy import stats
|
||||
|
||||
lr_stat = 2 * (full_model.llf - reduced_model.llf)
|
||||
df = full_model.df_model - reduced_model.df_model
|
||||
p_value = 1 - stats.chi2.cdf(lr_stat, df)
|
||||
|
||||
print(f"LR statistic: {lr_stat:.4f}")
|
||||
print(f"p-value: {p_value:.4f}")
|
||||
|
||||
if p_value < 0.05:
|
||||
print("Full model significantly better")
|
||||
else:
|
||||
print("Reduced model preferred (parsimony)")
|
||||
```
|
||||
|
||||
### Cross-Validation
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import KFold
|
||||
from sklearn.metrics import mean_squared_error
|
||||
|
||||
kf = KFold(n_splits=5, shuffle=True, random_state=42)
|
||||
cv_scores = []
|
||||
|
||||
for train_idx, val_idx in kf.split(X):
|
||||
X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
|
||||
y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]
|
||||
|
||||
# Fit model
|
||||
model = sm.OLS(y_train, X_train).fit()
|
||||
|
||||
# Predict
|
||||
y_pred = model.predict(X_val)
|
||||
|
||||
# Score
|
||||
rmse = np.sqrt(mean_squared_error(y_val, y_pred))
|
||||
cv_scores.append(rmse)
|
||||
|
||||
print(f"CV RMSE: {np.mean(cv_scores):.4f} ± {np.std(cv_scores):.4f}")
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Data Preparation
|
||||
|
||||
1. **Always add constant**: Use `sm.add_constant()` unless excluding intercept
|
||||
2. **Check for missing values**: Handle or impute before fitting
|
||||
3. **Scale if needed**: Improves convergence, interpretation (but not required for tree models)
|
||||
4. **Encode categoricals**: Use formula API or manual dummy coding
|
||||
|
||||
### Model Building
|
||||
|
||||
1. **Start simple**: Begin with basic model, add complexity as needed
|
||||
2. **Check assumptions**: Test residuals, heteroskedasticity, autocorrelation
|
||||
3. **Use appropriate model**: Match model to outcome type (binary→Logit, count→Poisson)
|
||||
4. **Consider alternatives**: If assumptions violated, use robust methods or different model
|
||||
|
||||
### Inference
|
||||
|
||||
1. **Report effect sizes**: Not just p-values
|
||||
2. **Use robust SEs**: When heteroskedasticity or clustering present
|
||||
3. **Multiple comparisons**: Correct when testing many hypotheses
|
||||
4. **Confidence intervals**: Always report alongside point estimates
|
||||
|
||||
### Model Evaluation
|
||||
|
||||
1. **Check residuals**: Plot residuals vs fitted, Q-Q plot
|
||||
2. **Influence diagnostics**: Identify and investigate influential observations
|
||||
3. **Out-of-sample validation**: Test on holdout set or cross-validate
|
||||
4. **Compare models**: Use AIC/BIC for non-nested, LR test for nested
|
||||
|
||||
### Reporting
|
||||
|
||||
1. **Comprehensive summary**: Use `.summary()` for detailed output
|
||||
2. **Document decisions**: Note transformations, excluded observations
|
||||
3. **Interpret carefully**: Account for link functions (e.g., exp(β) for log link)
|
||||
4. **Visualize**: Plot predictions, confidence intervals, diagnostics
|
||||
|
||||
## Common Workflows
|
||||
|
||||
### Workflow 1: Linear Regression Analysis
|
||||
|
||||
1. Explore data (plots, descriptives)
|
||||
2. Fit initial OLS model
|
||||
3. Check residual diagnostics
|
||||
4. Test for heteroskedasticity, autocorrelation
|
||||
5. Check for multicollinearity (VIF)
|
||||
6. Identify influential observations
|
||||
7. Refit with robust SEs if needed
|
||||
8. Interpret coefficients and inference
|
||||
9. Validate on holdout or via CV
|
||||
|
||||
### Workflow 2: Binary Classification
|
||||
|
||||
1. Fit logistic regression (Logit)
|
||||
2. Check for convergence issues
|
||||
3. Interpret odds ratios
|
||||
4. Calculate marginal effects
|
||||
5. Evaluate classification performance (AUC, confusion matrix)
|
||||
6. Check for influential observations
|
||||
7. Compare with alternative models (Probit)
|
||||
8. Validate predictions on test set
|
||||
|
||||
### Workflow 3: Count Data Analysis
|
||||
|
||||
1. Fit Poisson regression
|
||||
2. Check for overdispersion
|
||||
3. If overdispersed, fit Negative Binomial
|
||||
4. Check for excess zeros (consider ZIP/ZINB)
|
||||
5. Interpret rate ratios
|
||||
6. Assess goodness of fit
|
||||
7. Compare models via AIC
|
||||
8. Validate predictions
|
||||
|
||||
### Workflow 4: Time Series Forecasting
|
||||
|
||||
1. Plot series, check for trend/seasonality
|
||||
2. Test for stationarity (ADF, KPSS)
|
||||
3. Difference if non-stationary
|
||||
4. Identify p, q from ACF/PACF
|
||||
5. Fit ARIMA or SARIMAX
|
||||
6. Check residual diagnostics (Ljung-Box)
|
||||
7. Generate forecasts with confidence intervals
|
||||
8. Evaluate forecast accuracy on test set
|
||||
|
||||
## Reference Documentation
|
||||
|
||||
This skill includes comprehensive reference files for detailed guidance:
|
||||
|
||||
### references/linear_models.md
|
||||
Detailed coverage of linear regression models including:
|
||||
- OLS, WLS, GLS, GLSAR, Quantile Regression
|
||||
- Mixed effects models
|
||||
- Recursive and rolling regression
|
||||
- Comprehensive diagnostics (heteroskedasticity, autocorrelation, multicollinearity)
|
||||
- Influence statistics and outlier detection
|
||||
- Robust standard errors (HC, HAC, cluster)
|
||||
- Hypothesis testing and model comparison
|
||||
|
||||
### references/glm.md
|
||||
Complete guide to generalized linear models:
|
||||
- All distribution families (Binomial, Poisson, Gamma, etc.)
|
||||
- Link functions and when to use each
|
||||
- Model fitting and interpretation
|
||||
- Pseudo R-squared and goodness of fit
|
||||
- Diagnostics and residual analysis
|
||||
- Applications (logistic, Poisson, Gamma regression)
|
||||
|
||||
### references/discrete_choice.md
|
||||
Comprehensive guide to discrete outcome models:
|
||||
- Binary models (Logit, Probit)
|
||||
- Multinomial models (MNLogit, Conditional Logit)
|
||||
- Count models (Poisson, Negative Binomial, Zero-Inflated, Hurdle)
|
||||
- Ordinal models
|
||||
- Marginal effects and interpretation
|
||||
- Model diagnostics and comparison
|
||||
|
||||
### references/time_series.md
|
||||
In-depth time series analysis guidance:
|
||||
- Univariate models (AR, ARIMA, SARIMAX, Exponential Smoothing)
|
||||
- Multivariate models (VAR, VARMAX, Dynamic Factor)
|
||||
- State space models
|
||||
- Stationarity testing and diagnostics
|
||||
- Forecasting methods and evaluation
|
||||
- Granger causality, IRF, FEVD
|
||||
|
||||
### references/stats_diagnostics.md
|
||||
Comprehensive statistical testing and diagnostics:
|
||||
- Residual diagnostics (autocorrelation, heteroskedasticity, normality)
|
||||
- Influence and outlier detection
|
||||
- Hypothesis tests (parametric and non-parametric)
|
||||
- ANOVA and post-hoc tests
|
||||
- Multiple comparisons correction
|
||||
- Robust covariance matrices
|
||||
- Power analysis and effect sizes
|
||||
|
||||
**When to reference:**
|
||||
- Need detailed parameter explanations
|
||||
- Choosing between similar models
|
||||
- Troubleshooting convergence or diagnostic issues
|
||||
- Understanding specific test statistics
|
||||
- Looking for code examples for advanced features
|
||||
|
||||
**Search patterns:**
|
||||
```bash
|
||||
# Find information about specific models
|
||||
grep -r "Quantile Regression" references/
|
||||
|
||||
# Find diagnostic tests
|
||||
grep -r "Breusch-Pagan" references/stats_diagnostics.md
|
||||
|
||||
# Find time series guidance
|
||||
grep -r "SARIMAX" references/time_series.md
|
||||
```
|
||||
|
||||
## Common Pitfalls to Avoid
|
||||
|
||||
1. **Forgetting constant term**: Always use `sm.add_constant()` unless no intercept desired
|
||||
2. **Ignoring assumptions**: Check residuals, heteroskedasticity, autocorrelation
|
||||
3. **Wrong model for outcome type**: Binary→Logit/Probit, Count→Poisson/NB, not OLS
|
||||
4. **Not checking convergence**: Look for optimization warnings
|
||||
5. **Misinterpreting coefficients**: Remember link functions (log, logit, etc.)
|
||||
6. **Using Poisson with overdispersion**: Check dispersion, use Negative Binomial if needed
|
||||
7. **Not using robust SEs**: When heteroskedasticity or clustering present
|
||||
8. **Overfitting**: Too many parameters relative to sample size
|
||||
9. **Data leakage**: Fitting on test data or using future information
|
||||
10. **Not validating predictions**: Always check out-of-sample performance
|
||||
11. **Comparing non-nested models**: Use AIC/BIC, not LR test
|
||||
12. **Ignoring influential observations**: Check Cook's distance and leverage
|
||||
13. **Multiple testing**: Correct p-values when testing many hypotheses
|
||||
14. **Not differencing time series**: Fit ARIMA on non-stationary data
|
||||
15. **Confusing prediction vs confidence intervals**: Prediction intervals are wider
|
||||
|
||||
## Getting Help
|
||||
|
||||
For detailed documentation and examples:
|
||||
- Official docs: https://www.statsmodels.org/stable/
|
||||
- User guide: https://www.statsmodels.org/stable/user-guide.html
|
||||
- Examples: https://www.statsmodels.org/stable/examples/index.html
|
||||
- API reference: https://www.statsmodels.org/stable/api.html
|
||||
669
skills/statsmodels/references/discrete_choice.md
Normal file
669
skills/statsmodels/references/discrete_choice.md
Normal file
@@ -0,0 +1,669 @@
|
||||
# Discrete Choice Models Reference
|
||||
|
||||
This document provides comprehensive guidance on discrete choice models in statsmodels, including binary, multinomial, count, and ordinal models.
|
||||
|
||||
## Overview
|
||||
|
||||
Discrete choice models handle outcomes that are:
|
||||
- **Binary**: 0/1, success/failure
|
||||
- **Multinomial**: Multiple unordered categories
|
||||
- **Ordinal**: Ordered categories
|
||||
- **Count**: Non-negative integers
|
||||
|
||||
All models use maximum likelihood estimation and assume i.i.d. errors.
|
||||
|
||||
## Binary Models
|
||||
|
||||
### Logit (Logistic Regression)
|
||||
|
||||
Uses logistic distribution for binary outcomes.
|
||||
|
||||
**When to use:**
|
||||
- Binary classification (yes/no, success/failure)
|
||||
- Probability estimation for binary outcomes
|
||||
- Interpretable odds ratios
|
||||
|
||||
**Model**: P(Y=1|X) = 1 / (1 + exp(-Xβ))
|
||||
|
||||
```python
|
||||
import statsmodels.api as sm
|
||||
from statsmodels.discrete.discrete_model import Logit
|
||||
|
||||
# Prepare data
|
||||
X = sm.add_constant(X_data)
|
||||
|
||||
# Fit model
|
||||
model = Logit(y, X)
|
||||
results = model.fit()
|
||||
|
||||
print(results.summary())
|
||||
```
|
||||
|
||||
**Interpretation:**
|
||||
```python
|
||||
import numpy as np
|
||||
|
||||
# Odds ratios
|
||||
odds_ratios = np.exp(results.params)
|
||||
print("Odds ratios:", odds_ratios)
|
||||
|
||||
# For 1-unit increase in X, odds multiply by exp(β)
|
||||
# OR > 1: increases odds of success
|
||||
# OR < 1: decreases odds of success
|
||||
# OR = 1: no effect
|
||||
|
||||
# Confidence intervals for odds ratios
|
||||
odds_ci = np.exp(results.conf_int())
|
||||
print("Odds ratio 95% CI:")
|
||||
print(odds_ci)
|
||||
```
|
||||
|
||||
**Marginal effects:**
|
||||
```python
|
||||
# Average marginal effects (AME)
|
||||
marginal_effects = results.get_margeff(at='mean')
|
||||
print(marginal_effects.summary())
|
||||
|
||||
# Marginal effects at means (MEM)
|
||||
marginal_effects_mem = results.get_margeff(at='mean', method='dydx')
|
||||
|
||||
# Marginal effects at representative values
|
||||
marginal_effects_custom = results.get_margeff(at='mean',
|
||||
atexog={'x1': 1, 'x2': 5})
|
||||
```
|
||||
|
||||
**Predictions:**
|
||||
```python
|
||||
# Predicted probabilities
|
||||
probs = results.predict(X)
|
||||
|
||||
# Binary predictions (0.5 threshold)
|
||||
predictions = (probs > 0.5).astype(int)
|
||||
|
||||
# Custom threshold
|
||||
threshold = 0.3
|
||||
predictions_custom = (probs > threshold).astype(int)
|
||||
|
||||
# For new data
|
||||
X_new = sm.add_constant(X_new_data)
|
||||
new_probs = results.predict(X_new)
|
||||
```
|
||||
|
||||
**Model evaluation:**
|
||||
```python
|
||||
from sklearn.metrics import (classification_report, confusion_matrix,
|
||||
roc_auc_score, roc_curve)
|
||||
|
||||
# Classification report
|
||||
print(classification_report(y, predictions))
|
||||
|
||||
# Confusion matrix
|
||||
print(confusion_matrix(y, predictions))
|
||||
|
||||
# AUC-ROC
|
||||
auc = roc_auc_score(y, probs)
|
||||
print(f"AUC: {auc:.4f}")
|
||||
|
||||
# Pseudo R-squared
|
||||
print(f"McFadden's Pseudo R²: {results.prsquared:.4f}")
|
||||
```
|
||||
|
||||
### Probit
|
||||
|
||||
Uses normal distribution for binary outcomes.
|
||||
|
||||
**When to use:**
|
||||
- Binary outcomes
|
||||
- Prefer normal distribution assumption
|
||||
- Field convention (econometrics often uses probit)
|
||||
|
||||
**Model**: P(Y=1|X) = Φ(Xβ), where Φ is standard normal CDF
|
||||
|
||||
```python
|
||||
from statsmodels.discrete.discrete_model import Probit
|
||||
|
||||
model = Probit(y, X)
|
||||
results = model.fit()
|
||||
|
||||
print(results.summary())
|
||||
```
|
||||
|
||||
**Comparison with Logit:**
|
||||
- Probit and Logit usually give similar results
|
||||
- Probit: symmetric, based on normal distribution
|
||||
- Logit: slightly heavier tails, easier interpretation (odds ratios)
|
||||
- Coefficients not directly comparable (scale difference)
|
||||
|
||||
```python
|
||||
# Marginal effects are comparable
|
||||
logit_me = logit_results.get_margeff().margeff
|
||||
probit_me = probit_results.get_margeff().margeff
|
||||
|
||||
print("Logit marginal effects:", logit_me)
|
||||
print("Probit marginal effects:", probit_me)
|
||||
```
|
||||
|
||||
## Multinomial Models
|
||||
|
||||
### MNLogit (Multinomial Logit)
|
||||
|
||||
For unordered categorical outcomes with 3+ categories.
|
||||
|
||||
**When to use:**
|
||||
- Multiple unordered categories (e.g., transportation mode, brand choice)
|
||||
- No natural ordering among categories
|
||||
- Need probabilities for each category
|
||||
|
||||
**Model**: P(Y=j|X) = exp(Xβⱼ) / Σₖ exp(Xβₖ)
|
||||
|
||||
```python
|
||||
from statsmodels.discrete.discrete_model import MNLogit
|
||||
|
||||
# y should be integers 0, 1, 2, ... for categories
|
||||
model = MNLogit(y, X)
|
||||
results = model.fit()
|
||||
|
||||
print(results.summary())
|
||||
```
|
||||
|
||||
**Interpretation:**
|
||||
```python
|
||||
# One category is reference (usually category 0)
|
||||
# Coefficients represent log-odds relative to reference
|
||||
|
||||
# For category j vs reference:
|
||||
# exp(β_j) = odds ratio of category j vs reference
|
||||
|
||||
# Predicted probabilities for each category
|
||||
probs = results.predict(X) # Shape: (n_samples, n_categories)
|
||||
|
||||
# Most likely category
|
||||
predicted_categories = probs.argmax(axis=1)
|
||||
```
|
||||
|
||||
**Relative risk ratios:**
|
||||
```python
|
||||
# Exponentiate coefficients for relative risk ratios
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
|
||||
# Get parameter names and values
|
||||
params_df = pd.DataFrame({
|
||||
'coef': results.params,
|
||||
'RRR': np.exp(results.params)
|
||||
})
|
||||
print(params_df)
|
||||
```
|
||||
|
||||
### Conditional Logit
|
||||
|
||||
For choice models where alternatives have characteristics.
|
||||
|
||||
**When to use:**
|
||||
- Alternative-specific regressors (vary across choices)
|
||||
- Panel data with choices
|
||||
- Discrete choice experiments
|
||||
|
||||
```python
|
||||
from statsmodels.discrete.conditional_models import ConditionalLogit
|
||||
|
||||
# Data structure: long format with choice indicator
|
||||
model = ConditionalLogit(y_choice, X_alternatives, groups=individual_id)
|
||||
results = model.fit()
|
||||
```
|
||||
|
||||
## Count Models
|
||||
|
||||
### Poisson
|
||||
|
||||
Standard model for count data.
|
||||
|
||||
**When to use:**
|
||||
- Count outcomes (events, occurrences)
|
||||
- Rare events
|
||||
- Mean ≈ variance
|
||||
|
||||
**Model**: P(Y=k|X) = exp(-λ) λᵏ / k!, where log(λ) = Xβ
|
||||
|
||||
```python
|
||||
from statsmodels.discrete.count_model import Poisson
|
||||
|
||||
model = Poisson(y_counts, X)
|
||||
results = model.fit()
|
||||
|
||||
print(results.summary())
|
||||
```
|
||||
|
||||
**Interpretation:**
|
||||
```python
|
||||
# Rate ratios (incident rate ratios)
|
||||
rate_ratios = np.exp(results.params)
|
||||
print("Rate ratios:", rate_ratios)
|
||||
|
||||
# For 1-unit increase in X, expected count multiplies by exp(β)
|
||||
```
|
||||
|
||||
**Check overdispersion:**
|
||||
```python
|
||||
# Mean and variance should be similar for Poisson
|
||||
print(f"Mean: {y_counts.mean():.2f}")
|
||||
print(f"Variance: {y_counts.var():.2f}")
|
||||
|
||||
# Formal test
|
||||
from statsmodels.stats.stattools import durbin_watson
|
||||
|
||||
# Overdispersion if variance >> mean
|
||||
# Rule of thumb: variance/mean > 1.5 suggests overdispersion
|
||||
overdispersion_ratio = y_counts.var() / y_counts.mean()
|
||||
print(f"Variance/Mean: {overdispersion_ratio:.2f}")
|
||||
|
||||
if overdispersion_ratio > 1.5:
|
||||
print("Consider Negative Binomial model")
|
||||
```
|
||||
|
||||
**With offset (for rates):**
|
||||
```python
|
||||
# When modeling rates with varying exposure
|
||||
# log(λ) = log(exposure) + Xβ
|
||||
|
||||
model = Poisson(y_counts, X, offset=np.log(exposure))
|
||||
results = model.fit()
|
||||
```
|
||||
|
||||
### Negative Binomial
|
||||
|
||||
For overdispersed count data (variance > mean).
|
||||
|
||||
**When to use:**
|
||||
- Count data with overdispersion
|
||||
- Excess variance not explained by Poisson
|
||||
- Heterogeneity in counts
|
||||
|
||||
**Model**: Adds dispersion parameter α to account for overdispersion
|
||||
|
||||
```python
|
||||
from statsmodels.discrete.count_model import NegativeBinomial
|
||||
|
||||
model = NegativeBinomial(y_counts, X)
|
||||
results = model.fit()
|
||||
|
||||
print(results.summary())
|
||||
print(f"Dispersion parameter alpha: {results.params['alpha']:.4f}")
|
||||
```
|
||||
|
||||
**Compare with Poisson:**
|
||||
```python
|
||||
# Fit both models
|
||||
poisson_results = Poisson(y_counts, X).fit()
|
||||
nb_results = NegativeBinomial(y_counts, X).fit()
|
||||
|
||||
# AIC comparison (lower is better)
|
||||
print(f"Poisson AIC: {poisson_results.aic:.2f}")
|
||||
print(f"Negative Binomial AIC: {nb_results.aic:.2f}")
|
||||
|
||||
# Likelihood ratio test (if NB is better)
|
||||
from scipy import stats
|
||||
lr_stat = 2 * (nb_results.llf - poisson_results.llf)
|
||||
lr_pval = 1 - stats.chi2.cdf(lr_stat, df=1) # 1 extra parameter (alpha)
|
||||
print(f"LR test p-value: {lr_pval:.4f}")
|
||||
|
||||
if lr_pval < 0.05:
|
||||
print("Negative Binomial significantly better")
|
||||
```
|
||||
|
||||
### Zero-Inflated Models
|
||||
|
||||
For count data with excess zeros.
|
||||
|
||||
**When to use:**
|
||||
- More zeros than expected from Poisson/NB
|
||||
- Two processes: one for zeros, one for counts
|
||||
- Examples: number of doctor visits, insurance claims
|
||||
|
||||
**Models:**
|
||||
- ZeroInflatedPoisson (ZIP)
|
||||
- ZeroInflatedNegativeBinomialP (ZINB)
|
||||
|
||||
```python
|
||||
from statsmodels.discrete.count_model import (ZeroInflatedPoisson,
|
||||
ZeroInflatedNegativeBinomialP)
|
||||
|
||||
# ZIP model
|
||||
zip_model = ZeroInflatedPoisson(y_counts, X, exog_infl=X_inflation)
|
||||
zip_results = zip_model.fit()
|
||||
|
||||
# ZINB model (for overdispersion + excess zeros)
|
||||
zinb_model = ZeroInflatedNegativeBinomialP(y_counts, X, exog_infl=X_inflation)
|
||||
zinb_results = zinb_model.fit()
|
||||
|
||||
print(zip_results.summary())
|
||||
```
|
||||
|
||||
**Two parts of the model:**
|
||||
```python
|
||||
# 1. Inflation model: P(Y=0 due to inflation)
|
||||
# 2. Count model: distribution of counts
|
||||
|
||||
# Predicted probabilities of inflation
|
||||
inflation_probs = zip_results.predict(X, which='prob')
|
||||
|
||||
# Predicted counts
|
||||
predicted_counts = zip_results.predict(X, which='mean')
|
||||
```
|
||||
|
||||
### Hurdle Models
|
||||
|
||||
Two-stage model: whether any counts, then how many.
|
||||
|
||||
**When to use:**
|
||||
- Excess zeros
|
||||
- Different processes for zero vs positive counts
|
||||
- Zeros structurally different from positive values
|
||||
|
||||
```python
|
||||
from statsmodels.discrete.count_model import HurdleCountModel
|
||||
|
||||
# Specify count distribution and zero inflation
|
||||
model = HurdleCountModel(y_counts, X,
|
||||
exog_infl=X_hurdle,
|
||||
dist='poisson') # or 'negbin'
|
||||
results = model.fit()
|
||||
|
||||
print(results.summary())
|
||||
```
|
||||
|
||||
## Ordinal Models
|
||||
|
||||
### Ordered Logit/Probit
|
||||
|
||||
For ordered categorical outcomes.
|
||||
|
||||
**When to use:**
|
||||
- Ordered categories (e.g., low/medium/high, ratings 1-5)
|
||||
- Natural ordering matters
|
||||
- Want to respect ordinal structure
|
||||
|
||||
**Model**: Cumulative probability model with cutpoints
|
||||
|
||||
```python
|
||||
from statsmodels.miscmodels.ordinal_model import OrderedModel
|
||||
|
||||
# y should be ordered integers: 0, 1, 2, ...
|
||||
model = OrderedModel(y_ordered, X, distr='logit') # or 'probit'
|
||||
results = model.fit(method='bfgs')
|
||||
|
||||
print(results.summary())
|
||||
```
|
||||
|
||||
**Interpretation:**
|
||||
```python
|
||||
# Cutpoints (thresholds between categories)
|
||||
cutpoints = results.params[-n_categories+1:]
|
||||
print("Cutpoints:", cutpoints)
|
||||
|
||||
# Coefficients
|
||||
coefficients = results.params[:-n_categories+1]
|
||||
print("Coefficients:", coefficients)
|
||||
|
||||
# Predicted probabilities for each category
|
||||
probs = results.predict(X) # Shape: (n_samples, n_categories)
|
||||
|
||||
# Most likely category
|
||||
predicted_categories = probs.argmax(axis=1)
|
||||
```
|
||||
|
||||
**Proportional odds assumption:**
|
||||
```python
|
||||
# Test if coefficients are same across cutpoints
|
||||
# (Brant test - implement manually or check residuals)
|
||||
|
||||
# Check: model each cutpoint separately and compare coefficients
|
||||
```
|
||||
|
||||
## Model Diagnostics
|
||||
|
||||
### Goodness of Fit
|
||||
|
||||
```python
|
||||
# Pseudo R-squared (McFadden)
|
||||
print(f"Pseudo R²: {results.prsquared:.4f}")
|
||||
|
||||
# AIC/BIC for model comparison
|
||||
print(f"AIC: {results.aic:.2f}")
|
||||
print(f"BIC: {results.bic:.2f}")
|
||||
|
||||
# Log-likelihood
|
||||
print(f"Log-likelihood: {results.llf:.2f}")
|
||||
|
||||
# Likelihood ratio test vs null model
|
||||
lr_stat = 2 * (results.llf - results.llnull)
|
||||
from scipy import stats
|
||||
lr_pval = 1 - stats.chi2.cdf(lr_stat, results.df_model)
|
||||
print(f"LR test p-value: {lr_pval}")
|
||||
```
|
||||
|
||||
### Classification Metrics (Binary)
|
||||
|
||||
```python
|
||||
from sklearn.metrics import (accuracy_score, precision_score, recall_score,
|
||||
f1_score, roc_auc_score)
|
||||
|
||||
# Predictions
|
||||
probs = results.predict(X)
|
||||
predictions = (probs > 0.5).astype(int)
|
||||
|
||||
# Metrics
|
||||
print(f"Accuracy: {accuracy_score(y, predictions):.4f}")
|
||||
print(f"Precision: {precision_score(y, predictions):.4f}")
|
||||
print(f"Recall: {recall_score(y, predictions):.4f}")
|
||||
print(f"F1: {f1_score(y, predictions):.4f}")
|
||||
print(f"AUC: {roc_auc_score(y, probs):.4f}")
|
||||
```
|
||||
|
||||
### Classification Metrics (Multinomial)
|
||||
|
||||
```python
|
||||
from sklearn.metrics import accuracy_score, classification_report, log_loss
|
||||
|
||||
# Predicted categories
|
||||
probs = results.predict(X)
|
||||
predictions = probs.argmax(axis=1)
|
||||
|
||||
# Accuracy
|
||||
accuracy = accuracy_score(y, predictions)
|
||||
print(f"Accuracy: {accuracy:.4f}")
|
||||
|
||||
# Classification report
|
||||
print(classification_report(y, predictions))
|
||||
|
||||
# Log loss
|
||||
logloss = log_loss(y, probs)
|
||||
print(f"Log Loss: {logloss:.4f}")
|
||||
```
|
||||
|
||||
### Count Model Diagnostics
|
||||
|
||||
```python
|
||||
# Observed vs predicted frequencies
|
||||
observed = pd.Series(y_counts).value_counts().sort_index()
|
||||
predicted = results.predict(X)
|
||||
predicted_counts = pd.Series(np.round(predicted)).value_counts().sort_index()
|
||||
|
||||
# Compare distributions
|
||||
import matplotlib.pyplot as plt
|
||||
fig, ax = plt.subplots()
|
||||
observed.plot(kind='bar', alpha=0.5, label='Observed', ax=ax)
|
||||
predicted_counts.plot(kind='bar', alpha=0.5, label='Predicted', ax=ax)
|
||||
ax.legend()
|
||||
ax.set_xlabel('Count')
|
||||
ax.set_ylabel('Frequency')
|
||||
plt.show()
|
||||
|
||||
# Rootogram (better visualization)
|
||||
from statsmodels.graphics.agreement import mean_diff_plot
|
||||
# Custom rootogram implementation needed
|
||||
```
|
||||
|
||||
### Influence and Outliers
|
||||
|
||||
```python
|
||||
# Standardized residuals
|
||||
std_resid = (y - results.predict(X)) / np.sqrt(results.predict(X))
|
||||
|
||||
# Check for outliers (|std_resid| > 2)
|
||||
outliers = np.where(np.abs(std_resid) > 2)[0]
|
||||
print(f"Number of outliers: {len(outliers)}")
|
||||
|
||||
# Leverage (hat values) - for logit/probit
|
||||
# from statsmodels.stats.outliers_influence
|
||||
```
|
||||
|
||||
## Hypothesis Testing
|
||||
|
||||
```python
|
||||
# Single parameter test (automatic in summary)
|
||||
|
||||
# Multiple parameters: Wald test
|
||||
# Test H0: β₁ = β₂ = 0
|
||||
R = [[0, 1, 0, 0], [0, 0, 1, 0]]
|
||||
wald_test = results.wald_test(R)
|
||||
print(wald_test)
|
||||
|
||||
# Likelihood ratio test for nested models
|
||||
model_reduced = Logit(y, X_reduced).fit()
|
||||
model_full = Logit(y, X_full).fit()
|
||||
|
||||
lr_stat = 2 * (model_full.llf - model_reduced.llf)
|
||||
df = model_full.df_model - model_reduced.df_model
|
||||
from scipy import stats
|
||||
lr_pval = 1 - stats.chi2.cdf(lr_stat, df)
|
||||
print(f"LR test p-value: {lr_pval:.4f}")
|
||||
```
|
||||
|
||||
## Model Selection and Comparison
|
||||
|
||||
```python
|
||||
# Fit multiple models
|
||||
models = {
|
||||
'Logit': Logit(y, X).fit(),
|
||||
'Probit': Probit(y, X).fit(),
|
||||
# Add more models
|
||||
}
|
||||
|
||||
# Compare AIC/BIC
|
||||
comparison = pd.DataFrame({
|
||||
'AIC': {name: model.aic for name, model in models.items()},
|
||||
'BIC': {name: model.bic for name, model in models.items()},
|
||||
'Pseudo R²': {name: model.prsquared for name, model in models.items()}
|
||||
})
|
||||
print(comparison.sort_values('AIC'))
|
||||
|
||||
# Cross-validation for predictive performance
|
||||
from sklearn.model_selection import cross_val_score
|
||||
from sklearn.linear_model import LogisticRegression
|
||||
|
||||
# Use sklearn wrapper or manual CV
|
||||
```
|
||||
|
||||
## Formula API
|
||||
|
||||
Use R-style formulas for easier specification.
|
||||
|
||||
```python
|
||||
import statsmodels.formula.api as smf
|
||||
|
||||
# Logit with formula
|
||||
formula = 'y ~ x1 + x2 + C(category) + x1:x2'
|
||||
results = smf.logit(formula, data=df).fit()
|
||||
|
||||
# MNLogit with formula
|
||||
results = smf.mnlogit(formula, data=df).fit()
|
||||
|
||||
# Poisson with formula
|
||||
results = smf.poisson(formula, data=df).fit()
|
||||
|
||||
# Negative Binomial with formula
|
||||
results = smf.negativebinomial(formula, data=df).fit()
|
||||
```
|
||||
|
||||
## Common Applications
|
||||
|
||||
### Binary Classification (Marketing Response)
|
||||
|
||||
```python
|
||||
# Predict customer purchase probability
|
||||
X = sm.add_constant(customer_features)
|
||||
model = Logit(purchased, X)
|
||||
results = model.fit()
|
||||
|
||||
# Targeting: select top 20% likely to purchase
|
||||
probs = results.predict(X)
|
||||
top_20_pct_idx = np.argsort(probs)[-int(0.2*len(probs)):]
|
||||
```
|
||||
|
||||
### Multinomial Choice (Transportation Mode)
|
||||
|
||||
```python
|
||||
# Predict transportation mode choice
|
||||
model = MNLogit(mode_choice, X)
|
||||
results = model.fit()
|
||||
|
||||
# Predicted mode for new commuter
|
||||
new_commuter = sm.add_constant(new_features)
|
||||
mode_probs = results.predict(new_commuter)
|
||||
predicted_mode = mode_probs.argmax(axis=1)
|
||||
```
|
||||
|
||||
### Count Data (Number of Doctor Visits)
|
||||
|
||||
```python
|
||||
# Model healthcare utilization
|
||||
model = NegativeBinomial(num_visits, X)
|
||||
results = model.fit()
|
||||
|
||||
# Expected visits for new patient
|
||||
expected_visits = results.predict(new_patient_X)
|
||||
```
|
||||
|
||||
### Zero-Inflated (Insurance Claims)
|
||||
|
||||
```python
|
||||
# Many people have zero claims
|
||||
# Zero-inflation: some never claim
|
||||
# Count process: those who might claim
|
||||
|
||||
zip_model = ZeroInflatedPoisson(claims, X_count, exog_infl=X_inflation)
|
||||
results = zip_model.fit()
|
||||
|
||||
# P(never file claim)
|
||||
never_claim_prob = results.predict(X, which='prob-zero')
|
||||
|
||||
# Expected claims
|
||||
expected_claims = results.predict(X, which='mean')
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Check data type**: Ensure response matches model (binary, counts, categories)
|
||||
2. **Add constant**: Always use `sm.add_constant()` unless no intercept desired
|
||||
3. **Scale continuous predictors**: For better convergence and interpretation
|
||||
4. **Check convergence**: Look for convergence warnings
|
||||
5. **Use formula API**: For categorical variables and interactions
|
||||
6. **Marginal effects**: Report marginal effects, not just coefficients
|
||||
7. **Model comparison**: Use AIC/BIC and cross-validation
|
||||
8. **Validate**: Holdout set or cross-validation for predictive models
|
||||
9. **Check overdispersion**: For count models, test Poisson assumption
|
||||
10. **Consider alternatives**: Zero-inflation, hurdle models for excess zeros
|
||||
|
||||
## Common Pitfalls
|
||||
|
||||
1. **Forgetting constant**: No intercept term
|
||||
2. **Perfect separation**: Logit/probit may not converge
|
||||
3. **Using Poisson with overdispersion**: Check and use Negative Binomial
|
||||
4. **Misinterpreting coefficients**: Remember they're on log-odds/log scale
|
||||
5. **Not checking convergence**: Optimization may fail silently
|
||||
6. **Wrong distribution**: Match model to data type (binary/count/categorical)
|
||||
7. **Ignoring excess zeros**: Use ZIP/ZINB when appropriate
|
||||
8. **Not validating predictions**: Always check out-of-sample performance
|
||||
9. **Comparing non-nested models**: Use AIC/BIC, not likelihood ratio test
|
||||
10. **Ordinal as nominal**: Use OrderedModel for ordered categories
|
||||
619
skills/statsmodels/references/glm.md
Normal file
619
skills/statsmodels/references/glm.md
Normal file
@@ -0,0 +1,619 @@
|
||||
# Generalized Linear Models (GLM) Reference
|
||||
|
||||
This document provides comprehensive guidance on generalized linear models in statsmodels, including families, link functions, and applications.
|
||||
|
||||
## Overview
|
||||
|
||||
GLMs extend linear regression to non-normal response distributions through:
|
||||
1. **Distribution family**: Specifies the conditional distribution of the response
|
||||
2. **Link function**: Transforms the linear predictor to the scale of the mean
|
||||
3. **Variance function**: Relates variance to the mean
|
||||
|
||||
**General form**: g(μ) = Xβ, where g is the link function and μ = E(Y|X)
|
||||
|
||||
## When to Use GLM
|
||||
|
||||
- **Binary outcomes**: Logistic regression (Binomial family with logit link)
|
||||
- **Count data**: Poisson or Negative Binomial regression
|
||||
- **Positive continuous data**: Gamma or Inverse Gaussian
|
||||
- **Non-normal distributions**: When OLS assumptions violated
|
||||
- **Link functions**: Need non-linear relationship between predictors and response scale
|
||||
|
||||
## Distribution Families
|
||||
|
||||
### Binomial Family
|
||||
|
||||
For binary outcomes (0/1) or proportions (k/n).
|
||||
|
||||
**When to use:**
|
||||
- Binary classification
|
||||
- Success/failure outcomes
|
||||
- Proportions or rates
|
||||
|
||||
**Common links:**
|
||||
- Logit (default): log(μ/(1-μ))
|
||||
- Probit: Φ⁻¹(μ)
|
||||
- Log: log(μ)
|
||||
|
||||
```python
|
||||
import statsmodels.api as sm
|
||||
import statsmodels.formula.api as smf
|
||||
|
||||
# Binary logistic regression
|
||||
model = sm.GLM(y, X, family=sm.families.Binomial())
|
||||
results = model.fit()
|
||||
|
||||
# Formula API
|
||||
results = smf.glm('success ~ x1 + x2', data=df,
|
||||
family=sm.families.Binomial()).fit()
|
||||
|
||||
# Access predictions (probabilities)
|
||||
probs = results.predict(X_new)
|
||||
|
||||
# Classification (0.5 threshold)
|
||||
predictions = (probs > 0.5).astype(int)
|
||||
```
|
||||
|
||||
**Interpretation:**
|
||||
```python
|
||||
import numpy as np
|
||||
|
||||
# Odds ratios (for logit link)
|
||||
odds_ratios = np.exp(results.params)
|
||||
print("Odds ratios:", odds_ratios)
|
||||
|
||||
# For 1-unit increase in x, odds multiply by exp(beta)
|
||||
```
|
||||
|
||||
### Poisson Family
|
||||
|
||||
For count data (non-negative integers).
|
||||
|
||||
**When to use:**
|
||||
- Count outcomes (number of events)
|
||||
- Rare events
|
||||
- Rate modeling (with offset)
|
||||
|
||||
**Common links:**
|
||||
- Log (default): log(μ)
|
||||
- Identity: μ
|
||||
- Sqrt: √μ
|
||||
|
||||
```python
|
||||
# Poisson regression
|
||||
model = sm.GLM(y, X, family=sm.families.Poisson())
|
||||
results = model.fit()
|
||||
|
||||
# With exposure/offset for rates
|
||||
# If modeling rate = counts/exposure
|
||||
model = sm.GLM(y, X, family=sm.families.Poisson(),
|
||||
offset=np.log(exposure))
|
||||
results = model.fit()
|
||||
|
||||
# Interpretation: exp(beta) = multiplicative effect on expected count
|
||||
import numpy as np
|
||||
rate_ratios = np.exp(results.params)
|
||||
print("Rate ratios:", rate_ratios)
|
||||
```
|
||||
|
||||
**Overdispersion check:**
|
||||
```python
|
||||
# Deviance / df should be ~1 for Poisson
|
||||
overdispersion = results.deviance / results.df_resid
|
||||
print(f"Overdispersion: {overdispersion}")
|
||||
|
||||
# If >> 1, consider Negative Binomial
|
||||
if overdispersion > 1.5:
|
||||
print("Consider Negative Binomial model for overdispersion")
|
||||
```
|
||||
|
||||
### Negative Binomial Family
|
||||
|
||||
For overdispersed count data.
|
||||
|
||||
**When to use:**
|
||||
- Count data with variance > mean
|
||||
- Excess zeros or large variance
|
||||
- Poisson model shows overdispersion
|
||||
|
||||
```python
|
||||
# Negative Binomial GLM
|
||||
model = sm.GLM(y, X, family=sm.families.NegativeBinomial())
|
||||
results = model.fit()
|
||||
|
||||
# Alternative: use discrete choice model with alpha estimation
|
||||
from statsmodels.discrete.discrete_model import NegativeBinomial
|
||||
nb_model = NegativeBinomial(y, X)
|
||||
nb_results = nb_model.fit()
|
||||
|
||||
print(f"Dispersion parameter alpha: {nb_results.params[-1]}")
|
||||
```
|
||||
|
||||
### Gaussian Family
|
||||
|
||||
Equivalent to OLS but fit via IRLS (Iteratively Reweighted Least Squares).
|
||||
|
||||
**When to use:**
|
||||
- Want GLM framework for consistency
|
||||
- Need robust standard errors
|
||||
- Comparing with other GLMs
|
||||
|
||||
**Common links:**
|
||||
- Identity (default): μ
|
||||
- Log: log(μ)
|
||||
- Inverse: 1/μ
|
||||
|
||||
```python
|
||||
# Gaussian GLM (equivalent to OLS)
|
||||
model = sm.GLM(y, X, family=sm.families.Gaussian())
|
||||
results = model.fit()
|
||||
|
||||
# Verify equivalence with OLS
|
||||
ols_results = sm.OLS(y, X).fit()
|
||||
print("Parameters close:", np.allclose(results.params, ols_results.params))
|
||||
```
|
||||
|
||||
### Gamma Family
|
||||
|
||||
For positive continuous data, often right-skewed.
|
||||
|
||||
**When to use:**
|
||||
- Positive outcomes (insurance claims, survival times)
|
||||
- Right-skewed distributions
|
||||
- Variance proportional to mean²
|
||||
|
||||
**Common links:**
|
||||
- Inverse (default): 1/μ
|
||||
- Log: log(μ)
|
||||
- Identity: μ
|
||||
|
||||
```python
|
||||
# Gamma regression (common for cost data)
|
||||
model = sm.GLM(y, X, family=sm.families.Gamma())
|
||||
results = model.fit()
|
||||
|
||||
# Log link often preferred for interpretation
|
||||
model = sm.GLM(y, X, family=sm.families.Gamma(link=sm.families.links.Log()))
|
||||
results = model.fit()
|
||||
|
||||
# With log link, exp(beta) = multiplicative effect
|
||||
import numpy as np
|
||||
effects = np.exp(results.params)
|
||||
```
|
||||
|
||||
### Inverse Gaussian Family
|
||||
|
||||
For positive continuous data with specific variance structure.
|
||||
|
||||
**When to use:**
|
||||
- Positive skewed outcomes
|
||||
- Variance proportional to mean³
|
||||
- Alternative to Gamma
|
||||
|
||||
**Common links:**
|
||||
- Inverse squared (default): 1/μ²
|
||||
- Log: log(μ)
|
||||
|
||||
```python
|
||||
model = sm.GLM(y, X, family=sm.families.InverseGaussian())
|
||||
results = model.fit()
|
||||
```
|
||||
|
||||
### Tweedie Family
|
||||
|
||||
Flexible family covering multiple distributions.
|
||||
|
||||
**When to use:**
|
||||
- Insurance claims (mixture of zeros and continuous)
|
||||
- Semi-continuous data
|
||||
- Need flexible variance function
|
||||
|
||||
**Special cases (power parameter p):**
|
||||
- p=0: Normal
|
||||
- p=1: Poisson
|
||||
- p=2: Gamma
|
||||
- p=3: Inverse Gaussian
|
||||
- 1<p<2: Compound Poisson-Gamma (common for insurance)
|
||||
|
||||
```python
|
||||
# Tweedie with power=1.5
|
||||
model = sm.GLM(y, X, family=sm.families.Tweedie(link=sm.families.links.Log(),
|
||||
var_power=1.5))
|
||||
results = model.fit()
|
||||
```
|
||||
|
||||
## Link Functions
|
||||
|
||||
Link functions connect the linear predictor to the mean of the response.
|
||||
|
||||
### Available Links
|
||||
|
||||
```python
|
||||
from statsmodels.genmod import families
|
||||
|
||||
# Identity: g(μ) = μ
|
||||
link = families.links.Identity()
|
||||
|
||||
# Log: g(μ) = log(μ)
|
||||
link = families.links.Log()
|
||||
|
||||
# Logit: g(μ) = log(μ/(1-μ))
|
||||
link = families.links.Logit()
|
||||
|
||||
# Probit: g(μ) = Φ⁻¹(μ)
|
||||
link = families.links.Probit()
|
||||
|
||||
# Complementary log-log: g(μ) = log(-log(1-μ))
|
||||
link = families.links.CLogLog()
|
||||
|
||||
# Inverse: g(μ) = 1/μ
|
||||
link = families.links.InversePower()
|
||||
|
||||
# Inverse squared: g(μ) = 1/μ²
|
||||
link = families.links.InverseSquared()
|
||||
|
||||
# Square root: g(μ) = √μ
|
||||
link = families.links.Sqrt()
|
||||
|
||||
# Power: g(μ) = μ^p
|
||||
link = families.links.Power(power=2)
|
||||
```
|
||||
|
||||
### Choosing Link Functions
|
||||
|
||||
**Canonical links** (default for each family):
|
||||
- Binomial → Logit
|
||||
- Poisson → Log
|
||||
- Gamma → Inverse
|
||||
- Gaussian → Identity
|
||||
- Inverse Gaussian → Inverse squared
|
||||
|
||||
**When to use non-canonical:**
|
||||
- **Log link with Binomial**: Risk ratios instead of odds ratios
|
||||
- **Identity link**: Direct additive effects (when sensible)
|
||||
- **Probit vs Logit**: Similar results, preference based on field
|
||||
- **CLogLog**: Asymmetric relationship, common in survival analysis
|
||||
|
||||
```python
|
||||
# Example: Risk ratios with log-binomial model
|
||||
model = sm.GLM(y, X, family=sm.families.Binomial(link=sm.families.links.Log()))
|
||||
results = model.fit()
|
||||
|
||||
# exp(beta) now gives risk ratios, not odds ratios
|
||||
risk_ratios = np.exp(results.params)
|
||||
```
|
||||
|
||||
## Model Fitting and Results
|
||||
|
||||
### Basic Workflow
|
||||
|
||||
```python
|
||||
import statsmodels.api as sm
|
||||
|
||||
# Add constant
|
||||
X = sm.add_constant(X_data)
|
||||
|
||||
# Specify family and link
|
||||
family = sm.families.Poisson(link=sm.families.links.Log())
|
||||
|
||||
# Fit model using IRLS
|
||||
model = sm.GLM(y, X, family=family)
|
||||
results = model.fit()
|
||||
|
||||
# Summary
|
||||
print(results.summary())
|
||||
```
|
||||
|
||||
### Results Attributes
|
||||
|
||||
```python
|
||||
# Parameters and inference
|
||||
results.params # Coefficients
|
||||
results.bse # Standard errors
|
||||
results.tvalues # Z-statistics
|
||||
results.pvalues # P-values
|
||||
results.conf_int() # Confidence intervals
|
||||
|
||||
# Predictions
|
||||
results.fittedvalues # Fitted values (μ)
|
||||
results.predict(X_new) # Predictions for new data
|
||||
|
||||
# Model fit statistics
|
||||
results.aic # Akaike Information Criterion
|
||||
results.bic # Bayesian Information Criterion
|
||||
results.deviance # Deviance
|
||||
results.null_deviance # Null model deviance
|
||||
results.pearson_chi2 # Pearson chi-squared statistic
|
||||
results.df_resid # Residual degrees of freedom
|
||||
results.llf # Log-likelihood
|
||||
|
||||
# Residuals
|
||||
results.resid_response # Response residuals (y - μ)
|
||||
results.resid_pearson # Pearson residuals
|
||||
results.resid_deviance # Deviance residuals
|
||||
results.resid_anscombe # Anscombe residuals
|
||||
results.resid_working # Working residuals
|
||||
```
|
||||
|
||||
### Pseudo R-squared
|
||||
|
||||
```python
|
||||
# McFadden's pseudo R-squared
|
||||
pseudo_r2 = 1 - (results.deviance / results.null_deviance)
|
||||
print(f"Pseudo R²: {pseudo_r2:.4f}")
|
||||
|
||||
# Adjusted pseudo R-squared
|
||||
n = len(y)
|
||||
k = len(results.params)
|
||||
adj_pseudo_r2 = 1 - ((n-1)/(n-k)) * (results.deviance / results.null_deviance)
|
||||
print(f"Adjusted Pseudo R²: {adj_pseudo_r2:.4f}")
|
||||
```
|
||||
|
||||
## Diagnostics
|
||||
|
||||
### Goodness of Fit
|
||||
|
||||
```python
|
||||
# Deviance should be approximately χ² with df_resid degrees of freedom
|
||||
from scipy import stats
|
||||
|
||||
deviance_pval = 1 - stats.chi2.cdf(results.deviance, results.df_resid)
|
||||
print(f"Deviance test p-value: {deviance_pval}")
|
||||
|
||||
# Pearson chi-squared test
|
||||
pearson_pval = 1 - stats.chi2.cdf(results.pearson_chi2, results.df_resid)
|
||||
print(f"Pearson chi² test p-value: {pearson_pval}")
|
||||
|
||||
# Check for overdispersion/underdispersion
|
||||
dispersion = results.pearson_chi2 / results.df_resid
|
||||
print(f"Dispersion: {dispersion}")
|
||||
# Should be ~1; >1 suggests overdispersion, <1 underdispersion
|
||||
```
|
||||
|
||||
### Residual Analysis
|
||||
|
||||
```python
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
# Deviance residuals vs fitted
|
||||
plt.figure(figsize=(10, 6))
|
||||
plt.scatter(results.fittedvalues, results.resid_deviance, alpha=0.5)
|
||||
plt.xlabel('Fitted values')
|
||||
plt.ylabel('Deviance residuals')
|
||||
plt.axhline(y=0, color='r', linestyle='--')
|
||||
plt.title('Deviance Residuals vs Fitted')
|
||||
plt.show()
|
||||
|
||||
# Q-Q plot of deviance residuals
|
||||
from statsmodels.graphics.gofplots import qqplot
|
||||
qqplot(results.resid_deviance, line='s')
|
||||
plt.title('Q-Q Plot of Deviance Residuals')
|
||||
plt.show()
|
||||
|
||||
# For binary outcomes: binned residual plot
|
||||
if isinstance(results.model.family, sm.families.Binomial):
|
||||
from statsmodels.graphics.gofplots import qqplot
|
||||
# Group predictions and compute average residuals
|
||||
# (custom implementation needed)
|
||||
pass
|
||||
```
|
||||
|
||||
### Influence and Outliers
|
||||
|
||||
```python
|
||||
from statsmodels.stats.outliers_influence import GLMInfluence
|
||||
|
||||
influence = GLMInfluence(results)
|
||||
|
||||
# Leverage
|
||||
leverage = influence.hat_matrix_diag
|
||||
|
||||
# Cook's distance
|
||||
cooks_d = influence.cooks_distance[0]
|
||||
|
||||
# DFFITS
|
||||
dffits = influence.dffits[0]
|
||||
|
||||
# Find influential observations
|
||||
influential = np.where(cooks_d > 4/len(y))[0]
|
||||
print(f"Influential observations: {influential}")
|
||||
```
|
||||
|
||||
## Hypothesis Testing
|
||||
|
||||
```python
|
||||
# Wald test for single parameter (automatically in summary)
|
||||
|
||||
# Likelihood ratio test for nested models
|
||||
# Fit reduced model
|
||||
model_reduced = sm.GLM(y, X_reduced, family=family).fit()
|
||||
model_full = sm.GLM(y, X_full, family=family).fit()
|
||||
|
||||
# LR statistic
|
||||
lr_stat = 2 * (model_full.llf - model_reduced.llf)
|
||||
df = model_full.df_model - model_reduced.df_model
|
||||
|
||||
from scipy import stats
|
||||
lr_pval = 1 - stats.chi2.cdf(lr_stat, df)
|
||||
print(f"LR test p-value: {lr_pval}")
|
||||
|
||||
# Wald test for multiple parameters
|
||||
# Test beta_1 = beta_2 = 0
|
||||
R = [[0, 1, 0, 0], [0, 0, 1, 0]]
|
||||
wald_test = results.wald_test(R)
|
||||
print(wald_test)
|
||||
```
|
||||
|
||||
## Robust Standard Errors
|
||||
|
||||
```python
|
||||
# Heteroscedasticity-robust (sandwich estimator)
|
||||
results_robust = results.get_robustcov_results(cov_type='HC0')
|
||||
|
||||
# Cluster-robust
|
||||
results_cluster = results.get_robustcov_results(cov_type='cluster',
|
||||
groups=cluster_ids)
|
||||
|
||||
# Compare standard errors
|
||||
print("Regular SE:", results.bse)
|
||||
print("Robust SE:", results_robust.bse)
|
||||
```
|
||||
|
||||
## Model Comparison
|
||||
|
||||
```python
|
||||
# AIC/BIC for non-nested models
|
||||
models = [model1_results, model2_results, model3_results]
|
||||
for i, res in enumerate(models, 1):
|
||||
print(f"Model {i}: AIC={res.aic:.2f}, BIC={res.bic:.2f}")
|
||||
|
||||
# Likelihood ratio test for nested models (as shown above)
|
||||
|
||||
# Cross-validation for predictive performance
|
||||
from sklearn.model_selection import KFold
|
||||
from sklearn.metrics import log_loss
|
||||
|
||||
kf = KFold(n_splits=5, shuffle=True, random_state=42)
|
||||
cv_scores = []
|
||||
|
||||
for train_idx, val_idx in kf.split(X):
|
||||
X_train, X_val = X[train_idx], X[val_idx]
|
||||
y_train, y_val = y[train_idx], y[val_idx]
|
||||
|
||||
model_cv = sm.GLM(y_train, X_train, family=family).fit()
|
||||
pred_probs = model_cv.predict(X_val)
|
||||
|
||||
score = log_loss(y_val, pred_probs)
|
||||
cv_scores.append(score)
|
||||
|
||||
print(f"CV Log Loss: {np.mean(cv_scores):.4f} ± {np.std(cv_scores):.4f}")
|
||||
```
|
||||
|
||||
## Prediction
|
||||
|
||||
```python
|
||||
# Point predictions
|
||||
predictions = results.predict(X_new)
|
||||
|
||||
# For classification: get probabilities and convert
|
||||
if isinstance(family, sm.families.Binomial):
|
||||
probs = predictions
|
||||
class_predictions = (probs > 0.5).astype(int)
|
||||
|
||||
# For counts: predictions are expected counts
|
||||
if isinstance(family, sm.families.Poisson):
|
||||
expected_counts = predictions
|
||||
|
||||
# Prediction intervals via bootstrap
|
||||
n_boot = 1000
|
||||
boot_preds = np.zeros((n_boot, len(X_new)))
|
||||
|
||||
for i in range(n_boot):
|
||||
# Bootstrap resample
|
||||
boot_idx = np.random.choice(len(y), size=len(y), replace=True)
|
||||
X_boot, y_boot = X[boot_idx], y[boot_idx]
|
||||
|
||||
# Fit and predict
|
||||
boot_model = sm.GLM(y_boot, X_boot, family=family).fit()
|
||||
boot_preds[i] = boot_model.predict(X_new)
|
||||
|
||||
# 95% prediction intervals
|
||||
pred_lower = np.percentile(boot_preds, 2.5, axis=0)
|
||||
pred_upper = np.percentile(boot_preds, 97.5, axis=0)
|
||||
```
|
||||
|
||||
## Common Applications
|
||||
|
||||
### Logistic Regression (Binary Classification)
|
||||
|
||||
```python
|
||||
import statsmodels.api as sm
|
||||
|
||||
# Fit logistic regression
|
||||
X = sm.add_constant(X_data)
|
||||
model = sm.GLM(y, X, family=sm.families.Binomial())
|
||||
results = model.fit()
|
||||
|
||||
# Odds ratios
|
||||
odds_ratios = np.exp(results.params)
|
||||
odds_ci = np.exp(results.conf_int())
|
||||
|
||||
# Classification metrics
|
||||
from sklearn.metrics import classification_report, roc_auc_score
|
||||
|
||||
probs = results.predict(X)
|
||||
predictions = (probs > 0.5).astype(int)
|
||||
|
||||
print(classification_report(y, predictions))
|
||||
print(f"AUC: {roc_auc_score(y, probs):.4f}")
|
||||
|
||||
# ROC curve
|
||||
from sklearn.metrics import roc_curve
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
fpr, tpr, thresholds = roc_curve(y, probs)
|
||||
plt.plot(fpr, tpr)
|
||||
plt.plot([0, 1], [0, 1], 'k--')
|
||||
plt.xlabel('False Positive Rate')
|
||||
plt.ylabel('True Positive Rate')
|
||||
plt.title('ROC Curve')
|
||||
plt.show()
|
||||
```
|
||||
|
||||
### Poisson Regression (Count Data)
|
||||
|
||||
```python
|
||||
# Fit Poisson model
|
||||
X = sm.add_constant(X_data)
|
||||
model = sm.GLM(y_counts, X, family=sm.families.Poisson())
|
||||
results = model.fit()
|
||||
|
||||
# Rate ratios
|
||||
rate_ratios = np.exp(results.params)
|
||||
print("Rate ratios:", rate_ratios)
|
||||
|
||||
# Check overdispersion
|
||||
dispersion = results.pearson_chi2 / results.df_resid
|
||||
if dispersion > 1.5:
|
||||
print(f"Overdispersion detected ({dispersion:.2f}). Consider Negative Binomial.")
|
||||
```
|
||||
|
||||
### Gamma Regression (Cost/Duration Data)
|
||||
|
||||
```python
|
||||
# Fit Gamma model with log link
|
||||
X = sm.add_constant(X_data)
|
||||
model = sm.GLM(y_cost, X,
|
||||
family=sm.families.Gamma(link=sm.families.links.Log()))
|
||||
results = model.fit()
|
||||
|
||||
# Multiplicative effects
|
||||
effects = np.exp(results.params)
|
||||
print("Multiplicative effects on mean:", effects)
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Check distribution assumptions**: Plot histograms and Q-Q plots of response
|
||||
2. **Verify link function**: Use canonical links unless there's a reason not to
|
||||
3. **Examine residuals**: Deviance residuals should be approximately normal
|
||||
4. **Test for overdispersion**: Especially for Poisson models
|
||||
5. **Use offsets appropriately**: For rate modeling with varying exposure
|
||||
6. **Consider robust SEs**: When variance assumptions questionable
|
||||
7. **Compare models**: Use AIC/BIC for non-nested, LR test for nested
|
||||
8. **Interpret on original scale**: Transform coefficients (e.g., exp for log link)
|
||||
9. **Check influential observations**: Use Cook's distance
|
||||
10. **Validate predictions**: Use cross-validation or holdout set
|
||||
|
||||
## Common Pitfalls
|
||||
|
||||
1. **Forgetting to add constant**: No intercept term
|
||||
2. **Using wrong family**: Check distribution of response
|
||||
3. **Ignoring overdispersion**: Use Negative Binomial instead of Poisson
|
||||
4. **Misinterpreting coefficients**: Remember link function transformation
|
||||
5. **Not checking convergence**: IRLS may not converge; check warnings
|
||||
6. **Complete separation in logistic**: Some categories perfectly predict outcome
|
||||
7. **Using identity link with bounded outcomes**: May predict outside valid range
|
||||
8. **Comparing models with different samples**: Use same observations
|
||||
9. **Forgetting offset in rate models**: Must use log(exposure) as offset
|
||||
10. **Not considering alternatives**: Mixed models, zero-inflation for complex data
|
||||
447
skills/statsmodels/references/linear_models.md
Normal file
447
skills/statsmodels/references/linear_models.md
Normal file
@@ -0,0 +1,447 @@
|
||||
# Linear Regression Models Reference
|
||||
|
||||
This document provides detailed guidance on linear regression models in statsmodels, including OLS, GLS, WLS, quantile regression, and specialized variants.
|
||||
|
||||
## Core Model Classes
|
||||
|
||||
### OLS (Ordinary Least Squares)
|
||||
|
||||
Assumes independent, identically distributed errors (Σ=I). Best for standard regression with homoscedastic errors.
|
||||
|
||||
**When to use:**
|
||||
- Standard regression analysis
|
||||
- Errors are independent and have constant variance
|
||||
- No autocorrelation or heteroscedasticity
|
||||
- Most common starting point
|
||||
|
||||
**Basic usage:**
|
||||
```python
|
||||
import statsmodels.api as sm
|
||||
import numpy as np
|
||||
|
||||
# Prepare data - ALWAYS add constant for intercept
|
||||
X = sm.add_constant(X_data) # Adds column of 1s for intercept
|
||||
|
||||
# Fit model
|
||||
model = sm.OLS(y, X)
|
||||
results = model.fit()
|
||||
|
||||
# View results
|
||||
print(results.summary())
|
||||
```
|
||||
|
||||
**Key results attributes:**
|
||||
```python
|
||||
results.params # Coefficients
|
||||
results.bse # Standard errors
|
||||
results.tvalues # T-statistics
|
||||
results.pvalues # P-values
|
||||
results.rsquared # R-squared
|
||||
results.rsquared_adj # Adjusted R-squared
|
||||
results.fittedvalues # Fitted values (predictions on training data)
|
||||
results.resid # Residuals
|
||||
results.conf_int() # Confidence intervals for parameters
|
||||
```
|
||||
|
||||
**Prediction with confidence/prediction intervals:**
|
||||
```python
|
||||
# For in-sample predictions
|
||||
pred = results.get_prediction(X)
|
||||
pred_summary = pred.summary_frame()
|
||||
print(pred_summary) # Contains mean, std, confidence intervals
|
||||
|
||||
# For out-of-sample predictions
|
||||
X_new = sm.add_constant(X_new_data)
|
||||
pred_new = results.get_prediction(X_new)
|
||||
pred_summary = pred_new.summary_frame()
|
||||
|
||||
# Access intervals
|
||||
mean_ci_lower = pred_summary["mean_ci_lower"]
|
||||
mean_ci_upper = pred_summary["mean_ci_upper"]
|
||||
obs_ci_lower = pred_summary["obs_ci_lower"] # Prediction intervals
|
||||
obs_ci_upper = pred_summary["obs_ci_upper"]
|
||||
```
|
||||
|
||||
**Formula API (R-style):**
|
||||
```python
|
||||
import statsmodels.formula.api as smf
|
||||
|
||||
# Automatic handling of categorical variables and interactions
|
||||
formula = 'y ~ x1 + x2 + C(category) + x1:x2'
|
||||
results = smf.ols(formula, data=df).fit()
|
||||
```
|
||||
|
||||
### WLS (Weighted Least Squares)
|
||||
|
||||
Handles heteroscedastic errors (diagonal Σ) where variance differs across observations.
|
||||
|
||||
**When to use:**
|
||||
- Known heteroscedasticity (non-constant error variance)
|
||||
- Different observations have different reliability
|
||||
- Weights are known or can be estimated
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
# If you know the weights (inverse variance)
|
||||
weights = 1 / error_variance
|
||||
model = sm.WLS(y, X, weights=weights)
|
||||
results = model.fit()
|
||||
|
||||
# Common weight patterns:
|
||||
# - 1/variance: when variance is known
|
||||
# - n_i: sample size for grouped data
|
||||
# - 1/x: when variance proportional to x
|
||||
```
|
||||
|
||||
**Feasible WLS (estimating weights):**
|
||||
```python
|
||||
# Step 1: Fit OLS
|
||||
ols_results = sm.OLS(y, X).fit()
|
||||
|
||||
# Step 2: Model squared residuals to estimate variance
|
||||
abs_resid = np.abs(ols_results.resid)
|
||||
variance_model = sm.OLS(np.log(abs_resid**2), X).fit()
|
||||
|
||||
# Step 3: Use estimated variance as weights
|
||||
weights = 1 / np.exp(variance_model.fittedvalues)
|
||||
wls_results = sm.WLS(y, X, weights=weights).fit()
|
||||
```
|
||||
|
||||
### GLS (Generalized Least Squares)
|
||||
|
||||
Handles arbitrary covariance structure (Σ). Superclass for other regression methods.
|
||||
|
||||
**When to use:**
|
||||
- Known covariance structure
|
||||
- Correlated errors
|
||||
- More general than WLS
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
# Specify covariance structure
|
||||
# Sigma should be (n x n) covariance matrix
|
||||
model = sm.GLS(y, X, sigma=Sigma)
|
||||
results = model.fit()
|
||||
```
|
||||
|
||||
### GLSAR (GLS with Autoregressive Errors)
|
||||
|
||||
Feasible generalized least squares with AR(p) errors for time series data.
|
||||
|
||||
**When to use:**
|
||||
- Time series regression with autocorrelated errors
|
||||
- Need to account for serial correlation
|
||||
- Violations of error independence
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
# AR(1) errors
|
||||
model = sm.GLSAR(y, X, rho=1) # rho=1 for AR(1), rho=2 for AR(2), etc.
|
||||
results = model.iterative_fit() # Iteratively estimates AR parameters
|
||||
|
||||
print(results.summary())
|
||||
print(f"Estimated rho: {results.model.rho}")
|
||||
```
|
||||
|
||||
### RLS (Recursive Least Squares)
|
||||
|
||||
Sequential parameter estimation, useful for adaptive or online learning.
|
||||
|
||||
**When to use:**
|
||||
- Parameters change over time
|
||||
- Online/streaming data
|
||||
- Want to see parameter evolution
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
from statsmodels.regression.recursive_ls import RecursiveLS
|
||||
|
||||
model = RecursiveLS(y, X)
|
||||
results = model.fit()
|
||||
|
||||
# Access time-varying parameters
|
||||
params_over_time = results.recursive_coefficients
|
||||
cusum = results.cusum # CUSUM statistic for structural breaks
|
||||
```
|
||||
|
||||
### Rolling Regressions
|
||||
|
||||
Compute estimates across moving windows for time-varying parameter detection.
|
||||
|
||||
**When to use:**
|
||||
- Parameters vary over time
|
||||
- Want to detect structural changes
|
||||
- Time series with evolving relationships
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
from statsmodels.regression.rolling import RollingOLS, RollingWLS
|
||||
|
||||
# Rolling OLS with 60-period window
|
||||
rolling_model = RollingOLS(y, X, window=60)
|
||||
rolling_results = rolling_model.fit()
|
||||
|
||||
# Extract time-varying parameters
|
||||
rolling_params = rolling_results.params # DataFrame with parameters over time
|
||||
rolling_rsquared = rolling_results.rsquared
|
||||
|
||||
# Plot parameter evolution
|
||||
import matplotlib.pyplot as plt
|
||||
rolling_params.plot()
|
||||
plt.title('Time-Varying Coefficients')
|
||||
plt.show()
|
||||
```
|
||||
|
||||
### Quantile Regression
|
||||
|
||||
Analyzes conditional quantiles rather than conditional mean.
|
||||
|
||||
**When to use:**
|
||||
- Interest in quantiles (median, 90th percentile, etc.)
|
||||
- Robust to outliers (median regression)
|
||||
- Distributional effects across quantiles
|
||||
- Heterogeneous effects
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
from statsmodels.regression.quantile_regression import QuantReg
|
||||
|
||||
# Median regression (50th percentile)
|
||||
model = QuantReg(y, X)
|
||||
results_median = model.fit(q=0.5)
|
||||
|
||||
# Multiple quantiles
|
||||
quantiles = [0.1, 0.25, 0.5, 0.75, 0.9]
|
||||
results_dict = {}
|
||||
for q in quantiles:
|
||||
results_dict[q] = model.fit(q=q)
|
||||
|
||||
# Plot quantile-varying effects
|
||||
import matplotlib.pyplot as plt
|
||||
coef_dict = {q: res.params for q, res in results_dict.items()}
|
||||
coef_df = pd.DataFrame(coef_dict).T
|
||||
coef_df.plot()
|
||||
plt.xlabel('Quantile')
|
||||
plt.ylabel('Coefficient')
|
||||
plt.show()
|
||||
```
|
||||
|
||||
## Mixed Effects Models
|
||||
|
||||
For hierarchical/nested data with random effects.
|
||||
|
||||
**When to use:**
|
||||
- Clustered/grouped data (students in schools, patients in hospitals)
|
||||
- Repeated measures
|
||||
- Need random effects to account for grouping
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
from statsmodels.regression.mixed_linear_model import MixedLM
|
||||
|
||||
# Random intercept model
|
||||
model = MixedLM(y, X, groups=group_ids)
|
||||
results = model.fit()
|
||||
|
||||
# Random intercept and slope
|
||||
model = MixedLM(y, X, groups=group_ids, exog_re=X_random)
|
||||
results = model.fit()
|
||||
|
||||
print(results.summary())
|
||||
```
|
||||
|
||||
## Diagnostics and Model Assessment
|
||||
|
||||
### Residual Analysis
|
||||
|
||||
```python
|
||||
# Basic residual plots
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
# Residuals vs fitted
|
||||
plt.scatter(results.fittedvalues, results.resid)
|
||||
plt.xlabel('Fitted values')
|
||||
plt.ylabel('Residuals')
|
||||
plt.axhline(y=0, color='r', linestyle='--')
|
||||
plt.title('Residuals vs Fitted')
|
||||
plt.show()
|
||||
|
||||
# Q-Q plot for normality
|
||||
from statsmodels.graphics.gofplots import qqplot
|
||||
qqplot(results.resid, line='s')
|
||||
plt.show()
|
||||
|
||||
# Histogram of residuals
|
||||
plt.hist(results.resid, bins=30, edgecolor='black')
|
||||
plt.xlabel('Residuals')
|
||||
plt.ylabel('Frequency')
|
||||
plt.title('Distribution of Residuals')
|
||||
plt.show()
|
||||
```
|
||||
|
||||
### Specification Tests
|
||||
|
||||
```python
|
||||
from statsmodels.stats.diagnostic import het_breuschpagan, het_white
|
||||
from statsmodels.stats.stattools import durbin_watson, jarque_bera
|
||||
|
||||
# Heteroscedasticity tests
|
||||
lm_stat, lm_pval, f_stat, f_pval = het_breuschpagan(results.resid, X)
|
||||
print(f"Breusch-Pagan test p-value: {lm_pval}")
|
||||
|
||||
# White test
|
||||
white_test = het_white(results.resid, X)
|
||||
print(f"White test p-value: {white_test[1]}")
|
||||
|
||||
# Autocorrelation
|
||||
dw_stat = durbin_watson(results.resid)
|
||||
print(f"Durbin-Watson statistic: {dw_stat}")
|
||||
# DW ~ 2 indicates no autocorrelation
|
||||
# DW < 2 suggests positive autocorrelation
|
||||
# DW > 2 suggests negative autocorrelation
|
||||
|
||||
# Normality test
|
||||
jb_stat, jb_pval, skew, kurtosis = jarque_bera(results.resid)
|
||||
print(f"Jarque-Bera test p-value: {jb_pval}")
|
||||
```
|
||||
|
||||
### Multicollinearity
|
||||
|
||||
```python
|
||||
from statsmodels.stats.outliers_influence import variance_inflation_factor
|
||||
|
||||
# Calculate VIF for each variable
|
||||
vif_data = pd.DataFrame()
|
||||
vif_data["Variable"] = X.columns
|
||||
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
|
||||
|
||||
print(vif_data)
|
||||
# VIF > 10 indicates problematic multicollinearity
|
||||
# VIF > 5 suggests moderate multicollinearity
|
||||
|
||||
# Condition number (from summary)
|
||||
print(f"Condition number: {results.condition_number}")
|
||||
# Condition number > 20 suggests multicollinearity
|
||||
# Condition number > 30 indicates serious problems
|
||||
```
|
||||
|
||||
### Influence Statistics
|
||||
|
||||
```python
|
||||
from statsmodels.stats.outliers_influence import OLSInfluence
|
||||
|
||||
influence = results.get_influence()
|
||||
|
||||
# Leverage (hat values)
|
||||
leverage = influence.hat_matrix_diag
|
||||
# High leverage: > 2*p/n (p=predictors, n=observations)
|
||||
|
||||
# Cook's distance
|
||||
cooks_d = influence.cooks_distance[0]
|
||||
# Influential if Cook's D > 4/n
|
||||
|
||||
# DFFITS
|
||||
dffits = influence.dffits[0]
|
||||
# Influential if |DFFITS| > 2*sqrt(p/n)
|
||||
|
||||
# Create influence plot
|
||||
from statsmodels.graphics.regressionplots import influence_plot
|
||||
fig, ax = plt.subplots(figsize=(12, 8))
|
||||
influence_plot(results, ax=ax)
|
||||
plt.show()
|
||||
```
|
||||
|
||||
### Hypothesis Testing
|
||||
|
||||
```python
|
||||
# Test single coefficient
|
||||
# H0: beta_i = 0 (automatically in summary)
|
||||
|
||||
# Test multiple restrictions using F-test
|
||||
# Example: Test beta_1 = beta_2 = 0
|
||||
R = [[0, 1, 0, 0], [0, 0, 1, 0]] # Restriction matrix
|
||||
f_test = results.f_test(R)
|
||||
print(f_test)
|
||||
|
||||
# Formula-based hypothesis testing
|
||||
f_test = results.f_test("x1 = x2 = 0")
|
||||
print(f_test)
|
||||
|
||||
# Test linear combination: beta_1 + beta_2 = 1
|
||||
r_matrix = [[0, 1, 1, 0]]
|
||||
q_matrix = [1] # RHS value
|
||||
f_test = results.f_test((r_matrix, q_matrix))
|
||||
print(f_test)
|
||||
|
||||
# Wald test (equivalent to F-test for linear restrictions)
|
||||
wald_test = results.wald_test(R)
|
||||
print(wald_test)
|
||||
```
|
||||
|
||||
## Model Comparison
|
||||
|
||||
```python
|
||||
# Compare nested models using likelihood ratio test (if using MLE)
|
||||
from statsmodels.stats.anova import anova_lm
|
||||
|
||||
# Fit restricted and unrestricted models
|
||||
model_restricted = sm.OLS(y, X_restricted).fit()
|
||||
model_full = sm.OLS(y, X_full).fit()
|
||||
|
||||
# ANOVA table for model comparison
|
||||
anova_results = anova_lm(model_restricted, model_full)
|
||||
print(anova_results)
|
||||
|
||||
# AIC/BIC for non-nested model comparison
|
||||
print(f"Model 1 AIC: {model1.aic}, BIC: {model1.bic}")
|
||||
print(f"Model 2 AIC: {model2.aic}, BIC: {model2.bic}")
|
||||
# Lower AIC/BIC indicates better model
|
||||
```
|
||||
|
||||
## Robust Standard Errors
|
||||
|
||||
Handle heteroscedasticity or clustering without reweighting.
|
||||
|
||||
```python
|
||||
# Heteroscedasticity-robust (HC) standard errors
|
||||
results_hc = results.get_robustcov_results(cov_type='HC0') # White's
|
||||
results_hc1 = results.get_robustcov_results(cov_type='HC1')
|
||||
results_hc2 = results.get_robustcov_results(cov_type='HC2')
|
||||
results_hc3 = results.get_robustcov_results(cov_type='HC3') # Most conservative
|
||||
|
||||
# Newey-West HAC (Heteroscedasticity and Autocorrelation Consistent)
|
||||
results_hac = results.get_robustcov_results(cov_type='HAC', maxlags=4)
|
||||
|
||||
# Cluster-robust standard errors
|
||||
results_cluster = results.get_robustcov_results(cov_type='cluster',
|
||||
groups=cluster_ids)
|
||||
|
||||
# View robust results
|
||||
print(results_hc3.summary())
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Always add constant**: Use `sm.add_constant()` unless you specifically want to exclude the intercept
|
||||
2. **Check assumptions**: Run diagnostic tests (heteroscedasticity, autocorrelation, normality)
|
||||
3. **Use formula API for categorical variables**: `smf.ols()` handles categorical variables automatically
|
||||
4. **Robust standard errors**: Use when heteroscedasticity detected but model specification is correct
|
||||
5. **Model selection**: Use AIC/BIC for non-nested models, F-test/likelihood ratio for nested models
|
||||
6. **Outliers and influence**: Always check Cook's distance and leverage
|
||||
7. **Multicollinearity**: Check VIF and condition number before interpretation
|
||||
8. **Time series**: Use `GLSAR` or robust HAC standard errors for autocorrelated errors
|
||||
9. **Grouped data**: Consider mixed effects models or cluster-robust standard errors
|
||||
10. **Quantile regression**: Use for robust estimation or when interested in distributional effects
|
||||
|
||||
## Common Pitfalls
|
||||
|
||||
1. **Forgetting to add constant**: Results in no-intercept model
|
||||
2. **Ignoring heteroscedasticity**: Use WLS or robust standard errors
|
||||
3. **Using OLS with autocorrelated errors**: Use GLSAR or HAC standard errors
|
||||
4. **Over-interpreting with multicollinearity**: Check VIF first
|
||||
5. **Not checking residuals**: Always plot residuals vs fitted values
|
||||
6. **Using t-SNE/PCA residuals**: Residuals should be from original space
|
||||
7. **Confusing prediction vs confidence intervals**: Prediction intervals are wider
|
||||
8. **Not handling categorical variables properly**: Use formula API or manual dummy coding
|
||||
9. **Comparing models with different sample sizes**: Ensure same observations used
|
||||
10. **Ignoring influential observations**: Check Cook's distance and DFFITS
|
||||
859
skills/statsmodels/references/stats_diagnostics.md
Normal file
859
skills/statsmodels/references/stats_diagnostics.md
Normal file
@@ -0,0 +1,859 @@
|
||||
# Statistical Tests and Diagnostics Reference
|
||||
|
||||
This document provides comprehensive guidance on statistical tests, diagnostics, and tools available in statsmodels.
|
||||
|
||||
## Overview
|
||||
|
||||
Statsmodels provides extensive statistical testing capabilities:
|
||||
- Residual diagnostics and specification tests
|
||||
- Hypothesis testing (parametric and non-parametric)
|
||||
- Goodness-of-fit tests
|
||||
- Multiple comparisons and post-hoc tests
|
||||
- Power and sample size calculations
|
||||
- Robust covariance matrices
|
||||
- Influence and outlier detection
|
||||
|
||||
## Residual Diagnostics
|
||||
|
||||
### Autocorrelation Tests
|
||||
|
||||
**Ljung-Box Test**: Tests for autocorrelation in residuals
|
||||
|
||||
```python
|
||||
from statsmodels.stats.diagnostic import acorr_ljungbox
|
||||
|
||||
# Test residuals for autocorrelation
|
||||
lb_test = acorr_ljungbox(residuals, lags=10, return_df=True)
|
||||
print(lb_test)
|
||||
|
||||
# H0: No autocorrelation up to lag k
|
||||
# If p-value < 0.05, reject H0 (autocorrelation present)
|
||||
```
|
||||
|
||||
**Durbin-Watson Test**: Tests for first-order autocorrelation
|
||||
|
||||
```python
|
||||
from statsmodels.stats.stattools import durbin_watson
|
||||
|
||||
dw_stat = durbin_watson(residuals)
|
||||
print(f"Durbin-Watson: {dw_stat:.4f}")
|
||||
|
||||
# DW ≈ 2: no autocorrelation
|
||||
# DW < 2: positive autocorrelation
|
||||
# DW > 2: negative autocorrelation
|
||||
# Exact critical values depend on n and k
|
||||
```
|
||||
|
||||
**Breusch-Godfrey Test**: More general test for autocorrelation
|
||||
|
||||
```python
|
||||
from statsmodels.stats.diagnostic import acorr_breusch_godfrey
|
||||
|
||||
bg_test = acorr_breusch_godfrey(results, nlags=5)
|
||||
lm_stat, lm_pval, f_stat, f_pval = bg_test
|
||||
|
||||
print(f"LM statistic: {lm_stat:.4f}, p-value: {lm_pval:.4f}")
|
||||
# H0: No autocorrelation up to lag k
|
||||
```
|
||||
|
||||
### Heteroskedasticity Tests
|
||||
|
||||
**Breusch-Pagan Test**: Tests for heteroskedasticity
|
||||
|
||||
```python
|
||||
from statsmodels.stats.diagnostic import het_breuschpagan
|
||||
|
||||
bp_test = het_breuschpagan(residuals, exog)
|
||||
lm_stat, lm_pval, f_stat, f_pval = bp_test
|
||||
|
||||
print(f"Breusch-Pagan test p-value: {lm_pval:.4f}")
|
||||
# H0: Homoskedasticity (constant variance)
|
||||
# If p-value < 0.05, reject H0 (heteroskedasticity present)
|
||||
```
|
||||
|
||||
**White Test**: More general test for heteroskedasticity
|
||||
|
||||
```python
|
||||
from statsmodels.stats.diagnostic import het_white
|
||||
|
||||
white_test = het_white(residuals, exog)
|
||||
lm_stat, lm_pval, f_stat, f_pval = white_test
|
||||
|
||||
print(f"White test p-value: {lm_pval:.4f}")
|
||||
# H0: Homoskedasticity
|
||||
```
|
||||
|
||||
**ARCH Test**: Tests for autoregressive conditional heteroskedasticity
|
||||
|
||||
```python
|
||||
from statsmodels.stats.diagnostic import het_arch
|
||||
|
||||
arch_test = het_arch(residuals, nlags=5)
|
||||
lm_stat, lm_pval, f_stat, f_pval = arch_test
|
||||
|
||||
print(f"ARCH test p-value: {lm_pval:.4f}")
|
||||
# H0: No ARCH effects
|
||||
# If significant, consider GARCH model
|
||||
```
|
||||
|
||||
### Normality Tests
|
||||
|
||||
**Jarque-Bera Test**: Tests for normality using skewness and kurtosis
|
||||
|
||||
```python
|
||||
from statsmodels.stats.stattools import jarque_bera
|
||||
|
||||
jb_stat, jb_pval, skew, kurtosis = jarque_bera(residuals)
|
||||
|
||||
print(f"Jarque-Bera statistic: {jb_stat:.4f}")
|
||||
print(f"p-value: {jb_pval:.4f}")
|
||||
print(f"Skewness: {skew:.4f}")
|
||||
print(f"Kurtosis: {kurtosis:.4f}")
|
||||
|
||||
# H0: Residuals are normally distributed
|
||||
# Normal: skewness ≈ 0, kurtosis ≈ 3
|
||||
```
|
||||
|
||||
**Omnibus Test**: Another normality test (also based on skewness/kurtosis)
|
||||
|
||||
```python
|
||||
from statsmodels.stats.stattools import omni_normtest
|
||||
|
||||
omni_stat, omni_pval = omni_normtest(residuals)
|
||||
print(f"Omnibus test p-value: {omni_pval:.4f}")
|
||||
# H0: Normality
|
||||
```
|
||||
|
||||
**Anderson-Darling Test**: Distribution fit test
|
||||
|
||||
```python
|
||||
from statsmodels.stats.diagnostic import normal_ad
|
||||
|
||||
ad_stat, ad_pval = normal_ad(residuals)
|
||||
print(f"Anderson-Darling test p-value: {ad_pval:.4f}")
|
||||
```
|
||||
|
||||
**Lilliefors Test**: Modified Kolmogorov-Smirnov test
|
||||
|
||||
```python
|
||||
from statsmodels.stats.diagnostic import lilliefors
|
||||
|
||||
lf_stat, lf_pval = lilliefors(residuals, dist='norm')
|
||||
print(f"Lilliefors test p-value: {lf_pval:.4f}")
|
||||
```
|
||||
|
||||
### Linearity and Specification Tests
|
||||
|
||||
**Ramsey RESET Test**: Tests for functional form misspecification
|
||||
|
||||
```python
|
||||
from statsmodels.stats.diagnostic import linear_reset
|
||||
|
||||
reset_test = linear_reset(results, power=2)
|
||||
f_stat, f_pval = reset_test
|
||||
|
||||
print(f"RESET test p-value: {f_pval:.4f}")
|
||||
# H0: Model is correctly specified (linear)
|
||||
# If rejected, may need polynomial terms or transformations
|
||||
```
|
||||
|
||||
**Harvey-Collier Test**: Tests for linearity
|
||||
|
||||
```python
|
||||
from statsmodels.stats.diagnostic import linear_harvey_collier
|
||||
|
||||
hc_stat, hc_pval = linear_harvey_collier(results)
|
||||
print(f"Harvey-Collier test p-value: {hc_pval:.4f}")
|
||||
# H0: Linear specification is correct
|
||||
```
|
||||
|
||||
## Multicollinearity Detection
|
||||
|
||||
**Variance Inflation Factor (VIF)**:
|
||||
|
||||
```python
|
||||
from statsmodels.stats.outliers_influence import variance_inflation_factor
|
||||
import pandas as pd
|
||||
|
||||
# Calculate VIF for each variable
|
||||
vif_data = pd.DataFrame()
|
||||
vif_data["Variable"] = X.columns
|
||||
vif_data["VIF"] = [variance_inflation_factor(X.values, i)
|
||||
for i in range(X.shape[1])]
|
||||
|
||||
print(vif_data.sort_values('VIF', ascending=False))
|
||||
|
||||
# Interpretation:
|
||||
# VIF = 1: No correlation with other predictors
|
||||
# VIF > 5: Moderate multicollinearity
|
||||
# VIF > 10: Serious multicollinearity problem
|
||||
# VIF > 20: Severe multicollinearity (consider removing variable)
|
||||
```
|
||||
|
||||
**Condition Number**: From regression results
|
||||
|
||||
```python
|
||||
print(f"Condition number: {results.condition_number:.2f}")
|
||||
|
||||
# Interpretation:
|
||||
# < 10: No multicollinearity concern
|
||||
# 10-30: Moderate multicollinearity
|
||||
# > 30: Strong multicollinearity
|
||||
# > 100: Severe multicollinearity
|
||||
```
|
||||
|
||||
## Influence and Outlier Detection
|
||||
|
||||
### Leverage
|
||||
|
||||
High leverage points have extreme predictor values.
|
||||
|
||||
```python
|
||||
from statsmodels.stats.outliers_influence import OLSInfluence
|
||||
|
||||
influence = results.get_influence()
|
||||
|
||||
# Hat values (leverage)
|
||||
leverage = influence.hat_matrix_diag
|
||||
|
||||
# Rule of thumb: leverage > 2*p/n or 3*p/n is high
|
||||
# p = number of parameters, n = sample size
|
||||
threshold = 2 * len(results.params) / len(y)
|
||||
high_leverage = np.where(leverage > threshold)[0]
|
||||
|
||||
print(f"High leverage observations: {high_leverage}")
|
||||
```
|
||||
|
||||
### Cook's Distance
|
||||
|
||||
Measures overall influence of each observation.
|
||||
|
||||
```python
|
||||
# Cook's distance
|
||||
cooks_d = influence.cooks_distance[0]
|
||||
|
||||
# Rule of thumb: Cook's D > 4/n is influential
|
||||
threshold = 4 / len(y)
|
||||
influential = np.where(cooks_d > threshold)[0]
|
||||
|
||||
print(f"Influential observations (Cook's D): {influential}")
|
||||
|
||||
# Plot
|
||||
import matplotlib.pyplot as plt
|
||||
plt.stem(range(len(cooks_d)), cooks_d)
|
||||
plt.axhline(y=threshold, color='r', linestyle='--', label=f'Threshold (4/n)')
|
||||
plt.xlabel('Observation')
|
||||
plt.ylabel("Cook's Distance")
|
||||
plt.legend()
|
||||
plt.show()
|
||||
```
|
||||
|
||||
### DFFITS
|
||||
|
||||
Measures influence on fitted value.
|
||||
|
||||
```python
|
||||
# DFFITS
|
||||
dffits = influence.dffits[0]
|
||||
|
||||
# Rule of thumb: |DFFITS| > 2*sqrt(p/n) is influential
|
||||
p = len(results.params)
|
||||
n = len(y)
|
||||
threshold = 2 * np.sqrt(p / n)
|
||||
|
||||
influential_dffits = np.where(np.abs(dffits) > threshold)[0]
|
||||
print(f"Influential observations (DFFITS): {influential_dffits}")
|
||||
```
|
||||
|
||||
### DFBETAs
|
||||
|
||||
Measures influence on each coefficient.
|
||||
|
||||
```python
|
||||
# DFBETAs (one for each parameter)
|
||||
dfbetas = influence.dfbetas
|
||||
|
||||
# Rule of thumb: |DFBETA| > 2/sqrt(n)
|
||||
threshold = 2 / np.sqrt(n)
|
||||
|
||||
for i, param_name in enumerate(results.params.index):
|
||||
influential = np.where(np.abs(dfbetas[:, i]) > threshold)[0]
|
||||
if len(influential) > 0:
|
||||
print(f"Influential for {param_name}: {influential}")
|
||||
```
|
||||
|
||||
### Influence Plot
|
||||
|
||||
```python
|
||||
from statsmodels.graphics.regressionplots import influence_plot
|
||||
|
||||
fig, ax = plt.subplots(figsize=(12, 8))
|
||||
influence_plot(results, ax=ax, criterion='cooks')
|
||||
plt.show()
|
||||
|
||||
# Combines leverage, residuals, and Cook's distance
|
||||
# Large bubbles = high Cook's distance
|
||||
# Far from x=0 = high leverage
|
||||
# Far from y=0 = large residual
|
||||
```
|
||||
|
||||
### Studentized Residuals
|
||||
|
||||
```python
|
||||
# Studentized residuals (outliers)
|
||||
student_resid = influence.resid_studentized_internal
|
||||
|
||||
# External studentized residuals (more conservative)
|
||||
student_resid_external = influence.resid_studentized_external
|
||||
|
||||
# Outliers: |studentized residual| > 3 (or > 2.5)
|
||||
outliers = np.where(np.abs(student_resid_external) > 3)[0]
|
||||
print(f"Outliers: {outliers}")
|
||||
```
|
||||
|
||||
## Hypothesis Testing
|
||||
|
||||
### t-tests
|
||||
|
||||
**One-sample t-test**: Test if mean equals specific value
|
||||
|
||||
```python
|
||||
from scipy import stats
|
||||
|
||||
# H0: population mean = mu_0
|
||||
t_stat, p_value = stats.ttest_1samp(data, popmean=mu_0)
|
||||
|
||||
print(f"t-statistic: {t_stat:.4f}")
|
||||
print(f"p-value: {p_value:.4f}")
|
||||
```
|
||||
|
||||
**Two-sample t-test**: Compare means of two groups
|
||||
|
||||
```python
|
||||
# H0: mean1 = mean2 (equal variances)
|
||||
t_stat, p_value = stats.ttest_ind(group1, group2)
|
||||
|
||||
# Welch's t-test (unequal variances)
|
||||
t_stat, p_value = stats.ttest_ind(group1, group2, equal_var=False)
|
||||
|
||||
print(f"t-statistic: {t_stat:.4f}")
|
||||
print(f"p-value: {p_value:.4f}")
|
||||
```
|
||||
|
||||
**Paired t-test**: Compare paired observations
|
||||
|
||||
```python
|
||||
# H0: mean difference = 0
|
||||
t_stat, p_value = stats.ttest_rel(before, after)
|
||||
|
||||
print(f"t-statistic: {t_stat:.4f}")
|
||||
print(f"p-value: {p_value:.4f}")
|
||||
```
|
||||
|
||||
### Proportion Tests
|
||||
|
||||
**One-proportion test**:
|
||||
|
||||
```python
|
||||
from statsmodels.stats.proportion import proportions_ztest
|
||||
|
||||
# H0: proportion = p0
|
||||
count = 45 # successes
|
||||
nobs = 100 # total observations
|
||||
p0 = 0.5 # hypothesized proportion
|
||||
|
||||
z_stat, p_value = proportions_ztest(count, nobs, value=p0)
|
||||
|
||||
print(f"z-statistic: {z_stat:.4f}")
|
||||
print(f"p-value: {p_value:.4f}")
|
||||
```
|
||||
|
||||
**Two-proportion test**:
|
||||
|
||||
```python
|
||||
# H0: proportion1 = proportion2
|
||||
counts = [45, 60]
|
||||
nobs = [100, 120]
|
||||
|
||||
z_stat, p_value = proportions_ztest(counts, nobs)
|
||||
print(f"z-statistic: {z_stat:.4f}")
|
||||
print(f"p-value: {p_value:.4f}")
|
||||
```
|
||||
|
||||
### Chi-square Tests
|
||||
|
||||
**Chi-square test of independence**:
|
||||
|
||||
```python
|
||||
from scipy.stats import chi2_contingency
|
||||
|
||||
# Contingency table
|
||||
contingency_table = pd.crosstab(variable1, variable2)
|
||||
|
||||
chi2, p_value, dof, expected = chi2_contingency(contingency_table)
|
||||
|
||||
print(f"Chi-square statistic: {chi2:.4f}")
|
||||
print(f"p-value: {p_value:.4f}")
|
||||
print(f"Degrees of freedom: {dof}")
|
||||
|
||||
# H0: Variables are independent
|
||||
```
|
||||
|
||||
**Chi-square goodness-of-fit**:
|
||||
|
||||
```python
|
||||
from scipy.stats import chisquare
|
||||
|
||||
# Observed frequencies
|
||||
observed = [20, 30, 25, 25]
|
||||
|
||||
# Expected frequencies (equal by default)
|
||||
expected = [25, 25, 25, 25]
|
||||
|
||||
chi2, p_value = chisquare(observed, expected)
|
||||
|
||||
print(f"Chi-square statistic: {chi2:.4f}")
|
||||
print(f"p-value: {p_value:.4f}")
|
||||
|
||||
# H0: Data follow the expected distribution
|
||||
```
|
||||
|
||||
### Non-parametric Tests
|
||||
|
||||
**Mann-Whitney U test** (independent samples):
|
||||
|
||||
```python
|
||||
from scipy.stats import mannwhitneyu
|
||||
|
||||
# H0: Distributions are equal
|
||||
u_stat, p_value = mannwhitneyu(group1, group2, alternative='two-sided')
|
||||
|
||||
print(f"U statistic: {u_stat:.4f}")
|
||||
print(f"p-value: {p_value:.4f}")
|
||||
```
|
||||
|
||||
**Wilcoxon signed-rank test** (paired samples):
|
||||
|
||||
```python
|
||||
from scipy.stats import wilcoxon
|
||||
|
||||
# H0: Median difference = 0
|
||||
w_stat, p_value = wilcoxon(before, after)
|
||||
|
||||
print(f"W statistic: {w_stat:.4f}")
|
||||
print(f"p-value: {p_value:.4f}")
|
||||
```
|
||||
|
||||
**Kruskal-Wallis H test** (>2 groups):
|
||||
|
||||
```python
|
||||
from scipy.stats import kruskal
|
||||
|
||||
# H0: All groups have same distribution
|
||||
h_stat, p_value = kruskal(group1, group2, group3)
|
||||
|
||||
print(f"H statistic: {h_stat:.4f}")
|
||||
print(f"p-value: {p_value:.4f}")
|
||||
```
|
||||
|
||||
**Sign test**:
|
||||
|
||||
```python
|
||||
from statsmodels.stats.descriptivestats import sign_test
|
||||
|
||||
# H0: Median = m0
|
||||
result = sign_test(data, m0=0)
|
||||
print(result)
|
||||
```
|
||||
|
||||
### ANOVA
|
||||
|
||||
**One-way ANOVA**:
|
||||
|
||||
```python
|
||||
from scipy.stats import f_oneway
|
||||
|
||||
# H0: All group means are equal
|
||||
f_stat, p_value = f_oneway(group1, group2, group3)
|
||||
|
||||
print(f"F-statistic: {f_stat:.4f}")
|
||||
print(f"p-value: {p_value:.4f}")
|
||||
```
|
||||
|
||||
**Two-way ANOVA** (with statsmodels):
|
||||
|
||||
```python
|
||||
from statsmodels.formula.api import ols
|
||||
from statsmodels.stats.anova import anova_lm
|
||||
|
||||
# Fit model
|
||||
model = ols('response ~ C(factor1) + C(factor2) + C(factor1):C(factor2)',
|
||||
data=df).fit()
|
||||
|
||||
# ANOVA table
|
||||
anova_table = anova_lm(model, typ=2)
|
||||
print(anova_table)
|
||||
```
|
||||
|
||||
**Repeated measures ANOVA**:
|
||||
|
||||
```python
|
||||
from statsmodels.stats.anova import AnovaRM
|
||||
|
||||
# Requires long-format data
|
||||
aovrm = AnovaRM(df, depvar='score', subject='subject_id', within=['time'])
|
||||
results = aovrm.fit()
|
||||
|
||||
print(results.summary())
|
||||
```
|
||||
|
||||
## Multiple Comparisons
|
||||
|
||||
### Post-hoc Tests
|
||||
|
||||
**Tukey's HSD** (Honest Significant Difference):
|
||||
|
||||
```python
|
||||
from statsmodels.stats.multicomp import pairwise_tukeyhsd
|
||||
|
||||
# Perform Tukey HSD test
|
||||
tukey = pairwise_tukeyhsd(data, groups, alpha=0.05)
|
||||
|
||||
print(tukey.summary())
|
||||
|
||||
# Plot confidence intervals
|
||||
tukey.plot_simultaneous()
|
||||
plt.show()
|
||||
```
|
||||
|
||||
**Bonferroni correction**:
|
||||
|
||||
```python
|
||||
from statsmodels.stats.multitest import multipletests
|
||||
|
||||
# P-values from multiple tests
|
||||
p_values = [0.01, 0.03, 0.04, 0.15, 0.001]
|
||||
|
||||
# Apply correction
|
||||
reject, pvals_corrected, alphac_sidak, alphac_bonf = multipletests(
|
||||
p_values,
|
||||
alpha=0.05,
|
||||
method='bonferroni'
|
||||
)
|
||||
|
||||
print("Rejected:", reject)
|
||||
print("Corrected p-values:", pvals_corrected)
|
||||
```
|
||||
|
||||
**False Discovery Rate (FDR)**:
|
||||
|
||||
```python
|
||||
# FDR correction (less conservative than Bonferroni)
|
||||
reject, pvals_corrected, alphac_sidak, alphac_bonf = multipletests(
|
||||
p_values,
|
||||
alpha=0.05,
|
||||
method='fdr_bh' # Benjamini-Hochberg
|
||||
)
|
||||
|
||||
print("Rejected:", reject)
|
||||
print("Corrected p-values:", pvals_corrected)
|
||||
```
|
||||
|
||||
## Robust Covariance Matrices
|
||||
|
||||
### Heteroskedasticity-Consistent (HC) Standard Errors
|
||||
|
||||
```python
|
||||
# After fitting OLS
|
||||
results = sm.OLS(y, X).fit()
|
||||
|
||||
# HC0 (White's heteroskedasticity-consistent SEs)
|
||||
results_hc0 = results.get_robustcov_results(cov_type='HC0')
|
||||
|
||||
# HC1 (degrees of freedom adjustment)
|
||||
results_hc1 = results.get_robustcov_results(cov_type='HC1')
|
||||
|
||||
# HC2 (leverage adjustment)
|
||||
results_hc2 = results.get_robustcov_results(cov_type='HC2')
|
||||
|
||||
# HC3 (most conservative, recommended for small samples)
|
||||
results_hc3 = results.get_robustcov_results(cov_type='HC3')
|
||||
|
||||
print("Standard OLS SEs:", results.bse)
|
||||
print("Robust HC3 SEs:", results_hc3.bse)
|
||||
```
|
||||
|
||||
### HAC (Heteroskedasticity and Autocorrelation Consistent)
|
||||
|
||||
**Newey-West standard errors**:
|
||||
|
||||
```python
|
||||
# For time series with autocorrelation and heteroskedasticity
|
||||
results_hac = results.get_robustcov_results(cov_type='HAC', maxlags=4)
|
||||
|
||||
print("HAC (Newey-West) SEs:", results_hac.bse)
|
||||
print(results_hac.summary())
|
||||
```
|
||||
|
||||
### Cluster-Robust Standard Errors
|
||||
|
||||
```python
|
||||
# For clustered/grouped data
|
||||
results_cluster = results.get_robustcov_results(
|
||||
cov_type='cluster',
|
||||
groups=cluster_ids
|
||||
)
|
||||
|
||||
print("Cluster-robust SEs:", results_cluster.bse)
|
||||
```
|
||||
|
||||
## Descriptive Statistics
|
||||
|
||||
**Basic descriptive statistics**:
|
||||
|
||||
```python
|
||||
from statsmodels.stats.api import DescrStatsW
|
||||
|
||||
# Comprehensive descriptive stats
|
||||
desc = DescrStatsW(data)
|
||||
|
||||
print("Mean:", desc.mean)
|
||||
print("Std Dev:", desc.std)
|
||||
print("Variance:", desc.var)
|
||||
print("Confidence interval:", desc.tconfint_mean())
|
||||
|
||||
# Quantiles
|
||||
print("Median:", desc.quantile(0.5))
|
||||
print("IQR:", desc.quantile([0.25, 0.75]))
|
||||
```
|
||||
|
||||
**Weighted statistics**:
|
||||
|
||||
```python
|
||||
# With weights
|
||||
desc_weighted = DescrStatsW(data, weights=weights)
|
||||
|
||||
print("Weighted mean:", desc_weighted.mean)
|
||||
print("Weighted std:", desc_weighted.std)
|
||||
```
|
||||
|
||||
**Compare two groups**:
|
||||
|
||||
```python
|
||||
from statsmodels.stats.weightstats import CompareMeans
|
||||
|
||||
# Create comparison object
|
||||
cm = CompareMeans(DescrStatsW(group1), DescrStatsW(group2))
|
||||
|
||||
# t-test
|
||||
print("t-test:", cm.ttest_ind())
|
||||
|
||||
# Confidence interval for difference
|
||||
print("CI for difference:", cm.tconfint_diff())
|
||||
|
||||
# Test for equal variances
|
||||
print("Equal variance test:", cm.test_equal_var())
|
||||
```
|
||||
|
||||
## Power Analysis and Sample Size
|
||||
|
||||
**Power for t-test**:
|
||||
|
||||
```python
|
||||
from statsmodels.stats.power import tt_ind_solve_power
|
||||
|
||||
# Solve for sample size
|
||||
effect_size = 0.5 # Cohen's d
|
||||
alpha = 0.05
|
||||
power = 0.8
|
||||
|
||||
n = tt_ind_solve_power(effect_size=effect_size,
|
||||
alpha=alpha,
|
||||
power=power,
|
||||
alternative='two-sided')
|
||||
|
||||
print(f"Required sample size per group: {n:.0f}")
|
||||
|
||||
# Solve for power given n
|
||||
power = tt_ind_solve_power(effect_size=0.5,
|
||||
nobs1=50,
|
||||
alpha=0.05,
|
||||
alternative='two-sided')
|
||||
|
||||
print(f"Power: {power:.4f}")
|
||||
```
|
||||
|
||||
**Power for proportion test**:
|
||||
|
||||
```python
|
||||
from statsmodels.stats.power import zt_ind_solve_power
|
||||
|
||||
# For proportion tests (z-test)
|
||||
effect_size = 0.3 # Difference in proportions
|
||||
alpha = 0.05
|
||||
power = 0.8
|
||||
|
||||
n = zt_ind_solve_power(effect_size=effect_size,
|
||||
alpha=alpha,
|
||||
power=power,
|
||||
alternative='two-sided')
|
||||
|
||||
print(f"Required sample size per group: {n:.0f}")
|
||||
```
|
||||
|
||||
**Power curves**:
|
||||
|
||||
```python
|
||||
from statsmodels.stats.power import TTestIndPower
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
# Create power analysis object
|
||||
analysis = TTestIndPower()
|
||||
|
||||
# Plot power curves for different sample sizes
|
||||
sample_sizes = range(10, 200, 10)
|
||||
effect_sizes = [0.2, 0.5, 0.8] # Small, medium, large
|
||||
|
||||
fig, ax = plt.subplots(figsize=(10, 6))
|
||||
|
||||
for es in effect_sizes:
|
||||
power = [analysis.solve_power(effect_size=es, nobs1=n, alpha=0.05)
|
||||
for n in sample_sizes]
|
||||
ax.plot(sample_sizes, power, label=f'Effect size = {es}')
|
||||
|
||||
ax.axhline(y=0.8, color='r', linestyle='--', label='Power = 0.8')
|
||||
ax.set_xlabel('Sample size per group')
|
||||
ax.set_ylabel('Power')
|
||||
ax.set_title('Power Curves for Two-Sample t-test')
|
||||
ax.legend()
|
||||
ax.grid(True, alpha=0.3)
|
||||
plt.show()
|
||||
```
|
||||
|
||||
## Effect Sizes
|
||||
|
||||
**Cohen's d** (standardized mean difference):
|
||||
|
||||
```python
|
||||
def cohens_d(group1, group2):
|
||||
\"\"\"Calculate Cohen's d for independent samples\"\"\"
|
||||
n1, n2 = len(group1), len(group2)
|
||||
var1, var2 = np.var(group1, ddof=1), np.var(group2, ddof=1)
|
||||
|
||||
# Pooled standard deviation
|
||||
pooled_std = np.sqrt(((n1-1)*var1 + (n2-1)*var2) / (n1+n2-2))
|
||||
|
||||
# Cohen's d
|
||||
d = (np.mean(group1) - np.mean(group2)) / pooled_std
|
||||
|
||||
return d
|
||||
|
||||
d = cohens_d(group1, group2)
|
||||
print(f"Cohen's d: {d:.4f}")
|
||||
|
||||
# Interpretation:
|
||||
# |d| < 0.2: negligible
|
||||
# |d| ~ 0.2: small
|
||||
# |d| ~ 0.5: medium
|
||||
# |d| ~ 0.8: large
|
||||
```
|
||||
|
||||
**Eta-squared** (for ANOVA):
|
||||
|
||||
```python
|
||||
# From ANOVA table
|
||||
# η² = SS_between / SS_total
|
||||
|
||||
def eta_squared(anova_table):
|
||||
return anova_table['sum_sq'][0] / anova_table['sum_sq'].sum()
|
||||
|
||||
# After running ANOVA
|
||||
eta_sq = eta_squared(anova_table)
|
||||
print(f"Eta-squared: {eta_sq:.4f}")
|
||||
|
||||
# Interpretation:
|
||||
# 0.01: small effect
|
||||
# 0.06: medium effect
|
||||
# 0.14: large effect
|
||||
```
|
||||
|
||||
## Contingency Tables and Association
|
||||
|
||||
**McNemar's test** (paired binary data):
|
||||
|
||||
```python
|
||||
from statsmodels.stats.contingency_tables import mcnemar
|
||||
|
||||
# 2x2 contingency table
|
||||
table = [[a, b],
|
||||
[c, d]]
|
||||
|
||||
result = mcnemar(table, exact=True) # or exact=False for large samples
|
||||
print(f"p-value: {result.pvalue:.4f}")
|
||||
|
||||
# H0: Marginal probabilities are equal
|
||||
```
|
||||
|
||||
**Cochran-Mantel-Haenszel test**:
|
||||
|
||||
```python
|
||||
from statsmodels.stats.contingency_tables import StratifiedTable
|
||||
|
||||
# For stratified 2x2 tables
|
||||
strat_table = StratifiedTable(tables_list)
|
||||
result = strat_table.test_null_odds()
|
||||
|
||||
print(f"p-value: {result.pvalue:.4f}")
|
||||
```
|
||||
|
||||
## Treatment Effects and Causal Inference
|
||||
|
||||
**Propensity score matching**:
|
||||
|
||||
```python
|
||||
from statsmodels.treatment import propensity_score
|
||||
|
||||
# Estimate propensity scores
|
||||
ps_model = sm.Logit(treatment, X).fit()
|
||||
propensity_scores = ps_model.predict(X)
|
||||
|
||||
# Use for matching or weighting
|
||||
# (manual implementation of matching needed)
|
||||
```
|
||||
|
||||
**Difference-in-differences**:
|
||||
|
||||
```python
|
||||
# Did formula: outcome ~ treatment * post
|
||||
model = ols('outcome ~ treatment + post + treatment:post', data=df).fit()
|
||||
|
||||
# DiD estimate is the interaction coefficient
|
||||
did_estimate = model.params['treatment:post']
|
||||
print(f"DiD estimate: {did_estimate:.4f}")
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Always check assumptions**: Test before interpreting results
|
||||
2. **Report effect sizes**: Not just p-values
|
||||
3. **Use appropriate tests**: Match test to data type and distribution
|
||||
4. **Correct for multiple comparisons**: When conducting many tests
|
||||
5. **Check sample size**: Ensure adequate power
|
||||
6. **Visual inspection**: Plot data before testing
|
||||
7. **Report confidence intervals**: Along with point estimates
|
||||
8. **Consider alternatives**: Non-parametric when assumptions violated
|
||||
9. **Robust standard errors**: Use when heteroskedasticity/autocorrelation present
|
||||
10. **Document decisions**: Note which tests used and why
|
||||
|
||||
## Common Pitfalls
|
||||
|
||||
1. **Not checking test assumptions**: May invalidate results
|
||||
2. **Multiple testing without correction**: Inflated Type I error
|
||||
3. **Using parametric tests on non-normal data**: Consider non-parametric
|
||||
4. **Ignoring heteroskedasticity**: Use robust SEs
|
||||
5. **Confusing statistical and practical significance**: Check effect sizes
|
||||
6. **Not reporting confidence intervals**: Only p-values insufficient
|
||||
7. **Using wrong test**: Match test to research question
|
||||
8. **Insufficient power**: Risk of Type II error (false negatives)
|
||||
9. **p-hacking**: Testing many specifications until significant
|
||||
10. **Overinterpreting p-values**: Remember limitations of NHST
|
||||
716
skills/statsmodels/references/time_series.md
Normal file
716
skills/statsmodels/references/time_series.md
Normal file
@@ -0,0 +1,716 @@
|
||||
# Time Series Analysis Reference
|
||||
|
||||
This document provides comprehensive guidance on time series models in statsmodels, including ARIMA, state space models, VAR, exponential smoothing, and forecasting methods.
|
||||
|
||||
## Overview
|
||||
|
||||
Statsmodels offers extensive time series capabilities:
|
||||
- **Univariate models**: AR, ARIMA, SARIMAX, Exponential Smoothing
|
||||
- **Multivariate models**: VAR, VARMAX, Dynamic Factor Models
|
||||
- **State space framework**: Custom models, Kalman filtering
|
||||
- **Diagnostic tools**: ACF, PACF, stationarity tests, residual analysis
|
||||
- **Forecasting**: Point forecasts and prediction intervals
|
||||
|
||||
## Univariate Time Series Models
|
||||
|
||||
### AutoReg (AR Model)
|
||||
|
||||
Autoregressive model: current value depends on past values.
|
||||
|
||||
**When to use:**
|
||||
- Univariate time series
|
||||
- Past values predict future
|
||||
- Stationary series
|
||||
|
||||
**Model**: yₜ = c + φ₁yₜ₋₁ + φ₂yₜ₋₂ + ... + φₚyₜ₋ₚ + εₜ
|
||||
|
||||
```python
|
||||
from statsmodels.tsa.ar_model import AutoReg
|
||||
import pandas as pd
|
||||
|
||||
# Fit AR(p) model
|
||||
model = AutoReg(y, lags=5) # AR(5)
|
||||
results = model.fit()
|
||||
|
||||
print(results.summary())
|
||||
```
|
||||
|
||||
**With exogenous regressors:**
|
||||
```python
|
||||
# AR with exogenous variables (ARX)
|
||||
model = AutoReg(y, lags=5, exog=X_exog)
|
||||
results = model.fit()
|
||||
```
|
||||
|
||||
**Seasonal AR:**
|
||||
```python
|
||||
# Seasonal lags (e.g., monthly data with yearly seasonality)
|
||||
model = AutoReg(y, lags=12, seasonal=True)
|
||||
results = model.fit()
|
||||
```
|
||||
|
||||
### ARIMA (Autoregressive Integrated Moving Average)
|
||||
|
||||
Combines AR, differencing (I), and MA components.
|
||||
|
||||
**When to use:**
|
||||
- Non-stationary time series (needs differencing)
|
||||
- Past values and errors predict future
|
||||
- Flexible model for many time series
|
||||
|
||||
**Model**: ARIMA(p,d,q)
|
||||
- p: AR order (lags)
|
||||
- d: differencing order (to achieve stationarity)
|
||||
- q: MA order (lagged forecast errors)
|
||||
|
||||
```python
|
||||
from statsmodels.tsa.arima.model import ARIMA
|
||||
|
||||
# Fit ARIMA(p,d,q)
|
||||
model = ARIMA(y, order=(1, 1, 1)) # ARIMA(1,1,1)
|
||||
results = model.fit()
|
||||
|
||||
print(results.summary())
|
||||
```
|
||||
|
||||
**Choosing p, d, q:**
|
||||
|
||||
1. **Determine d (differencing order)**:
|
||||
```python
|
||||
from statsmodels.tsa.stattools import adfuller
|
||||
|
||||
# ADF test for stationarity
|
||||
def check_stationarity(series):
|
||||
result = adfuller(series)
|
||||
print(f"ADF Statistic: {result[0]:.4f}")
|
||||
print(f"p-value: {result[1]:.4f}")
|
||||
if result[1] <= 0.05:
|
||||
print("Series is stationary")
|
||||
return True
|
||||
else:
|
||||
print("Series is non-stationary, needs differencing")
|
||||
return False
|
||||
|
||||
# Test original series
|
||||
if not check_stationarity(y):
|
||||
# Difference once
|
||||
y_diff = y.diff().dropna()
|
||||
if not check_stationarity(y_diff):
|
||||
# Difference again
|
||||
y_diff2 = y_diff.diff().dropna()
|
||||
check_stationarity(y_diff2)
|
||||
```
|
||||
|
||||
2. **Determine p and q (ACF/PACF)**:
|
||||
```python
|
||||
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
# After differencing to stationarity
|
||||
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 8))
|
||||
|
||||
# ACF: helps determine q (MA order)
|
||||
plot_acf(y_stationary, lags=40, ax=ax1)
|
||||
ax1.set_title('Autocorrelation Function (ACF)')
|
||||
|
||||
# PACF: helps determine p (AR order)
|
||||
plot_pacf(y_stationary, lags=40, ax=ax2)
|
||||
ax2.set_title('Partial Autocorrelation Function (PACF)')
|
||||
|
||||
plt.tight_layout()
|
||||
plt.show()
|
||||
|
||||
# Rules of thumb:
|
||||
# - PACF cuts off at lag p → AR(p)
|
||||
# - ACF cuts off at lag q → MA(q)
|
||||
# - Both decay → ARMA(p,q)
|
||||
```
|
||||
|
||||
3. **Model selection (AIC/BIC)**:
|
||||
```python
|
||||
# Grid search for best (p,q) given d
|
||||
import numpy as np
|
||||
|
||||
best_aic = np.inf
|
||||
best_order = None
|
||||
|
||||
for p in range(5):
|
||||
for q in range(5):
|
||||
try:
|
||||
model = ARIMA(y, order=(p, d, q))
|
||||
results = model.fit()
|
||||
if results.aic < best_aic:
|
||||
best_aic = results.aic
|
||||
best_order = (p, d, q)
|
||||
except:
|
||||
continue
|
||||
|
||||
print(f"Best order: {best_order} with AIC: {best_aic:.2f}")
|
||||
```
|
||||
|
||||
### SARIMAX (Seasonal ARIMA with Exogenous Variables)
|
||||
|
||||
Extends ARIMA with seasonality and exogenous regressors.
|
||||
|
||||
**When to use:**
|
||||
- Seasonal patterns (monthly, quarterly data)
|
||||
- External variables influence series
|
||||
- Most flexible univariate model
|
||||
|
||||
**Model**: SARIMAX(p,d,q)(P,D,Q,s)
|
||||
- (p,d,q): Non-seasonal ARIMA
|
||||
- (P,D,Q,s): Seasonal ARIMA with period s
|
||||
|
||||
```python
|
||||
from statsmodels.tsa.statespace.sarimax import SARIMAX
|
||||
|
||||
# Seasonal ARIMA for monthly data (s=12)
|
||||
model = SARIMAX(y,
|
||||
order=(1, 1, 1), # (p,d,q)
|
||||
seasonal_order=(1, 1, 1, 12)) # (P,D,Q,s)
|
||||
results = model.fit()
|
||||
|
||||
print(results.summary())
|
||||
```
|
||||
|
||||
**With exogenous variables:**
|
||||
```python
|
||||
# SARIMAX with external predictors
|
||||
model = SARIMAX(y,
|
||||
exog=X_exog,
|
||||
order=(1, 1, 1),
|
||||
seasonal_order=(1, 1, 1, 12))
|
||||
results = model.fit()
|
||||
```
|
||||
|
||||
**Example: Monthly sales with trend and seasonality**
|
||||
```python
|
||||
# Typical for monthly data: (p,d,q)(P,D,Q,12)
|
||||
# Start with (1,1,1)(1,1,1,12) or (0,1,1)(0,1,1,12)
|
||||
|
||||
model = SARIMAX(monthly_sales,
|
||||
order=(0, 1, 1),
|
||||
seasonal_order=(0, 1, 1, 12),
|
||||
enforce_stationarity=False,
|
||||
enforce_invertibility=False)
|
||||
results = model.fit()
|
||||
```
|
||||
|
||||
### Exponential Smoothing
|
||||
|
||||
Weighted averages of past observations with exponentially decreasing weights.
|
||||
|
||||
**When to use:**
|
||||
- Simple, interpretable forecasts
|
||||
- Trend and/or seasonality present
|
||||
- No need for explicit model specification
|
||||
|
||||
**Types:**
|
||||
- Simple Exponential Smoothing: no trend, no seasonality
|
||||
- Holt's method: with trend
|
||||
- Holt-Winters: with trend and seasonality
|
||||
|
||||
```python
|
||||
from statsmodels.tsa.holtwinters import ExponentialSmoothing
|
||||
|
||||
# Simple exponential smoothing
|
||||
model = ExponentialSmoothing(y, trend=None, seasonal=None)
|
||||
results = model.fit()
|
||||
|
||||
# Holt's method (with trend)
|
||||
model = ExponentialSmoothing(y, trend='add', seasonal=None)
|
||||
results = model.fit()
|
||||
|
||||
# Holt-Winters (trend + seasonality)
|
||||
model = ExponentialSmoothing(y,
|
||||
trend='add', # 'add' or 'mul'
|
||||
seasonal='add', # 'add' or 'mul'
|
||||
seasonal_periods=12) # e.g., 12 for monthly
|
||||
results = model.fit()
|
||||
|
||||
print(results.summary())
|
||||
```
|
||||
|
||||
**Additive vs Multiplicative:**
|
||||
```python
|
||||
# Additive: constant seasonal variation
|
||||
# yₜ = Level + Trend + Seasonal + Error
|
||||
|
||||
# Multiplicative: proportional seasonal variation
|
||||
# yₜ = Level × Trend × Seasonal × Error
|
||||
|
||||
# Choose based on data:
|
||||
# - Additive: seasonal variation constant over time
|
||||
# - Multiplicative: seasonal variation increases with level
|
||||
```
|
||||
|
||||
**Innovations state space (ETS):**
|
||||
```python
|
||||
from statsmodels.tsa.exponential_smoothing.ets import ETSModel
|
||||
|
||||
# More robust, state space formulation
|
||||
model = ETSModel(y,
|
||||
error='add', # 'add' or 'mul'
|
||||
trend='add', # 'add', 'mul', or None
|
||||
seasonal='add', # 'add', 'mul', or None
|
||||
seasonal_periods=12)
|
||||
results = model.fit()
|
||||
```
|
||||
|
||||
## Multivariate Time Series
|
||||
|
||||
### VAR (Vector Autoregression)
|
||||
|
||||
System of equations where each variable depends on past values of all variables.
|
||||
|
||||
**When to use:**
|
||||
- Multiple interrelated time series
|
||||
- Bidirectional relationships
|
||||
- Granger causality testing
|
||||
|
||||
**Model**: Each variable is AR on all variables:
|
||||
- y₁ₜ = c₁ + φ₁₁y₁ₜ₋₁ + φ₁₂y₂ₜ₋₁ + ... + ε₁ₜ
|
||||
- y₂ₜ = c₂ + φ₂₁y₁ₜ₋₁ + φ₂₂y₂ₜ₋₁ + ... + ε₂ₜ
|
||||
|
||||
```python
|
||||
from statsmodels.tsa.api import VAR
|
||||
import pandas as pd
|
||||
|
||||
# Data should be DataFrame with multiple columns
|
||||
# Each column is a time series
|
||||
df_multivariate = pd.DataFrame({'series1': y1, 'series2': y2, 'series3': y3})
|
||||
|
||||
# Fit VAR
|
||||
model = VAR(df_multivariate)
|
||||
|
||||
# Select lag order using AIC/BIC
|
||||
lag_order_results = model.select_order(maxlags=15)
|
||||
print(lag_order_results.summary())
|
||||
|
||||
# Fit with optimal lags
|
||||
results = model.fit(maxlags=5, ic='aic')
|
||||
print(results.summary())
|
||||
```
|
||||
|
||||
**Granger causality testing:**
|
||||
```python
|
||||
# Test if series1 Granger-causes series2
|
||||
from statsmodels.tsa.stattools import grangercausalitytests
|
||||
|
||||
# Requires 2D array [series2, series1]
|
||||
test_data = df_multivariate[['series2', 'series1']]
|
||||
|
||||
# Test up to max_lag
|
||||
max_lag = 5
|
||||
results = grangercausalitytests(test_data, max_lag, verbose=True)
|
||||
|
||||
# P-values for each lag
|
||||
for lag in range(1, max_lag + 1):
|
||||
p_value = results[lag][0]['ssr_ftest'][1]
|
||||
print(f"Lag {lag}: p-value = {p_value:.4f}")
|
||||
```
|
||||
|
||||
**Impulse Response Functions (IRF):**
|
||||
```python
|
||||
# Trace effect of shock through system
|
||||
irf = results.irf(10) # 10 periods ahead
|
||||
|
||||
# Plot IRFs
|
||||
irf.plot(orth=True) # Orthogonalized (Cholesky decomposition)
|
||||
plt.show()
|
||||
|
||||
# Cumulative effects
|
||||
irf.plot_cum_effects(orth=True)
|
||||
plt.show()
|
||||
```
|
||||
|
||||
**Forecast Error Variance Decomposition:**
|
||||
```python
|
||||
# Contribution of each variable to forecast error variance
|
||||
fevd = results.fevd(10) # 10 periods ahead
|
||||
fevd.plot()
|
||||
plt.show()
|
||||
```
|
||||
|
||||
### VARMAX (VAR with Moving Average and Exogenous Variables)
|
||||
|
||||
Extends VAR with MA component and external regressors.
|
||||
|
||||
**When to use:**
|
||||
- VAR inadequate (MA component needed)
|
||||
- External variables affect system
|
||||
- More flexible multivariate model
|
||||
|
||||
```python
|
||||
from statsmodels.tsa.statespace.varmax import VARMAX
|
||||
|
||||
# VARMAX(p, q) with exogenous variables
|
||||
model = VARMAX(df_multivariate,
|
||||
order=(1, 1), # (p, q)
|
||||
exog=X_exog)
|
||||
results = model.fit()
|
||||
|
||||
print(results.summary())
|
||||
```
|
||||
|
||||
## State Space Models
|
||||
|
||||
Flexible framework for custom time series models.
|
||||
|
||||
**When to use:**
|
||||
- Custom model specification
|
||||
- Unobserved components
|
||||
- Kalman filtering/smoothing
|
||||
- Missing data
|
||||
|
||||
```python
|
||||
from statsmodels.tsa.statespace.mlemodel import MLEModel
|
||||
|
||||
# Extend MLEModel for custom state space models
|
||||
# Example: Local level model (random walk + noise)
|
||||
```
|
||||
|
||||
**Dynamic Factor Models:**
|
||||
```python
|
||||
from statsmodels.tsa.statespace.dynamic_factor import DynamicFactor
|
||||
|
||||
# Extract common factors from multiple time series
|
||||
model = DynamicFactor(df_multivariate,
|
||||
k_factors=2, # Number of factors
|
||||
factor_order=2) # AR order of factors
|
||||
results = model.fit()
|
||||
|
||||
# Estimated factors
|
||||
factors = results.factors.filtered
|
||||
```
|
||||
|
||||
## Forecasting
|
||||
|
||||
### Point Forecasts
|
||||
|
||||
```python
|
||||
# ARIMA forecasting
|
||||
model = ARIMA(y, order=(1, 1, 1))
|
||||
results = model.fit()
|
||||
|
||||
# Forecast h steps ahead
|
||||
h = 10
|
||||
forecast = results.forecast(steps=h)
|
||||
|
||||
# With exogenous variables (SARIMAX)
|
||||
model = SARIMAX(y, exog=X, order=(1, 1, 1))
|
||||
results = model.fit()
|
||||
|
||||
# Need future exogenous values
|
||||
forecast = results.forecast(steps=h, exog=X_future)
|
||||
```
|
||||
|
||||
### Prediction Intervals
|
||||
|
||||
```python
|
||||
# Get forecast with confidence intervals
|
||||
forecast_obj = results.get_forecast(steps=h)
|
||||
forecast_df = forecast_obj.summary_frame()
|
||||
|
||||
print(forecast_df)
|
||||
# Contains: mean, mean_se, mean_ci_lower, mean_ci_upper
|
||||
|
||||
# Extract components
|
||||
forecast_mean = forecast_df['mean']
|
||||
forecast_ci_lower = forecast_df['mean_ci_lower']
|
||||
forecast_ci_upper = forecast_df['mean_ci_upper']
|
||||
|
||||
# Plot
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
plt.figure(figsize=(12, 6))
|
||||
plt.plot(y.index, y, label='Historical')
|
||||
plt.plot(forecast_df.index, forecast_mean, label='Forecast', color='red')
|
||||
plt.fill_between(forecast_df.index,
|
||||
forecast_ci_lower,
|
||||
forecast_ci_upper,
|
||||
alpha=0.3, color='red', label='95% CI')
|
||||
plt.legend()
|
||||
plt.title('Forecast with Prediction Intervals')
|
||||
plt.show()
|
||||
```
|
||||
|
||||
### Dynamic vs Static Forecasts
|
||||
|
||||
```python
|
||||
# Static (one-step-ahead, using actual values)
|
||||
static_forecast = results.get_prediction(start=split_point, end=len(y)-1)
|
||||
|
||||
# Dynamic (multi-step, using predicted values)
|
||||
dynamic_forecast = results.get_prediction(start=split_point,
|
||||
end=len(y)-1,
|
||||
dynamic=True)
|
||||
|
||||
# Plot comparison
|
||||
fig, ax = plt.subplots(figsize=(12, 6))
|
||||
y.plot(ax=ax, label='Actual')
|
||||
static_forecast.predicted_mean.plot(ax=ax, label='Static forecast')
|
||||
dynamic_forecast.predicted_mean.plot(ax=ax, label='Dynamic forecast')
|
||||
ax.legend()
|
||||
plt.show()
|
||||
```
|
||||
|
||||
## Diagnostic Tests
|
||||
|
||||
### Stationarity Tests
|
||||
|
||||
```python
|
||||
from statsmodels.tsa.stattools import adfuller, kpss
|
||||
|
||||
# Augmented Dickey-Fuller (ADF) test
|
||||
# H0: unit root (non-stationary)
|
||||
adf_result = adfuller(y, autolag='AIC')
|
||||
print(f"ADF Statistic: {adf_result[0]:.4f}")
|
||||
print(f"p-value: {adf_result[1]:.4f}")
|
||||
if adf_result[1] <= 0.05:
|
||||
print("Reject H0: Series is stationary")
|
||||
else:
|
||||
print("Fail to reject H0: Series is non-stationary")
|
||||
|
||||
# KPSS test
|
||||
# H0: stationary (opposite of ADF)
|
||||
kpss_result = kpss(y, regression='c', nlags='auto')
|
||||
print(f"KPSS Statistic: {kpss_result[0]:.4f}")
|
||||
print(f"p-value: {kpss_result[1]:.4f}")
|
||||
if kpss_result[1] <= 0.05:
|
||||
print("Reject H0: Series is non-stationary")
|
||||
else:
|
||||
print("Fail to reject H0: Series is stationary")
|
||||
```
|
||||
|
||||
### Residual Diagnostics
|
||||
|
||||
```python
|
||||
# Ljung-Box test for autocorrelation in residuals
|
||||
from statsmodels.stats.diagnostic import acorr_ljungbox
|
||||
|
||||
lb_test = acorr_ljungbox(results.resid, lags=10, return_df=True)
|
||||
print(lb_test)
|
||||
# P-values > 0.05 indicate no significant autocorrelation (good)
|
||||
|
||||
# Plot residual diagnostics
|
||||
results.plot_diagnostics(figsize=(12, 8))
|
||||
plt.show()
|
||||
|
||||
# Components:
|
||||
# 1. Standardized residuals over time
|
||||
# 2. Histogram + KDE of residuals
|
||||
# 3. Q-Q plot for normality
|
||||
# 4. Correlogram (ACF of residuals)
|
||||
```
|
||||
|
||||
### Heteroskedasticity Tests
|
||||
|
||||
```python
|
||||
from statsmodels.stats.diagnostic import het_arch
|
||||
|
||||
# ARCH test for heteroskedasticity
|
||||
arch_test = het_arch(results.resid, nlags=10)
|
||||
print(f"ARCH test statistic: {arch_test[0]:.4f}")
|
||||
print(f"p-value: {arch_test[1]:.4f}")
|
||||
|
||||
# If significant, consider GARCH model
|
||||
```
|
||||
|
||||
## Seasonal Decomposition
|
||||
|
||||
```python
|
||||
from statsmodels.tsa.seasonal import seasonal_decompose
|
||||
|
||||
# Decompose into trend, seasonal, residual
|
||||
decomposition = seasonal_decompose(y,
|
||||
model='additive', # or 'multiplicative'
|
||||
period=12) # seasonal period
|
||||
|
||||
# Plot components
|
||||
fig = decomposition.plot()
|
||||
fig.set_size_inches(12, 8)
|
||||
plt.show()
|
||||
|
||||
# Access components
|
||||
trend = decomposition.trend
|
||||
seasonal = decomposition.seasonal
|
||||
residual = decomposition.resid
|
||||
|
||||
# STL decomposition (more robust)
|
||||
from statsmodels.tsa.seasonal import STL
|
||||
|
||||
stl = STL(y, seasonal=13) # seasonal must be odd
|
||||
stl_result = stl.fit()
|
||||
|
||||
fig = stl_result.plot()
|
||||
plt.show()
|
||||
```
|
||||
|
||||
## Model Evaluation
|
||||
|
||||
### In-Sample Metrics
|
||||
|
||||
```python
|
||||
# From results object
|
||||
print(f"AIC: {results.aic:.2f}")
|
||||
print(f"BIC: {results.bic:.2f}")
|
||||
print(f"Log-likelihood: {results.llf:.2f}")
|
||||
|
||||
# MSE on training data
|
||||
from sklearn.metrics import mean_squared_error
|
||||
|
||||
mse = mean_squared_error(y, results.fittedvalues)
|
||||
rmse = np.sqrt(mse)
|
||||
print(f"RMSE: {rmse:.4f}")
|
||||
|
||||
# MAE
|
||||
from sklearn.metrics import mean_absolute_error
|
||||
mae = mean_absolute_error(y, results.fittedvalues)
|
||||
print(f"MAE: {mae:.4f}")
|
||||
```
|
||||
|
||||
### Out-of-Sample Evaluation
|
||||
|
||||
```python
|
||||
# Train-test split for time series (no shuffle!)
|
||||
train_size = int(0.8 * len(y))
|
||||
y_train = y[:train_size]
|
||||
y_test = y[train_size:]
|
||||
|
||||
# Fit on training data
|
||||
model = ARIMA(y_train, order=(1, 1, 1))
|
||||
results = model.fit()
|
||||
|
||||
# Forecast test period
|
||||
forecast = results.forecast(steps=len(y_test))
|
||||
|
||||
# Metrics
|
||||
from sklearn.metrics import mean_squared_error, mean_absolute_error
|
||||
|
||||
rmse = np.sqrt(mean_squared_error(y_test, forecast))
|
||||
mae = mean_absolute_error(y_test, forecast)
|
||||
mape = np.mean(np.abs((y_test - forecast) / y_test)) * 100
|
||||
|
||||
print(f"Test RMSE: {rmse:.4f}")
|
||||
print(f"Test MAE: {mae:.4f}")
|
||||
print(f"Test MAPE: {mape:.2f}%")
|
||||
```
|
||||
|
||||
### Rolling Forecast
|
||||
|
||||
```python
|
||||
# More realistic evaluation: rolling one-step-ahead forecasts
|
||||
forecasts = []
|
||||
|
||||
for t in range(len(y_test)):
|
||||
# Refit or update with new observation
|
||||
y_current = y[:train_size + t]
|
||||
model = ARIMA(y_current, order=(1, 1, 1))
|
||||
fit = model.fit()
|
||||
|
||||
# One-step forecast
|
||||
fc = fit.forecast(steps=1)[0]
|
||||
forecasts.append(fc)
|
||||
|
||||
forecasts = np.array(forecasts)
|
||||
|
||||
rmse = np.sqrt(mean_squared_error(y_test, forecasts))
|
||||
print(f"Rolling forecast RMSE: {rmse:.4f}")
|
||||
```
|
||||
|
||||
### Cross-Validation
|
||||
|
||||
```python
|
||||
# Time series cross-validation (expanding window)
|
||||
from sklearn.model_selection import TimeSeriesSplit
|
||||
|
||||
tscv = TimeSeriesSplit(n_splits=5)
|
||||
rmse_scores = []
|
||||
|
||||
for train_idx, test_idx in tscv.split(y):
|
||||
y_train_cv = y.iloc[train_idx]
|
||||
y_test_cv = y.iloc[test_idx]
|
||||
|
||||
model = ARIMA(y_train_cv, order=(1, 1, 1))
|
||||
results = model.fit()
|
||||
|
||||
forecast = results.forecast(steps=len(test_idx))
|
||||
rmse = np.sqrt(mean_squared_error(y_test_cv, forecast))
|
||||
rmse_scores.append(rmse)
|
||||
|
||||
print(f"CV RMSE: {np.mean(rmse_scores):.4f} ± {np.std(rmse_scores):.4f}")
|
||||
```
|
||||
|
||||
## Advanced Topics
|
||||
|
||||
### ARDL (Autoregressive Distributed Lag)
|
||||
|
||||
Bridges univariate and multivariate time series.
|
||||
|
||||
```python
|
||||
from statsmodels.tsa.ardl import ARDL
|
||||
|
||||
# ARDL(p, q) model
|
||||
# y depends on its own lags and lags of X
|
||||
model = ARDL(y, lags=2, exog=X, exog_lags=2)
|
||||
results = model.fit()
|
||||
```
|
||||
|
||||
### Error Correction Models
|
||||
|
||||
For cointegrated series.
|
||||
|
||||
```python
|
||||
from statsmodels.tsa.vector_ar.vecm import coint_johansen
|
||||
|
||||
# Test for cointegration
|
||||
johansen_test = coint_johansen(df_multivariate, det_order=0, k_ar_diff=1)
|
||||
|
||||
# Fit VECM if cointegrated
|
||||
from statsmodels.tsa.vector_ar.vecm import VECM
|
||||
|
||||
model = VECM(df_multivariate, k_ar_diff=1, coint_rank=1)
|
||||
results = model.fit()
|
||||
```
|
||||
|
||||
### Regime Switching Models
|
||||
|
||||
For structural breaks and regime changes.
|
||||
|
||||
```python
|
||||
from statsmodels.tsa.regime_switching.markov_regression import MarkovRegression
|
||||
|
||||
# Markov switching model
|
||||
model = MarkovRegression(y, k_regimes=2, order=1)
|
||||
results = model.fit()
|
||||
|
||||
# Smoothed probabilities of regimes
|
||||
regime_probs = results.smoothed_marginal_probabilities
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Check stationarity**: Difference if needed, verify with ADF/KPSS tests
|
||||
2. **Plot data**: Always visualize before modeling
|
||||
3. **Identify seasonality**: Use appropriate seasonal models (SARIMAX, Holt-Winters)
|
||||
4. **Model selection**: Use AIC/BIC and out-of-sample validation
|
||||
5. **Residual diagnostics**: Check for autocorrelation, normality, heteroskedasticity
|
||||
6. **Forecast evaluation**: Use rolling forecasts and proper time series CV
|
||||
7. **Avoid overfitting**: Prefer simpler models, use information criteria
|
||||
8. **Document assumptions**: Note any data transformations (log, differencing)
|
||||
9. **Prediction intervals**: Always provide uncertainty estimates
|
||||
10. **Refit regularly**: Update models as new data arrives
|
||||
|
||||
## Common Pitfalls
|
||||
|
||||
1. **Not checking stationarity**: Fit ARIMA on non-stationary data
|
||||
2. **Data leakage**: Using future data in transformations
|
||||
3. **Wrong seasonal period**: S=4 for quarterly, S=12 for monthly
|
||||
4. **Overfitting**: Too many parameters relative to data
|
||||
5. **Ignoring residual autocorrelation**: Model inadequate
|
||||
6. **Using inappropriate metrics**: MAPE fails with zeros or negatives
|
||||
7. **Not handling missing data**: Affects model estimation
|
||||
8. **Extrapolating exogenous variables**: Need future X values for SARIMAX
|
||||
9. **Confusing static vs dynamic forecasts**: Dynamic more realistic for multi-step
|
||||
10. **Not validating forecasts**: Always check out-of-sample performance
|
||||
Reference in New Issue
Block a user