Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions

View File

@@ -0,0 +1,669 @@
# Discrete Choice Models Reference
This document provides comprehensive guidance on discrete choice models in statsmodels, including binary, multinomial, count, and ordinal models.
## Overview
Discrete choice models handle outcomes that are:
- **Binary**: 0/1, success/failure
- **Multinomial**: Multiple unordered categories
- **Ordinal**: Ordered categories
- **Count**: Non-negative integers
All models use maximum likelihood estimation and assume i.i.d. errors.
## Binary Models
### Logit (Logistic Regression)
Uses logistic distribution for binary outcomes.
**When to use:**
- Binary classification (yes/no, success/failure)
- Probability estimation for binary outcomes
- Interpretable odds ratios
**Model**: P(Y=1|X) = 1 / (1 + exp(-Xβ))
```python
import statsmodels.api as sm
from statsmodels.discrete.discrete_model import Logit
# Prepare data
X = sm.add_constant(X_data)
# Fit model
model = Logit(y, X)
results = model.fit()
print(results.summary())
```
**Interpretation:**
```python
import numpy as np
# Odds ratios
odds_ratios = np.exp(results.params)
print("Odds ratios:", odds_ratios)
# For 1-unit increase in X, odds multiply by exp(β)
# OR > 1: increases odds of success
# OR < 1: decreases odds of success
# OR = 1: no effect
# Confidence intervals for odds ratios
odds_ci = np.exp(results.conf_int())
print("Odds ratio 95% CI:")
print(odds_ci)
```
**Marginal effects:**
```python
# Average marginal effects (AME)
marginal_effects = results.get_margeff(at='mean')
print(marginal_effects.summary())
# Marginal effects at means (MEM)
marginal_effects_mem = results.get_margeff(at='mean', method='dydx')
# Marginal effects at representative values
marginal_effects_custom = results.get_margeff(at='mean',
atexog={'x1': 1, 'x2': 5})
```
**Predictions:**
```python
# Predicted probabilities
probs = results.predict(X)
# Binary predictions (0.5 threshold)
predictions = (probs > 0.5).astype(int)
# Custom threshold
threshold = 0.3
predictions_custom = (probs > threshold).astype(int)
# For new data
X_new = sm.add_constant(X_new_data)
new_probs = results.predict(X_new)
```
**Model evaluation:**
```python
from sklearn.metrics import (classification_report, confusion_matrix,
roc_auc_score, roc_curve)
# Classification report
print(classification_report(y, predictions))
# Confusion matrix
print(confusion_matrix(y, predictions))
# AUC-ROC
auc = roc_auc_score(y, probs)
print(f"AUC: {auc:.4f}")
# Pseudo R-squared
print(f"McFadden's Pseudo R²: {results.prsquared:.4f}")
```
### Probit
Uses normal distribution for binary outcomes.
**When to use:**
- Binary outcomes
- Prefer normal distribution assumption
- Field convention (econometrics often uses probit)
**Model**: P(Y=1|X) = Φ(Xβ), where Φ is standard normal CDF
```python
from statsmodels.discrete.discrete_model import Probit
model = Probit(y, X)
results = model.fit()
print(results.summary())
```
**Comparison with Logit:**
- Probit and Logit usually give similar results
- Probit: symmetric, based on normal distribution
- Logit: slightly heavier tails, easier interpretation (odds ratios)
- Coefficients not directly comparable (scale difference)
```python
# Marginal effects are comparable
logit_me = logit_results.get_margeff().margeff
probit_me = probit_results.get_margeff().margeff
print("Logit marginal effects:", logit_me)
print("Probit marginal effects:", probit_me)
```
## Multinomial Models
### MNLogit (Multinomial Logit)
For unordered categorical outcomes with 3+ categories.
**When to use:**
- Multiple unordered categories (e.g., transportation mode, brand choice)
- No natural ordering among categories
- Need probabilities for each category
**Model**: P(Y=j|X) = exp(Xβⱼ) / Σₖ exp(Xβₖ)
```python
from statsmodels.discrete.discrete_model import MNLogit
# y should be integers 0, 1, 2, ... for categories
model = MNLogit(y, X)
results = model.fit()
print(results.summary())
```
**Interpretation:**
```python
# One category is reference (usually category 0)
# Coefficients represent log-odds relative to reference
# For category j vs reference:
# exp(β_j) = odds ratio of category j vs reference
# Predicted probabilities for each category
probs = results.predict(X) # Shape: (n_samples, n_categories)
# Most likely category
predicted_categories = probs.argmax(axis=1)
```
**Relative risk ratios:**
```python
# Exponentiate coefficients for relative risk ratios
import numpy as np
import pandas as pd
# Get parameter names and values
params_df = pd.DataFrame({
'coef': results.params,
'RRR': np.exp(results.params)
})
print(params_df)
```
### Conditional Logit
For choice models where alternatives have characteristics.
**When to use:**
- Alternative-specific regressors (vary across choices)
- Panel data with choices
- Discrete choice experiments
```python
from statsmodels.discrete.conditional_models import ConditionalLogit
# Data structure: long format with choice indicator
model = ConditionalLogit(y_choice, X_alternatives, groups=individual_id)
results = model.fit()
```
## Count Models
### Poisson
Standard model for count data.
**When to use:**
- Count outcomes (events, occurrences)
- Rare events
- Mean ≈ variance
**Model**: P(Y=k|X) = exp(-λ) λᵏ / k!, where log(λ) = Xβ
```python
from statsmodels.discrete.count_model import Poisson
model = Poisson(y_counts, X)
results = model.fit()
print(results.summary())
```
**Interpretation:**
```python
# Rate ratios (incident rate ratios)
rate_ratios = np.exp(results.params)
print("Rate ratios:", rate_ratios)
# For 1-unit increase in X, expected count multiplies by exp(β)
```
**Check overdispersion:**
```python
# Mean and variance should be similar for Poisson
print(f"Mean: {y_counts.mean():.2f}")
print(f"Variance: {y_counts.var():.2f}")
# Formal test
from statsmodels.stats.stattools import durbin_watson
# Overdispersion if variance >> mean
# Rule of thumb: variance/mean > 1.5 suggests overdispersion
overdispersion_ratio = y_counts.var() / y_counts.mean()
print(f"Variance/Mean: {overdispersion_ratio:.2f}")
if overdispersion_ratio > 1.5:
print("Consider Negative Binomial model")
```
**With offset (for rates):**
```python
# When modeling rates with varying exposure
# log(λ) = log(exposure) + Xβ
model = Poisson(y_counts, X, offset=np.log(exposure))
results = model.fit()
```
### Negative Binomial
For overdispersed count data (variance > mean).
**When to use:**
- Count data with overdispersion
- Excess variance not explained by Poisson
- Heterogeneity in counts
**Model**: Adds dispersion parameter α to account for overdispersion
```python
from statsmodels.discrete.count_model import NegativeBinomial
model = NegativeBinomial(y_counts, X)
results = model.fit()
print(results.summary())
print(f"Dispersion parameter alpha: {results.params['alpha']:.4f}")
```
**Compare with Poisson:**
```python
# Fit both models
poisson_results = Poisson(y_counts, X).fit()
nb_results = NegativeBinomial(y_counts, X).fit()
# AIC comparison (lower is better)
print(f"Poisson AIC: {poisson_results.aic:.2f}")
print(f"Negative Binomial AIC: {nb_results.aic:.2f}")
# Likelihood ratio test (if NB is better)
from scipy import stats
lr_stat = 2 * (nb_results.llf - poisson_results.llf)
lr_pval = 1 - stats.chi2.cdf(lr_stat, df=1) # 1 extra parameter (alpha)
print(f"LR test p-value: {lr_pval:.4f}")
if lr_pval < 0.05:
print("Negative Binomial significantly better")
```
### Zero-Inflated Models
For count data with excess zeros.
**When to use:**
- More zeros than expected from Poisson/NB
- Two processes: one for zeros, one for counts
- Examples: number of doctor visits, insurance claims
**Models:**
- ZeroInflatedPoisson (ZIP)
- ZeroInflatedNegativeBinomialP (ZINB)
```python
from statsmodels.discrete.count_model import (ZeroInflatedPoisson,
ZeroInflatedNegativeBinomialP)
# ZIP model
zip_model = ZeroInflatedPoisson(y_counts, X, exog_infl=X_inflation)
zip_results = zip_model.fit()
# ZINB model (for overdispersion + excess zeros)
zinb_model = ZeroInflatedNegativeBinomialP(y_counts, X, exog_infl=X_inflation)
zinb_results = zinb_model.fit()
print(zip_results.summary())
```
**Two parts of the model:**
```python
# 1. Inflation model: P(Y=0 due to inflation)
# 2. Count model: distribution of counts
# Predicted probabilities of inflation
inflation_probs = zip_results.predict(X, which='prob')
# Predicted counts
predicted_counts = zip_results.predict(X, which='mean')
```
### Hurdle Models
Two-stage model: whether any counts, then how many.
**When to use:**
- Excess zeros
- Different processes for zero vs positive counts
- Zeros structurally different from positive values
```python
from statsmodels.discrete.count_model import HurdleCountModel
# Specify count distribution and zero inflation
model = HurdleCountModel(y_counts, X,
exog_infl=X_hurdle,
dist='poisson') # or 'negbin'
results = model.fit()
print(results.summary())
```
## Ordinal Models
### Ordered Logit/Probit
For ordered categorical outcomes.
**When to use:**
- Ordered categories (e.g., low/medium/high, ratings 1-5)
- Natural ordering matters
- Want to respect ordinal structure
**Model**: Cumulative probability model with cutpoints
```python
from statsmodels.miscmodels.ordinal_model import OrderedModel
# y should be ordered integers: 0, 1, 2, ...
model = OrderedModel(y_ordered, X, distr='logit') # or 'probit'
results = model.fit(method='bfgs')
print(results.summary())
```
**Interpretation:**
```python
# Cutpoints (thresholds between categories)
cutpoints = results.params[-n_categories+1:]
print("Cutpoints:", cutpoints)
# Coefficients
coefficients = results.params[:-n_categories+1]
print("Coefficients:", coefficients)
# Predicted probabilities for each category
probs = results.predict(X) # Shape: (n_samples, n_categories)
# Most likely category
predicted_categories = probs.argmax(axis=1)
```
**Proportional odds assumption:**
```python
# Test if coefficients are same across cutpoints
# (Brant test - implement manually or check residuals)
# Check: model each cutpoint separately and compare coefficients
```
## Model Diagnostics
### Goodness of Fit
```python
# Pseudo R-squared (McFadden)
print(f"Pseudo R²: {results.prsquared:.4f}")
# AIC/BIC for model comparison
print(f"AIC: {results.aic:.2f}")
print(f"BIC: {results.bic:.2f}")
# Log-likelihood
print(f"Log-likelihood: {results.llf:.2f}")
# Likelihood ratio test vs null model
lr_stat = 2 * (results.llf - results.llnull)
from scipy import stats
lr_pval = 1 - stats.chi2.cdf(lr_stat, results.df_model)
print(f"LR test p-value: {lr_pval}")
```
### Classification Metrics (Binary)
```python
from sklearn.metrics import (accuracy_score, precision_score, recall_score,
f1_score, roc_auc_score)
# Predictions
probs = results.predict(X)
predictions = (probs > 0.5).astype(int)
# Metrics
print(f"Accuracy: {accuracy_score(y, predictions):.4f}")
print(f"Precision: {precision_score(y, predictions):.4f}")
print(f"Recall: {recall_score(y, predictions):.4f}")
print(f"F1: {f1_score(y, predictions):.4f}")
print(f"AUC: {roc_auc_score(y, probs):.4f}")
```
### Classification Metrics (Multinomial)
```python
from sklearn.metrics import accuracy_score, classification_report, log_loss
# Predicted categories
probs = results.predict(X)
predictions = probs.argmax(axis=1)
# Accuracy
accuracy = accuracy_score(y, predictions)
print(f"Accuracy: {accuracy:.4f}")
# Classification report
print(classification_report(y, predictions))
# Log loss
logloss = log_loss(y, probs)
print(f"Log Loss: {logloss:.4f}")
```
### Count Model Diagnostics
```python
# Observed vs predicted frequencies
observed = pd.Series(y_counts).value_counts().sort_index()
predicted = results.predict(X)
predicted_counts = pd.Series(np.round(predicted)).value_counts().sort_index()
# Compare distributions
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
observed.plot(kind='bar', alpha=0.5, label='Observed', ax=ax)
predicted_counts.plot(kind='bar', alpha=0.5, label='Predicted', ax=ax)
ax.legend()
ax.set_xlabel('Count')
ax.set_ylabel('Frequency')
plt.show()
# Rootogram (better visualization)
from statsmodels.graphics.agreement import mean_diff_plot
# Custom rootogram implementation needed
```
### Influence and Outliers
```python
# Standardized residuals
std_resid = (y - results.predict(X)) / np.sqrt(results.predict(X))
# Check for outliers (|std_resid| > 2)
outliers = np.where(np.abs(std_resid) > 2)[0]
print(f"Number of outliers: {len(outliers)}")
# Leverage (hat values) - for logit/probit
# from statsmodels.stats.outliers_influence
```
## Hypothesis Testing
```python
# Single parameter test (automatic in summary)
# Multiple parameters: Wald test
# Test H0: β₁ = β₂ = 0
R = [[0, 1, 0, 0], [0, 0, 1, 0]]
wald_test = results.wald_test(R)
print(wald_test)
# Likelihood ratio test for nested models
model_reduced = Logit(y, X_reduced).fit()
model_full = Logit(y, X_full).fit()
lr_stat = 2 * (model_full.llf - model_reduced.llf)
df = model_full.df_model - model_reduced.df_model
from scipy import stats
lr_pval = 1 - stats.chi2.cdf(lr_stat, df)
print(f"LR test p-value: {lr_pval:.4f}")
```
## Model Selection and Comparison
```python
# Fit multiple models
models = {
'Logit': Logit(y, X).fit(),
'Probit': Probit(y, X).fit(),
# Add more models
}
# Compare AIC/BIC
comparison = pd.DataFrame({
'AIC': {name: model.aic for name, model in models.items()},
'BIC': {name: model.bic for name, model in models.items()},
'Pseudo R²': {name: model.prsquared for name, model in models.items()}
})
print(comparison.sort_values('AIC'))
# Cross-validation for predictive performance
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
# Use sklearn wrapper or manual CV
```
## Formula API
Use R-style formulas for easier specification.
```python
import statsmodels.formula.api as smf
# Logit with formula
formula = 'y ~ x1 + x2 + C(category) + x1:x2'
results = smf.logit(formula, data=df).fit()
# MNLogit with formula
results = smf.mnlogit(formula, data=df).fit()
# Poisson with formula
results = smf.poisson(formula, data=df).fit()
# Negative Binomial with formula
results = smf.negativebinomial(formula, data=df).fit()
```
## Common Applications
### Binary Classification (Marketing Response)
```python
# Predict customer purchase probability
X = sm.add_constant(customer_features)
model = Logit(purchased, X)
results = model.fit()
# Targeting: select top 20% likely to purchase
probs = results.predict(X)
top_20_pct_idx = np.argsort(probs)[-int(0.2*len(probs)):]
```
### Multinomial Choice (Transportation Mode)
```python
# Predict transportation mode choice
model = MNLogit(mode_choice, X)
results = model.fit()
# Predicted mode for new commuter
new_commuter = sm.add_constant(new_features)
mode_probs = results.predict(new_commuter)
predicted_mode = mode_probs.argmax(axis=1)
```
### Count Data (Number of Doctor Visits)
```python
# Model healthcare utilization
model = NegativeBinomial(num_visits, X)
results = model.fit()
# Expected visits for new patient
expected_visits = results.predict(new_patient_X)
```
### Zero-Inflated (Insurance Claims)
```python
# Many people have zero claims
# Zero-inflation: some never claim
# Count process: those who might claim
zip_model = ZeroInflatedPoisson(claims, X_count, exog_infl=X_inflation)
results = zip_model.fit()
# P(never file claim)
never_claim_prob = results.predict(X, which='prob-zero')
# Expected claims
expected_claims = results.predict(X, which='mean')
```
## Best Practices
1. **Check data type**: Ensure response matches model (binary, counts, categories)
2. **Add constant**: Always use `sm.add_constant()` unless no intercept desired
3. **Scale continuous predictors**: For better convergence and interpretation
4. **Check convergence**: Look for convergence warnings
5. **Use formula API**: For categorical variables and interactions
6. **Marginal effects**: Report marginal effects, not just coefficients
7. **Model comparison**: Use AIC/BIC and cross-validation
8. **Validate**: Holdout set or cross-validation for predictive models
9. **Check overdispersion**: For count models, test Poisson assumption
10. **Consider alternatives**: Zero-inflation, hurdle models for excess zeros
## Common Pitfalls
1. **Forgetting constant**: No intercept term
2. **Perfect separation**: Logit/probit may not converge
3. **Using Poisson with overdispersion**: Check and use Negative Binomial
4. **Misinterpreting coefficients**: Remember they're on log-odds/log scale
5. **Not checking convergence**: Optimization may fail silently
6. **Wrong distribution**: Match model to data type (binary/count/categorical)
7. **Ignoring excess zeros**: Use ZIP/ZINB when appropriate
8. **Not validating predictions**: Always check out-of-sample performance
9. **Comparing non-nested models**: Use AIC/BIC, not likelihood ratio test
10. **Ordinal as nominal**: Use OrderedModel for ordered categories

View File

@@ -0,0 +1,619 @@
# Generalized Linear Models (GLM) Reference
This document provides comprehensive guidance on generalized linear models in statsmodels, including families, link functions, and applications.
## Overview
GLMs extend linear regression to non-normal response distributions through:
1. **Distribution family**: Specifies the conditional distribution of the response
2. **Link function**: Transforms the linear predictor to the scale of the mean
3. **Variance function**: Relates variance to the mean
**General form**: g(μ) = Xβ, where g is the link function and μ = E(Y|X)
## When to Use GLM
- **Binary outcomes**: Logistic regression (Binomial family with logit link)
- **Count data**: Poisson or Negative Binomial regression
- **Positive continuous data**: Gamma or Inverse Gaussian
- **Non-normal distributions**: When OLS assumptions violated
- **Link functions**: Need non-linear relationship between predictors and response scale
## Distribution Families
### Binomial Family
For binary outcomes (0/1) or proportions (k/n).
**When to use:**
- Binary classification
- Success/failure outcomes
- Proportions or rates
**Common links:**
- Logit (default): log(μ/(1-μ))
- Probit: Φ⁻¹(μ)
- Log: log(μ)
```python
import statsmodels.api as sm
import statsmodels.formula.api as smf
# Binary logistic regression
model = sm.GLM(y, X, family=sm.families.Binomial())
results = model.fit()
# Formula API
results = smf.glm('success ~ x1 + x2', data=df,
family=sm.families.Binomial()).fit()
# Access predictions (probabilities)
probs = results.predict(X_new)
# Classification (0.5 threshold)
predictions = (probs > 0.5).astype(int)
```
**Interpretation:**
```python
import numpy as np
# Odds ratios (for logit link)
odds_ratios = np.exp(results.params)
print("Odds ratios:", odds_ratios)
# For 1-unit increase in x, odds multiply by exp(beta)
```
### Poisson Family
For count data (non-negative integers).
**When to use:**
- Count outcomes (number of events)
- Rare events
- Rate modeling (with offset)
**Common links:**
- Log (default): log(μ)
- Identity: μ
- Sqrt: √μ
```python
# Poisson regression
model = sm.GLM(y, X, family=sm.families.Poisson())
results = model.fit()
# With exposure/offset for rates
# If modeling rate = counts/exposure
model = sm.GLM(y, X, family=sm.families.Poisson(),
offset=np.log(exposure))
results = model.fit()
# Interpretation: exp(beta) = multiplicative effect on expected count
import numpy as np
rate_ratios = np.exp(results.params)
print("Rate ratios:", rate_ratios)
```
**Overdispersion check:**
```python
# Deviance / df should be ~1 for Poisson
overdispersion = results.deviance / results.df_resid
print(f"Overdispersion: {overdispersion}")
# If >> 1, consider Negative Binomial
if overdispersion > 1.5:
print("Consider Negative Binomial model for overdispersion")
```
### Negative Binomial Family
For overdispersed count data.
**When to use:**
- Count data with variance > mean
- Excess zeros or large variance
- Poisson model shows overdispersion
```python
# Negative Binomial GLM
model = sm.GLM(y, X, family=sm.families.NegativeBinomial())
results = model.fit()
# Alternative: use discrete choice model with alpha estimation
from statsmodels.discrete.discrete_model import NegativeBinomial
nb_model = NegativeBinomial(y, X)
nb_results = nb_model.fit()
print(f"Dispersion parameter alpha: {nb_results.params[-1]}")
```
### Gaussian Family
Equivalent to OLS but fit via IRLS (Iteratively Reweighted Least Squares).
**When to use:**
- Want GLM framework for consistency
- Need robust standard errors
- Comparing with other GLMs
**Common links:**
- Identity (default): μ
- Log: log(μ)
- Inverse: 1/μ
```python
# Gaussian GLM (equivalent to OLS)
model = sm.GLM(y, X, family=sm.families.Gaussian())
results = model.fit()
# Verify equivalence with OLS
ols_results = sm.OLS(y, X).fit()
print("Parameters close:", np.allclose(results.params, ols_results.params))
```
### Gamma Family
For positive continuous data, often right-skewed.
**When to use:**
- Positive outcomes (insurance claims, survival times)
- Right-skewed distributions
- Variance proportional to mean²
**Common links:**
- Inverse (default): 1/μ
- Log: log(μ)
- Identity: μ
```python
# Gamma regression (common for cost data)
model = sm.GLM(y, X, family=sm.families.Gamma())
results = model.fit()
# Log link often preferred for interpretation
model = sm.GLM(y, X, family=sm.families.Gamma(link=sm.families.links.Log()))
results = model.fit()
# With log link, exp(beta) = multiplicative effect
import numpy as np
effects = np.exp(results.params)
```
### Inverse Gaussian Family
For positive continuous data with specific variance structure.
**When to use:**
- Positive skewed outcomes
- Variance proportional to mean³
- Alternative to Gamma
**Common links:**
- Inverse squared (default): 1/μ²
- Log: log(μ)
```python
model = sm.GLM(y, X, family=sm.families.InverseGaussian())
results = model.fit()
```
### Tweedie Family
Flexible family covering multiple distributions.
**When to use:**
- Insurance claims (mixture of zeros and continuous)
- Semi-continuous data
- Need flexible variance function
**Special cases (power parameter p):**
- p=0: Normal
- p=1: Poisson
- p=2: Gamma
- p=3: Inverse Gaussian
- 1<p<2: Compound Poisson-Gamma (common for insurance)
```python
# Tweedie with power=1.5
model = sm.GLM(y, X, family=sm.families.Tweedie(link=sm.families.links.Log(),
var_power=1.5))
results = model.fit()
```
## Link Functions
Link functions connect the linear predictor to the mean of the response.
### Available Links
```python
from statsmodels.genmod import families
# Identity: g(μ) = μ
link = families.links.Identity()
# Log: g(μ) = log(μ)
link = families.links.Log()
# Logit: g(μ) = log(μ/(1-μ))
link = families.links.Logit()
# Probit: g(μ) = Φ⁻¹(μ)
link = families.links.Probit()
# Complementary log-log: g(μ) = log(-log(1-μ))
link = families.links.CLogLog()
# Inverse: g(μ) = 1/μ
link = families.links.InversePower()
# Inverse squared: g(μ) = 1/μ²
link = families.links.InverseSquared()
# Square root: g(μ) = √μ
link = families.links.Sqrt()
# Power: g(μ) = μ^p
link = families.links.Power(power=2)
```
### Choosing Link Functions
**Canonical links** (default for each family):
- Binomial → Logit
- Poisson → Log
- Gamma → Inverse
- Gaussian → Identity
- Inverse Gaussian → Inverse squared
**When to use non-canonical:**
- **Log link with Binomial**: Risk ratios instead of odds ratios
- **Identity link**: Direct additive effects (when sensible)
- **Probit vs Logit**: Similar results, preference based on field
- **CLogLog**: Asymmetric relationship, common in survival analysis
```python
# Example: Risk ratios with log-binomial model
model = sm.GLM(y, X, family=sm.families.Binomial(link=sm.families.links.Log()))
results = model.fit()
# exp(beta) now gives risk ratios, not odds ratios
risk_ratios = np.exp(results.params)
```
## Model Fitting and Results
### Basic Workflow
```python
import statsmodels.api as sm
# Add constant
X = sm.add_constant(X_data)
# Specify family and link
family = sm.families.Poisson(link=sm.families.links.Log())
# Fit model using IRLS
model = sm.GLM(y, X, family=family)
results = model.fit()
# Summary
print(results.summary())
```
### Results Attributes
```python
# Parameters and inference
results.params # Coefficients
results.bse # Standard errors
results.tvalues # Z-statistics
results.pvalues # P-values
results.conf_int() # Confidence intervals
# Predictions
results.fittedvalues # Fitted values (μ)
results.predict(X_new) # Predictions for new data
# Model fit statistics
results.aic # Akaike Information Criterion
results.bic # Bayesian Information Criterion
results.deviance # Deviance
results.null_deviance # Null model deviance
results.pearson_chi2 # Pearson chi-squared statistic
results.df_resid # Residual degrees of freedom
results.llf # Log-likelihood
# Residuals
results.resid_response # Response residuals (y - μ)
results.resid_pearson # Pearson residuals
results.resid_deviance # Deviance residuals
results.resid_anscombe # Anscombe residuals
results.resid_working # Working residuals
```
### Pseudo R-squared
```python
# McFadden's pseudo R-squared
pseudo_r2 = 1 - (results.deviance / results.null_deviance)
print(f"Pseudo R²: {pseudo_r2:.4f}")
# Adjusted pseudo R-squared
n = len(y)
k = len(results.params)
adj_pseudo_r2 = 1 - ((n-1)/(n-k)) * (results.deviance / results.null_deviance)
print(f"Adjusted Pseudo R²: {adj_pseudo_r2:.4f}")
```
## Diagnostics
### Goodness of Fit
```python
# Deviance should be approximately χ² with df_resid degrees of freedom
from scipy import stats
deviance_pval = 1 - stats.chi2.cdf(results.deviance, results.df_resid)
print(f"Deviance test p-value: {deviance_pval}")
# Pearson chi-squared test
pearson_pval = 1 - stats.chi2.cdf(results.pearson_chi2, results.df_resid)
print(f"Pearson chi² test p-value: {pearson_pval}")
# Check for overdispersion/underdispersion
dispersion = results.pearson_chi2 / results.df_resid
print(f"Dispersion: {dispersion}")
# Should be ~1; >1 suggests overdispersion, <1 underdispersion
```
### Residual Analysis
```python
import matplotlib.pyplot as plt
# Deviance residuals vs fitted
plt.figure(figsize=(10, 6))
plt.scatter(results.fittedvalues, results.resid_deviance, alpha=0.5)
plt.xlabel('Fitted values')
plt.ylabel('Deviance residuals')
plt.axhline(y=0, color='r', linestyle='--')
plt.title('Deviance Residuals vs Fitted')
plt.show()
# Q-Q plot of deviance residuals
from statsmodels.graphics.gofplots import qqplot
qqplot(results.resid_deviance, line='s')
plt.title('Q-Q Plot of Deviance Residuals')
plt.show()
# For binary outcomes: binned residual plot
if isinstance(results.model.family, sm.families.Binomial):
from statsmodels.graphics.gofplots import qqplot
# Group predictions and compute average residuals
# (custom implementation needed)
pass
```
### Influence and Outliers
```python
from statsmodels.stats.outliers_influence import GLMInfluence
influence = GLMInfluence(results)
# Leverage
leverage = influence.hat_matrix_diag
# Cook's distance
cooks_d = influence.cooks_distance[0]
# DFFITS
dffits = influence.dffits[0]
# Find influential observations
influential = np.where(cooks_d > 4/len(y))[0]
print(f"Influential observations: {influential}")
```
## Hypothesis Testing
```python
# Wald test for single parameter (automatically in summary)
# Likelihood ratio test for nested models
# Fit reduced model
model_reduced = sm.GLM(y, X_reduced, family=family).fit()
model_full = sm.GLM(y, X_full, family=family).fit()
# LR statistic
lr_stat = 2 * (model_full.llf - model_reduced.llf)
df = model_full.df_model - model_reduced.df_model
from scipy import stats
lr_pval = 1 - stats.chi2.cdf(lr_stat, df)
print(f"LR test p-value: {lr_pval}")
# Wald test for multiple parameters
# Test beta_1 = beta_2 = 0
R = [[0, 1, 0, 0], [0, 0, 1, 0]]
wald_test = results.wald_test(R)
print(wald_test)
```
## Robust Standard Errors
```python
# Heteroscedasticity-robust (sandwich estimator)
results_robust = results.get_robustcov_results(cov_type='HC0')
# Cluster-robust
results_cluster = results.get_robustcov_results(cov_type='cluster',
groups=cluster_ids)
# Compare standard errors
print("Regular SE:", results.bse)
print("Robust SE:", results_robust.bse)
```
## Model Comparison
```python
# AIC/BIC for non-nested models
models = [model1_results, model2_results, model3_results]
for i, res in enumerate(models, 1):
print(f"Model {i}: AIC={res.aic:.2f}, BIC={res.bic:.2f}")
# Likelihood ratio test for nested models (as shown above)
# Cross-validation for predictive performance
from sklearn.model_selection import KFold
from sklearn.metrics import log_loss
kf = KFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = []
for train_idx, val_idx in kf.split(X):
X_train, X_val = X[train_idx], X[val_idx]
y_train, y_val = y[train_idx], y[val_idx]
model_cv = sm.GLM(y_train, X_train, family=family).fit()
pred_probs = model_cv.predict(X_val)
score = log_loss(y_val, pred_probs)
cv_scores.append(score)
print(f"CV Log Loss: {np.mean(cv_scores):.4f} ± {np.std(cv_scores):.4f}")
```
## Prediction
```python
# Point predictions
predictions = results.predict(X_new)
# For classification: get probabilities and convert
if isinstance(family, sm.families.Binomial):
probs = predictions
class_predictions = (probs > 0.5).astype(int)
# For counts: predictions are expected counts
if isinstance(family, sm.families.Poisson):
expected_counts = predictions
# Prediction intervals via bootstrap
n_boot = 1000
boot_preds = np.zeros((n_boot, len(X_new)))
for i in range(n_boot):
# Bootstrap resample
boot_idx = np.random.choice(len(y), size=len(y), replace=True)
X_boot, y_boot = X[boot_idx], y[boot_idx]
# Fit and predict
boot_model = sm.GLM(y_boot, X_boot, family=family).fit()
boot_preds[i] = boot_model.predict(X_new)
# 95% prediction intervals
pred_lower = np.percentile(boot_preds, 2.5, axis=0)
pred_upper = np.percentile(boot_preds, 97.5, axis=0)
```
## Common Applications
### Logistic Regression (Binary Classification)
```python
import statsmodels.api as sm
# Fit logistic regression
X = sm.add_constant(X_data)
model = sm.GLM(y, X, family=sm.families.Binomial())
results = model.fit()
# Odds ratios
odds_ratios = np.exp(results.params)
odds_ci = np.exp(results.conf_int())
# Classification metrics
from sklearn.metrics import classification_report, roc_auc_score
probs = results.predict(X)
predictions = (probs > 0.5).astype(int)
print(classification_report(y, predictions))
print(f"AUC: {roc_auc_score(y, probs):.4f}")
# ROC curve
from sklearn.metrics import roc_curve
import matplotlib.pyplot as plt
fpr, tpr, thresholds = roc_curve(y, probs)
plt.plot(fpr, tpr)
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()
```
### Poisson Regression (Count Data)
```python
# Fit Poisson model
X = sm.add_constant(X_data)
model = sm.GLM(y_counts, X, family=sm.families.Poisson())
results = model.fit()
# Rate ratios
rate_ratios = np.exp(results.params)
print("Rate ratios:", rate_ratios)
# Check overdispersion
dispersion = results.pearson_chi2 / results.df_resid
if dispersion > 1.5:
print(f"Overdispersion detected ({dispersion:.2f}). Consider Negative Binomial.")
```
### Gamma Regression (Cost/Duration Data)
```python
# Fit Gamma model with log link
X = sm.add_constant(X_data)
model = sm.GLM(y_cost, X,
family=sm.families.Gamma(link=sm.families.links.Log()))
results = model.fit()
# Multiplicative effects
effects = np.exp(results.params)
print("Multiplicative effects on mean:", effects)
```
## Best Practices
1. **Check distribution assumptions**: Plot histograms and Q-Q plots of response
2. **Verify link function**: Use canonical links unless there's a reason not to
3. **Examine residuals**: Deviance residuals should be approximately normal
4. **Test for overdispersion**: Especially for Poisson models
5. **Use offsets appropriately**: For rate modeling with varying exposure
6. **Consider robust SEs**: When variance assumptions questionable
7. **Compare models**: Use AIC/BIC for non-nested, LR test for nested
8. **Interpret on original scale**: Transform coefficients (e.g., exp for log link)
9. **Check influential observations**: Use Cook's distance
10. **Validate predictions**: Use cross-validation or holdout set
## Common Pitfalls
1. **Forgetting to add constant**: No intercept term
2. **Using wrong family**: Check distribution of response
3. **Ignoring overdispersion**: Use Negative Binomial instead of Poisson
4. **Misinterpreting coefficients**: Remember link function transformation
5. **Not checking convergence**: IRLS may not converge; check warnings
6. **Complete separation in logistic**: Some categories perfectly predict outcome
7. **Using identity link with bounded outcomes**: May predict outside valid range
8. **Comparing models with different samples**: Use same observations
9. **Forgetting offset in rate models**: Must use log(exposure) as offset
10. **Not considering alternatives**: Mixed models, zero-inflation for complex data

View File

@@ -0,0 +1,447 @@
# Linear Regression Models Reference
This document provides detailed guidance on linear regression models in statsmodels, including OLS, GLS, WLS, quantile regression, and specialized variants.
## Core Model Classes
### OLS (Ordinary Least Squares)
Assumes independent, identically distributed errors (Σ=I). Best for standard regression with homoscedastic errors.
**When to use:**
- Standard regression analysis
- Errors are independent and have constant variance
- No autocorrelation or heteroscedasticity
- Most common starting point
**Basic usage:**
```python
import statsmodels.api as sm
import numpy as np
# Prepare data - ALWAYS add constant for intercept
X = sm.add_constant(X_data) # Adds column of 1s for intercept
# Fit model
model = sm.OLS(y, X)
results = model.fit()
# View results
print(results.summary())
```
**Key results attributes:**
```python
results.params # Coefficients
results.bse # Standard errors
results.tvalues # T-statistics
results.pvalues # P-values
results.rsquared # R-squared
results.rsquared_adj # Adjusted R-squared
results.fittedvalues # Fitted values (predictions on training data)
results.resid # Residuals
results.conf_int() # Confidence intervals for parameters
```
**Prediction with confidence/prediction intervals:**
```python
# For in-sample predictions
pred = results.get_prediction(X)
pred_summary = pred.summary_frame()
print(pred_summary) # Contains mean, std, confidence intervals
# For out-of-sample predictions
X_new = sm.add_constant(X_new_data)
pred_new = results.get_prediction(X_new)
pred_summary = pred_new.summary_frame()
# Access intervals
mean_ci_lower = pred_summary["mean_ci_lower"]
mean_ci_upper = pred_summary["mean_ci_upper"]
obs_ci_lower = pred_summary["obs_ci_lower"] # Prediction intervals
obs_ci_upper = pred_summary["obs_ci_upper"]
```
**Formula API (R-style):**
```python
import statsmodels.formula.api as smf
# Automatic handling of categorical variables and interactions
formula = 'y ~ x1 + x2 + C(category) + x1:x2'
results = smf.ols(formula, data=df).fit()
```
### WLS (Weighted Least Squares)
Handles heteroscedastic errors (diagonal Σ) where variance differs across observations.
**When to use:**
- Known heteroscedasticity (non-constant error variance)
- Different observations have different reliability
- Weights are known or can be estimated
**Usage:**
```python
# If you know the weights (inverse variance)
weights = 1 / error_variance
model = sm.WLS(y, X, weights=weights)
results = model.fit()
# Common weight patterns:
# - 1/variance: when variance is known
# - n_i: sample size for grouped data
# - 1/x: when variance proportional to x
```
**Feasible WLS (estimating weights):**
```python
# Step 1: Fit OLS
ols_results = sm.OLS(y, X).fit()
# Step 2: Model squared residuals to estimate variance
abs_resid = np.abs(ols_results.resid)
variance_model = sm.OLS(np.log(abs_resid**2), X).fit()
# Step 3: Use estimated variance as weights
weights = 1 / np.exp(variance_model.fittedvalues)
wls_results = sm.WLS(y, X, weights=weights).fit()
```
### GLS (Generalized Least Squares)
Handles arbitrary covariance structure (Σ). Superclass for other regression methods.
**When to use:**
- Known covariance structure
- Correlated errors
- More general than WLS
**Usage:**
```python
# Specify covariance structure
# Sigma should be (n x n) covariance matrix
model = sm.GLS(y, X, sigma=Sigma)
results = model.fit()
```
### GLSAR (GLS with Autoregressive Errors)
Feasible generalized least squares with AR(p) errors for time series data.
**When to use:**
- Time series regression with autocorrelated errors
- Need to account for serial correlation
- Violations of error independence
**Usage:**
```python
# AR(1) errors
model = sm.GLSAR(y, X, rho=1) # rho=1 for AR(1), rho=2 for AR(2), etc.
results = model.iterative_fit() # Iteratively estimates AR parameters
print(results.summary())
print(f"Estimated rho: {results.model.rho}")
```
### RLS (Recursive Least Squares)
Sequential parameter estimation, useful for adaptive or online learning.
**When to use:**
- Parameters change over time
- Online/streaming data
- Want to see parameter evolution
**Usage:**
```python
from statsmodels.regression.recursive_ls import RecursiveLS
model = RecursiveLS(y, X)
results = model.fit()
# Access time-varying parameters
params_over_time = results.recursive_coefficients
cusum = results.cusum # CUSUM statistic for structural breaks
```
### Rolling Regressions
Compute estimates across moving windows for time-varying parameter detection.
**When to use:**
- Parameters vary over time
- Want to detect structural changes
- Time series with evolving relationships
**Usage:**
```python
from statsmodels.regression.rolling import RollingOLS, RollingWLS
# Rolling OLS with 60-period window
rolling_model = RollingOLS(y, X, window=60)
rolling_results = rolling_model.fit()
# Extract time-varying parameters
rolling_params = rolling_results.params # DataFrame with parameters over time
rolling_rsquared = rolling_results.rsquared
# Plot parameter evolution
import matplotlib.pyplot as plt
rolling_params.plot()
plt.title('Time-Varying Coefficients')
plt.show()
```
### Quantile Regression
Analyzes conditional quantiles rather than conditional mean.
**When to use:**
- Interest in quantiles (median, 90th percentile, etc.)
- Robust to outliers (median regression)
- Distributional effects across quantiles
- Heterogeneous effects
**Usage:**
```python
from statsmodels.regression.quantile_regression import QuantReg
# Median regression (50th percentile)
model = QuantReg(y, X)
results_median = model.fit(q=0.5)
# Multiple quantiles
quantiles = [0.1, 0.25, 0.5, 0.75, 0.9]
results_dict = {}
for q in quantiles:
results_dict[q] = model.fit(q=q)
# Plot quantile-varying effects
import matplotlib.pyplot as plt
coef_dict = {q: res.params for q, res in results_dict.items()}
coef_df = pd.DataFrame(coef_dict).T
coef_df.plot()
plt.xlabel('Quantile')
plt.ylabel('Coefficient')
plt.show()
```
## Mixed Effects Models
For hierarchical/nested data with random effects.
**When to use:**
- Clustered/grouped data (students in schools, patients in hospitals)
- Repeated measures
- Need random effects to account for grouping
**Usage:**
```python
from statsmodels.regression.mixed_linear_model import MixedLM
# Random intercept model
model = MixedLM(y, X, groups=group_ids)
results = model.fit()
# Random intercept and slope
model = MixedLM(y, X, groups=group_ids, exog_re=X_random)
results = model.fit()
print(results.summary())
```
## Diagnostics and Model Assessment
### Residual Analysis
```python
# Basic residual plots
import matplotlib.pyplot as plt
# Residuals vs fitted
plt.scatter(results.fittedvalues, results.resid)
plt.xlabel('Fitted values')
plt.ylabel('Residuals')
plt.axhline(y=0, color='r', linestyle='--')
plt.title('Residuals vs Fitted')
plt.show()
# Q-Q plot for normality
from statsmodels.graphics.gofplots import qqplot
qqplot(results.resid, line='s')
plt.show()
# Histogram of residuals
plt.hist(results.resid, bins=30, edgecolor='black')
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.title('Distribution of Residuals')
plt.show()
```
### Specification Tests
```python
from statsmodels.stats.diagnostic import het_breuschpagan, het_white
from statsmodels.stats.stattools import durbin_watson, jarque_bera
# Heteroscedasticity tests
lm_stat, lm_pval, f_stat, f_pval = het_breuschpagan(results.resid, X)
print(f"Breusch-Pagan test p-value: {lm_pval}")
# White test
white_test = het_white(results.resid, X)
print(f"White test p-value: {white_test[1]}")
# Autocorrelation
dw_stat = durbin_watson(results.resid)
print(f"Durbin-Watson statistic: {dw_stat}")
# DW ~ 2 indicates no autocorrelation
# DW < 2 suggests positive autocorrelation
# DW > 2 suggests negative autocorrelation
# Normality test
jb_stat, jb_pval, skew, kurtosis = jarque_bera(results.resid)
print(f"Jarque-Bera test p-value: {jb_pval}")
```
### Multicollinearity
```python
from statsmodels.stats.outliers_influence import variance_inflation_factor
# Calculate VIF for each variable
vif_data = pd.DataFrame()
vif_data["Variable"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif_data)
# VIF > 10 indicates problematic multicollinearity
# VIF > 5 suggests moderate multicollinearity
# Condition number (from summary)
print(f"Condition number: {results.condition_number}")
# Condition number > 20 suggests multicollinearity
# Condition number > 30 indicates serious problems
```
### Influence Statistics
```python
from statsmodels.stats.outliers_influence import OLSInfluence
influence = results.get_influence()
# Leverage (hat values)
leverage = influence.hat_matrix_diag
# High leverage: > 2*p/n (p=predictors, n=observations)
# Cook's distance
cooks_d = influence.cooks_distance[0]
# Influential if Cook's D > 4/n
# DFFITS
dffits = influence.dffits[0]
# Influential if |DFFITS| > 2*sqrt(p/n)
# Create influence plot
from statsmodels.graphics.regressionplots import influence_plot
fig, ax = plt.subplots(figsize=(12, 8))
influence_plot(results, ax=ax)
plt.show()
```
### Hypothesis Testing
```python
# Test single coefficient
# H0: beta_i = 0 (automatically in summary)
# Test multiple restrictions using F-test
# Example: Test beta_1 = beta_2 = 0
R = [[0, 1, 0, 0], [0, 0, 1, 0]] # Restriction matrix
f_test = results.f_test(R)
print(f_test)
# Formula-based hypothesis testing
f_test = results.f_test("x1 = x2 = 0")
print(f_test)
# Test linear combination: beta_1 + beta_2 = 1
r_matrix = [[0, 1, 1, 0]]
q_matrix = [1] # RHS value
f_test = results.f_test((r_matrix, q_matrix))
print(f_test)
# Wald test (equivalent to F-test for linear restrictions)
wald_test = results.wald_test(R)
print(wald_test)
```
## Model Comparison
```python
# Compare nested models using likelihood ratio test (if using MLE)
from statsmodels.stats.anova import anova_lm
# Fit restricted and unrestricted models
model_restricted = sm.OLS(y, X_restricted).fit()
model_full = sm.OLS(y, X_full).fit()
# ANOVA table for model comparison
anova_results = anova_lm(model_restricted, model_full)
print(anova_results)
# AIC/BIC for non-nested model comparison
print(f"Model 1 AIC: {model1.aic}, BIC: {model1.bic}")
print(f"Model 2 AIC: {model2.aic}, BIC: {model2.bic}")
# Lower AIC/BIC indicates better model
```
## Robust Standard Errors
Handle heteroscedasticity or clustering without reweighting.
```python
# Heteroscedasticity-robust (HC) standard errors
results_hc = results.get_robustcov_results(cov_type='HC0') # White's
results_hc1 = results.get_robustcov_results(cov_type='HC1')
results_hc2 = results.get_robustcov_results(cov_type='HC2')
results_hc3 = results.get_robustcov_results(cov_type='HC3') # Most conservative
# Newey-West HAC (Heteroscedasticity and Autocorrelation Consistent)
results_hac = results.get_robustcov_results(cov_type='HAC', maxlags=4)
# Cluster-robust standard errors
results_cluster = results.get_robustcov_results(cov_type='cluster',
groups=cluster_ids)
# View robust results
print(results_hc3.summary())
```
## Best Practices
1. **Always add constant**: Use `sm.add_constant()` unless you specifically want to exclude the intercept
2. **Check assumptions**: Run diagnostic tests (heteroscedasticity, autocorrelation, normality)
3. **Use formula API for categorical variables**: `smf.ols()` handles categorical variables automatically
4. **Robust standard errors**: Use when heteroscedasticity detected but model specification is correct
5. **Model selection**: Use AIC/BIC for non-nested models, F-test/likelihood ratio for nested models
6. **Outliers and influence**: Always check Cook's distance and leverage
7. **Multicollinearity**: Check VIF and condition number before interpretation
8. **Time series**: Use `GLSAR` or robust HAC standard errors for autocorrelated errors
9. **Grouped data**: Consider mixed effects models or cluster-robust standard errors
10. **Quantile regression**: Use for robust estimation or when interested in distributional effects
## Common Pitfalls
1. **Forgetting to add constant**: Results in no-intercept model
2. **Ignoring heteroscedasticity**: Use WLS or robust standard errors
3. **Using OLS with autocorrelated errors**: Use GLSAR or HAC standard errors
4. **Over-interpreting with multicollinearity**: Check VIF first
5. **Not checking residuals**: Always plot residuals vs fitted values
6. **Using t-SNE/PCA residuals**: Residuals should be from original space
7. **Confusing prediction vs confidence intervals**: Prediction intervals are wider
8. **Not handling categorical variables properly**: Use formula API or manual dummy coding
9. **Comparing models with different sample sizes**: Ensure same observations used
10. **Ignoring influential observations**: Check Cook's distance and DFFITS

View File

@@ -0,0 +1,859 @@
# Statistical Tests and Diagnostics Reference
This document provides comprehensive guidance on statistical tests, diagnostics, and tools available in statsmodels.
## Overview
Statsmodels provides extensive statistical testing capabilities:
- Residual diagnostics and specification tests
- Hypothesis testing (parametric and non-parametric)
- Goodness-of-fit tests
- Multiple comparisons and post-hoc tests
- Power and sample size calculations
- Robust covariance matrices
- Influence and outlier detection
## Residual Diagnostics
### Autocorrelation Tests
**Ljung-Box Test**: Tests for autocorrelation in residuals
```python
from statsmodels.stats.diagnostic import acorr_ljungbox
# Test residuals for autocorrelation
lb_test = acorr_ljungbox(residuals, lags=10, return_df=True)
print(lb_test)
# H0: No autocorrelation up to lag k
# If p-value < 0.05, reject H0 (autocorrelation present)
```
**Durbin-Watson Test**: Tests for first-order autocorrelation
```python
from statsmodels.stats.stattools import durbin_watson
dw_stat = durbin_watson(residuals)
print(f"Durbin-Watson: {dw_stat:.4f}")
# DW ≈ 2: no autocorrelation
# DW < 2: positive autocorrelation
# DW > 2: negative autocorrelation
# Exact critical values depend on n and k
```
**Breusch-Godfrey Test**: More general test for autocorrelation
```python
from statsmodels.stats.diagnostic import acorr_breusch_godfrey
bg_test = acorr_breusch_godfrey(results, nlags=5)
lm_stat, lm_pval, f_stat, f_pval = bg_test
print(f"LM statistic: {lm_stat:.4f}, p-value: {lm_pval:.4f}")
# H0: No autocorrelation up to lag k
```
### Heteroskedasticity Tests
**Breusch-Pagan Test**: Tests for heteroskedasticity
```python
from statsmodels.stats.diagnostic import het_breuschpagan
bp_test = het_breuschpagan(residuals, exog)
lm_stat, lm_pval, f_stat, f_pval = bp_test
print(f"Breusch-Pagan test p-value: {lm_pval:.4f}")
# H0: Homoskedasticity (constant variance)
# If p-value < 0.05, reject H0 (heteroskedasticity present)
```
**White Test**: More general test for heteroskedasticity
```python
from statsmodels.stats.diagnostic import het_white
white_test = het_white(residuals, exog)
lm_stat, lm_pval, f_stat, f_pval = white_test
print(f"White test p-value: {lm_pval:.4f}")
# H0: Homoskedasticity
```
**ARCH Test**: Tests for autoregressive conditional heteroskedasticity
```python
from statsmodels.stats.diagnostic import het_arch
arch_test = het_arch(residuals, nlags=5)
lm_stat, lm_pval, f_stat, f_pval = arch_test
print(f"ARCH test p-value: {lm_pval:.4f}")
# H0: No ARCH effects
# If significant, consider GARCH model
```
### Normality Tests
**Jarque-Bera Test**: Tests for normality using skewness and kurtosis
```python
from statsmodels.stats.stattools import jarque_bera
jb_stat, jb_pval, skew, kurtosis = jarque_bera(residuals)
print(f"Jarque-Bera statistic: {jb_stat:.4f}")
print(f"p-value: {jb_pval:.4f}")
print(f"Skewness: {skew:.4f}")
print(f"Kurtosis: {kurtosis:.4f}")
# H0: Residuals are normally distributed
# Normal: skewness ≈ 0, kurtosis ≈ 3
```
**Omnibus Test**: Another normality test (also based on skewness/kurtosis)
```python
from statsmodels.stats.stattools import omni_normtest
omni_stat, omni_pval = omni_normtest(residuals)
print(f"Omnibus test p-value: {omni_pval:.4f}")
# H0: Normality
```
**Anderson-Darling Test**: Distribution fit test
```python
from statsmodels.stats.diagnostic import normal_ad
ad_stat, ad_pval = normal_ad(residuals)
print(f"Anderson-Darling test p-value: {ad_pval:.4f}")
```
**Lilliefors Test**: Modified Kolmogorov-Smirnov test
```python
from statsmodels.stats.diagnostic import lilliefors
lf_stat, lf_pval = lilliefors(residuals, dist='norm')
print(f"Lilliefors test p-value: {lf_pval:.4f}")
```
### Linearity and Specification Tests
**Ramsey RESET Test**: Tests for functional form misspecification
```python
from statsmodels.stats.diagnostic import linear_reset
reset_test = linear_reset(results, power=2)
f_stat, f_pval = reset_test
print(f"RESET test p-value: {f_pval:.4f}")
# H0: Model is correctly specified (linear)
# If rejected, may need polynomial terms or transformations
```
**Harvey-Collier Test**: Tests for linearity
```python
from statsmodels.stats.diagnostic import linear_harvey_collier
hc_stat, hc_pval = linear_harvey_collier(results)
print(f"Harvey-Collier test p-value: {hc_pval:.4f}")
# H0: Linear specification is correct
```
## Multicollinearity Detection
**Variance Inflation Factor (VIF)**:
```python
from statsmodels.stats.outliers_influence import variance_inflation_factor
import pandas as pd
# Calculate VIF for each variable
vif_data = pd.DataFrame()
vif_data["Variable"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i)
for i in range(X.shape[1])]
print(vif_data.sort_values('VIF', ascending=False))
# Interpretation:
# VIF = 1: No correlation with other predictors
# VIF > 5: Moderate multicollinearity
# VIF > 10: Serious multicollinearity problem
# VIF > 20: Severe multicollinearity (consider removing variable)
```
**Condition Number**: From regression results
```python
print(f"Condition number: {results.condition_number:.2f}")
# Interpretation:
# < 10: No multicollinearity concern
# 10-30: Moderate multicollinearity
# > 30: Strong multicollinearity
# > 100: Severe multicollinearity
```
## Influence and Outlier Detection
### Leverage
High leverage points have extreme predictor values.
```python
from statsmodels.stats.outliers_influence import OLSInfluence
influence = results.get_influence()
# Hat values (leverage)
leverage = influence.hat_matrix_diag
# Rule of thumb: leverage > 2*p/n or 3*p/n is high
# p = number of parameters, n = sample size
threshold = 2 * len(results.params) / len(y)
high_leverage = np.where(leverage > threshold)[0]
print(f"High leverage observations: {high_leverage}")
```
### Cook's Distance
Measures overall influence of each observation.
```python
# Cook's distance
cooks_d = influence.cooks_distance[0]
# Rule of thumb: Cook's D > 4/n is influential
threshold = 4 / len(y)
influential = np.where(cooks_d > threshold)[0]
print(f"Influential observations (Cook's D): {influential}")
# Plot
import matplotlib.pyplot as plt
plt.stem(range(len(cooks_d)), cooks_d)
plt.axhline(y=threshold, color='r', linestyle='--', label=f'Threshold (4/n)')
plt.xlabel('Observation')
plt.ylabel("Cook's Distance")
plt.legend()
plt.show()
```
### DFFITS
Measures influence on fitted value.
```python
# DFFITS
dffits = influence.dffits[0]
# Rule of thumb: |DFFITS| > 2*sqrt(p/n) is influential
p = len(results.params)
n = len(y)
threshold = 2 * np.sqrt(p / n)
influential_dffits = np.where(np.abs(dffits) > threshold)[0]
print(f"Influential observations (DFFITS): {influential_dffits}")
```
### DFBETAs
Measures influence on each coefficient.
```python
# DFBETAs (one for each parameter)
dfbetas = influence.dfbetas
# Rule of thumb: |DFBETA| > 2/sqrt(n)
threshold = 2 / np.sqrt(n)
for i, param_name in enumerate(results.params.index):
influential = np.where(np.abs(dfbetas[:, i]) > threshold)[0]
if len(influential) > 0:
print(f"Influential for {param_name}: {influential}")
```
### Influence Plot
```python
from statsmodels.graphics.regressionplots import influence_plot
fig, ax = plt.subplots(figsize=(12, 8))
influence_plot(results, ax=ax, criterion='cooks')
plt.show()
# Combines leverage, residuals, and Cook's distance
# Large bubbles = high Cook's distance
# Far from x=0 = high leverage
# Far from y=0 = large residual
```
### Studentized Residuals
```python
# Studentized residuals (outliers)
student_resid = influence.resid_studentized_internal
# External studentized residuals (more conservative)
student_resid_external = influence.resid_studentized_external
# Outliers: |studentized residual| > 3 (or > 2.5)
outliers = np.where(np.abs(student_resid_external) > 3)[0]
print(f"Outliers: {outliers}")
```
## Hypothesis Testing
### t-tests
**One-sample t-test**: Test if mean equals specific value
```python
from scipy import stats
# H0: population mean = mu_0
t_stat, p_value = stats.ttest_1samp(data, popmean=mu_0)
print(f"t-statistic: {t_stat:.4f}")
print(f"p-value: {p_value:.4f}")
```
**Two-sample t-test**: Compare means of two groups
```python
# H0: mean1 = mean2 (equal variances)
t_stat, p_value = stats.ttest_ind(group1, group2)
# Welch's t-test (unequal variances)
t_stat, p_value = stats.ttest_ind(group1, group2, equal_var=False)
print(f"t-statistic: {t_stat:.4f}")
print(f"p-value: {p_value:.4f}")
```
**Paired t-test**: Compare paired observations
```python
# H0: mean difference = 0
t_stat, p_value = stats.ttest_rel(before, after)
print(f"t-statistic: {t_stat:.4f}")
print(f"p-value: {p_value:.4f}")
```
### Proportion Tests
**One-proportion test**:
```python
from statsmodels.stats.proportion import proportions_ztest
# H0: proportion = p0
count = 45 # successes
nobs = 100 # total observations
p0 = 0.5 # hypothesized proportion
z_stat, p_value = proportions_ztest(count, nobs, value=p0)
print(f"z-statistic: {z_stat:.4f}")
print(f"p-value: {p_value:.4f}")
```
**Two-proportion test**:
```python
# H0: proportion1 = proportion2
counts = [45, 60]
nobs = [100, 120]
z_stat, p_value = proportions_ztest(counts, nobs)
print(f"z-statistic: {z_stat:.4f}")
print(f"p-value: {p_value:.4f}")
```
### Chi-square Tests
**Chi-square test of independence**:
```python
from scipy.stats import chi2_contingency
# Contingency table
contingency_table = pd.crosstab(variable1, variable2)
chi2, p_value, dof, expected = chi2_contingency(contingency_table)
print(f"Chi-square statistic: {chi2:.4f}")
print(f"p-value: {p_value:.4f}")
print(f"Degrees of freedom: {dof}")
# H0: Variables are independent
```
**Chi-square goodness-of-fit**:
```python
from scipy.stats import chisquare
# Observed frequencies
observed = [20, 30, 25, 25]
# Expected frequencies (equal by default)
expected = [25, 25, 25, 25]
chi2, p_value = chisquare(observed, expected)
print(f"Chi-square statistic: {chi2:.4f}")
print(f"p-value: {p_value:.4f}")
# H0: Data follow the expected distribution
```
### Non-parametric Tests
**Mann-Whitney U test** (independent samples):
```python
from scipy.stats import mannwhitneyu
# H0: Distributions are equal
u_stat, p_value = mannwhitneyu(group1, group2, alternative='two-sided')
print(f"U statistic: {u_stat:.4f}")
print(f"p-value: {p_value:.4f}")
```
**Wilcoxon signed-rank test** (paired samples):
```python
from scipy.stats import wilcoxon
# H0: Median difference = 0
w_stat, p_value = wilcoxon(before, after)
print(f"W statistic: {w_stat:.4f}")
print(f"p-value: {p_value:.4f}")
```
**Kruskal-Wallis H test** (>2 groups):
```python
from scipy.stats import kruskal
# H0: All groups have same distribution
h_stat, p_value = kruskal(group1, group2, group3)
print(f"H statistic: {h_stat:.4f}")
print(f"p-value: {p_value:.4f}")
```
**Sign test**:
```python
from statsmodels.stats.descriptivestats import sign_test
# H0: Median = m0
result = sign_test(data, m0=0)
print(result)
```
### ANOVA
**One-way ANOVA**:
```python
from scipy.stats import f_oneway
# H0: All group means are equal
f_stat, p_value = f_oneway(group1, group2, group3)
print(f"F-statistic: {f_stat:.4f}")
print(f"p-value: {p_value:.4f}")
```
**Two-way ANOVA** (with statsmodels):
```python
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
# Fit model
model = ols('response ~ C(factor1) + C(factor2) + C(factor1):C(factor2)',
data=df).fit()
# ANOVA table
anova_table = anova_lm(model, typ=2)
print(anova_table)
```
**Repeated measures ANOVA**:
```python
from statsmodels.stats.anova import AnovaRM
# Requires long-format data
aovrm = AnovaRM(df, depvar='score', subject='subject_id', within=['time'])
results = aovrm.fit()
print(results.summary())
```
## Multiple Comparisons
### Post-hoc Tests
**Tukey's HSD** (Honest Significant Difference):
```python
from statsmodels.stats.multicomp import pairwise_tukeyhsd
# Perform Tukey HSD test
tukey = pairwise_tukeyhsd(data, groups, alpha=0.05)
print(tukey.summary())
# Plot confidence intervals
tukey.plot_simultaneous()
plt.show()
```
**Bonferroni correction**:
```python
from statsmodels.stats.multitest import multipletests
# P-values from multiple tests
p_values = [0.01, 0.03, 0.04, 0.15, 0.001]
# Apply correction
reject, pvals_corrected, alphac_sidak, alphac_bonf = multipletests(
p_values,
alpha=0.05,
method='bonferroni'
)
print("Rejected:", reject)
print("Corrected p-values:", pvals_corrected)
```
**False Discovery Rate (FDR)**:
```python
# FDR correction (less conservative than Bonferroni)
reject, pvals_corrected, alphac_sidak, alphac_bonf = multipletests(
p_values,
alpha=0.05,
method='fdr_bh' # Benjamini-Hochberg
)
print("Rejected:", reject)
print("Corrected p-values:", pvals_corrected)
```
## Robust Covariance Matrices
### Heteroskedasticity-Consistent (HC) Standard Errors
```python
# After fitting OLS
results = sm.OLS(y, X).fit()
# HC0 (White's heteroskedasticity-consistent SEs)
results_hc0 = results.get_robustcov_results(cov_type='HC0')
# HC1 (degrees of freedom adjustment)
results_hc1 = results.get_robustcov_results(cov_type='HC1')
# HC2 (leverage adjustment)
results_hc2 = results.get_robustcov_results(cov_type='HC2')
# HC3 (most conservative, recommended for small samples)
results_hc3 = results.get_robustcov_results(cov_type='HC3')
print("Standard OLS SEs:", results.bse)
print("Robust HC3 SEs:", results_hc3.bse)
```
### HAC (Heteroskedasticity and Autocorrelation Consistent)
**Newey-West standard errors**:
```python
# For time series with autocorrelation and heteroskedasticity
results_hac = results.get_robustcov_results(cov_type='HAC', maxlags=4)
print("HAC (Newey-West) SEs:", results_hac.bse)
print(results_hac.summary())
```
### Cluster-Robust Standard Errors
```python
# For clustered/grouped data
results_cluster = results.get_robustcov_results(
cov_type='cluster',
groups=cluster_ids
)
print("Cluster-robust SEs:", results_cluster.bse)
```
## Descriptive Statistics
**Basic descriptive statistics**:
```python
from statsmodels.stats.api import DescrStatsW
# Comprehensive descriptive stats
desc = DescrStatsW(data)
print("Mean:", desc.mean)
print("Std Dev:", desc.std)
print("Variance:", desc.var)
print("Confidence interval:", desc.tconfint_mean())
# Quantiles
print("Median:", desc.quantile(0.5))
print("IQR:", desc.quantile([0.25, 0.75]))
```
**Weighted statistics**:
```python
# With weights
desc_weighted = DescrStatsW(data, weights=weights)
print("Weighted mean:", desc_weighted.mean)
print("Weighted std:", desc_weighted.std)
```
**Compare two groups**:
```python
from statsmodels.stats.weightstats import CompareMeans
# Create comparison object
cm = CompareMeans(DescrStatsW(group1), DescrStatsW(group2))
# t-test
print("t-test:", cm.ttest_ind())
# Confidence interval for difference
print("CI for difference:", cm.tconfint_diff())
# Test for equal variances
print("Equal variance test:", cm.test_equal_var())
```
## Power Analysis and Sample Size
**Power for t-test**:
```python
from statsmodels.stats.power import tt_ind_solve_power
# Solve for sample size
effect_size = 0.5 # Cohen's d
alpha = 0.05
power = 0.8
n = tt_ind_solve_power(effect_size=effect_size,
alpha=alpha,
power=power,
alternative='two-sided')
print(f"Required sample size per group: {n:.0f}")
# Solve for power given n
power = tt_ind_solve_power(effect_size=0.5,
nobs1=50,
alpha=0.05,
alternative='two-sided')
print(f"Power: {power:.4f}")
```
**Power for proportion test**:
```python
from statsmodels.stats.power import zt_ind_solve_power
# For proportion tests (z-test)
effect_size = 0.3 # Difference in proportions
alpha = 0.05
power = 0.8
n = zt_ind_solve_power(effect_size=effect_size,
alpha=alpha,
power=power,
alternative='two-sided')
print(f"Required sample size per group: {n:.0f}")
```
**Power curves**:
```python
from statsmodels.stats.power import TTestIndPower
import matplotlib.pyplot as plt
# Create power analysis object
analysis = TTestIndPower()
# Plot power curves for different sample sizes
sample_sizes = range(10, 200, 10)
effect_sizes = [0.2, 0.5, 0.8] # Small, medium, large
fig, ax = plt.subplots(figsize=(10, 6))
for es in effect_sizes:
power = [analysis.solve_power(effect_size=es, nobs1=n, alpha=0.05)
for n in sample_sizes]
ax.plot(sample_sizes, power, label=f'Effect size = {es}')
ax.axhline(y=0.8, color='r', linestyle='--', label='Power = 0.8')
ax.set_xlabel('Sample size per group')
ax.set_ylabel('Power')
ax.set_title('Power Curves for Two-Sample t-test')
ax.legend()
ax.grid(True, alpha=0.3)
plt.show()
```
## Effect Sizes
**Cohen's d** (standardized mean difference):
```python
def cohens_d(group1, group2):
\"\"\"Calculate Cohen's d for independent samples\"\"\"
n1, n2 = len(group1), len(group2)
var1, var2 = np.var(group1, ddof=1), np.var(group2, ddof=1)
# Pooled standard deviation
pooled_std = np.sqrt(((n1-1)*var1 + (n2-1)*var2) / (n1+n2-2))
# Cohen's d
d = (np.mean(group1) - np.mean(group2)) / pooled_std
return d
d = cohens_d(group1, group2)
print(f"Cohen's d: {d:.4f}")
# Interpretation:
# |d| < 0.2: negligible
# |d| ~ 0.2: small
# |d| ~ 0.5: medium
# |d| ~ 0.8: large
```
**Eta-squared** (for ANOVA):
```python
# From ANOVA table
# η² = SS_between / SS_total
def eta_squared(anova_table):
return anova_table['sum_sq'][0] / anova_table['sum_sq'].sum()
# After running ANOVA
eta_sq = eta_squared(anova_table)
print(f"Eta-squared: {eta_sq:.4f}")
# Interpretation:
# 0.01: small effect
# 0.06: medium effect
# 0.14: large effect
```
## Contingency Tables and Association
**McNemar's test** (paired binary data):
```python
from statsmodels.stats.contingency_tables import mcnemar
# 2x2 contingency table
table = [[a, b],
[c, d]]
result = mcnemar(table, exact=True) # or exact=False for large samples
print(f"p-value: {result.pvalue:.4f}")
# H0: Marginal probabilities are equal
```
**Cochran-Mantel-Haenszel test**:
```python
from statsmodels.stats.contingency_tables import StratifiedTable
# For stratified 2x2 tables
strat_table = StratifiedTable(tables_list)
result = strat_table.test_null_odds()
print(f"p-value: {result.pvalue:.4f}")
```
## Treatment Effects and Causal Inference
**Propensity score matching**:
```python
from statsmodels.treatment import propensity_score
# Estimate propensity scores
ps_model = sm.Logit(treatment, X).fit()
propensity_scores = ps_model.predict(X)
# Use for matching or weighting
# (manual implementation of matching needed)
```
**Difference-in-differences**:
```python
# Did formula: outcome ~ treatment * post
model = ols('outcome ~ treatment + post + treatment:post', data=df).fit()
# DiD estimate is the interaction coefficient
did_estimate = model.params['treatment:post']
print(f"DiD estimate: {did_estimate:.4f}")
```
## Best Practices
1. **Always check assumptions**: Test before interpreting results
2. **Report effect sizes**: Not just p-values
3. **Use appropriate tests**: Match test to data type and distribution
4. **Correct for multiple comparisons**: When conducting many tests
5. **Check sample size**: Ensure adequate power
6. **Visual inspection**: Plot data before testing
7. **Report confidence intervals**: Along with point estimates
8. **Consider alternatives**: Non-parametric when assumptions violated
9. **Robust standard errors**: Use when heteroskedasticity/autocorrelation present
10. **Document decisions**: Note which tests used and why
## Common Pitfalls
1. **Not checking test assumptions**: May invalidate results
2. **Multiple testing without correction**: Inflated Type I error
3. **Using parametric tests on non-normal data**: Consider non-parametric
4. **Ignoring heteroskedasticity**: Use robust SEs
5. **Confusing statistical and practical significance**: Check effect sizes
6. **Not reporting confidence intervals**: Only p-values insufficient
7. **Using wrong test**: Match test to research question
8. **Insufficient power**: Risk of Type II error (false negatives)
9. **p-hacking**: Testing many specifications until significant
10. **Overinterpreting p-values**: Remember limitations of NHST

View File

@@ -0,0 +1,716 @@
# Time Series Analysis Reference
This document provides comprehensive guidance on time series models in statsmodels, including ARIMA, state space models, VAR, exponential smoothing, and forecasting methods.
## Overview
Statsmodels offers extensive time series capabilities:
- **Univariate models**: AR, ARIMA, SARIMAX, Exponential Smoothing
- **Multivariate models**: VAR, VARMAX, Dynamic Factor Models
- **State space framework**: Custom models, Kalman filtering
- **Diagnostic tools**: ACF, PACF, stationarity tests, residual analysis
- **Forecasting**: Point forecasts and prediction intervals
## Univariate Time Series Models
### AutoReg (AR Model)
Autoregressive model: current value depends on past values.
**When to use:**
- Univariate time series
- Past values predict future
- Stationary series
**Model**: yₜ = c + φ₁yₜ₋₁ + φ₂yₜ₋₂ + ... + φₚyₜ₋ₚ + εₜ
```python
from statsmodels.tsa.ar_model import AutoReg
import pandas as pd
# Fit AR(p) model
model = AutoReg(y, lags=5) # AR(5)
results = model.fit()
print(results.summary())
```
**With exogenous regressors:**
```python
# AR with exogenous variables (ARX)
model = AutoReg(y, lags=5, exog=X_exog)
results = model.fit()
```
**Seasonal AR:**
```python
# Seasonal lags (e.g., monthly data with yearly seasonality)
model = AutoReg(y, lags=12, seasonal=True)
results = model.fit()
```
### ARIMA (Autoregressive Integrated Moving Average)
Combines AR, differencing (I), and MA components.
**When to use:**
- Non-stationary time series (needs differencing)
- Past values and errors predict future
- Flexible model for many time series
**Model**: ARIMA(p,d,q)
- p: AR order (lags)
- d: differencing order (to achieve stationarity)
- q: MA order (lagged forecast errors)
```python
from statsmodels.tsa.arima.model import ARIMA
# Fit ARIMA(p,d,q)
model = ARIMA(y, order=(1, 1, 1)) # ARIMA(1,1,1)
results = model.fit()
print(results.summary())
```
**Choosing p, d, q:**
1. **Determine d (differencing order)**:
```python
from statsmodels.tsa.stattools import adfuller
# ADF test for stationarity
def check_stationarity(series):
result = adfuller(series)
print(f"ADF Statistic: {result[0]:.4f}")
print(f"p-value: {result[1]:.4f}")
if result[1] <= 0.05:
print("Series is stationary")
return True
else:
print("Series is non-stationary, needs differencing")
return False
# Test original series
if not check_stationarity(y):
# Difference once
y_diff = y.diff().dropna()
if not check_stationarity(y_diff):
# Difference again
y_diff2 = y_diff.diff().dropna()
check_stationarity(y_diff2)
```
2. **Determine p and q (ACF/PACF)**:
```python
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
import matplotlib.pyplot as plt
# After differencing to stationarity
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 8))
# ACF: helps determine q (MA order)
plot_acf(y_stationary, lags=40, ax=ax1)
ax1.set_title('Autocorrelation Function (ACF)')
# PACF: helps determine p (AR order)
plot_pacf(y_stationary, lags=40, ax=ax2)
ax2.set_title('Partial Autocorrelation Function (PACF)')
plt.tight_layout()
plt.show()
# Rules of thumb:
# - PACF cuts off at lag p → AR(p)
# - ACF cuts off at lag q → MA(q)
# - Both decay → ARMA(p,q)
```
3. **Model selection (AIC/BIC)**:
```python
# Grid search for best (p,q) given d
import numpy as np
best_aic = np.inf
best_order = None
for p in range(5):
for q in range(5):
try:
model = ARIMA(y, order=(p, d, q))
results = model.fit()
if results.aic < best_aic:
best_aic = results.aic
best_order = (p, d, q)
except:
continue
print(f"Best order: {best_order} with AIC: {best_aic:.2f}")
```
### SARIMAX (Seasonal ARIMA with Exogenous Variables)
Extends ARIMA with seasonality and exogenous regressors.
**When to use:**
- Seasonal patterns (monthly, quarterly data)
- External variables influence series
- Most flexible univariate model
**Model**: SARIMAX(p,d,q)(P,D,Q,s)
- (p,d,q): Non-seasonal ARIMA
- (P,D,Q,s): Seasonal ARIMA with period s
```python
from statsmodels.tsa.statespace.sarimax import SARIMAX
# Seasonal ARIMA for monthly data (s=12)
model = SARIMAX(y,
order=(1, 1, 1), # (p,d,q)
seasonal_order=(1, 1, 1, 12)) # (P,D,Q,s)
results = model.fit()
print(results.summary())
```
**With exogenous variables:**
```python
# SARIMAX with external predictors
model = SARIMAX(y,
exog=X_exog,
order=(1, 1, 1),
seasonal_order=(1, 1, 1, 12))
results = model.fit()
```
**Example: Monthly sales with trend and seasonality**
```python
# Typical for monthly data: (p,d,q)(P,D,Q,12)
# Start with (1,1,1)(1,1,1,12) or (0,1,1)(0,1,1,12)
model = SARIMAX(monthly_sales,
order=(0, 1, 1),
seasonal_order=(0, 1, 1, 12),
enforce_stationarity=False,
enforce_invertibility=False)
results = model.fit()
```
### Exponential Smoothing
Weighted averages of past observations with exponentially decreasing weights.
**When to use:**
- Simple, interpretable forecasts
- Trend and/or seasonality present
- No need for explicit model specification
**Types:**
- Simple Exponential Smoothing: no trend, no seasonality
- Holt's method: with trend
- Holt-Winters: with trend and seasonality
```python
from statsmodels.tsa.holtwinters import ExponentialSmoothing
# Simple exponential smoothing
model = ExponentialSmoothing(y, trend=None, seasonal=None)
results = model.fit()
# Holt's method (with trend)
model = ExponentialSmoothing(y, trend='add', seasonal=None)
results = model.fit()
# Holt-Winters (trend + seasonality)
model = ExponentialSmoothing(y,
trend='add', # 'add' or 'mul'
seasonal='add', # 'add' or 'mul'
seasonal_periods=12) # e.g., 12 for monthly
results = model.fit()
print(results.summary())
```
**Additive vs Multiplicative:**
```python
# Additive: constant seasonal variation
# yₜ = Level + Trend + Seasonal + Error
# Multiplicative: proportional seasonal variation
# yₜ = Level × Trend × Seasonal × Error
# Choose based on data:
# - Additive: seasonal variation constant over time
# - Multiplicative: seasonal variation increases with level
```
**Innovations state space (ETS):**
```python
from statsmodels.tsa.exponential_smoothing.ets import ETSModel
# More robust, state space formulation
model = ETSModel(y,
error='add', # 'add' or 'mul'
trend='add', # 'add', 'mul', or None
seasonal='add', # 'add', 'mul', or None
seasonal_periods=12)
results = model.fit()
```
## Multivariate Time Series
### VAR (Vector Autoregression)
System of equations where each variable depends on past values of all variables.
**When to use:**
- Multiple interrelated time series
- Bidirectional relationships
- Granger causality testing
**Model**: Each variable is AR on all variables:
- y₁ₜ = c₁ + φ₁₁y₁ₜ₋₁ + φ₁₂y₂ₜ₋₁ + ... + ε₁ₜ
- y₂ₜ = c₂ + φ₂₁y₁ₜ₋₁ + φ₂₂y₂ₜ₋₁ + ... + ε₂ₜ
```python
from statsmodels.tsa.api import VAR
import pandas as pd
# Data should be DataFrame with multiple columns
# Each column is a time series
df_multivariate = pd.DataFrame({'series1': y1, 'series2': y2, 'series3': y3})
# Fit VAR
model = VAR(df_multivariate)
# Select lag order using AIC/BIC
lag_order_results = model.select_order(maxlags=15)
print(lag_order_results.summary())
# Fit with optimal lags
results = model.fit(maxlags=5, ic='aic')
print(results.summary())
```
**Granger causality testing:**
```python
# Test if series1 Granger-causes series2
from statsmodels.tsa.stattools import grangercausalitytests
# Requires 2D array [series2, series1]
test_data = df_multivariate[['series2', 'series1']]
# Test up to max_lag
max_lag = 5
results = grangercausalitytests(test_data, max_lag, verbose=True)
# P-values for each lag
for lag in range(1, max_lag + 1):
p_value = results[lag][0]['ssr_ftest'][1]
print(f"Lag {lag}: p-value = {p_value:.4f}")
```
**Impulse Response Functions (IRF):**
```python
# Trace effect of shock through system
irf = results.irf(10) # 10 periods ahead
# Plot IRFs
irf.plot(orth=True) # Orthogonalized (Cholesky decomposition)
plt.show()
# Cumulative effects
irf.plot_cum_effects(orth=True)
plt.show()
```
**Forecast Error Variance Decomposition:**
```python
# Contribution of each variable to forecast error variance
fevd = results.fevd(10) # 10 periods ahead
fevd.plot()
plt.show()
```
### VARMAX (VAR with Moving Average and Exogenous Variables)
Extends VAR with MA component and external regressors.
**When to use:**
- VAR inadequate (MA component needed)
- External variables affect system
- More flexible multivariate model
```python
from statsmodels.tsa.statespace.varmax import VARMAX
# VARMAX(p, q) with exogenous variables
model = VARMAX(df_multivariate,
order=(1, 1), # (p, q)
exog=X_exog)
results = model.fit()
print(results.summary())
```
## State Space Models
Flexible framework for custom time series models.
**When to use:**
- Custom model specification
- Unobserved components
- Kalman filtering/smoothing
- Missing data
```python
from statsmodels.tsa.statespace.mlemodel import MLEModel
# Extend MLEModel for custom state space models
# Example: Local level model (random walk + noise)
```
**Dynamic Factor Models:**
```python
from statsmodels.tsa.statespace.dynamic_factor import DynamicFactor
# Extract common factors from multiple time series
model = DynamicFactor(df_multivariate,
k_factors=2, # Number of factors
factor_order=2) # AR order of factors
results = model.fit()
# Estimated factors
factors = results.factors.filtered
```
## Forecasting
### Point Forecasts
```python
# ARIMA forecasting
model = ARIMA(y, order=(1, 1, 1))
results = model.fit()
# Forecast h steps ahead
h = 10
forecast = results.forecast(steps=h)
# With exogenous variables (SARIMAX)
model = SARIMAX(y, exog=X, order=(1, 1, 1))
results = model.fit()
# Need future exogenous values
forecast = results.forecast(steps=h, exog=X_future)
```
### Prediction Intervals
```python
# Get forecast with confidence intervals
forecast_obj = results.get_forecast(steps=h)
forecast_df = forecast_obj.summary_frame()
print(forecast_df)
# Contains: mean, mean_se, mean_ci_lower, mean_ci_upper
# Extract components
forecast_mean = forecast_df['mean']
forecast_ci_lower = forecast_df['mean_ci_lower']
forecast_ci_upper = forecast_df['mean_ci_upper']
# Plot
import matplotlib.pyplot as plt
plt.figure(figsize=(12, 6))
plt.plot(y.index, y, label='Historical')
plt.plot(forecast_df.index, forecast_mean, label='Forecast', color='red')
plt.fill_between(forecast_df.index,
forecast_ci_lower,
forecast_ci_upper,
alpha=0.3, color='red', label='95% CI')
plt.legend()
plt.title('Forecast with Prediction Intervals')
plt.show()
```
### Dynamic vs Static Forecasts
```python
# Static (one-step-ahead, using actual values)
static_forecast = results.get_prediction(start=split_point, end=len(y)-1)
# Dynamic (multi-step, using predicted values)
dynamic_forecast = results.get_prediction(start=split_point,
end=len(y)-1,
dynamic=True)
# Plot comparison
fig, ax = plt.subplots(figsize=(12, 6))
y.plot(ax=ax, label='Actual')
static_forecast.predicted_mean.plot(ax=ax, label='Static forecast')
dynamic_forecast.predicted_mean.plot(ax=ax, label='Dynamic forecast')
ax.legend()
plt.show()
```
## Diagnostic Tests
### Stationarity Tests
```python
from statsmodels.tsa.stattools import adfuller, kpss
# Augmented Dickey-Fuller (ADF) test
# H0: unit root (non-stationary)
adf_result = adfuller(y, autolag='AIC')
print(f"ADF Statistic: {adf_result[0]:.4f}")
print(f"p-value: {adf_result[1]:.4f}")
if adf_result[1] <= 0.05:
print("Reject H0: Series is stationary")
else:
print("Fail to reject H0: Series is non-stationary")
# KPSS test
# H0: stationary (opposite of ADF)
kpss_result = kpss(y, regression='c', nlags='auto')
print(f"KPSS Statistic: {kpss_result[0]:.4f}")
print(f"p-value: {kpss_result[1]:.4f}")
if kpss_result[1] <= 0.05:
print("Reject H0: Series is non-stationary")
else:
print("Fail to reject H0: Series is stationary")
```
### Residual Diagnostics
```python
# Ljung-Box test for autocorrelation in residuals
from statsmodels.stats.diagnostic import acorr_ljungbox
lb_test = acorr_ljungbox(results.resid, lags=10, return_df=True)
print(lb_test)
# P-values > 0.05 indicate no significant autocorrelation (good)
# Plot residual diagnostics
results.plot_diagnostics(figsize=(12, 8))
plt.show()
# Components:
# 1. Standardized residuals over time
# 2. Histogram + KDE of residuals
# 3. Q-Q plot for normality
# 4. Correlogram (ACF of residuals)
```
### Heteroskedasticity Tests
```python
from statsmodels.stats.diagnostic import het_arch
# ARCH test for heteroskedasticity
arch_test = het_arch(results.resid, nlags=10)
print(f"ARCH test statistic: {arch_test[0]:.4f}")
print(f"p-value: {arch_test[1]:.4f}")
# If significant, consider GARCH model
```
## Seasonal Decomposition
```python
from statsmodels.tsa.seasonal import seasonal_decompose
# Decompose into trend, seasonal, residual
decomposition = seasonal_decompose(y,
model='additive', # or 'multiplicative'
period=12) # seasonal period
# Plot components
fig = decomposition.plot()
fig.set_size_inches(12, 8)
plt.show()
# Access components
trend = decomposition.trend
seasonal = decomposition.seasonal
residual = decomposition.resid
# STL decomposition (more robust)
from statsmodels.tsa.seasonal import STL
stl = STL(y, seasonal=13) # seasonal must be odd
stl_result = stl.fit()
fig = stl_result.plot()
plt.show()
```
## Model Evaluation
### In-Sample Metrics
```python
# From results object
print(f"AIC: {results.aic:.2f}")
print(f"BIC: {results.bic:.2f}")
print(f"Log-likelihood: {results.llf:.2f}")
# MSE on training data
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y, results.fittedvalues)
rmse = np.sqrt(mse)
print(f"RMSE: {rmse:.4f}")
# MAE
from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(y, results.fittedvalues)
print(f"MAE: {mae:.4f}")
```
### Out-of-Sample Evaluation
```python
# Train-test split for time series (no shuffle!)
train_size = int(0.8 * len(y))
y_train = y[:train_size]
y_test = y[train_size:]
# Fit on training data
model = ARIMA(y_train, order=(1, 1, 1))
results = model.fit()
# Forecast test period
forecast = results.forecast(steps=len(y_test))
# Metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error
rmse = np.sqrt(mean_squared_error(y_test, forecast))
mae = mean_absolute_error(y_test, forecast)
mape = np.mean(np.abs((y_test - forecast) / y_test)) * 100
print(f"Test RMSE: {rmse:.4f}")
print(f"Test MAE: {mae:.4f}")
print(f"Test MAPE: {mape:.2f}%")
```
### Rolling Forecast
```python
# More realistic evaluation: rolling one-step-ahead forecasts
forecasts = []
for t in range(len(y_test)):
# Refit or update with new observation
y_current = y[:train_size + t]
model = ARIMA(y_current, order=(1, 1, 1))
fit = model.fit()
# One-step forecast
fc = fit.forecast(steps=1)[0]
forecasts.append(fc)
forecasts = np.array(forecasts)
rmse = np.sqrt(mean_squared_error(y_test, forecasts))
print(f"Rolling forecast RMSE: {rmse:.4f}")
```
### Cross-Validation
```python
# Time series cross-validation (expanding window)
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
rmse_scores = []
for train_idx, test_idx in tscv.split(y):
y_train_cv = y.iloc[train_idx]
y_test_cv = y.iloc[test_idx]
model = ARIMA(y_train_cv, order=(1, 1, 1))
results = model.fit()
forecast = results.forecast(steps=len(test_idx))
rmse = np.sqrt(mean_squared_error(y_test_cv, forecast))
rmse_scores.append(rmse)
print(f"CV RMSE: {np.mean(rmse_scores):.4f} ± {np.std(rmse_scores):.4f}")
```
## Advanced Topics
### ARDL (Autoregressive Distributed Lag)
Bridges univariate and multivariate time series.
```python
from statsmodels.tsa.ardl import ARDL
# ARDL(p, q) model
# y depends on its own lags and lags of X
model = ARDL(y, lags=2, exog=X, exog_lags=2)
results = model.fit()
```
### Error Correction Models
For cointegrated series.
```python
from statsmodels.tsa.vector_ar.vecm import coint_johansen
# Test for cointegration
johansen_test = coint_johansen(df_multivariate, det_order=0, k_ar_diff=1)
# Fit VECM if cointegrated
from statsmodels.tsa.vector_ar.vecm import VECM
model = VECM(df_multivariate, k_ar_diff=1, coint_rank=1)
results = model.fit()
```
### Regime Switching Models
For structural breaks and regime changes.
```python
from statsmodels.tsa.regime_switching.markov_regression import MarkovRegression
# Markov switching model
model = MarkovRegression(y, k_regimes=2, order=1)
results = model.fit()
# Smoothed probabilities of regimes
regime_probs = results.smoothed_marginal_probabilities
```
## Best Practices
1. **Check stationarity**: Difference if needed, verify with ADF/KPSS tests
2. **Plot data**: Always visualize before modeling
3. **Identify seasonality**: Use appropriate seasonal models (SARIMAX, Holt-Winters)
4. **Model selection**: Use AIC/BIC and out-of-sample validation
5. **Residual diagnostics**: Check for autocorrelation, normality, heteroskedasticity
6. **Forecast evaluation**: Use rolling forecasts and proper time series CV
7. **Avoid overfitting**: Prefer simpler models, use information criteria
8. **Document assumptions**: Note any data transformations (log, differencing)
9. **Prediction intervals**: Always provide uncertainty estimates
10. **Refit regularly**: Update models as new data arrives
## Common Pitfalls
1. **Not checking stationarity**: Fit ARIMA on non-stationary data
2. **Data leakage**: Using future data in transformations
3. **Wrong seasonal period**: S=4 for quarterly, S=12 for monthly
4. **Overfitting**: Too many parameters relative to data
5. **Ignoring residual autocorrelation**: Model inadequate
6. **Using inappropriate metrics**: MAPE fails with zeros or negatives
7. **Not handling missing data**: Affects model estimation
8. **Extrapolating exogenous variables**: Need future X values for SARIMAX
9. **Confusing static vs dynamic forecasts**: Dynamic more realistic for multi-step
10. **Not validating forecasts**: Always check out-of-sample performance