Files
gh-k-dense-ai-claude-scient…/skills/statsmodels/references/linear_models.md
2025-11-30 08:30:10 +08:00

448 lines
13 KiB
Markdown

# Linear Regression Models Reference
This document provides detailed guidance on linear regression models in statsmodels, including OLS, GLS, WLS, quantile regression, and specialized variants.
## Core Model Classes
### OLS (Ordinary Least Squares)
Assumes independent, identically distributed errors (Σ=I). Best for standard regression with homoscedastic errors.
**When to use:**
- Standard regression analysis
- Errors are independent and have constant variance
- No autocorrelation or heteroscedasticity
- Most common starting point
**Basic usage:**
```python
import statsmodels.api as sm
import numpy as np
# Prepare data - ALWAYS add constant for intercept
X = sm.add_constant(X_data) # Adds column of 1s for intercept
# Fit model
model = sm.OLS(y, X)
results = model.fit()
# View results
print(results.summary())
```
**Key results attributes:**
```python
results.params # Coefficients
results.bse # Standard errors
results.tvalues # T-statistics
results.pvalues # P-values
results.rsquared # R-squared
results.rsquared_adj # Adjusted R-squared
results.fittedvalues # Fitted values (predictions on training data)
results.resid # Residuals
results.conf_int() # Confidence intervals for parameters
```
**Prediction with confidence/prediction intervals:**
```python
# For in-sample predictions
pred = results.get_prediction(X)
pred_summary = pred.summary_frame()
print(pred_summary) # Contains mean, std, confidence intervals
# For out-of-sample predictions
X_new = sm.add_constant(X_new_data)
pred_new = results.get_prediction(X_new)
pred_summary = pred_new.summary_frame()
# Access intervals
mean_ci_lower = pred_summary["mean_ci_lower"]
mean_ci_upper = pred_summary["mean_ci_upper"]
obs_ci_lower = pred_summary["obs_ci_lower"] # Prediction intervals
obs_ci_upper = pred_summary["obs_ci_upper"]
```
**Formula API (R-style):**
```python
import statsmodels.formula.api as smf
# Automatic handling of categorical variables and interactions
formula = 'y ~ x1 + x2 + C(category) + x1:x2'
results = smf.ols(formula, data=df).fit()
```
### WLS (Weighted Least Squares)
Handles heteroscedastic errors (diagonal Σ) where variance differs across observations.
**When to use:**
- Known heteroscedasticity (non-constant error variance)
- Different observations have different reliability
- Weights are known or can be estimated
**Usage:**
```python
# If you know the weights (inverse variance)
weights = 1 / error_variance
model = sm.WLS(y, X, weights=weights)
results = model.fit()
# Common weight patterns:
# - 1/variance: when variance is known
# - n_i: sample size for grouped data
# - 1/x: when variance proportional to x
```
**Feasible WLS (estimating weights):**
```python
# Step 1: Fit OLS
ols_results = sm.OLS(y, X).fit()
# Step 2: Model squared residuals to estimate variance
abs_resid = np.abs(ols_results.resid)
variance_model = sm.OLS(np.log(abs_resid**2), X).fit()
# Step 3: Use estimated variance as weights
weights = 1 / np.exp(variance_model.fittedvalues)
wls_results = sm.WLS(y, X, weights=weights).fit()
```
### GLS (Generalized Least Squares)
Handles arbitrary covariance structure (Σ). Superclass for other regression methods.
**When to use:**
- Known covariance structure
- Correlated errors
- More general than WLS
**Usage:**
```python
# Specify covariance structure
# Sigma should be (n x n) covariance matrix
model = sm.GLS(y, X, sigma=Sigma)
results = model.fit()
```
### GLSAR (GLS with Autoregressive Errors)
Feasible generalized least squares with AR(p) errors for time series data.
**When to use:**
- Time series regression with autocorrelated errors
- Need to account for serial correlation
- Violations of error independence
**Usage:**
```python
# AR(1) errors
model = sm.GLSAR(y, X, rho=1) # rho=1 for AR(1), rho=2 for AR(2), etc.
results = model.iterative_fit() # Iteratively estimates AR parameters
print(results.summary())
print(f"Estimated rho: {results.model.rho}")
```
### RLS (Recursive Least Squares)
Sequential parameter estimation, useful for adaptive or online learning.
**When to use:**
- Parameters change over time
- Online/streaming data
- Want to see parameter evolution
**Usage:**
```python
from statsmodels.regression.recursive_ls import RecursiveLS
model = RecursiveLS(y, X)
results = model.fit()
# Access time-varying parameters
params_over_time = results.recursive_coefficients
cusum = results.cusum # CUSUM statistic for structural breaks
```
### Rolling Regressions
Compute estimates across moving windows for time-varying parameter detection.
**When to use:**
- Parameters vary over time
- Want to detect structural changes
- Time series with evolving relationships
**Usage:**
```python
from statsmodels.regression.rolling import RollingOLS, RollingWLS
# Rolling OLS with 60-period window
rolling_model = RollingOLS(y, X, window=60)
rolling_results = rolling_model.fit()
# Extract time-varying parameters
rolling_params = rolling_results.params # DataFrame with parameters over time
rolling_rsquared = rolling_results.rsquared
# Plot parameter evolution
import matplotlib.pyplot as plt
rolling_params.plot()
plt.title('Time-Varying Coefficients')
plt.show()
```
### Quantile Regression
Analyzes conditional quantiles rather than conditional mean.
**When to use:**
- Interest in quantiles (median, 90th percentile, etc.)
- Robust to outliers (median regression)
- Distributional effects across quantiles
- Heterogeneous effects
**Usage:**
```python
from statsmodels.regression.quantile_regression import QuantReg
# Median regression (50th percentile)
model = QuantReg(y, X)
results_median = model.fit(q=0.5)
# Multiple quantiles
quantiles = [0.1, 0.25, 0.5, 0.75, 0.9]
results_dict = {}
for q in quantiles:
results_dict[q] = model.fit(q=q)
# Plot quantile-varying effects
import matplotlib.pyplot as plt
coef_dict = {q: res.params for q, res in results_dict.items()}
coef_df = pd.DataFrame(coef_dict).T
coef_df.plot()
plt.xlabel('Quantile')
plt.ylabel('Coefficient')
plt.show()
```
## Mixed Effects Models
For hierarchical/nested data with random effects.
**When to use:**
- Clustered/grouped data (students in schools, patients in hospitals)
- Repeated measures
- Need random effects to account for grouping
**Usage:**
```python
from statsmodels.regression.mixed_linear_model import MixedLM
# Random intercept model
model = MixedLM(y, X, groups=group_ids)
results = model.fit()
# Random intercept and slope
model = MixedLM(y, X, groups=group_ids, exog_re=X_random)
results = model.fit()
print(results.summary())
```
## Diagnostics and Model Assessment
### Residual Analysis
```python
# Basic residual plots
import matplotlib.pyplot as plt
# Residuals vs fitted
plt.scatter(results.fittedvalues, results.resid)
plt.xlabel('Fitted values')
plt.ylabel('Residuals')
plt.axhline(y=0, color='r', linestyle='--')
plt.title('Residuals vs Fitted')
plt.show()
# Q-Q plot for normality
from statsmodels.graphics.gofplots import qqplot
qqplot(results.resid, line='s')
plt.show()
# Histogram of residuals
plt.hist(results.resid, bins=30, edgecolor='black')
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.title('Distribution of Residuals')
plt.show()
```
### Specification Tests
```python
from statsmodels.stats.diagnostic import het_breuschpagan, het_white
from statsmodels.stats.stattools import durbin_watson, jarque_bera
# Heteroscedasticity tests
lm_stat, lm_pval, f_stat, f_pval = het_breuschpagan(results.resid, X)
print(f"Breusch-Pagan test p-value: {lm_pval}")
# White test
white_test = het_white(results.resid, X)
print(f"White test p-value: {white_test[1]}")
# Autocorrelation
dw_stat = durbin_watson(results.resid)
print(f"Durbin-Watson statistic: {dw_stat}")
# DW ~ 2 indicates no autocorrelation
# DW < 2 suggests positive autocorrelation
# DW > 2 suggests negative autocorrelation
# Normality test
jb_stat, jb_pval, skew, kurtosis = jarque_bera(results.resid)
print(f"Jarque-Bera test p-value: {jb_pval}")
```
### Multicollinearity
```python
from statsmodels.stats.outliers_influence import variance_inflation_factor
# Calculate VIF for each variable
vif_data = pd.DataFrame()
vif_data["Variable"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif_data)
# VIF > 10 indicates problematic multicollinearity
# VIF > 5 suggests moderate multicollinearity
# Condition number (from summary)
print(f"Condition number: {results.condition_number}")
# Condition number > 20 suggests multicollinearity
# Condition number > 30 indicates serious problems
```
### Influence Statistics
```python
from statsmodels.stats.outliers_influence import OLSInfluence
influence = results.get_influence()
# Leverage (hat values)
leverage = influence.hat_matrix_diag
# High leverage: > 2*p/n (p=predictors, n=observations)
# Cook's distance
cooks_d = influence.cooks_distance[0]
# Influential if Cook's D > 4/n
# DFFITS
dffits = influence.dffits[0]
# Influential if |DFFITS| > 2*sqrt(p/n)
# Create influence plot
from statsmodels.graphics.regressionplots import influence_plot
fig, ax = plt.subplots(figsize=(12, 8))
influence_plot(results, ax=ax)
plt.show()
```
### Hypothesis Testing
```python
# Test single coefficient
# H0: beta_i = 0 (automatically in summary)
# Test multiple restrictions using F-test
# Example: Test beta_1 = beta_2 = 0
R = [[0, 1, 0, 0], [0, 0, 1, 0]] # Restriction matrix
f_test = results.f_test(R)
print(f_test)
# Formula-based hypothesis testing
f_test = results.f_test("x1 = x2 = 0")
print(f_test)
# Test linear combination: beta_1 + beta_2 = 1
r_matrix = [[0, 1, 1, 0]]
q_matrix = [1] # RHS value
f_test = results.f_test((r_matrix, q_matrix))
print(f_test)
# Wald test (equivalent to F-test for linear restrictions)
wald_test = results.wald_test(R)
print(wald_test)
```
## Model Comparison
```python
# Compare nested models using likelihood ratio test (if using MLE)
from statsmodels.stats.anova import anova_lm
# Fit restricted and unrestricted models
model_restricted = sm.OLS(y, X_restricted).fit()
model_full = sm.OLS(y, X_full).fit()
# ANOVA table for model comparison
anova_results = anova_lm(model_restricted, model_full)
print(anova_results)
# AIC/BIC for non-nested model comparison
print(f"Model 1 AIC: {model1.aic}, BIC: {model1.bic}")
print(f"Model 2 AIC: {model2.aic}, BIC: {model2.bic}")
# Lower AIC/BIC indicates better model
```
## Robust Standard Errors
Handle heteroscedasticity or clustering without reweighting.
```python
# Heteroscedasticity-robust (HC) standard errors
results_hc = results.get_robustcov_results(cov_type='HC0') # White's
results_hc1 = results.get_robustcov_results(cov_type='HC1')
results_hc2 = results.get_robustcov_results(cov_type='HC2')
results_hc3 = results.get_robustcov_results(cov_type='HC3') # Most conservative
# Newey-West HAC (Heteroscedasticity and Autocorrelation Consistent)
results_hac = results.get_robustcov_results(cov_type='HAC', maxlags=4)
# Cluster-robust standard errors
results_cluster = results.get_robustcov_results(cov_type='cluster',
groups=cluster_ids)
# View robust results
print(results_hc3.summary())
```
## Best Practices
1. **Always add constant**: Use `sm.add_constant()` unless you specifically want to exclude the intercept
2. **Check assumptions**: Run diagnostic tests (heteroscedasticity, autocorrelation, normality)
3. **Use formula API for categorical variables**: `smf.ols()` handles categorical variables automatically
4. **Robust standard errors**: Use when heteroscedasticity detected but model specification is correct
5. **Model selection**: Use AIC/BIC for non-nested models, F-test/likelihood ratio for nested models
6. **Outliers and influence**: Always check Cook's distance and leverage
7. **Multicollinearity**: Check VIF and condition number before interpretation
8. **Time series**: Use `GLSAR` or robust HAC standard errors for autocorrelated errors
9. **Grouped data**: Consider mixed effects models or cluster-robust standard errors
10. **Quantile regression**: Use for robust estimation or when interested in distributional effects
## Common Pitfalls
1. **Forgetting to add constant**: Results in no-intercept model
2. **Ignoring heteroscedasticity**: Use WLS or robust standard errors
3. **Using OLS with autocorrelated errors**: Use GLSAR or HAC standard errors
4. **Over-interpreting with multicollinearity**: Check VIF first
5. **Not checking residuals**: Always plot residuals vs fitted values
6. **Using t-SNE/PCA residuals**: Residuals should be from original space
7. **Confusing prediction vs confidence intervals**: Prediction intervals are wider
8. **Not handling categorical variables properly**: Use formula API or manual dummy coding
9. **Comparing models with different sample sizes**: Ensure same observations used
10. **Ignoring influential observations**: Check Cook's distance and DFFITS