448 lines
13 KiB
Markdown
448 lines
13 KiB
Markdown
# Linear Regression Models Reference
|
|
|
|
This document provides detailed guidance on linear regression models in statsmodels, including OLS, GLS, WLS, quantile regression, and specialized variants.
|
|
|
|
## Core Model Classes
|
|
|
|
### OLS (Ordinary Least Squares)
|
|
|
|
Assumes independent, identically distributed errors (Σ=I). Best for standard regression with homoscedastic errors.
|
|
|
|
**When to use:**
|
|
- Standard regression analysis
|
|
- Errors are independent and have constant variance
|
|
- No autocorrelation or heteroscedasticity
|
|
- Most common starting point
|
|
|
|
**Basic usage:**
|
|
```python
|
|
import statsmodels.api as sm
|
|
import numpy as np
|
|
|
|
# Prepare data - ALWAYS add constant for intercept
|
|
X = sm.add_constant(X_data) # Adds column of 1s for intercept
|
|
|
|
# Fit model
|
|
model = sm.OLS(y, X)
|
|
results = model.fit()
|
|
|
|
# View results
|
|
print(results.summary())
|
|
```
|
|
|
|
**Key results attributes:**
|
|
```python
|
|
results.params # Coefficients
|
|
results.bse # Standard errors
|
|
results.tvalues # T-statistics
|
|
results.pvalues # P-values
|
|
results.rsquared # R-squared
|
|
results.rsquared_adj # Adjusted R-squared
|
|
results.fittedvalues # Fitted values (predictions on training data)
|
|
results.resid # Residuals
|
|
results.conf_int() # Confidence intervals for parameters
|
|
```
|
|
|
|
**Prediction with confidence/prediction intervals:**
|
|
```python
|
|
# For in-sample predictions
|
|
pred = results.get_prediction(X)
|
|
pred_summary = pred.summary_frame()
|
|
print(pred_summary) # Contains mean, std, confidence intervals
|
|
|
|
# For out-of-sample predictions
|
|
X_new = sm.add_constant(X_new_data)
|
|
pred_new = results.get_prediction(X_new)
|
|
pred_summary = pred_new.summary_frame()
|
|
|
|
# Access intervals
|
|
mean_ci_lower = pred_summary["mean_ci_lower"]
|
|
mean_ci_upper = pred_summary["mean_ci_upper"]
|
|
obs_ci_lower = pred_summary["obs_ci_lower"] # Prediction intervals
|
|
obs_ci_upper = pred_summary["obs_ci_upper"]
|
|
```
|
|
|
|
**Formula API (R-style):**
|
|
```python
|
|
import statsmodels.formula.api as smf
|
|
|
|
# Automatic handling of categorical variables and interactions
|
|
formula = 'y ~ x1 + x2 + C(category) + x1:x2'
|
|
results = smf.ols(formula, data=df).fit()
|
|
```
|
|
|
|
### WLS (Weighted Least Squares)
|
|
|
|
Handles heteroscedastic errors (diagonal Σ) where variance differs across observations.
|
|
|
|
**When to use:**
|
|
- Known heteroscedasticity (non-constant error variance)
|
|
- Different observations have different reliability
|
|
- Weights are known or can be estimated
|
|
|
|
**Usage:**
|
|
```python
|
|
# If you know the weights (inverse variance)
|
|
weights = 1 / error_variance
|
|
model = sm.WLS(y, X, weights=weights)
|
|
results = model.fit()
|
|
|
|
# Common weight patterns:
|
|
# - 1/variance: when variance is known
|
|
# - n_i: sample size for grouped data
|
|
# - 1/x: when variance proportional to x
|
|
```
|
|
|
|
**Feasible WLS (estimating weights):**
|
|
```python
|
|
# Step 1: Fit OLS
|
|
ols_results = sm.OLS(y, X).fit()
|
|
|
|
# Step 2: Model squared residuals to estimate variance
|
|
abs_resid = np.abs(ols_results.resid)
|
|
variance_model = sm.OLS(np.log(abs_resid**2), X).fit()
|
|
|
|
# Step 3: Use estimated variance as weights
|
|
weights = 1 / np.exp(variance_model.fittedvalues)
|
|
wls_results = sm.WLS(y, X, weights=weights).fit()
|
|
```
|
|
|
|
### GLS (Generalized Least Squares)
|
|
|
|
Handles arbitrary covariance structure (Σ). Superclass for other regression methods.
|
|
|
|
**When to use:**
|
|
- Known covariance structure
|
|
- Correlated errors
|
|
- More general than WLS
|
|
|
|
**Usage:**
|
|
```python
|
|
# Specify covariance structure
|
|
# Sigma should be (n x n) covariance matrix
|
|
model = sm.GLS(y, X, sigma=Sigma)
|
|
results = model.fit()
|
|
```
|
|
|
|
### GLSAR (GLS with Autoregressive Errors)
|
|
|
|
Feasible generalized least squares with AR(p) errors for time series data.
|
|
|
|
**When to use:**
|
|
- Time series regression with autocorrelated errors
|
|
- Need to account for serial correlation
|
|
- Violations of error independence
|
|
|
|
**Usage:**
|
|
```python
|
|
# AR(1) errors
|
|
model = sm.GLSAR(y, X, rho=1) # rho=1 for AR(1), rho=2 for AR(2), etc.
|
|
results = model.iterative_fit() # Iteratively estimates AR parameters
|
|
|
|
print(results.summary())
|
|
print(f"Estimated rho: {results.model.rho}")
|
|
```
|
|
|
|
### RLS (Recursive Least Squares)
|
|
|
|
Sequential parameter estimation, useful for adaptive or online learning.
|
|
|
|
**When to use:**
|
|
- Parameters change over time
|
|
- Online/streaming data
|
|
- Want to see parameter evolution
|
|
|
|
**Usage:**
|
|
```python
|
|
from statsmodels.regression.recursive_ls import RecursiveLS
|
|
|
|
model = RecursiveLS(y, X)
|
|
results = model.fit()
|
|
|
|
# Access time-varying parameters
|
|
params_over_time = results.recursive_coefficients
|
|
cusum = results.cusum # CUSUM statistic for structural breaks
|
|
```
|
|
|
|
### Rolling Regressions
|
|
|
|
Compute estimates across moving windows for time-varying parameter detection.
|
|
|
|
**When to use:**
|
|
- Parameters vary over time
|
|
- Want to detect structural changes
|
|
- Time series with evolving relationships
|
|
|
|
**Usage:**
|
|
```python
|
|
from statsmodels.regression.rolling import RollingOLS, RollingWLS
|
|
|
|
# Rolling OLS with 60-period window
|
|
rolling_model = RollingOLS(y, X, window=60)
|
|
rolling_results = rolling_model.fit()
|
|
|
|
# Extract time-varying parameters
|
|
rolling_params = rolling_results.params # DataFrame with parameters over time
|
|
rolling_rsquared = rolling_results.rsquared
|
|
|
|
# Plot parameter evolution
|
|
import matplotlib.pyplot as plt
|
|
rolling_params.plot()
|
|
plt.title('Time-Varying Coefficients')
|
|
plt.show()
|
|
```
|
|
|
|
### Quantile Regression
|
|
|
|
Analyzes conditional quantiles rather than conditional mean.
|
|
|
|
**When to use:**
|
|
- Interest in quantiles (median, 90th percentile, etc.)
|
|
- Robust to outliers (median regression)
|
|
- Distributional effects across quantiles
|
|
- Heterogeneous effects
|
|
|
|
**Usage:**
|
|
```python
|
|
from statsmodels.regression.quantile_regression import QuantReg
|
|
|
|
# Median regression (50th percentile)
|
|
model = QuantReg(y, X)
|
|
results_median = model.fit(q=0.5)
|
|
|
|
# Multiple quantiles
|
|
quantiles = [0.1, 0.25, 0.5, 0.75, 0.9]
|
|
results_dict = {}
|
|
for q in quantiles:
|
|
results_dict[q] = model.fit(q=q)
|
|
|
|
# Plot quantile-varying effects
|
|
import matplotlib.pyplot as plt
|
|
coef_dict = {q: res.params for q, res in results_dict.items()}
|
|
coef_df = pd.DataFrame(coef_dict).T
|
|
coef_df.plot()
|
|
plt.xlabel('Quantile')
|
|
plt.ylabel('Coefficient')
|
|
plt.show()
|
|
```
|
|
|
|
## Mixed Effects Models
|
|
|
|
For hierarchical/nested data with random effects.
|
|
|
|
**When to use:**
|
|
- Clustered/grouped data (students in schools, patients in hospitals)
|
|
- Repeated measures
|
|
- Need random effects to account for grouping
|
|
|
|
**Usage:**
|
|
```python
|
|
from statsmodels.regression.mixed_linear_model import MixedLM
|
|
|
|
# Random intercept model
|
|
model = MixedLM(y, X, groups=group_ids)
|
|
results = model.fit()
|
|
|
|
# Random intercept and slope
|
|
model = MixedLM(y, X, groups=group_ids, exog_re=X_random)
|
|
results = model.fit()
|
|
|
|
print(results.summary())
|
|
```
|
|
|
|
## Diagnostics and Model Assessment
|
|
|
|
### Residual Analysis
|
|
|
|
```python
|
|
# Basic residual plots
|
|
import matplotlib.pyplot as plt
|
|
|
|
# Residuals vs fitted
|
|
plt.scatter(results.fittedvalues, results.resid)
|
|
plt.xlabel('Fitted values')
|
|
plt.ylabel('Residuals')
|
|
plt.axhline(y=0, color='r', linestyle='--')
|
|
plt.title('Residuals vs Fitted')
|
|
plt.show()
|
|
|
|
# Q-Q plot for normality
|
|
from statsmodels.graphics.gofplots import qqplot
|
|
qqplot(results.resid, line='s')
|
|
plt.show()
|
|
|
|
# Histogram of residuals
|
|
plt.hist(results.resid, bins=30, edgecolor='black')
|
|
plt.xlabel('Residuals')
|
|
plt.ylabel('Frequency')
|
|
plt.title('Distribution of Residuals')
|
|
plt.show()
|
|
```
|
|
|
|
### Specification Tests
|
|
|
|
```python
|
|
from statsmodels.stats.diagnostic import het_breuschpagan, het_white
|
|
from statsmodels.stats.stattools import durbin_watson, jarque_bera
|
|
|
|
# Heteroscedasticity tests
|
|
lm_stat, lm_pval, f_stat, f_pval = het_breuschpagan(results.resid, X)
|
|
print(f"Breusch-Pagan test p-value: {lm_pval}")
|
|
|
|
# White test
|
|
white_test = het_white(results.resid, X)
|
|
print(f"White test p-value: {white_test[1]}")
|
|
|
|
# Autocorrelation
|
|
dw_stat = durbin_watson(results.resid)
|
|
print(f"Durbin-Watson statistic: {dw_stat}")
|
|
# DW ~ 2 indicates no autocorrelation
|
|
# DW < 2 suggests positive autocorrelation
|
|
# DW > 2 suggests negative autocorrelation
|
|
|
|
# Normality test
|
|
jb_stat, jb_pval, skew, kurtosis = jarque_bera(results.resid)
|
|
print(f"Jarque-Bera test p-value: {jb_pval}")
|
|
```
|
|
|
|
### Multicollinearity
|
|
|
|
```python
|
|
from statsmodels.stats.outliers_influence import variance_inflation_factor
|
|
|
|
# Calculate VIF for each variable
|
|
vif_data = pd.DataFrame()
|
|
vif_data["Variable"] = X.columns
|
|
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
|
|
|
|
print(vif_data)
|
|
# VIF > 10 indicates problematic multicollinearity
|
|
# VIF > 5 suggests moderate multicollinearity
|
|
|
|
# Condition number (from summary)
|
|
print(f"Condition number: {results.condition_number}")
|
|
# Condition number > 20 suggests multicollinearity
|
|
# Condition number > 30 indicates serious problems
|
|
```
|
|
|
|
### Influence Statistics
|
|
|
|
```python
|
|
from statsmodels.stats.outliers_influence import OLSInfluence
|
|
|
|
influence = results.get_influence()
|
|
|
|
# Leverage (hat values)
|
|
leverage = influence.hat_matrix_diag
|
|
# High leverage: > 2*p/n (p=predictors, n=observations)
|
|
|
|
# Cook's distance
|
|
cooks_d = influence.cooks_distance[0]
|
|
# Influential if Cook's D > 4/n
|
|
|
|
# DFFITS
|
|
dffits = influence.dffits[0]
|
|
# Influential if |DFFITS| > 2*sqrt(p/n)
|
|
|
|
# Create influence plot
|
|
from statsmodels.graphics.regressionplots import influence_plot
|
|
fig, ax = plt.subplots(figsize=(12, 8))
|
|
influence_plot(results, ax=ax)
|
|
plt.show()
|
|
```
|
|
|
|
### Hypothesis Testing
|
|
|
|
```python
|
|
# Test single coefficient
|
|
# H0: beta_i = 0 (automatically in summary)
|
|
|
|
# Test multiple restrictions using F-test
|
|
# Example: Test beta_1 = beta_2 = 0
|
|
R = [[0, 1, 0, 0], [0, 0, 1, 0]] # Restriction matrix
|
|
f_test = results.f_test(R)
|
|
print(f_test)
|
|
|
|
# Formula-based hypothesis testing
|
|
f_test = results.f_test("x1 = x2 = 0")
|
|
print(f_test)
|
|
|
|
# Test linear combination: beta_1 + beta_2 = 1
|
|
r_matrix = [[0, 1, 1, 0]]
|
|
q_matrix = [1] # RHS value
|
|
f_test = results.f_test((r_matrix, q_matrix))
|
|
print(f_test)
|
|
|
|
# Wald test (equivalent to F-test for linear restrictions)
|
|
wald_test = results.wald_test(R)
|
|
print(wald_test)
|
|
```
|
|
|
|
## Model Comparison
|
|
|
|
```python
|
|
# Compare nested models using likelihood ratio test (if using MLE)
|
|
from statsmodels.stats.anova import anova_lm
|
|
|
|
# Fit restricted and unrestricted models
|
|
model_restricted = sm.OLS(y, X_restricted).fit()
|
|
model_full = sm.OLS(y, X_full).fit()
|
|
|
|
# ANOVA table for model comparison
|
|
anova_results = anova_lm(model_restricted, model_full)
|
|
print(anova_results)
|
|
|
|
# AIC/BIC for non-nested model comparison
|
|
print(f"Model 1 AIC: {model1.aic}, BIC: {model1.bic}")
|
|
print(f"Model 2 AIC: {model2.aic}, BIC: {model2.bic}")
|
|
# Lower AIC/BIC indicates better model
|
|
```
|
|
|
|
## Robust Standard Errors
|
|
|
|
Handle heteroscedasticity or clustering without reweighting.
|
|
|
|
```python
|
|
# Heteroscedasticity-robust (HC) standard errors
|
|
results_hc = results.get_robustcov_results(cov_type='HC0') # White's
|
|
results_hc1 = results.get_robustcov_results(cov_type='HC1')
|
|
results_hc2 = results.get_robustcov_results(cov_type='HC2')
|
|
results_hc3 = results.get_robustcov_results(cov_type='HC3') # Most conservative
|
|
|
|
# Newey-West HAC (Heteroscedasticity and Autocorrelation Consistent)
|
|
results_hac = results.get_robustcov_results(cov_type='HAC', maxlags=4)
|
|
|
|
# Cluster-robust standard errors
|
|
results_cluster = results.get_robustcov_results(cov_type='cluster',
|
|
groups=cluster_ids)
|
|
|
|
# View robust results
|
|
print(results_hc3.summary())
|
|
```
|
|
|
|
## Best Practices
|
|
|
|
1. **Always add constant**: Use `sm.add_constant()` unless you specifically want to exclude the intercept
|
|
2. **Check assumptions**: Run diagnostic tests (heteroscedasticity, autocorrelation, normality)
|
|
3. **Use formula API for categorical variables**: `smf.ols()` handles categorical variables automatically
|
|
4. **Robust standard errors**: Use when heteroscedasticity detected but model specification is correct
|
|
5. **Model selection**: Use AIC/BIC for non-nested models, F-test/likelihood ratio for nested models
|
|
6. **Outliers and influence**: Always check Cook's distance and leverage
|
|
7. **Multicollinearity**: Check VIF and condition number before interpretation
|
|
8. **Time series**: Use `GLSAR` or robust HAC standard errors for autocorrelated errors
|
|
9. **Grouped data**: Consider mixed effects models or cluster-robust standard errors
|
|
10. **Quantile regression**: Use for robust estimation or when interested in distributional effects
|
|
|
|
## Common Pitfalls
|
|
|
|
1. **Forgetting to add constant**: Results in no-intercept model
|
|
2. **Ignoring heteroscedasticity**: Use WLS or robust standard errors
|
|
3. **Using OLS with autocorrelated errors**: Use GLSAR or HAC standard errors
|
|
4. **Over-interpreting with multicollinearity**: Check VIF first
|
|
5. **Not checking residuals**: Always plot residuals vs fitted values
|
|
6. **Using t-SNE/PCA residuals**: Residuals should be from original space
|
|
7. **Confusing prediction vs confidence intervals**: Prediction intervals are wider
|
|
8. **Not handling categorical variables properly**: Use formula API or manual dummy coding
|
|
9. **Comparing models with different sample sizes**: Ensure same observations used
|
|
10. **Ignoring influential observations**: Check Cook's distance and DFFITS
|