zhongwei/gh-k-dense-ai-claude-scientific-skills-scientific-skills

Files

Zhongwei Li f0bd18fb4e Initial commit

2025-11-30 08:30:10 +08:00

13 KiB

Raw Blame History

Linear Regression Models Reference

This document provides detailed guidance on linear regression models in statsmodels, including OLS, GLS, WLS, quantile regression, and specialized variants.

Core Model Classes

OLS (Ordinary Least Squares)

Assumes independent, identically distributed errors (Σ=I). Best for standard regression with homoscedastic errors.

When to use:

Standard regression analysis
Errors are independent and have constant variance
No autocorrelation or heteroscedasticity
Most common starting point

Basic usage:

import statsmodels.api as sm
import numpy as np

# Prepare data - ALWAYS add constant for intercept
X = sm.add_constant(X_data)  # Adds column of 1s for intercept

# Fit model
model = sm.OLS(y, X)
results = model.fit()

# View results
print(results.summary())

Key results attributes:

results.params           # Coefficients
results.bse              # Standard errors
results.tvalues          # T-statistics
results.pvalues          # P-values
results.rsquared         # R-squared
results.rsquared_adj     # Adjusted R-squared
results.fittedvalues     # Fitted values (predictions on training data)
results.resid            # Residuals
results.conf_int()       # Confidence intervals for parameters

Prediction with confidence/prediction intervals:

# For in-sample predictions
pred = results.get_prediction(X)
pred_summary = pred.summary_frame()
print(pred_summary)  # Contains mean, std, confidence intervals

# For out-of-sample predictions
X_new = sm.add_constant(X_new_data)
pred_new = results.get_prediction(X_new)
pred_summary = pred_new.summary_frame()

# Access intervals
mean_ci_lower = pred_summary["mean_ci_lower"]
mean_ci_upper = pred_summary["mean_ci_upper"]
obs_ci_lower = pred_summary["obs_ci_lower"]  # Prediction intervals
obs_ci_upper = pred_summary["obs_ci_upper"]

Formula API (R-style):

import statsmodels.formula.api as smf

# Automatic handling of categorical variables and interactions
formula = 'y ~ x1 + x2 + C(category) + x1:x2'
results = smf.ols(formula, data=df).fit()

WLS (Weighted Least Squares)

Handles heteroscedastic errors (diagonal Σ) where variance differs across observations.

When to use:

Known heteroscedasticity (non-constant error variance)
Different observations have different reliability
Weights are known or can be estimated

Usage:

# If you know the weights (inverse variance)
weights = 1 / error_variance
model = sm.WLS(y, X, weights=weights)
results = model.fit()

# Common weight patterns:
# - 1/variance: when variance is known
# - n_i: sample size for grouped data
# - 1/x: when variance proportional to x

Feasible WLS (estimating weights):

# Step 1: Fit OLS
ols_results = sm.OLS(y, X).fit()

# Step 2: Model squared residuals to estimate variance
abs_resid = np.abs(ols_results.resid)
variance_model = sm.OLS(np.log(abs_resid**2), X).fit()

# Step 3: Use estimated variance as weights
weights = 1 / np.exp(variance_model.fittedvalues)
wls_results = sm.WLS(y, X, weights=weights).fit()

GLS (Generalized Least Squares)

Handles arbitrary covariance structure (Σ). Superclass for other regression methods.

When to use:

Known covariance structure
Correlated errors
More general than WLS

Usage:

# Specify covariance structure
# Sigma should be (n x n) covariance matrix
model = sm.GLS(y, X, sigma=Sigma)
results = model.fit()

GLSAR (GLS with Autoregressive Errors)

Feasible generalized least squares with AR(p) errors for time series data.

When to use:

Time series regression with autocorrelated errors
Need to account for serial correlation
Violations of error independence

Usage:

# AR(1) errors
model = sm.GLSAR(y, X, rho=1)  # rho=1 for AR(1), rho=2 for AR(2), etc.
results = model.iterative_fit()  # Iteratively estimates AR parameters

print(results.summary())
print(f"Estimated rho: {results.model.rho}")

RLS (Recursive Least Squares)

Sequential parameter estimation, useful for adaptive or online learning.

When to use:

Parameters change over time
Online/streaming data
Want to see parameter evolution

Usage:

from statsmodels.regression.recursive_ls import RecursiveLS

model = RecursiveLS(y, X)
results = model.fit()

# Access time-varying parameters
params_over_time = results.recursive_coefficients
cusum = results.cusum  # CUSUM statistic for structural breaks

Rolling Regressions

Compute estimates across moving windows for time-varying parameter detection.

When to use:

Parameters vary over time
Want to detect structural changes
Time series with evolving relationships

Usage:

from statsmodels.regression.rolling import RollingOLS, RollingWLS

# Rolling OLS with 60-period window
rolling_model = RollingOLS(y, X, window=60)
rolling_results = rolling_model.fit()

# Extract time-varying parameters
rolling_params = rolling_results.params  # DataFrame with parameters over time
rolling_rsquared = rolling_results.rsquared

# Plot parameter evolution
import matplotlib.pyplot as plt
rolling_params.plot()
plt.title('Time-Varying Coefficients')
plt.show()

Quantile Regression

Analyzes conditional quantiles rather than conditional mean.

When to use:

Interest in quantiles (median, 90th percentile, etc.)
Robust to outliers (median regression)
Distributional effects across quantiles
Heterogeneous effects

Usage:

from statsmodels.regression.quantile_regression import QuantReg

# Median regression (50th percentile)
model = QuantReg(y, X)
results_median = model.fit(q=0.5)

# Multiple quantiles
quantiles = [0.1, 0.25, 0.5, 0.75, 0.9]
results_dict = {}
for q in quantiles:
    results_dict[q] = model.fit(q=q)

# Plot quantile-varying effects
import matplotlib.pyplot as plt
coef_dict = {q: res.params for q, res in results_dict.items()}
coef_df = pd.DataFrame(coef_dict).T
coef_df.plot()
plt.xlabel('Quantile')
plt.ylabel('Coefficient')
plt.show()

Mixed Effects Models

For hierarchical/nested data with random effects.

When to use:

Clustered/grouped data (students in schools, patients in hospitals)
Repeated measures
Need random effects to account for grouping

Usage:

from statsmodels.regression.mixed_linear_model import MixedLM

# Random intercept model
model = MixedLM(y, X, groups=group_ids)
results = model.fit()

# Random intercept and slope
model = MixedLM(y, X, groups=group_ids, exog_re=X_random)
results = model.fit()

print(results.summary())

Diagnostics and Model Assessment

Residual Analysis

# Basic residual plots
import matplotlib.pyplot as plt

# Residuals vs fitted
plt.scatter(results.fittedvalues, results.resid)
plt.xlabel('Fitted values')
plt.ylabel('Residuals')
plt.axhline(y=0, color='r', linestyle='--')
plt.title('Residuals vs Fitted')
plt.show()

# Q-Q plot for normality
from statsmodels.graphics.gofplots import qqplot
qqplot(results.resid, line='s')
plt.show()

# Histogram of residuals
plt.hist(results.resid, bins=30, edgecolor='black')
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.title('Distribution of Residuals')
plt.show()

Specification Tests

from statsmodels.stats.diagnostic import het_breuschpagan, het_white
from statsmodels.stats.stattools import durbin_watson, jarque_bera

# Heteroscedasticity tests
lm_stat, lm_pval, f_stat, f_pval = het_breuschpagan(results.resid, X)
print(f"Breusch-Pagan test p-value: {lm_pval}")

# White test
white_test = het_white(results.resid, X)
print(f"White test p-value: {white_test[1]}")

# Autocorrelation
dw_stat = durbin_watson(results.resid)
print(f"Durbin-Watson statistic: {dw_stat}")
# DW ~ 2 indicates no autocorrelation
# DW < 2 suggests positive autocorrelation
# DW > 2 suggests negative autocorrelation

# Normality test
jb_stat, jb_pval, skew, kurtosis = jarque_bera(results.resid)
print(f"Jarque-Bera test p-value: {jb_pval}")

Multicollinearity

from statsmodels.stats.outliers_influence import variance_inflation_factor

# Calculate VIF for each variable
vif_data = pd.DataFrame()
vif_data["Variable"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

print(vif_data)
# VIF > 10 indicates problematic multicollinearity
# VIF > 5 suggests moderate multicollinearity

# Condition number (from summary)
print(f"Condition number: {results.condition_number}")
# Condition number > 20 suggests multicollinearity
# Condition number > 30 indicates serious problems

Influence Statistics

from statsmodels.stats.outliers_influence import OLSInfluence

influence = results.get_influence()

# Leverage (hat values)
leverage = influence.hat_matrix_diag
# High leverage: > 2*p/n (p=predictors, n=observations)

# Cook's distance
cooks_d = influence.cooks_distance[0]
# Influential if Cook's D > 4/n

# DFFITS
dffits = influence.dffits[0]
# Influential if |DFFITS| > 2*sqrt(p/n)

# Create influence plot
from statsmodels.graphics.regressionplots import influence_plot
fig, ax = plt.subplots(figsize=(12, 8))
influence_plot(results, ax=ax)
plt.show()

Hypothesis Testing

# Test single coefficient
# H0: beta_i = 0 (automatically in summary)

# Test multiple restrictions using F-test
# Example: Test beta_1 = beta_2 = 0
R = [[0, 1, 0, 0], [0, 0, 1, 0]]  # Restriction matrix
f_test = results.f_test(R)
print(f_test)

# Formula-based hypothesis testing
f_test = results.f_test("x1 = x2 = 0")
print(f_test)

# Test linear combination: beta_1 + beta_2 = 1
r_matrix = [[0, 1, 1, 0]]
q_matrix = [1]  # RHS value
f_test = results.f_test((r_matrix, q_matrix))
print(f_test)

# Wald test (equivalent to F-test for linear restrictions)
wald_test = results.wald_test(R)
print(wald_test)

Model Comparison

# Compare nested models using likelihood ratio test (if using MLE)
from statsmodels.stats.anova import anova_lm

# Fit restricted and unrestricted models
model_restricted = sm.OLS(y, X_restricted).fit()
model_full = sm.OLS(y, X_full).fit()

# ANOVA table for model comparison
anova_results = anova_lm(model_restricted, model_full)
print(anova_results)

# AIC/BIC for non-nested model comparison
print(f"Model 1 AIC: {model1.aic}, BIC: {model1.bic}")
print(f"Model 2 AIC: {model2.aic}, BIC: {model2.bic}")
# Lower AIC/BIC indicates better model

Robust Standard Errors

Handle heteroscedasticity or clustering without reweighting.

# Heteroscedasticity-robust (HC) standard errors
results_hc = results.get_robustcov_results(cov_type='HC0')  # White's
results_hc1 = results.get_robustcov_results(cov_type='HC1')
results_hc2 = results.get_robustcov_results(cov_type='HC2')
results_hc3 = results.get_robustcov_results(cov_type='HC3')  # Most conservative

# Newey-West HAC (Heteroscedasticity and Autocorrelation Consistent)
results_hac = results.get_robustcov_results(cov_type='HAC', maxlags=4)

# Cluster-robust standard errors
results_cluster = results.get_robustcov_results(cov_type='cluster',
                                                groups=cluster_ids)

# View robust results
print(results_hc3.summary())

Best Practices

Always add constant: Use sm.add_constant() unless you specifically want to exclude the intercept
Check assumptions: Run diagnostic tests (heteroscedasticity, autocorrelation, normality)
Use formula API for categorical variables: smf.ols() handles categorical variables automatically
Robust standard errors: Use when heteroscedasticity detected but model specification is correct
Model selection: Use AIC/BIC for non-nested models, F-test/likelihood ratio for nested models
Outliers and influence: Always check Cook's distance and leverage
Multicollinearity: Check VIF and condition number before interpretation
Time series: Use GLSAR or robust HAC standard errors for autocorrelated errors
Grouped data: Consider mixed effects models or cluster-robust standard errors
Quantile regression: Use for robust estimation or when interested in distributional effects

Common Pitfalls

Forgetting to add constant: Results in no-intercept model
Ignoring heteroscedasticity: Use WLS or robust standard errors
Using OLS with autocorrelated errors: Use GLSAR or HAC standard errors
Over-interpreting with multicollinearity: Check VIF first
Not checking residuals: Always plot residuals vs fitted values
Using t-SNE/PCA residuals: Residuals should be from original space
Confusing prediction vs confidence intervals: Prediction intervals are wider
Not handling categorical variables properly: Use formula API or manual dummy coding
Comparing models with different sample sizes: Ensure same observations used
Ignoring influential observations: Check Cook's distance and DFFITS

13 KiB Raw Blame History

Linear Regression Models Reference

Core Model Classes

OLS (Ordinary Least Squares)

WLS (Weighted Least Squares)

GLS (Generalized Least Squares)

GLSAR (GLS with Autoregressive Errors)

RLS (Recursive Least Squares)

Rolling Regressions

Quantile Regression

Mixed Effects Models

Diagnostics and Model Assessment

Residual Analysis

Specification Tests

Multicollinearity

Influence Statistics

Hypothesis Testing

Model Comparison

Robust Standard Errors

Best Practices

Common Pitfalls

13 KiB

Raw Blame History