Initial commit

2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions
--- a/skills/statsmodels/SKILL.md
+++ b/skills/statsmodels/SKILL.md
@@ -0,0 +1,608 @@
+---
+name: statsmodels
+description: "Statistical modeling toolkit. OLS, GLM, logistic, ARIMA, time series, hypothesis tests, diagnostics, AIC/BIC, for rigorous statistical inference and econometric analysis."
+---
+
+# Statsmodels: Statistical Modeling and Econometrics
+
+## Overview
+
+Statsmodels is Python's premier library for statistical modeling, providing tools for estimation, inference, and diagnostics across a wide range of statistical methods. Apply this skill for rigorous statistical analysis, from simple linear regression to complex time series models and econometric analyses.
+
+## When to Use This Skill
+
+This skill should be used when:
+- Fitting regression models (OLS, WLS, GLS, quantile regression)
+- Performing generalized linear modeling (logistic, Poisson, Gamma, etc.)
+- Analyzing discrete outcomes (binary, multinomial, count, ordinal)
+- Conducting time series analysis (ARIMA, SARIMAX, VAR, forecasting)
+- Running statistical tests and diagnostics
+- Testing model assumptions (heteroskedasticity, autocorrelation, normality)
+- Detecting outliers and influential observations
+- Comparing models (AIC/BIC, likelihood ratio tests)
+- Estimating causal effects
+- Producing publication-ready statistical tables and inference
+
+## Quick Start Guide
+
+### Linear Regression (OLS)
+
+```python
+import statsmodels.api as sm
+import numpy as np
+import pandas as pd
+
+# Prepare data - ALWAYS add constant for intercept
+X = sm.add_constant(X_data)
+
+# Fit OLS model
+model = sm.OLS(y, X)
+results = model.fit()
+
+# View comprehensive results
+print(results.summary())
+
+# Key results
+print(f"R-squared: {results.rsquared:.4f}")
+print(f"Coefficients:\\n{results.params}")
+print(f"P-values:\\n{results.pvalues}")
+
+# Predictions with confidence intervals
+predictions = results.get_prediction(X_new)
+pred_summary = predictions.summary_frame()
+print(pred_summary)  # includes mean, CI, prediction intervals
+
+# Diagnostics
+from statsmodels.stats.diagnostic import het_breuschpagan
+bp_test = het_breuschpagan(results.resid, X)
+print(f"Breusch-Pagan p-value: {bp_test[1]:.4f}")
+
+# Visualize residuals
+import matplotlib.pyplot as plt
+plt.scatter(results.fittedvalues, results.resid)
+plt.axhline(y=0, color='r', linestyle='--')
+plt.xlabel('Fitted values')
+plt.ylabel('Residuals')
+plt.show()
+```
+
+### Logistic Regression (Binary Outcomes)
+
+```python
+from statsmodels.discrete.discrete_model import Logit
+
+# Add constant
+X = sm.add_constant(X_data)
+
+# Fit logit model
+model = Logit(y_binary, X)
+results = model.fit()
+
+print(results.summary())
+
+# Odds ratios
+odds_ratios = np.exp(results.params)
+print("Odds ratios:\\n", odds_ratios)
+
+# Predicted probabilities
+probs = results.predict(X)
+
+# Binary predictions (0.5 threshold)
+predictions = (probs > 0.5).astype(int)
+
+# Model evaluation
+from sklearn.metrics import classification_report, roc_auc_score
+
+print(classification_report(y_binary, predictions))
+print(f"AUC: {roc_auc_score(y_binary, probs):.4f}")
+
+# Marginal effects
+marginal = results.get_margeff()
+print(marginal.summary())
+```
+
+### Time Series (ARIMA)
+
+```python
+from statsmodels.tsa.arima.model import ARIMA
+from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
+
+# Check stationarity
+from statsmodels.tsa.stattools import adfuller
+
+adf_result = adfuller(y_series)
+print(f"ADF p-value: {adf_result[1]:.4f}")
+
+if adf_result[1] > 0.05:
+    # Series is non-stationary, difference it
+    y_diff = y_series.diff().dropna()
+
+# Plot ACF/PACF to identify p, q
+fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 8))
+plot_acf(y_diff, lags=40, ax=ax1)
+plot_pacf(y_diff, lags=40, ax=ax2)
+plt.show()
+
+# Fit ARIMA(p,d,q)
+model = ARIMA(y_series, order=(1, 1, 1))
+results = model.fit()
+
+print(results.summary())
+
+# Forecast
+forecast = results.forecast(steps=10)
+forecast_obj = results.get_forecast(steps=10)
+forecast_df = forecast_obj.summary_frame()
+
+print(forecast_df)  # includes mean and confidence intervals
+
+# Residual diagnostics
+results.plot_diagnostics(figsize=(12, 8))
+plt.show()
+```
+
+### Generalized Linear Models (GLM)
+
+```python
+import statsmodels.api as sm
+
+# Poisson regression for count data
+X = sm.add_constant(X_data)
+model = sm.GLM(y_counts, X, family=sm.families.Poisson())
+results = model.fit()
+
+print(results.summary())
+
+# Rate ratios (for Poisson with log link)
+rate_ratios = np.exp(results.params)
+print("Rate ratios:\\n", rate_ratios)
+
+# Check overdispersion
+overdispersion = results.pearson_chi2 / results.df_resid
+print(f"Overdispersion: {overdispersion:.2f}")
+
+if overdispersion > 1.5:
+    # Use Negative Binomial instead
+    from statsmodels.discrete.count_model import NegativeBinomial
+    nb_model = NegativeBinomial(y_counts, X)
+    nb_results = nb_model.fit()
+    print(nb_results.summary())
+```
+
+## Core Statistical Modeling Capabilities
+
+### 1. Linear Regression Models
+
+Comprehensive suite of linear models for continuous outcomes with various error structures.
+
+**Available models:**
+- **OLS**: Standard linear regression with i.i.d. errors
+- **WLS**: Weighted least squares for heteroskedastic errors
+- **GLS**: Generalized least squares for arbitrary covariance structure
+- **GLSAR**: GLS with autoregressive errors for time series
+- **Quantile Regression**: Conditional quantiles (robust to outliers)
+- **Mixed Effects**: Hierarchical/multilevel models with random effects
+- **Recursive/Rolling**: Time-varying parameter estimation
+
+**Key features:**
+- Comprehensive diagnostic tests
+- Robust standard errors (HC, HAC, cluster-robust)
+- Influence statistics (Cook's distance, leverage, DFFITS)
+- Hypothesis testing (F-tests, Wald tests)
+- Model comparison (AIC, BIC, likelihood ratio tests)
+- Prediction with confidence and prediction intervals
+
+**When to use:** Continuous outcome variable, want inference on coefficients, need diagnostics
+
+**Reference:** See `references/linear_models.md` for detailed guidance on model selection, diagnostics, and best practices.
+
+### 2. Generalized Linear Models (GLM)
+
+Flexible framework extending linear models to non-normal distributions.
+
+**Distribution families:**
+- **Binomial**: Binary outcomes or proportions (logistic regression)
+- **Poisson**: Count data
+- **Negative Binomial**: Overdispersed counts
+- **Gamma**: Positive continuous, right-skewed data
+- **Inverse Gaussian**: Positive continuous with specific variance structure
+- **Gaussian**: Equivalent to OLS
+- **Tweedie**: Flexible family for semi-continuous data
+
+**Link functions:**
+- Logit, Probit, Log, Identity, Inverse, Sqrt, CLogLog, Power
+- Choose based on interpretation needs and model fit
+
+**Key features:**
+- Maximum likelihood estimation via IRLS
+- Deviance and Pearson residuals
+- Goodness-of-fit statistics
+- Pseudo R-squared measures
+- Robust standard errors
+
+**When to use:** Non-normal outcomes, need flexible variance and link specifications
+
+**Reference:** See `references/glm.md` for family selection, link functions, interpretation, and diagnostics.
+
+### 3. Discrete Choice Models
+
+Models for categorical and count outcomes.
+
+**Binary models:**
+- **Logit**: Logistic regression (odds ratios)
+- **Probit**: Probit regression (normal distribution)
+
+**Multinomial models:**
+- **MNLogit**: Unordered categories (3+ levels)
+- **Conditional Logit**: Choice models with alternative-specific variables
+- **Ordered Model**: Ordinal outcomes (ordered categories)
+
+**Count models:**
+- **Poisson**: Standard count model
+- **Negative Binomial**: Overdispersed counts
+- **Zero-Inflated**: Excess zeros (ZIP, ZINB)
+- **Hurdle Models**: Two-stage models for zero-heavy data
+
+**Key features:**
+- Maximum likelihood estimation
+- Marginal effects at means or average marginal effects
+- Model comparison via AIC/BIC
+- Predicted probabilities and classification
+- Goodness-of-fit tests
+
+**When to use:** Binary, categorical, or count outcomes
+
+**Reference:** See `references/discrete_choice.md` for model selection, interpretation, and evaluation.
+
+### 4. Time Series Analysis
+
+Comprehensive time series modeling and forecasting capabilities.
+
+**Univariate models:**
+- **AutoReg (AR)**: Autoregressive models
+- **ARIMA**: Autoregressive integrated moving average
+- **SARIMAX**: Seasonal ARIMA with exogenous variables
+- **Exponential Smoothing**: Simple, Holt, Holt-Winters
+- **ETS**: Innovations state space models
+
+**Multivariate models:**
+- **VAR**: Vector autoregression
+- **VARMAX**: VAR with MA and exogenous variables
+- **Dynamic Factor Models**: Extract common factors
+- **VECM**: Vector error correction models (cointegration)
+
+**Advanced models:**
+- **State Space**: Kalman filtering, custom specifications
+- **Regime Switching**: Markov switching models
+- **ARDL**: Autoregressive distributed lag
+
+**Key features:**
+- ACF/PACF analysis for model identification
+- Stationarity tests (ADF, KPSS)
+- Forecasting with prediction intervals
+- Residual diagnostics (Ljung-Box, heteroskedasticity)
+- Granger causality testing
+- Impulse response functions (IRF)
+- Forecast error variance decomposition (FEVD)
+
+**When to use:** Time-ordered data, forecasting, understanding temporal dynamics
+
+**Reference:** See `references/time_series.md` for model selection, diagnostics, and forecasting methods.
+
+### 5. Statistical Tests and Diagnostics
+
+Extensive testing and diagnostic capabilities for model validation.
+
+**Residual diagnostics:**
+- Autocorrelation tests (Ljung-Box, Durbin-Watson, Breusch-Godfrey)
+- Heteroskedasticity tests (Breusch-Pagan, White, ARCH)
+- Normality tests (Jarque-Bera, Omnibus, Anderson-Darling, Lilliefors)
+- Specification tests (RESET, Harvey-Collier)
+
+**Influence and outliers:**
+- Leverage (hat values)
+- Cook's distance
+- DFFITS and DFBETAs
+- Studentized residuals
+- Influence plots
+
+**Hypothesis testing:**
+- t-tests (one-sample, two-sample, paired)
+- Proportion tests
+- Chi-square tests
+- Non-parametric tests (Mann-Whitney, Wilcoxon, Kruskal-Wallis)
+- ANOVA (one-way, two-way, repeated measures)
+
+**Multiple comparisons:**
+- Tukey's HSD
+- Bonferroni correction
+- False Discovery Rate (FDR)
+
+**Effect sizes and power:**
+- Cohen's d, eta-squared
+- Power analysis for t-tests, proportions
+- Sample size calculations
+
+**Robust inference:**
+- Heteroskedasticity-consistent SEs (HC0-HC3)
+- HAC standard errors (Newey-West)
+- Cluster-robust standard errors
+
+**When to use:** Validating assumptions, detecting problems, ensuring robust inference
+
+**Reference:** See `references/stats_diagnostics.md` for comprehensive testing and diagnostic procedures.
+
+## Formula API (R-style)
+
+Statsmodels supports R-style formulas for intuitive model specification:
+
+```python
+import statsmodels.formula.api as smf
+
+# OLS with formula
+results = smf.ols('y ~ x1 + x2 + x1:x2', data=df).fit()
+
+# Categorical variables (automatic dummy coding)
+results = smf.ols('y ~ x1 + C(category)', data=df).fit()
+
+# Interactions
+results = smf.ols('y ~ x1 * x2', data=df).fit()  # x1 + x2 + x1:x2
+
+# Polynomial terms
+results = smf.ols('y ~ x + I(x**2)', data=df).fit()
+
+# Logit
+results = smf.logit('y ~ x1 + x2 + C(group)', data=df).fit()
+
+# Poisson
+results = smf.poisson('count ~ x1 + x2', data=df).fit()
+
+# ARIMA (not available via formula, use regular API)
+```
+
+## Model Selection and Comparison
+
+### Information Criteria
+
+```python
+# Compare models using AIC/BIC
+models = {
+    'Model 1': model1_results,
+    'Model 2': model2_results,
+    'Model 3': model3_results
+}
+
+comparison = pd.DataFrame({
+    'AIC': {name: res.aic for name, res in models.items()},
+    'BIC': {name: res.bic for name, res in models.items()},
+    'Log-Likelihood': {name: res.llf for name, res in models.items()}
+})
+
+print(comparison.sort_values('AIC'))
+# Lower AIC/BIC indicates better model
+```
+
+### Likelihood Ratio Test (Nested Models)
+
+```python
+# For nested models (one is subset of the other)
+from scipy import stats
+
+lr_stat = 2 * (full_model.llf - reduced_model.llf)
+df = full_model.df_model - reduced_model.df_model
+p_value = 1 - stats.chi2.cdf(lr_stat, df)
+
+print(f"LR statistic: {lr_stat:.4f}")
+print(f"p-value: {p_value:.4f}")
+
+if p_value < 0.05:
+    print("Full model significantly better")
+else:
+    print("Reduced model preferred (parsimony)")
+```
+
+### Cross-Validation
+
+```python
+from sklearn.model_selection import KFold
+from sklearn.metrics import mean_squared_error
+
+kf = KFold(n_splits=5, shuffle=True, random_state=42)
+cv_scores = []
+
+for train_idx, val_idx in kf.split(X):
+    X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
+    y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]
+
+    # Fit model
+    model = sm.OLS(y_train, X_train).fit()
+
+    # Predict
+    y_pred = model.predict(X_val)
+
+    # Score
+    rmse = np.sqrt(mean_squared_error(y_val, y_pred))
+    cv_scores.append(rmse)
+
+print(f"CV RMSE: {np.mean(cv_scores):.4f} ± {np.std(cv_scores):.4f}")
+```
+
+## Best Practices
+
+### Data Preparation
+
+1. **Always add constant**: Use `sm.add_constant()` unless excluding intercept
+2. **Check for missing values**: Handle or impute before fitting
+3. **Scale if needed**: Improves convergence, interpretation (but not required for tree models)
+4. **Encode categoricals**: Use formula API or manual dummy coding
+
+### Model Building
+
+1. **Start simple**: Begin with basic model, add complexity as needed
+2. **Check assumptions**: Test residuals, heteroskedasticity, autocorrelation
+3. **Use appropriate model**: Match model to outcome type (binary→Logit, count→Poisson)
+4. **Consider alternatives**: If assumptions violated, use robust methods or different model
+
+### Inference
+
+1. **Report effect sizes**: Not just p-values
+2. **Use robust SEs**: When heteroskedasticity or clustering present
+3. **Multiple comparisons**: Correct when testing many hypotheses
+4. **Confidence intervals**: Always report alongside point estimates
+
+### Model Evaluation
+
+1. **Check residuals**: Plot residuals vs fitted, Q-Q plot
+2. **Influence diagnostics**: Identify and investigate influential observations
+3. **Out-of-sample validation**: Test on holdout set or cross-validate
+4. **Compare models**: Use AIC/BIC for non-nested, LR test for nested
+
+### Reporting
+
+1. **Comprehensive summary**: Use `.summary()` for detailed output
+2. **Document decisions**: Note transformations, excluded observations
+3. **Interpret carefully**: Account for link functions (e.g., exp(β) for log link)
+4. **Visualize**: Plot predictions, confidence intervals, diagnostics
+
+## Common Workflows
+
+### Workflow 1: Linear Regression Analysis
+
+1. Explore data (plots, descriptives)
+2. Fit initial OLS model
+3. Check residual diagnostics
+4. Test for heteroskedasticity, autocorrelation
+5. Check for multicollinearity (VIF)
+6. Identify influential observations
+7. Refit with robust SEs if needed
+8. Interpret coefficients and inference
+9. Validate on holdout or via CV
+
+### Workflow 2: Binary Classification
+
+1. Fit logistic regression (Logit)
+2. Check for convergence issues
+3. Interpret odds ratios
+4. Calculate marginal effects
+5. Evaluate classification performance (AUC, confusion matrix)
+6. Check for influential observations
+7. Compare with alternative models (Probit)
+8. Validate predictions on test set
+
+### Workflow 3: Count Data Analysis
+
+1. Fit Poisson regression
+2. Check for overdispersion
+3. If overdispersed, fit Negative Binomial
+4. Check for excess zeros (consider ZIP/ZINB)
+5. Interpret rate ratios
+6. Assess goodness of fit
+7. Compare models via AIC
+8. Validate predictions
+
+### Workflow 4: Time Series Forecasting
+
+1. Plot series, check for trend/seasonality
+2. Test for stationarity (ADF, KPSS)
+3. Difference if non-stationary
+4. Identify p, q from ACF/PACF
+5. Fit ARIMA or SARIMAX
+6. Check residual diagnostics (Ljung-Box)
+7. Generate forecasts with confidence intervals
+8. Evaluate forecast accuracy on test set
+
+## Reference Documentation
+
+This skill includes comprehensive reference files for detailed guidance:
+
+### references/linear_models.md
+Detailed coverage of linear regression models including:
+- OLS, WLS, GLS, GLSAR, Quantile Regression
+- Mixed effects models
+- Recursive and rolling regression
+- Comprehensive diagnostics (heteroskedasticity, autocorrelation, multicollinearity)
+- Influence statistics and outlier detection
+- Robust standard errors (HC, HAC, cluster)
+- Hypothesis testing and model comparison
+
+### references/glm.md
+Complete guide to generalized linear models:
+- All distribution families (Binomial, Poisson, Gamma, etc.)
+- Link functions and when to use each
+- Model fitting and interpretation
+- Pseudo R-squared and goodness of fit
+- Diagnostics and residual analysis
+- Applications (logistic, Poisson, Gamma regression)
+
+### references/discrete_choice.md
+Comprehensive guide to discrete outcome models:
+- Binary models (Logit, Probit)
+- Multinomial models (MNLogit, Conditional Logit)
+- Count models (Poisson, Negative Binomial, Zero-Inflated, Hurdle)
+- Ordinal models
+- Marginal effects and interpretation
+- Model diagnostics and comparison
+
+### references/time_series.md
+In-depth time series analysis guidance:
+- Univariate models (AR, ARIMA, SARIMAX, Exponential Smoothing)
+- Multivariate models (VAR, VARMAX, Dynamic Factor)
+- State space models
+- Stationarity testing and diagnostics
+- Forecasting methods and evaluation
+- Granger causality, IRF, FEVD
+
+### references/stats_diagnostics.md
+Comprehensive statistical testing and diagnostics:
+- Residual diagnostics (autocorrelation, heteroskedasticity, normality)
+- Influence and outlier detection
+- Hypothesis tests (parametric and non-parametric)
+- ANOVA and post-hoc tests
+- Multiple comparisons correction
+- Robust covariance matrices
+- Power analysis and effect sizes
+
+**When to reference:**
+- Need detailed parameter explanations
+- Choosing between similar models
+- Troubleshooting convergence or diagnostic issues
+- Understanding specific test statistics
+- Looking for code examples for advanced features
+
+**Search patterns:**
+```bash
+# Find information about specific models
+grep -r "Quantile Regression" references/
+
+# Find diagnostic tests
+grep -r "Breusch-Pagan" references/stats_diagnostics.md
+
+# Find time series guidance
+grep -r "SARIMAX" references/time_series.md
+```
+
+## Common Pitfalls to Avoid
+
+1. **Forgetting constant term**: Always use `sm.add_constant()` unless no intercept desired
+2. **Ignoring assumptions**: Check residuals, heteroskedasticity, autocorrelation
+3. **Wrong model for outcome type**: Binary→Logit/Probit, Count→Poisson/NB, not OLS
+4. **Not checking convergence**: Look for optimization warnings
+5. **Misinterpreting coefficients**: Remember link functions (log, logit, etc.)
+6. **Using Poisson with overdispersion**: Check dispersion, use Negative Binomial if needed
+7. **Not using robust SEs**: When heteroskedasticity or clustering present
+8. **Overfitting**: Too many parameters relative to sample size
+9. **Data leakage**: Fitting on test data or using future information
+10. **Not validating predictions**: Always check out-of-sample performance
+11. **Comparing non-nested models**: Use AIC/BIC, not LR test
+12. **Ignoring influential observations**: Check Cook's distance and leverage
+13. **Multiple testing**: Correct p-values when testing many hypotheses
+14. **Not differencing time series**: Fit ARIMA on non-stationary data
+15. **Confusing prediction vs confidence intervals**: Prediction intervals are wider
+
+## Getting Help
+
+For detailed documentation and examples:
+- Official docs: https://www.statsmodels.org/stable/
+- User guide: https://www.statsmodels.org/stable/user-guide.html
+- Examples: https://www.statsmodels.org/stable/examples/index.html
+- API reference: https://www.statsmodels.org/stable/api.html