Initial commit
This commit is contained in:
716
skills/statsmodels/references/time_series.md
Normal file
716
skills/statsmodels/references/time_series.md
Normal file
@@ -0,0 +1,716 @@
|
||||
# Time Series Analysis Reference
|
||||
|
||||
This document provides comprehensive guidance on time series models in statsmodels, including ARIMA, state space models, VAR, exponential smoothing, and forecasting methods.
|
||||
|
||||
## Overview
|
||||
|
||||
Statsmodels offers extensive time series capabilities:
|
||||
- **Univariate models**: AR, ARIMA, SARIMAX, Exponential Smoothing
|
||||
- **Multivariate models**: VAR, VARMAX, Dynamic Factor Models
|
||||
- **State space framework**: Custom models, Kalman filtering
|
||||
- **Diagnostic tools**: ACF, PACF, stationarity tests, residual analysis
|
||||
- **Forecasting**: Point forecasts and prediction intervals
|
||||
|
||||
## Univariate Time Series Models
|
||||
|
||||
### AutoReg (AR Model)
|
||||
|
||||
Autoregressive model: current value depends on past values.
|
||||
|
||||
**When to use:**
|
||||
- Univariate time series
|
||||
- Past values predict future
|
||||
- Stationary series
|
||||
|
||||
**Model**: yₜ = c + φ₁yₜ₋₁ + φ₂yₜ₋₂ + ... + φₚyₜ₋ₚ + εₜ
|
||||
|
||||
```python
|
||||
from statsmodels.tsa.ar_model import AutoReg
|
||||
import pandas as pd
|
||||
|
||||
# Fit AR(p) model
|
||||
model = AutoReg(y, lags=5) # AR(5)
|
||||
results = model.fit()
|
||||
|
||||
print(results.summary())
|
||||
```
|
||||
|
||||
**With exogenous regressors:**
|
||||
```python
|
||||
# AR with exogenous variables (ARX)
|
||||
model = AutoReg(y, lags=5, exog=X_exog)
|
||||
results = model.fit()
|
||||
```
|
||||
|
||||
**Seasonal AR:**
|
||||
```python
|
||||
# Seasonal lags (e.g., monthly data with yearly seasonality)
|
||||
model = AutoReg(y, lags=12, seasonal=True)
|
||||
results = model.fit()
|
||||
```
|
||||
|
||||
### ARIMA (Autoregressive Integrated Moving Average)
|
||||
|
||||
Combines AR, differencing (I), and MA components.
|
||||
|
||||
**When to use:**
|
||||
- Non-stationary time series (needs differencing)
|
||||
- Past values and errors predict future
|
||||
- Flexible model for many time series
|
||||
|
||||
**Model**: ARIMA(p,d,q)
|
||||
- p: AR order (lags)
|
||||
- d: differencing order (to achieve stationarity)
|
||||
- q: MA order (lagged forecast errors)
|
||||
|
||||
```python
|
||||
from statsmodels.tsa.arima.model import ARIMA
|
||||
|
||||
# Fit ARIMA(p,d,q)
|
||||
model = ARIMA(y, order=(1, 1, 1)) # ARIMA(1,1,1)
|
||||
results = model.fit()
|
||||
|
||||
print(results.summary())
|
||||
```
|
||||
|
||||
**Choosing p, d, q:**
|
||||
|
||||
1. **Determine d (differencing order)**:
|
||||
```python
|
||||
from statsmodels.tsa.stattools import adfuller
|
||||
|
||||
# ADF test for stationarity
|
||||
def check_stationarity(series):
|
||||
result = adfuller(series)
|
||||
print(f"ADF Statistic: {result[0]:.4f}")
|
||||
print(f"p-value: {result[1]:.4f}")
|
||||
if result[1] <= 0.05:
|
||||
print("Series is stationary")
|
||||
return True
|
||||
else:
|
||||
print("Series is non-stationary, needs differencing")
|
||||
return False
|
||||
|
||||
# Test original series
|
||||
if not check_stationarity(y):
|
||||
# Difference once
|
||||
y_diff = y.diff().dropna()
|
||||
if not check_stationarity(y_diff):
|
||||
# Difference again
|
||||
y_diff2 = y_diff.diff().dropna()
|
||||
check_stationarity(y_diff2)
|
||||
```
|
||||
|
||||
2. **Determine p and q (ACF/PACF)**:
|
||||
```python
|
||||
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
# After differencing to stationarity
|
||||
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 8))
|
||||
|
||||
# ACF: helps determine q (MA order)
|
||||
plot_acf(y_stationary, lags=40, ax=ax1)
|
||||
ax1.set_title('Autocorrelation Function (ACF)')
|
||||
|
||||
# PACF: helps determine p (AR order)
|
||||
plot_pacf(y_stationary, lags=40, ax=ax2)
|
||||
ax2.set_title('Partial Autocorrelation Function (PACF)')
|
||||
|
||||
plt.tight_layout()
|
||||
plt.show()
|
||||
|
||||
# Rules of thumb:
|
||||
# - PACF cuts off at lag p → AR(p)
|
||||
# - ACF cuts off at lag q → MA(q)
|
||||
# - Both decay → ARMA(p,q)
|
||||
```
|
||||
|
||||
3. **Model selection (AIC/BIC)**:
|
||||
```python
|
||||
# Grid search for best (p,q) given d
|
||||
import numpy as np
|
||||
|
||||
best_aic = np.inf
|
||||
best_order = None
|
||||
|
||||
for p in range(5):
|
||||
for q in range(5):
|
||||
try:
|
||||
model = ARIMA(y, order=(p, d, q))
|
||||
results = model.fit()
|
||||
if results.aic < best_aic:
|
||||
best_aic = results.aic
|
||||
best_order = (p, d, q)
|
||||
except:
|
||||
continue
|
||||
|
||||
print(f"Best order: {best_order} with AIC: {best_aic:.2f}")
|
||||
```
|
||||
|
||||
### SARIMAX (Seasonal ARIMA with Exogenous Variables)
|
||||
|
||||
Extends ARIMA with seasonality and exogenous regressors.
|
||||
|
||||
**When to use:**
|
||||
- Seasonal patterns (monthly, quarterly data)
|
||||
- External variables influence series
|
||||
- Most flexible univariate model
|
||||
|
||||
**Model**: SARIMAX(p,d,q)(P,D,Q,s)
|
||||
- (p,d,q): Non-seasonal ARIMA
|
||||
- (P,D,Q,s): Seasonal ARIMA with period s
|
||||
|
||||
```python
|
||||
from statsmodels.tsa.statespace.sarimax import SARIMAX
|
||||
|
||||
# Seasonal ARIMA for monthly data (s=12)
|
||||
model = SARIMAX(y,
|
||||
order=(1, 1, 1), # (p,d,q)
|
||||
seasonal_order=(1, 1, 1, 12)) # (P,D,Q,s)
|
||||
results = model.fit()
|
||||
|
||||
print(results.summary())
|
||||
```
|
||||
|
||||
**With exogenous variables:**
|
||||
```python
|
||||
# SARIMAX with external predictors
|
||||
model = SARIMAX(y,
|
||||
exog=X_exog,
|
||||
order=(1, 1, 1),
|
||||
seasonal_order=(1, 1, 1, 12))
|
||||
results = model.fit()
|
||||
```
|
||||
|
||||
**Example: Monthly sales with trend and seasonality**
|
||||
```python
|
||||
# Typical for monthly data: (p,d,q)(P,D,Q,12)
|
||||
# Start with (1,1,1)(1,1,1,12) or (0,1,1)(0,1,1,12)
|
||||
|
||||
model = SARIMAX(monthly_sales,
|
||||
order=(0, 1, 1),
|
||||
seasonal_order=(0, 1, 1, 12),
|
||||
enforce_stationarity=False,
|
||||
enforce_invertibility=False)
|
||||
results = model.fit()
|
||||
```
|
||||
|
||||
### Exponential Smoothing
|
||||
|
||||
Weighted averages of past observations with exponentially decreasing weights.
|
||||
|
||||
**When to use:**
|
||||
- Simple, interpretable forecasts
|
||||
- Trend and/or seasonality present
|
||||
- No need for explicit model specification
|
||||
|
||||
**Types:**
|
||||
- Simple Exponential Smoothing: no trend, no seasonality
|
||||
- Holt's method: with trend
|
||||
- Holt-Winters: with trend and seasonality
|
||||
|
||||
```python
|
||||
from statsmodels.tsa.holtwinters import ExponentialSmoothing
|
||||
|
||||
# Simple exponential smoothing
|
||||
model = ExponentialSmoothing(y, trend=None, seasonal=None)
|
||||
results = model.fit()
|
||||
|
||||
# Holt's method (with trend)
|
||||
model = ExponentialSmoothing(y, trend='add', seasonal=None)
|
||||
results = model.fit()
|
||||
|
||||
# Holt-Winters (trend + seasonality)
|
||||
model = ExponentialSmoothing(y,
|
||||
trend='add', # 'add' or 'mul'
|
||||
seasonal='add', # 'add' or 'mul'
|
||||
seasonal_periods=12) # e.g., 12 for monthly
|
||||
results = model.fit()
|
||||
|
||||
print(results.summary())
|
||||
```
|
||||
|
||||
**Additive vs Multiplicative:**
|
||||
```python
|
||||
# Additive: constant seasonal variation
|
||||
# yₜ = Level + Trend + Seasonal + Error
|
||||
|
||||
# Multiplicative: proportional seasonal variation
|
||||
# yₜ = Level × Trend × Seasonal × Error
|
||||
|
||||
# Choose based on data:
|
||||
# - Additive: seasonal variation constant over time
|
||||
# - Multiplicative: seasonal variation increases with level
|
||||
```
|
||||
|
||||
**Innovations state space (ETS):**
|
||||
```python
|
||||
from statsmodels.tsa.exponential_smoothing.ets import ETSModel
|
||||
|
||||
# More robust, state space formulation
|
||||
model = ETSModel(y,
|
||||
error='add', # 'add' or 'mul'
|
||||
trend='add', # 'add', 'mul', or None
|
||||
seasonal='add', # 'add', 'mul', or None
|
||||
seasonal_periods=12)
|
||||
results = model.fit()
|
||||
```
|
||||
|
||||
## Multivariate Time Series
|
||||
|
||||
### VAR (Vector Autoregression)
|
||||
|
||||
System of equations where each variable depends on past values of all variables.
|
||||
|
||||
**When to use:**
|
||||
- Multiple interrelated time series
|
||||
- Bidirectional relationships
|
||||
- Granger causality testing
|
||||
|
||||
**Model**: Each variable is AR on all variables:
|
||||
- y₁ₜ = c₁ + φ₁₁y₁ₜ₋₁ + φ₁₂y₂ₜ₋₁ + ... + ε₁ₜ
|
||||
- y₂ₜ = c₂ + φ₂₁y₁ₜ₋₁ + φ₂₂y₂ₜ₋₁ + ... + ε₂ₜ
|
||||
|
||||
```python
|
||||
from statsmodels.tsa.api import VAR
|
||||
import pandas as pd
|
||||
|
||||
# Data should be DataFrame with multiple columns
|
||||
# Each column is a time series
|
||||
df_multivariate = pd.DataFrame({'series1': y1, 'series2': y2, 'series3': y3})
|
||||
|
||||
# Fit VAR
|
||||
model = VAR(df_multivariate)
|
||||
|
||||
# Select lag order using AIC/BIC
|
||||
lag_order_results = model.select_order(maxlags=15)
|
||||
print(lag_order_results.summary())
|
||||
|
||||
# Fit with optimal lags
|
||||
results = model.fit(maxlags=5, ic='aic')
|
||||
print(results.summary())
|
||||
```
|
||||
|
||||
**Granger causality testing:**
|
||||
```python
|
||||
# Test if series1 Granger-causes series2
|
||||
from statsmodels.tsa.stattools import grangercausalitytests
|
||||
|
||||
# Requires 2D array [series2, series1]
|
||||
test_data = df_multivariate[['series2', 'series1']]
|
||||
|
||||
# Test up to max_lag
|
||||
max_lag = 5
|
||||
results = grangercausalitytests(test_data, max_lag, verbose=True)
|
||||
|
||||
# P-values for each lag
|
||||
for lag in range(1, max_lag + 1):
|
||||
p_value = results[lag][0]['ssr_ftest'][1]
|
||||
print(f"Lag {lag}: p-value = {p_value:.4f}")
|
||||
```
|
||||
|
||||
**Impulse Response Functions (IRF):**
|
||||
```python
|
||||
# Trace effect of shock through system
|
||||
irf = results.irf(10) # 10 periods ahead
|
||||
|
||||
# Plot IRFs
|
||||
irf.plot(orth=True) # Orthogonalized (Cholesky decomposition)
|
||||
plt.show()
|
||||
|
||||
# Cumulative effects
|
||||
irf.plot_cum_effects(orth=True)
|
||||
plt.show()
|
||||
```
|
||||
|
||||
**Forecast Error Variance Decomposition:**
|
||||
```python
|
||||
# Contribution of each variable to forecast error variance
|
||||
fevd = results.fevd(10) # 10 periods ahead
|
||||
fevd.plot()
|
||||
plt.show()
|
||||
```
|
||||
|
||||
### VARMAX (VAR with Moving Average and Exogenous Variables)
|
||||
|
||||
Extends VAR with MA component and external regressors.
|
||||
|
||||
**When to use:**
|
||||
- VAR inadequate (MA component needed)
|
||||
- External variables affect system
|
||||
- More flexible multivariate model
|
||||
|
||||
```python
|
||||
from statsmodels.tsa.statespace.varmax import VARMAX
|
||||
|
||||
# VARMAX(p, q) with exogenous variables
|
||||
model = VARMAX(df_multivariate,
|
||||
order=(1, 1), # (p, q)
|
||||
exog=X_exog)
|
||||
results = model.fit()
|
||||
|
||||
print(results.summary())
|
||||
```
|
||||
|
||||
## State Space Models
|
||||
|
||||
Flexible framework for custom time series models.
|
||||
|
||||
**When to use:**
|
||||
- Custom model specification
|
||||
- Unobserved components
|
||||
- Kalman filtering/smoothing
|
||||
- Missing data
|
||||
|
||||
```python
|
||||
from statsmodels.tsa.statespace.mlemodel import MLEModel
|
||||
|
||||
# Extend MLEModel for custom state space models
|
||||
# Example: Local level model (random walk + noise)
|
||||
```
|
||||
|
||||
**Dynamic Factor Models:**
|
||||
```python
|
||||
from statsmodels.tsa.statespace.dynamic_factor import DynamicFactor
|
||||
|
||||
# Extract common factors from multiple time series
|
||||
model = DynamicFactor(df_multivariate,
|
||||
k_factors=2, # Number of factors
|
||||
factor_order=2) # AR order of factors
|
||||
results = model.fit()
|
||||
|
||||
# Estimated factors
|
||||
factors = results.factors.filtered
|
||||
```
|
||||
|
||||
## Forecasting
|
||||
|
||||
### Point Forecasts
|
||||
|
||||
```python
|
||||
# ARIMA forecasting
|
||||
model = ARIMA(y, order=(1, 1, 1))
|
||||
results = model.fit()
|
||||
|
||||
# Forecast h steps ahead
|
||||
h = 10
|
||||
forecast = results.forecast(steps=h)
|
||||
|
||||
# With exogenous variables (SARIMAX)
|
||||
model = SARIMAX(y, exog=X, order=(1, 1, 1))
|
||||
results = model.fit()
|
||||
|
||||
# Need future exogenous values
|
||||
forecast = results.forecast(steps=h, exog=X_future)
|
||||
```
|
||||
|
||||
### Prediction Intervals
|
||||
|
||||
```python
|
||||
# Get forecast with confidence intervals
|
||||
forecast_obj = results.get_forecast(steps=h)
|
||||
forecast_df = forecast_obj.summary_frame()
|
||||
|
||||
print(forecast_df)
|
||||
# Contains: mean, mean_se, mean_ci_lower, mean_ci_upper
|
||||
|
||||
# Extract components
|
||||
forecast_mean = forecast_df['mean']
|
||||
forecast_ci_lower = forecast_df['mean_ci_lower']
|
||||
forecast_ci_upper = forecast_df['mean_ci_upper']
|
||||
|
||||
# Plot
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
plt.figure(figsize=(12, 6))
|
||||
plt.plot(y.index, y, label='Historical')
|
||||
plt.plot(forecast_df.index, forecast_mean, label='Forecast', color='red')
|
||||
plt.fill_between(forecast_df.index,
|
||||
forecast_ci_lower,
|
||||
forecast_ci_upper,
|
||||
alpha=0.3, color='red', label='95% CI')
|
||||
plt.legend()
|
||||
plt.title('Forecast with Prediction Intervals')
|
||||
plt.show()
|
||||
```
|
||||
|
||||
### Dynamic vs Static Forecasts
|
||||
|
||||
```python
|
||||
# Static (one-step-ahead, using actual values)
|
||||
static_forecast = results.get_prediction(start=split_point, end=len(y)-1)
|
||||
|
||||
# Dynamic (multi-step, using predicted values)
|
||||
dynamic_forecast = results.get_prediction(start=split_point,
|
||||
end=len(y)-1,
|
||||
dynamic=True)
|
||||
|
||||
# Plot comparison
|
||||
fig, ax = plt.subplots(figsize=(12, 6))
|
||||
y.plot(ax=ax, label='Actual')
|
||||
static_forecast.predicted_mean.plot(ax=ax, label='Static forecast')
|
||||
dynamic_forecast.predicted_mean.plot(ax=ax, label='Dynamic forecast')
|
||||
ax.legend()
|
||||
plt.show()
|
||||
```
|
||||
|
||||
## Diagnostic Tests
|
||||
|
||||
### Stationarity Tests
|
||||
|
||||
```python
|
||||
from statsmodels.tsa.stattools import adfuller, kpss
|
||||
|
||||
# Augmented Dickey-Fuller (ADF) test
|
||||
# H0: unit root (non-stationary)
|
||||
adf_result = adfuller(y, autolag='AIC')
|
||||
print(f"ADF Statistic: {adf_result[0]:.4f}")
|
||||
print(f"p-value: {adf_result[1]:.4f}")
|
||||
if adf_result[1] <= 0.05:
|
||||
print("Reject H0: Series is stationary")
|
||||
else:
|
||||
print("Fail to reject H0: Series is non-stationary")
|
||||
|
||||
# KPSS test
|
||||
# H0: stationary (opposite of ADF)
|
||||
kpss_result = kpss(y, regression='c', nlags='auto')
|
||||
print(f"KPSS Statistic: {kpss_result[0]:.4f}")
|
||||
print(f"p-value: {kpss_result[1]:.4f}")
|
||||
if kpss_result[1] <= 0.05:
|
||||
print("Reject H0: Series is non-stationary")
|
||||
else:
|
||||
print("Fail to reject H0: Series is stationary")
|
||||
```
|
||||
|
||||
### Residual Diagnostics
|
||||
|
||||
```python
|
||||
# Ljung-Box test for autocorrelation in residuals
|
||||
from statsmodels.stats.diagnostic import acorr_ljungbox
|
||||
|
||||
lb_test = acorr_ljungbox(results.resid, lags=10, return_df=True)
|
||||
print(lb_test)
|
||||
# P-values > 0.05 indicate no significant autocorrelation (good)
|
||||
|
||||
# Plot residual diagnostics
|
||||
results.plot_diagnostics(figsize=(12, 8))
|
||||
plt.show()
|
||||
|
||||
# Components:
|
||||
# 1. Standardized residuals over time
|
||||
# 2. Histogram + KDE of residuals
|
||||
# 3. Q-Q plot for normality
|
||||
# 4. Correlogram (ACF of residuals)
|
||||
```
|
||||
|
||||
### Heteroskedasticity Tests
|
||||
|
||||
```python
|
||||
from statsmodels.stats.diagnostic import het_arch
|
||||
|
||||
# ARCH test for heteroskedasticity
|
||||
arch_test = het_arch(results.resid, nlags=10)
|
||||
print(f"ARCH test statistic: {arch_test[0]:.4f}")
|
||||
print(f"p-value: {arch_test[1]:.4f}")
|
||||
|
||||
# If significant, consider GARCH model
|
||||
```
|
||||
|
||||
## Seasonal Decomposition
|
||||
|
||||
```python
|
||||
from statsmodels.tsa.seasonal import seasonal_decompose
|
||||
|
||||
# Decompose into trend, seasonal, residual
|
||||
decomposition = seasonal_decompose(y,
|
||||
model='additive', # or 'multiplicative'
|
||||
period=12) # seasonal period
|
||||
|
||||
# Plot components
|
||||
fig = decomposition.plot()
|
||||
fig.set_size_inches(12, 8)
|
||||
plt.show()
|
||||
|
||||
# Access components
|
||||
trend = decomposition.trend
|
||||
seasonal = decomposition.seasonal
|
||||
residual = decomposition.resid
|
||||
|
||||
# STL decomposition (more robust)
|
||||
from statsmodels.tsa.seasonal import STL
|
||||
|
||||
stl = STL(y, seasonal=13) # seasonal must be odd
|
||||
stl_result = stl.fit()
|
||||
|
||||
fig = stl_result.plot()
|
||||
plt.show()
|
||||
```
|
||||
|
||||
## Model Evaluation
|
||||
|
||||
### In-Sample Metrics
|
||||
|
||||
```python
|
||||
# From results object
|
||||
print(f"AIC: {results.aic:.2f}")
|
||||
print(f"BIC: {results.bic:.2f}")
|
||||
print(f"Log-likelihood: {results.llf:.2f}")
|
||||
|
||||
# MSE on training data
|
||||
from sklearn.metrics import mean_squared_error
|
||||
|
||||
mse = mean_squared_error(y, results.fittedvalues)
|
||||
rmse = np.sqrt(mse)
|
||||
print(f"RMSE: {rmse:.4f}")
|
||||
|
||||
# MAE
|
||||
from sklearn.metrics import mean_absolute_error
|
||||
mae = mean_absolute_error(y, results.fittedvalues)
|
||||
print(f"MAE: {mae:.4f}")
|
||||
```
|
||||
|
||||
### Out-of-Sample Evaluation
|
||||
|
||||
```python
|
||||
# Train-test split for time series (no shuffle!)
|
||||
train_size = int(0.8 * len(y))
|
||||
y_train = y[:train_size]
|
||||
y_test = y[train_size:]
|
||||
|
||||
# Fit on training data
|
||||
model = ARIMA(y_train, order=(1, 1, 1))
|
||||
results = model.fit()
|
||||
|
||||
# Forecast test period
|
||||
forecast = results.forecast(steps=len(y_test))
|
||||
|
||||
# Metrics
|
||||
from sklearn.metrics import mean_squared_error, mean_absolute_error
|
||||
|
||||
rmse = np.sqrt(mean_squared_error(y_test, forecast))
|
||||
mae = mean_absolute_error(y_test, forecast)
|
||||
mape = np.mean(np.abs((y_test - forecast) / y_test)) * 100
|
||||
|
||||
print(f"Test RMSE: {rmse:.4f}")
|
||||
print(f"Test MAE: {mae:.4f}")
|
||||
print(f"Test MAPE: {mape:.2f}%")
|
||||
```
|
||||
|
||||
### Rolling Forecast
|
||||
|
||||
```python
|
||||
# More realistic evaluation: rolling one-step-ahead forecasts
|
||||
forecasts = []
|
||||
|
||||
for t in range(len(y_test)):
|
||||
# Refit or update with new observation
|
||||
y_current = y[:train_size + t]
|
||||
model = ARIMA(y_current, order=(1, 1, 1))
|
||||
fit = model.fit()
|
||||
|
||||
# One-step forecast
|
||||
fc = fit.forecast(steps=1)[0]
|
||||
forecasts.append(fc)
|
||||
|
||||
forecasts = np.array(forecasts)
|
||||
|
||||
rmse = np.sqrt(mean_squared_error(y_test, forecasts))
|
||||
print(f"Rolling forecast RMSE: {rmse:.4f}")
|
||||
```
|
||||
|
||||
### Cross-Validation
|
||||
|
||||
```python
|
||||
# Time series cross-validation (expanding window)
|
||||
from sklearn.model_selection import TimeSeriesSplit
|
||||
|
||||
tscv = TimeSeriesSplit(n_splits=5)
|
||||
rmse_scores = []
|
||||
|
||||
for train_idx, test_idx in tscv.split(y):
|
||||
y_train_cv = y.iloc[train_idx]
|
||||
y_test_cv = y.iloc[test_idx]
|
||||
|
||||
model = ARIMA(y_train_cv, order=(1, 1, 1))
|
||||
results = model.fit()
|
||||
|
||||
forecast = results.forecast(steps=len(test_idx))
|
||||
rmse = np.sqrt(mean_squared_error(y_test_cv, forecast))
|
||||
rmse_scores.append(rmse)
|
||||
|
||||
print(f"CV RMSE: {np.mean(rmse_scores):.4f} ± {np.std(rmse_scores):.4f}")
|
||||
```
|
||||
|
||||
## Advanced Topics
|
||||
|
||||
### ARDL (Autoregressive Distributed Lag)
|
||||
|
||||
Bridges univariate and multivariate time series.
|
||||
|
||||
```python
|
||||
from statsmodels.tsa.ardl import ARDL
|
||||
|
||||
# ARDL(p, q) model
|
||||
# y depends on its own lags and lags of X
|
||||
model = ARDL(y, lags=2, exog=X, exog_lags=2)
|
||||
results = model.fit()
|
||||
```
|
||||
|
||||
### Error Correction Models
|
||||
|
||||
For cointegrated series.
|
||||
|
||||
```python
|
||||
from statsmodels.tsa.vector_ar.vecm import coint_johansen
|
||||
|
||||
# Test for cointegration
|
||||
johansen_test = coint_johansen(df_multivariate, det_order=0, k_ar_diff=1)
|
||||
|
||||
# Fit VECM if cointegrated
|
||||
from statsmodels.tsa.vector_ar.vecm import VECM
|
||||
|
||||
model = VECM(df_multivariate, k_ar_diff=1, coint_rank=1)
|
||||
results = model.fit()
|
||||
```
|
||||
|
||||
### Regime Switching Models
|
||||
|
||||
For structural breaks and regime changes.
|
||||
|
||||
```python
|
||||
from statsmodels.tsa.regime_switching.markov_regression import MarkovRegression
|
||||
|
||||
# Markov switching model
|
||||
model = MarkovRegression(y, k_regimes=2, order=1)
|
||||
results = model.fit()
|
||||
|
||||
# Smoothed probabilities of regimes
|
||||
regime_probs = results.smoothed_marginal_probabilities
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Check stationarity**: Difference if needed, verify with ADF/KPSS tests
|
||||
2. **Plot data**: Always visualize before modeling
|
||||
3. **Identify seasonality**: Use appropriate seasonal models (SARIMAX, Holt-Winters)
|
||||
4. **Model selection**: Use AIC/BIC and out-of-sample validation
|
||||
5. **Residual diagnostics**: Check for autocorrelation, normality, heteroskedasticity
|
||||
6. **Forecast evaluation**: Use rolling forecasts and proper time series CV
|
||||
7. **Avoid overfitting**: Prefer simpler models, use information criteria
|
||||
8. **Document assumptions**: Note any data transformations (log, differencing)
|
||||
9. **Prediction intervals**: Always provide uncertainty estimates
|
||||
10. **Refit regularly**: Update models as new data arrives
|
||||
|
||||
## Common Pitfalls
|
||||
|
||||
1. **Not checking stationarity**: Fit ARIMA on non-stationary data
|
||||
2. **Data leakage**: Using future data in transformations
|
||||
3. **Wrong seasonal period**: S=4 for quarterly, S=12 for monthly
|
||||
4. **Overfitting**: Too many parameters relative to data
|
||||
5. **Ignoring residual autocorrelation**: Model inadequate
|
||||
6. **Using inappropriate metrics**: MAPE fails with zeros or negatives
|
||||
7. **Not handling missing data**: Affects model estimation
|
||||
8. **Extrapolating exogenous variables**: Need future X values for SARIMAX
|
||||
9. **Confusing static vs dynamic forecasts**: Dynamic more realistic for multi-step
|
||||
10. **Not validating forecasts**: Always check out-of-sample performance
|
||||
Reference in New Issue
Block a user