717 lines
18 KiB
Markdown
717 lines
18 KiB
Markdown
# Time Series Analysis Reference
|
||
|
||
This document provides comprehensive guidance on time series models in statsmodels, including ARIMA, state space models, VAR, exponential smoothing, and forecasting methods.
|
||
|
||
## Overview
|
||
|
||
Statsmodels offers extensive time series capabilities:
|
||
- **Univariate models**: AR, ARIMA, SARIMAX, Exponential Smoothing
|
||
- **Multivariate models**: VAR, VARMAX, Dynamic Factor Models
|
||
- **State space framework**: Custom models, Kalman filtering
|
||
- **Diagnostic tools**: ACF, PACF, stationarity tests, residual analysis
|
||
- **Forecasting**: Point forecasts and prediction intervals
|
||
|
||
## Univariate Time Series Models
|
||
|
||
### AutoReg (AR Model)
|
||
|
||
Autoregressive model: current value depends on past values.
|
||
|
||
**When to use:**
|
||
- Univariate time series
|
||
- Past values predict future
|
||
- Stationary series
|
||
|
||
**Model**: yₜ = c + φ₁yₜ₋₁ + φ₂yₜ₋₂ + ... + φₚyₜ₋ₚ + εₜ
|
||
|
||
```python
|
||
from statsmodels.tsa.ar_model import AutoReg
|
||
import pandas as pd
|
||
|
||
# Fit AR(p) model
|
||
model = AutoReg(y, lags=5) # AR(5)
|
||
results = model.fit()
|
||
|
||
print(results.summary())
|
||
```
|
||
|
||
**With exogenous regressors:**
|
||
```python
|
||
# AR with exogenous variables (ARX)
|
||
model = AutoReg(y, lags=5, exog=X_exog)
|
||
results = model.fit()
|
||
```
|
||
|
||
**Seasonal AR:**
|
||
```python
|
||
# Seasonal lags (e.g., monthly data with yearly seasonality)
|
||
model = AutoReg(y, lags=12, seasonal=True)
|
||
results = model.fit()
|
||
```
|
||
|
||
### ARIMA (Autoregressive Integrated Moving Average)
|
||
|
||
Combines AR, differencing (I), and MA components.
|
||
|
||
**When to use:**
|
||
- Non-stationary time series (needs differencing)
|
||
- Past values and errors predict future
|
||
- Flexible model for many time series
|
||
|
||
**Model**: ARIMA(p,d,q)
|
||
- p: AR order (lags)
|
||
- d: differencing order (to achieve stationarity)
|
||
- q: MA order (lagged forecast errors)
|
||
|
||
```python
|
||
from statsmodels.tsa.arima.model import ARIMA
|
||
|
||
# Fit ARIMA(p,d,q)
|
||
model = ARIMA(y, order=(1, 1, 1)) # ARIMA(1,1,1)
|
||
results = model.fit()
|
||
|
||
print(results.summary())
|
||
```
|
||
|
||
**Choosing p, d, q:**
|
||
|
||
1. **Determine d (differencing order)**:
|
||
```python
|
||
from statsmodels.tsa.stattools import adfuller
|
||
|
||
# ADF test for stationarity
|
||
def check_stationarity(series):
|
||
result = adfuller(series)
|
||
print(f"ADF Statistic: {result[0]:.4f}")
|
||
print(f"p-value: {result[1]:.4f}")
|
||
if result[1] <= 0.05:
|
||
print("Series is stationary")
|
||
return True
|
||
else:
|
||
print("Series is non-stationary, needs differencing")
|
||
return False
|
||
|
||
# Test original series
|
||
if not check_stationarity(y):
|
||
# Difference once
|
||
y_diff = y.diff().dropna()
|
||
if not check_stationarity(y_diff):
|
||
# Difference again
|
||
y_diff2 = y_diff.diff().dropna()
|
||
check_stationarity(y_diff2)
|
||
```
|
||
|
||
2. **Determine p and q (ACF/PACF)**:
|
||
```python
|
||
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
|
||
import matplotlib.pyplot as plt
|
||
|
||
# After differencing to stationarity
|
||
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 8))
|
||
|
||
# ACF: helps determine q (MA order)
|
||
plot_acf(y_stationary, lags=40, ax=ax1)
|
||
ax1.set_title('Autocorrelation Function (ACF)')
|
||
|
||
# PACF: helps determine p (AR order)
|
||
plot_pacf(y_stationary, lags=40, ax=ax2)
|
||
ax2.set_title('Partial Autocorrelation Function (PACF)')
|
||
|
||
plt.tight_layout()
|
||
plt.show()
|
||
|
||
# Rules of thumb:
|
||
# - PACF cuts off at lag p → AR(p)
|
||
# - ACF cuts off at lag q → MA(q)
|
||
# - Both decay → ARMA(p,q)
|
||
```
|
||
|
||
3. **Model selection (AIC/BIC)**:
|
||
```python
|
||
# Grid search for best (p,q) given d
|
||
import numpy as np
|
||
|
||
best_aic = np.inf
|
||
best_order = None
|
||
|
||
for p in range(5):
|
||
for q in range(5):
|
||
try:
|
||
model = ARIMA(y, order=(p, d, q))
|
||
results = model.fit()
|
||
if results.aic < best_aic:
|
||
best_aic = results.aic
|
||
best_order = (p, d, q)
|
||
except:
|
||
continue
|
||
|
||
print(f"Best order: {best_order} with AIC: {best_aic:.2f}")
|
||
```
|
||
|
||
### SARIMAX (Seasonal ARIMA with Exogenous Variables)
|
||
|
||
Extends ARIMA with seasonality and exogenous regressors.
|
||
|
||
**When to use:**
|
||
- Seasonal patterns (monthly, quarterly data)
|
||
- External variables influence series
|
||
- Most flexible univariate model
|
||
|
||
**Model**: SARIMAX(p,d,q)(P,D,Q,s)
|
||
- (p,d,q): Non-seasonal ARIMA
|
||
- (P,D,Q,s): Seasonal ARIMA with period s
|
||
|
||
```python
|
||
from statsmodels.tsa.statespace.sarimax import SARIMAX
|
||
|
||
# Seasonal ARIMA for monthly data (s=12)
|
||
model = SARIMAX(y,
|
||
order=(1, 1, 1), # (p,d,q)
|
||
seasonal_order=(1, 1, 1, 12)) # (P,D,Q,s)
|
||
results = model.fit()
|
||
|
||
print(results.summary())
|
||
```
|
||
|
||
**With exogenous variables:**
|
||
```python
|
||
# SARIMAX with external predictors
|
||
model = SARIMAX(y,
|
||
exog=X_exog,
|
||
order=(1, 1, 1),
|
||
seasonal_order=(1, 1, 1, 12))
|
||
results = model.fit()
|
||
```
|
||
|
||
**Example: Monthly sales with trend and seasonality**
|
||
```python
|
||
# Typical for monthly data: (p,d,q)(P,D,Q,12)
|
||
# Start with (1,1,1)(1,1,1,12) or (0,1,1)(0,1,1,12)
|
||
|
||
model = SARIMAX(monthly_sales,
|
||
order=(0, 1, 1),
|
||
seasonal_order=(0, 1, 1, 12),
|
||
enforce_stationarity=False,
|
||
enforce_invertibility=False)
|
||
results = model.fit()
|
||
```
|
||
|
||
### Exponential Smoothing
|
||
|
||
Weighted averages of past observations with exponentially decreasing weights.
|
||
|
||
**When to use:**
|
||
- Simple, interpretable forecasts
|
||
- Trend and/or seasonality present
|
||
- No need for explicit model specification
|
||
|
||
**Types:**
|
||
- Simple Exponential Smoothing: no trend, no seasonality
|
||
- Holt's method: with trend
|
||
- Holt-Winters: with trend and seasonality
|
||
|
||
```python
|
||
from statsmodels.tsa.holtwinters import ExponentialSmoothing
|
||
|
||
# Simple exponential smoothing
|
||
model = ExponentialSmoothing(y, trend=None, seasonal=None)
|
||
results = model.fit()
|
||
|
||
# Holt's method (with trend)
|
||
model = ExponentialSmoothing(y, trend='add', seasonal=None)
|
||
results = model.fit()
|
||
|
||
# Holt-Winters (trend + seasonality)
|
||
model = ExponentialSmoothing(y,
|
||
trend='add', # 'add' or 'mul'
|
||
seasonal='add', # 'add' or 'mul'
|
||
seasonal_periods=12) # e.g., 12 for monthly
|
||
results = model.fit()
|
||
|
||
print(results.summary())
|
||
```
|
||
|
||
**Additive vs Multiplicative:**
|
||
```python
|
||
# Additive: constant seasonal variation
|
||
# yₜ = Level + Trend + Seasonal + Error
|
||
|
||
# Multiplicative: proportional seasonal variation
|
||
# yₜ = Level × Trend × Seasonal × Error
|
||
|
||
# Choose based on data:
|
||
# - Additive: seasonal variation constant over time
|
||
# - Multiplicative: seasonal variation increases with level
|
||
```
|
||
|
||
**Innovations state space (ETS):**
|
||
```python
|
||
from statsmodels.tsa.exponential_smoothing.ets import ETSModel
|
||
|
||
# More robust, state space formulation
|
||
model = ETSModel(y,
|
||
error='add', # 'add' or 'mul'
|
||
trend='add', # 'add', 'mul', or None
|
||
seasonal='add', # 'add', 'mul', or None
|
||
seasonal_periods=12)
|
||
results = model.fit()
|
||
```
|
||
|
||
## Multivariate Time Series
|
||
|
||
### VAR (Vector Autoregression)
|
||
|
||
System of equations where each variable depends on past values of all variables.
|
||
|
||
**When to use:**
|
||
- Multiple interrelated time series
|
||
- Bidirectional relationships
|
||
- Granger causality testing
|
||
|
||
**Model**: Each variable is AR on all variables:
|
||
- y₁ₜ = c₁ + φ₁₁y₁ₜ₋₁ + φ₁₂y₂ₜ₋₁ + ... + ε₁ₜ
|
||
- y₂ₜ = c₂ + φ₂₁y₁ₜ₋₁ + φ₂₂y₂ₜ₋₁ + ... + ε₂ₜ
|
||
|
||
```python
|
||
from statsmodels.tsa.api import VAR
|
||
import pandas as pd
|
||
|
||
# Data should be DataFrame with multiple columns
|
||
# Each column is a time series
|
||
df_multivariate = pd.DataFrame({'series1': y1, 'series2': y2, 'series3': y3})
|
||
|
||
# Fit VAR
|
||
model = VAR(df_multivariate)
|
||
|
||
# Select lag order using AIC/BIC
|
||
lag_order_results = model.select_order(maxlags=15)
|
||
print(lag_order_results.summary())
|
||
|
||
# Fit with optimal lags
|
||
results = model.fit(maxlags=5, ic='aic')
|
||
print(results.summary())
|
||
```
|
||
|
||
**Granger causality testing:**
|
||
```python
|
||
# Test if series1 Granger-causes series2
|
||
from statsmodels.tsa.stattools import grangercausalitytests
|
||
|
||
# Requires 2D array [series2, series1]
|
||
test_data = df_multivariate[['series2', 'series1']]
|
||
|
||
# Test up to max_lag
|
||
max_lag = 5
|
||
results = grangercausalitytests(test_data, max_lag, verbose=True)
|
||
|
||
# P-values for each lag
|
||
for lag in range(1, max_lag + 1):
|
||
p_value = results[lag][0]['ssr_ftest'][1]
|
||
print(f"Lag {lag}: p-value = {p_value:.4f}")
|
||
```
|
||
|
||
**Impulse Response Functions (IRF):**
|
||
```python
|
||
# Trace effect of shock through system
|
||
irf = results.irf(10) # 10 periods ahead
|
||
|
||
# Plot IRFs
|
||
irf.plot(orth=True) # Orthogonalized (Cholesky decomposition)
|
||
plt.show()
|
||
|
||
# Cumulative effects
|
||
irf.plot_cum_effects(orth=True)
|
||
plt.show()
|
||
```
|
||
|
||
**Forecast Error Variance Decomposition:**
|
||
```python
|
||
# Contribution of each variable to forecast error variance
|
||
fevd = results.fevd(10) # 10 periods ahead
|
||
fevd.plot()
|
||
plt.show()
|
||
```
|
||
|
||
### VARMAX (VAR with Moving Average and Exogenous Variables)
|
||
|
||
Extends VAR with MA component and external regressors.
|
||
|
||
**When to use:**
|
||
- VAR inadequate (MA component needed)
|
||
- External variables affect system
|
||
- More flexible multivariate model
|
||
|
||
```python
|
||
from statsmodels.tsa.statespace.varmax import VARMAX
|
||
|
||
# VARMAX(p, q) with exogenous variables
|
||
model = VARMAX(df_multivariate,
|
||
order=(1, 1), # (p, q)
|
||
exog=X_exog)
|
||
results = model.fit()
|
||
|
||
print(results.summary())
|
||
```
|
||
|
||
## State Space Models
|
||
|
||
Flexible framework for custom time series models.
|
||
|
||
**When to use:**
|
||
- Custom model specification
|
||
- Unobserved components
|
||
- Kalman filtering/smoothing
|
||
- Missing data
|
||
|
||
```python
|
||
from statsmodels.tsa.statespace.mlemodel import MLEModel
|
||
|
||
# Extend MLEModel for custom state space models
|
||
# Example: Local level model (random walk + noise)
|
||
```
|
||
|
||
**Dynamic Factor Models:**
|
||
```python
|
||
from statsmodels.tsa.statespace.dynamic_factor import DynamicFactor
|
||
|
||
# Extract common factors from multiple time series
|
||
model = DynamicFactor(df_multivariate,
|
||
k_factors=2, # Number of factors
|
||
factor_order=2) # AR order of factors
|
||
results = model.fit()
|
||
|
||
# Estimated factors
|
||
factors = results.factors.filtered
|
||
```
|
||
|
||
## Forecasting
|
||
|
||
### Point Forecasts
|
||
|
||
```python
|
||
# ARIMA forecasting
|
||
model = ARIMA(y, order=(1, 1, 1))
|
||
results = model.fit()
|
||
|
||
# Forecast h steps ahead
|
||
h = 10
|
||
forecast = results.forecast(steps=h)
|
||
|
||
# With exogenous variables (SARIMAX)
|
||
model = SARIMAX(y, exog=X, order=(1, 1, 1))
|
||
results = model.fit()
|
||
|
||
# Need future exogenous values
|
||
forecast = results.forecast(steps=h, exog=X_future)
|
||
```
|
||
|
||
### Prediction Intervals
|
||
|
||
```python
|
||
# Get forecast with confidence intervals
|
||
forecast_obj = results.get_forecast(steps=h)
|
||
forecast_df = forecast_obj.summary_frame()
|
||
|
||
print(forecast_df)
|
||
# Contains: mean, mean_se, mean_ci_lower, mean_ci_upper
|
||
|
||
# Extract components
|
||
forecast_mean = forecast_df['mean']
|
||
forecast_ci_lower = forecast_df['mean_ci_lower']
|
||
forecast_ci_upper = forecast_df['mean_ci_upper']
|
||
|
||
# Plot
|
||
import matplotlib.pyplot as plt
|
||
|
||
plt.figure(figsize=(12, 6))
|
||
plt.plot(y.index, y, label='Historical')
|
||
plt.plot(forecast_df.index, forecast_mean, label='Forecast', color='red')
|
||
plt.fill_between(forecast_df.index,
|
||
forecast_ci_lower,
|
||
forecast_ci_upper,
|
||
alpha=0.3, color='red', label='95% CI')
|
||
plt.legend()
|
||
plt.title('Forecast with Prediction Intervals')
|
||
plt.show()
|
||
```
|
||
|
||
### Dynamic vs Static Forecasts
|
||
|
||
```python
|
||
# Static (one-step-ahead, using actual values)
|
||
static_forecast = results.get_prediction(start=split_point, end=len(y)-1)
|
||
|
||
# Dynamic (multi-step, using predicted values)
|
||
dynamic_forecast = results.get_prediction(start=split_point,
|
||
end=len(y)-1,
|
||
dynamic=True)
|
||
|
||
# Plot comparison
|
||
fig, ax = plt.subplots(figsize=(12, 6))
|
||
y.plot(ax=ax, label='Actual')
|
||
static_forecast.predicted_mean.plot(ax=ax, label='Static forecast')
|
||
dynamic_forecast.predicted_mean.plot(ax=ax, label='Dynamic forecast')
|
||
ax.legend()
|
||
plt.show()
|
||
```
|
||
|
||
## Diagnostic Tests
|
||
|
||
### Stationarity Tests
|
||
|
||
```python
|
||
from statsmodels.tsa.stattools import adfuller, kpss
|
||
|
||
# Augmented Dickey-Fuller (ADF) test
|
||
# H0: unit root (non-stationary)
|
||
adf_result = adfuller(y, autolag='AIC')
|
||
print(f"ADF Statistic: {adf_result[0]:.4f}")
|
||
print(f"p-value: {adf_result[1]:.4f}")
|
||
if adf_result[1] <= 0.05:
|
||
print("Reject H0: Series is stationary")
|
||
else:
|
||
print("Fail to reject H0: Series is non-stationary")
|
||
|
||
# KPSS test
|
||
# H0: stationary (opposite of ADF)
|
||
kpss_result = kpss(y, regression='c', nlags='auto')
|
||
print(f"KPSS Statistic: {kpss_result[0]:.4f}")
|
||
print(f"p-value: {kpss_result[1]:.4f}")
|
||
if kpss_result[1] <= 0.05:
|
||
print("Reject H0: Series is non-stationary")
|
||
else:
|
||
print("Fail to reject H0: Series is stationary")
|
||
```
|
||
|
||
### Residual Diagnostics
|
||
|
||
```python
|
||
# Ljung-Box test for autocorrelation in residuals
|
||
from statsmodels.stats.diagnostic import acorr_ljungbox
|
||
|
||
lb_test = acorr_ljungbox(results.resid, lags=10, return_df=True)
|
||
print(lb_test)
|
||
# P-values > 0.05 indicate no significant autocorrelation (good)
|
||
|
||
# Plot residual diagnostics
|
||
results.plot_diagnostics(figsize=(12, 8))
|
||
plt.show()
|
||
|
||
# Components:
|
||
# 1. Standardized residuals over time
|
||
# 2. Histogram + KDE of residuals
|
||
# 3. Q-Q plot for normality
|
||
# 4. Correlogram (ACF of residuals)
|
||
```
|
||
|
||
### Heteroskedasticity Tests
|
||
|
||
```python
|
||
from statsmodels.stats.diagnostic import het_arch
|
||
|
||
# ARCH test for heteroskedasticity
|
||
arch_test = het_arch(results.resid, nlags=10)
|
||
print(f"ARCH test statistic: {arch_test[0]:.4f}")
|
||
print(f"p-value: {arch_test[1]:.4f}")
|
||
|
||
# If significant, consider GARCH model
|
||
```
|
||
|
||
## Seasonal Decomposition
|
||
|
||
```python
|
||
from statsmodels.tsa.seasonal import seasonal_decompose
|
||
|
||
# Decompose into trend, seasonal, residual
|
||
decomposition = seasonal_decompose(y,
|
||
model='additive', # or 'multiplicative'
|
||
period=12) # seasonal period
|
||
|
||
# Plot components
|
||
fig = decomposition.plot()
|
||
fig.set_size_inches(12, 8)
|
||
plt.show()
|
||
|
||
# Access components
|
||
trend = decomposition.trend
|
||
seasonal = decomposition.seasonal
|
||
residual = decomposition.resid
|
||
|
||
# STL decomposition (more robust)
|
||
from statsmodels.tsa.seasonal import STL
|
||
|
||
stl = STL(y, seasonal=13) # seasonal must be odd
|
||
stl_result = stl.fit()
|
||
|
||
fig = stl_result.plot()
|
||
plt.show()
|
||
```
|
||
|
||
## Model Evaluation
|
||
|
||
### In-Sample Metrics
|
||
|
||
```python
|
||
# From results object
|
||
print(f"AIC: {results.aic:.2f}")
|
||
print(f"BIC: {results.bic:.2f}")
|
||
print(f"Log-likelihood: {results.llf:.2f}")
|
||
|
||
# MSE on training data
|
||
from sklearn.metrics import mean_squared_error
|
||
|
||
mse = mean_squared_error(y, results.fittedvalues)
|
||
rmse = np.sqrt(mse)
|
||
print(f"RMSE: {rmse:.4f}")
|
||
|
||
# MAE
|
||
from sklearn.metrics import mean_absolute_error
|
||
mae = mean_absolute_error(y, results.fittedvalues)
|
||
print(f"MAE: {mae:.4f}")
|
||
```
|
||
|
||
### Out-of-Sample Evaluation
|
||
|
||
```python
|
||
# Train-test split for time series (no shuffle!)
|
||
train_size = int(0.8 * len(y))
|
||
y_train = y[:train_size]
|
||
y_test = y[train_size:]
|
||
|
||
# Fit on training data
|
||
model = ARIMA(y_train, order=(1, 1, 1))
|
||
results = model.fit()
|
||
|
||
# Forecast test period
|
||
forecast = results.forecast(steps=len(y_test))
|
||
|
||
# Metrics
|
||
from sklearn.metrics import mean_squared_error, mean_absolute_error
|
||
|
||
rmse = np.sqrt(mean_squared_error(y_test, forecast))
|
||
mae = mean_absolute_error(y_test, forecast)
|
||
mape = np.mean(np.abs((y_test - forecast) / y_test)) * 100
|
||
|
||
print(f"Test RMSE: {rmse:.4f}")
|
||
print(f"Test MAE: {mae:.4f}")
|
||
print(f"Test MAPE: {mape:.2f}%")
|
||
```
|
||
|
||
### Rolling Forecast
|
||
|
||
```python
|
||
# More realistic evaluation: rolling one-step-ahead forecasts
|
||
forecasts = []
|
||
|
||
for t in range(len(y_test)):
|
||
# Refit or update with new observation
|
||
y_current = y[:train_size + t]
|
||
model = ARIMA(y_current, order=(1, 1, 1))
|
||
fit = model.fit()
|
||
|
||
# One-step forecast
|
||
fc = fit.forecast(steps=1)[0]
|
||
forecasts.append(fc)
|
||
|
||
forecasts = np.array(forecasts)
|
||
|
||
rmse = np.sqrt(mean_squared_error(y_test, forecasts))
|
||
print(f"Rolling forecast RMSE: {rmse:.4f}")
|
||
```
|
||
|
||
### Cross-Validation
|
||
|
||
```python
|
||
# Time series cross-validation (expanding window)
|
||
from sklearn.model_selection import TimeSeriesSplit
|
||
|
||
tscv = TimeSeriesSplit(n_splits=5)
|
||
rmse_scores = []
|
||
|
||
for train_idx, test_idx in tscv.split(y):
|
||
y_train_cv = y.iloc[train_idx]
|
||
y_test_cv = y.iloc[test_idx]
|
||
|
||
model = ARIMA(y_train_cv, order=(1, 1, 1))
|
||
results = model.fit()
|
||
|
||
forecast = results.forecast(steps=len(test_idx))
|
||
rmse = np.sqrt(mean_squared_error(y_test_cv, forecast))
|
||
rmse_scores.append(rmse)
|
||
|
||
print(f"CV RMSE: {np.mean(rmse_scores):.4f} ± {np.std(rmse_scores):.4f}")
|
||
```
|
||
|
||
## Advanced Topics
|
||
|
||
### ARDL (Autoregressive Distributed Lag)
|
||
|
||
Bridges univariate and multivariate time series.
|
||
|
||
```python
|
||
from statsmodels.tsa.ardl import ARDL
|
||
|
||
# ARDL(p, q) model
|
||
# y depends on its own lags and lags of X
|
||
model = ARDL(y, lags=2, exog=X, exog_lags=2)
|
||
results = model.fit()
|
||
```
|
||
|
||
### Error Correction Models
|
||
|
||
For cointegrated series.
|
||
|
||
```python
|
||
from statsmodels.tsa.vector_ar.vecm import coint_johansen
|
||
|
||
# Test for cointegration
|
||
johansen_test = coint_johansen(df_multivariate, det_order=0, k_ar_diff=1)
|
||
|
||
# Fit VECM if cointegrated
|
||
from statsmodels.tsa.vector_ar.vecm import VECM
|
||
|
||
model = VECM(df_multivariate, k_ar_diff=1, coint_rank=1)
|
||
results = model.fit()
|
||
```
|
||
|
||
### Regime Switching Models
|
||
|
||
For structural breaks and regime changes.
|
||
|
||
```python
|
||
from statsmodels.tsa.regime_switching.markov_regression import MarkovRegression
|
||
|
||
# Markov switching model
|
||
model = MarkovRegression(y, k_regimes=2, order=1)
|
||
results = model.fit()
|
||
|
||
# Smoothed probabilities of regimes
|
||
regime_probs = results.smoothed_marginal_probabilities
|
||
```
|
||
|
||
## Best Practices
|
||
|
||
1. **Check stationarity**: Difference if needed, verify with ADF/KPSS tests
|
||
2. **Plot data**: Always visualize before modeling
|
||
3. **Identify seasonality**: Use appropriate seasonal models (SARIMAX, Holt-Winters)
|
||
4. **Model selection**: Use AIC/BIC and out-of-sample validation
|
||
5. **Residual diagnostics**: Check for autocorrelation, normality, heteroskedasticity
|
||
6. **Forecast evaluation**: Use rolling forecasts and proper time series CV
|
||
7. **Avoid overfitting**: Prefer simpler models, use information criteria
|
||
8. **Document assumptions**: Note any data transformations (log, differencing)
|
||
9. **Prediction intervals**: Always provide uncertainty estimates
|
||
10. **Refit regularly**: Update models as new data arrives
|
||
|
||
## Common Pitfalls
|
||
|
||
1. **Not checking stationarity**: Fit ARIMA on non-stationary data
|
||
2. **Data leakage**: Using future data in transformations
|
||
3. **Wrong seasonal period**: S=4 for quarterly, S=12 for monthly
|
||
4. **Overfitting**: Too many parameters relative to data
|
||
5. **Ignoring residual autocorrelation**: Model inadequate
|
||
6. **Using inappropriate metrics**: MAPE fails with zeros or negatives
|
||
7. **Not handling missing data**: Affects model estimation
|
||
8. **Extrapolating exogenous variables**: Need future X values for SARIMAX
|
||
9. **Confusing static vs dynamic forecasts**: Dynamic more realistic for multi-step
|
||
10. **Not validating forecasts**: Always check out-of-sample performance
|