18 KiB
Time Series Analysis Reference
This document provides comprehensive guidance on time series models in statsmodels, including ARIMA, state space models, VAR, exponential smoothing, and forecasting methods.
Overview
Statsmodels offers extensive time series capabilities:
- Univariate models: AR, ARIMA, SARIMAX, Exponential Smoothing
- Multivariate models: VAR, VARMAX, Dynamic Factor Models
- State space framework: Custom models, Kalman filtering
- Diagnostic tools: ACF, PACF, stationarity tests, residual analysis
- Forecasting: Point forecasts and prediction intervals
Univariate Time Series Models
AutoReg (AR Model)
Autoregressive model: current value depends on past values.
When to use:
- Univariate time series
- Past values predict future
- Stationary series
Model: yₜ = c + φ₁yₜ₋₁ + φ₂yₜ₋₂ + ... + φₚyₜ₋ₚ + εₜ
from statsmodels.tsa.ar_model import AutoReg
import pandas as pd
# Fit AR(p) model
model = AutoReg(y, lags=5) # AR(5)
results = model.fit()
print(results.summary())
With exogenous regressors:
# AR with exogenous variables (ARX)
model = AutoReg(y, lags=5, exog=X_exog)
results = model.fit()
Seasonal AR:
# Seasonal lags (e.g., monthly data with yearly seasonality)
model = AutoReg(y, lags=12, seasonal=True)
results = model.fit()
ARIMA (Autoregressive Integrated Moving Average)
Combines AR, differencing (I), and MA components.
When to use:
- Non-stationary time series (needs differencing)
- Past values and errors predict future
- Flexible model for many time series
Model: ARIMA(p,d,q)
- p: AR order (lags)
- d: differencing order (to achieve stationarity)
- q: MA order (lagged forecast errors)
from statsmodels.tsa.arima.model import ARIMA
# Fit ARIMA(p,d,q)
model = ARIMA(y, order=(1, 1, 1)) # ARIMA(1,1,1)
results = model.fit()
print(results.summary())
Choosing p, d, q:
- Determine d (differencing order):
from statsmodels.tsa.stattools import adfuller
# ADF test for stationarity
def check_stationarity(series):
result = adfuller(series)
print(f"ADF Statistic: {result[0]:.4f}")
print(f"p-value: {result[1]:.4f}")
if result[1] <= 0.05:
print("Series is stationary")
return True
else:
print("Series is non-stationary, needs differencing")
return False
# Test original series
if not check_stationarity(y):
# Difference once
y_diff = y.diff().dropna()
if not check_stationarity(y_diff):
# Difference again
y_diff2 = y_diff.diff().dropna()
check_stationarity(y_diff2)
- Determine p and q (ACF/PACF):
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
import matplotlib.pyplot as plt
# After differencing to stationarity
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 8))
# ACF: helps determine q (MA order)
plot_acf(y_stationary, lags=40, ax=ax1)
ax1.set_title('Autocorrelation Function (ACF)')
# PACF: helps determine p (AR order)
plot_pacf(y_stationary, lags=40, ax=ax2)
ax2.set_title('Partial Autocorrelation Function (PACF)')
plt.tight_layout()
plt.show()
# Rules of thumb:
# - PACF cuts off at lag p → AR(p)
# - ACF cuts off at lag q → MA(q)
# - Both decay → ARMA(p,q)
- Model selection (AIC/BIC):
# Grid search for best (p,q) given d
import numpy as np
best_aic = np.inf
best_order = None
for p in range(5):
for q in range(5):
try:
model = ARIMA(y, order=(p, d, q))
results = model.fit()
if results.aic < best_aic:
best_aic = results.aic
best_order = (p, d, q)
except:
continue
print(f"Best order: {best_order} with AIC: {best_aic:.2f}")
SARIMAX (Seasonal ARIMA with Exogenous Variables)
Extends ARIMA with seasonality and exogenous regressors.
When to use:
- Seasonal patterns (monthly, quarterly data)
- External variables influence series
- Most flexible univariate model
Model: SARIMAX(p,d,q)(P,D,Q,s)
- (p,d,q): Non-seasonal ARIMA
- (P,D,Q,s): Seasonal ARIMA with period s
from statsmodels.tsa.statespace.sarimax import SARIMAX
# Seasonal ARIMA for monthly data (s=12)
model = SARIMAX(y,
order=(1, 1, 1), # (p,d,q)
seasonal_order=(1, 1, 1, 12)) # (P,D,Q,s)
results = model.fit()
print(results.summary())
With exogenous variables:
# SARIMAX with external predictors
model = SARIMAX(y,
exog=X_exog,
order=(1, 1, 1),
seasonal_order=(1, 1, 1, 12))
results = model.fit()
Example: Monthly sales with trend and seasonality
# Typical for monthly data: (p,d,q)(P,D,Q,12)
# Start with (1,1,1)(1,1,1,12) or (0,1,1)(0,1,1,12)
model = SARIMAX(monthly_sales,
order=(0, 1, 1),
seasonal_order=(0, 1, 1, 12),
enforce_stationarity=False,
enforce_invertibility=False)
results = model.fit()
Exponential Smoothing
Weighted averages of past observations with exponentially decreasing weights.
When to use:
- Simple, interpretable forecasts
- Trend and/or seasonality present
- No need for explicit model specification
Types:
- Simple Exponential Smoothing: no trend, no seasonality
- Holt's method: with trend
- Holt-Winters: with trend and seasonality
from statsmodels.tsa.holtwinters import ExponentialSmoothing
# Simple exponential smoothing
model = ExponentialSmoothing(y, trend=None, seasonal=None)
results = model.fit()
# Holt's method (with trend)
model = ExponentialSmoothing(y, trend='add', seasonal=None)
results = model.fit()
# Holt-Winters (trend + seasonality)
model = ExponentialSmoothing(y,
trend='add', # 'add' or 'mul'
seasonal='add', # 'add' or 'mul'
seasonal_periods=12) # e.g., 12 for monthly
results = model.fit()
print(results.summary())
Additive vs Multiplicative:
# Additive: constant seasonal variation
# yₜ = Level + Trend + Seasonal + Error
# Multiplicative: proportional seasonal variation
# yₜ = Level × Trend × Seasonal × Error
# Choose based on data:
# - Additive: seasonal variation constant over time
# - Multiplicative: seasonal variation increases with level
Innovations state space (ETS):
from statsmodels.tsa.exponential_smoothing.ets import ETSModel
# More robust, state space formulation
model = ETSModel(y,
error='add', # 'add' or 'mul'
trend='add', # 'add', 'mul', or None
seasonal='add', # 'add', 'mul', or None
seasonal_periods=12)
results = model.fit()
Multivariate Time Series
VAR (Vector Autoregression)
System of equations where each variable depends on past values of all variables.
When to use:
- Multiple interrelated time series
- Bidirectional relationships
- Granger causality testing
Model: Each variable is AR on all variables:
- y₁ₜ = c₁ + φ₁₁y₁ₜ₋₁ + φ₁₂y₂ₜ₋₁ + ... + ε₁ₜ
- y₂ₜ = c₂ + φ₂₁y₁ₜ₋₁ + φ₂₂y₂ₜ₋₁ + ... + ε₂ₜ
from statsmodels.tsa.api import VAR
import pandas as pd
# Data should be DataFrame with multiple columns
# Each column is a time series
df_multivariate = pd.DataFrame({'series1': y1, 'series2': y2, 'series3': y3})
# Fit VAR
model = VAR(df_multivariate)
# Select lag order using AIC/BIC
lag_order_results = model.select_order(maxlags=15)
print(lag_order_results.summary())
# Fit with optimal lags
results = model.fit(maxlags=5, ic='aic')
print(results.summary())
Granger causality testing:
# Test if series1 Granger-causes series2
from statsmodels.tsa.stattools import grangercausalitytests
# Requires 2D array [series2, series1]
test_data = df_multivariate[['series2', 'series1']]
# Test up to max_lag
max_lag = 5
results = grangercausalitytests(test_data, max_lag, verbose=True)
# P-values for each lag
for lag in range(1, max_lag + 1):
p_value = results[lag][0]['ssr_ftest'][1]
print(f"Lag {lag}: p-value = {p_value:.4f}")
Impulse Response Functions (IRF):
# Trace effect of shock through system
irf = results.irf(10) # 10 periods ahead
# Plot IRFs
irf.plot(orth=True) # Orthogonalized (Cholesky decomposition)
plt.show()
# Cumulative effects
irf.plot_cum_effects(orth=True)
plt.show()
Forecast Error Variance Decomposition:
# Contribution of each variable to forecast error variance
fevd = results.fevd(10) # 10 periods ahead
fevd.plot()
plt.show()
VARMAX (VAR with Moving Average and Exogenous Variables)
Extends VAR with MA component and external regressors.
When to use:
- VAR inadequate (MA component needed)
- External variables affect system
- More flexible multivariate model
from statsmodels.tsa.statespace.varmax import VARMAX
# VARMAX(p, q) with exogenous variables
model = VARMAX(df_multivariate,
order=(1, 1), # (p, q)
exog=X_exog)
results = model.fit()
print(results.summary())
State Space Models
Flexible framework for custom time series models.
When to use:
- Custom model specification
- Unobserved components
- Kalman filtering/smoothing
- Missing data
from statsmodels.tsa.statespace.mlemodel import MLEModel
# Extend MLEModel for custom state space models
# Example: Local level model (random walk + noise)
Dynamic Factor Models:
from statsmodels.tsa.statespace.dynamic_factor import DynamicFactor
# Extract common factors from multiple time series
model = DynamicFactor(df_multivariate,
k_factors=2, # Number of factors
factor_order=2) # AR order of factors
results = model.fit()
# Estimated factors
factors = results.factors.filtered
Forecasting
Point Forecasts
# ARIMA forecasting
model = ARIMA(y, order=(1, 1, 1))
results = model.fit()
# Forecast h steps ahead
h = 10
forecast = results.forecast(steps=h)
# With exogenous variables (SARIMAX)
model = SARIMAX(y, exog=X, order=(1, 1, 1))
results = model.fit()
# Need future exogenous values
forecast = results.forecast(steps=h, exog=X_future)
Prediction Intervals
# Get forecast with confidence intervals
forecast_obj = results.get_forecast(steps=h)
forecast_df = forecast_obj.summary_frame()
print(forecast_df)
# Contains: mean, mean_se, mean_ci_lower, mean_ci_upper
# Extract components
forecast_mean = forecast_df['mean']
forecast_ci_lower = forecast_df['mean_ci_lower']
forecast_ci_upper = forecast_df['mean_ci_upper']
# Plot
import matplotlib.pyplot as plt
plt.figure(figsize=(12, 6))
plt.plot(y.index, y, label='Historical')
plt.plot(forecast_df.index, forecast_mean, label='Forecast', color='red')
plt.fill_between(forecast_df.index,
forecast_ci_lower,
forecast_ci_upper,
alpha=0.3, color='red', label='95% CI')
plt.legend()
plt.title('Forecast with Prediction Intervals')
plt.show()
Dynamic vs Static Forecasts
# Static (one-step-ahead, using actual values)
static_forecast = results.get_prediction(start=split_point, end=len(y)-1)
# Dynamic (multi-step, using predicted values)
dynamic_forecast = results.get_prediction(start=split_point,
end=len(y)-1,
dynamic=True)
# Plot comparison
fig, ax = plt.subplots(figsize=(12, 6))
y.plot(ax=ax, label='Actual')
static_forecast.predicted_mean.plot(ax=ax, label='Static forecast')
dynamic_forecast.predicted_mean.plot(ax=ax, label='Dynamic forecast')
ax.legend()
plt.show()
Diagnostic Tests
Stationarity Tests
from statsmodels.tsa.stattools import adfuller, kpss
# Augmented Dickey-Fuller (ADF) test
# H0: unit root (non-stationary)
adf_result = adfuller(y, autolag='AIC')
print(f"ADF Statistic: {adf_result[0]:.4f}")
print(f"p-value: {adf_result[1]:.4f}")
if adf_result[1] <= 0.05:
print("Reject H0: Series is stationary")
else:
print("Fail to reject H0: Series is non-stationary")
# KPSS test
# H0: stationary (opposite of ADF)
kpss_result = kpss(y, regression='c', nlags='auto')
print(f"KPSS Statistic: {kpss_result[0]:.4f}")
print(f"p-value: {kpss_result[1]:.4f}")
if kpss_result[1] <= 0.05:
print("Reject H0: Series is non-stationary")
else:
print("Fail to reject H0: Series is stationary")
Residual Diagnostics
# Ljung-Box test for autocorrelation in residuals
from statsmodels.stats.diagnostic import acorr_ljungbox
lb_test = acorr_ljungbox(results.resid, lags=10, return_df=True)
print(lb_test)
# P-values > 0.05 indicate no significant autocorrelation (good)
# Plot residual diagnostics
results.plot_diagnostics(figsize=(12, 8))
plt.show()
# Components:
# 1. Standardized residuals over time
# 2. Histogram + KDE of residuals
# 3. Q-Q plot for normality
# 4. Correlogram (ACF of residuals)
Heteroskedasticity Tests
from statsmodels.stats.diagnostic import het_arch
# ARCH test for heteroskedasticity
arch_test = het_arch(results.resid, nlags=10)
print(f"ARCH test statistic: {arch_test[0]:.4f}")
print(f"p-value: {arch_test[1]:.4f}")
# If significant, consider GARCH model
Seasonal Decomposition
from statsmodels.tsa.seasonal import seasonal_decompose
# Decompose into trend, seasonal, residual
decomposition = seasonal_decompose(y,
model='additive', # or 'multiplicative'
period=12) # seasonal period
# Plot components
fig = decomposition.plot()
fig.set_size_inches(12, 8)
plt.show()
# Access components
trend = decomposition.trend
seasonal = decomposition.seasonal
residual = decomposition.resid
# STL decomposition (more robust)
from statsmodels.tsa.seasonal import STL
stl = STL(y, seasonal=13) # seasonal must be odd
stl_result = stl.fit()
fig = stl_result.plot()
plt.show()
Model Evaluation
In-Sample Metrics
# From results object
print(f"AIC: {results.aic:.2f}")
print(f"BIC: {results.bic:.2f}")
print(f"Log-likelihood: {results.llf:.2f}")
# MSE on training data
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y, results.fittedvalues)
rmse = np.sqrt(mse)
print(f"RMSE: {rmse:.4f}")
# MAE
from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(y, results.fittedvalues)
print(f"MAE: {mae:.4f}")
Out-of-Sample Evaluation
# Train-test split for time series (no shuffle!)
train_size = int(0.8 * len(y))
y_train = y[:train_size]
y_test = y[train_size:]
# Fit on training data
model = ARIMA(y_train, order=(1, 1, 1))
results = model.fit()
# Forecast test period
forecast = results.forecast(steps=len(y_test))
# Metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error
rmse = np.sqrt(mean_squared_error(y_test, forecast))
mae = mean_absolute_error(y_test, forecast)
mape = np.mean(np.abs((y_test - forecast) / y_test)) * 100
print(f"Test RMSE: {rmse:.4f}")
print(f"Test MAE: {mae:.4f}")
print(f"Test MAPE: {mape:.2f}%")
Rolling Forecast
# More realistic evaluation: rolling one-step-ahead forecasts
forecasts = []
for t in range(len(y_test)):
# Refit or update with new observation
y_current = y[:train_size + t]
model = ARIMA(y_current, order=(1, 1, 1))
fit = model.fit()
# One-step forecast
fc = fit.forecast(steps=1)[0]
forecasts.append(fc)
forecasts = np.array(forecasts)
rmse = np.sqrt(mean_squared_error(y_test, forecasts))
print(f"Rolling forecast RMSE: {rmse:.4f}")
Cross-Validation
# Time series cross-validation (expanding window)
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
rmse_scores = []
for train_idx, test_idx in tscv.split(y):
y_train_cv = y.iloc[train_idx]
y_test_cv = y.iloc[test_idx]
model = ARIMA(y_train_cv, order=(1, 1, 1))
results = model.fit()
forecast = results.forecast(steps=len(test_idx))
rmse = np.sqrt(mean_squared_error(y_test_cv, forecast))
rmse_scores.append(rmse)
print(f"CV RMSE: {np.mean(rmse_scores):.4f} ± {np.std(rmse_scores):.4f}")
Advanced Topics
ARDL (Autoregressive Distributed Lag)
Bridges univariate and multivariate time series.
from statsmodels.tsa.ardl import ARDL
# ARDL(p, q) model
# y depends on its own lags and lags of X
model = ARDL(y, lags=2, exog=X, exog_lags=2)
results = model.fit()
Error Correction Models
For cointegrated series.
from statsmodels.tsa.vector_ar.vecm import coint_johansen
# Test for cointegration
johansen_test = coint_johansen(df_multivariate, det_order=0, k_ar_diff=1)
# Fit VECM if cointegrated
from statsmodels.tsa.vector_ar.vecm import VECM
model = VECM(df_multivariate, k_ar_diff=1, coint_rank=1)
results = model.fit()
Regime Switching Models
For structural breaks and regime changes.
from statsmodels.tsa.regime_switching.markov_regression import MarkovRegression
# Markov switching model
model = MarkovRegression(y, k_regimes=2, order=1)
results = model.fit()
# Smoothed probabilities of regimes
regime_probs = results.smoothed_marginal_probabilities
Best Practices
- Check stationarity: Difference if needed, verify with ADF/KPSS tests
- Plot data: Always visualize before modeling
- Identify seasonality: Use appropriate seasonal models (SARIMAX, Holt-Winters)
- Model selection: Use AIC/BIC and out-of-sample validation
- Residual diagnostics: Check for autocorrelation, normality, heteroskedasticity
- Forecast evaluation: Use rolling forecasts and proper time series CV
- Avoid overfitting: Prefer simpler models, use information criteria
- Document assumptions: Note any data transformations (log, differencing)
- Prediction intervals: Always provide uncertainty estimates
- Refit regularly: Update models as new data arrives
Common Pitfalls
- Not checking stationarity: Fit ARIMA on non-stationary data
- Data leakage: Using future data in transformations
- Wrong seasonal period: S=4 for quarterly, S=12 for monthly
- Overfitting: Too many parameters relative to data
- Ignoring residual autocorrelation: Model inadequate
- Using inappropriate metrics: MAPE fails with zeros or negatives
- Not handling missing data: Affects model estimation
- Extrapolating exogenous variables: Need future X values for SARIMAX
- Confusing static vs dynamic forecasts: Dynamic more realistic for multi-step
- Not validating forecasts: Always check out-of-sample performance