zhongwei/gh-k-dense-ai-claude-scientific-skills-scientific-skills

Files

Zhongwei Li f0bd18fb4e Initial commit

2025-11-30 08:30:10 +08:00

17 KiB

Raw Blame History

Discrete Choice Models Reference

This document provides comprehensive guidance on discrete choice models in statsmodels, including binary, multinomial, count, and ordinal models.

Overview

Discrete choice models handle outcomes that are:

Binary: 0/1, success/failure
Multinomial: Multiple unordered categories
Ordinal: Ordered categories
Count: Non-negative integers

All models use maximum likelihood estimation and assume i.i.d. errors.

Binary Models

Logit (Logistic Regression)

Uses logistic distribution for binary outcomes.

When to use:

Binary classification (yes/no, success/failure)
Probability estimation for binary outcomes
Interpretable odds ratios

Model: P(Y=1|X) = 1 / (1 + exp(-Xβ))

import statsmodels.api as sm
from statsmodels.discrete.discrete_model import Logit

# Prepare data
X = sm.add_constant(X_data)

# Fit model
model = Logit(y, X)
results = model.fit()

print(results.summary())

Interpretation:

import numpy as np

# Odds ratios
odds_ratios = np.exp(results.params)
print("Odds ratios:", odds_ratios)

# For 1-unit increase in X, odds multiply by exp(β)
# OR > 1: increases odds of success
# OR < 1: decreases odds of success
# OR = 1: no effect

# Confidence intervals for odds ratios
odds_ci = np.exp(results.conf_int())
print("Odds ratio 95% CI:")
print(odds_ci)

Marginal effects:

# Average marginal effects (AME)
marginal_effects = results.get_margeff(at='mean')
print(marginal_effects.summary())

# Marginal effects at means (MEM)
marginal_effects_mem = results.get_margeff(at='mean', method='dydx')

# Marginal effects at representative values
marginal_effects_custom = results.get_margeff(at='mean',
                                              atexog={'x1': 1, 'x2': 5})

Predictions:

# Predicted probabilities
probs = results.predict(X)

# Binary predictions (0.5 threshold)
predictions = (probs > 0.5).astype(int)

# Custom threshold
threshold = 0.3
predictions_custom = (probs > threshold).astype(int)

# For new data
X_new = sm.add_constant(X_new_data)
new_probs = results.predict(X_new)

Model evaluation:

from sklearn.metrics import (classification_report, confusion_matrix,
                             roc_auc_score, roc_curve)

# Classification report
print(classification_report(y, predictions))

# Confusion matrix
print(confusion_matrix(y, predictions))

# AUC-ROC
auc = roc_auc_score(y, probs)
print(f"AUC: {auc:.4f}")

# Pseudo R-squared
print(f"McFadden's Pseudo R²: {results.prsquared:.4f}")

Probit

Uses normal distribution for binary outcomes.

When to use:

Binary outcomes
Prefer normal distribution assumption
Field convention (econometrics often uses probit)

Model: P(Y=1|X) = Φ(Xβ), where Φ is standard normal CDF

from statsmodels.discrete.discrete_model import Probit

model = Probit(y, X)
results = model.fit()

print(results.summary())

Comparison with Logit:

Probit and Logit usually give similar results
Probit: symmetric, based on normal distribution
Logit: slightly heavier tails, easier interpretation (odds ratios)
Coefficients not directly comparable (scale difference)

# Marginal effects are comparable
logit_me = logit_results.get_margeff().margeff
probit_me = probit_results.get_margeff().margeff

print("Logit marginal effects:", logit_me)
print("Probit marginal effects:", probit_me)

Multinomial Models

MNLogit (Multinomial Logit)

For unordered categorical outcomes with 3+ categories.

When to use:

Multiple unordered categories (e.g., transportation mode, brand choice)
No natural ordering among categories
Need probabilities for each category

Model: P(Y=j|X) = exp(Xβⱼ) / Σₖ exp(Xβₖ)

from statsmodels.discrete.discrete_model import MNLogit

# y should be integers 0, 1, 2, ... for categories
model = MNLogit(y, X)
results = model.fit()

print(results.summary())

Interpretation:

# One category is reference (usually category 0)
# Coefficients represent log-odds relative to reference

# For category j vs reference:
# exp(β_j) = odds ratio of category j vs reference

# Predicted probabilities for each category
probs = results.predict(X)  # Shape: (n_samples, n_categories)

# Most likely category
predicted_categories = probs.argmax(axis=1)

Relative risk ratios:

# Exponentiate coefficients for relative risk ratios
import numpy as np
import pandas as pd

# Get parameter names and values
params_df = pd.DataFrame({
    'coef': results.params,
    'RRR': np.exp(results.params)
})
print(params_df)

Conditional Logit

For choice models where alternatives have characteristics.

When to use:

Alternative-specific regressors (vary across choices)
Panel data with choices
Discrete choice experiments

from statsmodels.discrete.conditional_models import ConditionalLogit

# Data structure: long format with choice indicator
model = ConditionalLogit(y_choice, X_alternatives, groups=individual_id)
results = model.fit()

Count Models

Poisson

Standard model for count data.

When to use:

Count outcomes (events, occurrences)
Rare events
Mean ≈ variance

Model: P(Y=k|X) = exp(-λ) λᵏ / k!, where log(λ) = Xβ

from statsmodels.discrete.count_model import Poisson

model = Poisson(y_counts, X)
results = model.fit()

print(results.summary())

Interpretation:

# Rate ratios (incident rate ratios)
rate_ratios = np.exp(results.params)
print("Rate ratios:", rate_ratios)

# For 1-unit increase in X, expected count multiplies by exp(β)

Check overdispersion:

# Mean and variance should be similar for Poisson
print(f"Mean: {y_counts.mean():.2f}")
print(f"Variance: {y_counts.var():.2f}")

# Formal test
from statsmodels.stats.stattools import durbin_watson

# Overdispersion if variance >> mean
# Rule of thumb: variance/mean > 1.5 suggests overdispersion
overdispersion_ratio = y_counts.var() / y_counts.mean()
print(f"Variance/Mean: {overdispersion_ratio:.2f}")

if overdispersion_ratio > 1.5:
    print("Consider Negative Binomial model")

With offset (for rates):

# When modeling rates with varying exposure
# log(λ) = log(exposure) + Xβ

model = Poisson(y_counts, X, offset=np.log(exposure))
results = model.fit()

Negative Binomial

For overdispersed count data (variance > mean).

When to use:

Count data with overdispersion
Excess variance not explained by Poisson
Heterogeneity in counts

Model: Adds dispersion parameter α to account for overdispersion

from statsmodels.discrete.count_model import NegativeBinomial

model = NegativeBinomial(y_counts, X)
results = model.fit()

print(results.summary())
print(f"Dispersion parameter alpha: {results.params['alpha']:.4f}")

Compare with Poisson:

# Fit both models
poisson_results = Poisson(y_counts, X).fit()
nb_results = NegativeBinomial(y_counts, X).fit()

# AIC comparison (lower is better)
print(f"Poisson AIC: {poisson_results.aic:.2f}")
print(f"Negative Binomial AIC: {nb_results.aic:.2f}")

# Likelihood ratio test (if NB is better)
from scipy import stats
lr_stat = 2 * (nb_results.llf - poisson_results.llf)
lr_pval = 1 - stats.chi2.cdf(lr_stat, df=1)  # 1 extra parameter (alpha)
print(f"LR test p-value: {lr_pval:.4f}")

if lr_pval < 0.05:
    print("Negative Binomial significantly better")

Zero-Inflated Models

For count data with excess zeros.

When to use:

More zeros than expected from Poisson/NB
Two processes: one for zeros, one for counts
Examples: number of doctor visits, insurance claims

Models:

ZeroInflatedPoisson (ZIP)
ZeroInflatedNegativeBinomialP (ZINB)

from statsmodels.discrete.count_model import (ZeroInflatedPoisson,
                                               ZeroInflatedNegativeBinomialP)

# ZIP model
zip_model = ZeroInflatedPoisson(y_counts, X, exog_infl=X_inflation)
zip_results = zip_model.fit()

# ZINB model (for overdispersion + excess zeros)
zinb_model = ZeroInflatedNegativeBinomialP(y_counts, X, exog_infl=X_inflation)
zinb_results = zinb_model.fit()

print(zip_results.summary())

Two parts of the model:

# 1. Inflation model: P(Y=0 due to inflation)
# 2. Count model: distribution of counts

# Predicted probabilities of inflation
inflation_probs = zip_results.predict(X, which='prob')

# Predicted counts
predicted_counts = zip_results.predict(X, which='mean')

Hurdle Models

Two-stage model: whether any counts, then how many.

When to use:

Excess zeros
Different processes for zero vs positive counts
Zeros structurally different from positive values

from statsmodels.discrete.count_model import HurdleCountModel

# Specify count distribution and zero inflation
model = HurdleCountModel(y_counts, X,
                         exog_infl=X_hurdle,
                         dist='poisson')  # or 'negbin'
results = model.fit()

print(results.summary())

Ordinal Models

Ordered Logit/Probit

For ordered categorical outcomes.

When to use:

Ordered categories (e.g., low/medium/high, ratings 1-5)
Natural ordering matters
Want to respect ordinal structure

Model: Cumulative probability model with cutpoints

from statsmodels.miscmodels.ordinal_model import OrderedModel

# y should be ordered integers: 0, 1, 2, ...
model = OrderedModel(y_ordered, X, distr='logit')  # or 'probit'
results = model.fit(method='bfgs')

print(results.summary())

Interpretation:

# Cutpoints (thresholds between categories)
cutpoints = results.params[-n_categories+1:]
print("Cutpoints:", cutpoints)

# Coefficients
coefficients = results.params[:-n_categories+1]
print("Coefficients:", coefficients)

# Predicted probabilities for each category
probs = results.predict(X)  # Shape: (n_samples, n_categories)

# Most likely category
predicted_categories = probs.argmax(axis=1)

Proportional odds assumption:

# Test if coefficients are same across cutpoints
# (Brant test - implement manually or check residuals)

# Check: model each cutpoint separately and compare coefficients

Model Diagnostics

Goodness of Fit

# Pseudo R-squared (McFadden)
print(f"Pseudo R²: {results.prsquared:.4f}")

# AIC/BIC for model comparison
print(f"AIC: {results.aic:.2f}")
print(f"BIC: {results.bic:.2f}")

# Log-likelihood
print(f"Log-likelihood: {results.llf:.2f}")

# Likelihood ratio test vs null model
lr_stat = 2 * (results.llf - results.llnull)
from scipy import stats
lr_pval = 1 - stats.chi2.cdf(lr_stat, results.df_model)
print(f"LR test p-value: {lr_pval}")

Classification Metrics (Binary)

from sklearn.metrics import (accuracy_score, precision_score, recall_score,
                             f1_score, roc_auc_score)

# Predictions
probs = results.predict(X)
predictions = (probs > 0.5).astype(int)

# Metrics
print(f"Accuracy: {accuracy_score(y, predictions):.4f}")
print(f"Precision: {precision_score(y, predictions):.4f}")
print(f"Recall: {recall_score(y, predictions):.4f}")
print(f"F1: {f1_score(y, predictions):.4f}")
print(f"AUC: {roc_auc_score(y, probs):.4f}")

Classification Metrics (Multinomial)

from sklearn.metrics import accuracy_score, classification_report, log_loss

# Predicted categories
probs = results.predict(X)
predictions = probs.argmax(axis=1)

# Accuracy
accuracy = accuracy_score(y, predictions)
print(f"Accuracy: {accuracy:.4f}")

# Classification report
print(classification_report(y, predictions))

# Log loss
logloss = log_loss(y, probs)
print(f"Log Loss: {logloss:.4f}")

Count Model Diagnostics

# Observed vs predicted frequencies
observed = pd.Series(y_counts).value_counts().sort_index()
predicted = results.predict(X)
predicted_counts = pd.Series(np.round(predicted)).value_counts().sort_index()

# Compare distributions
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
observed.plot(kind='bar', alpha=0.5, label='Observed', ax=ax)
predicted_counts.plot(kind='bar', alpha=0.5, label='Predicted', ax=ax)
ax.legend()
ax.set_xlabel('Count')
ax.set_ylabel('Frequency')
plt.show()

# Rootogram (better visualization)
from statsmodels.graphics.agreement import mean_diff_plot
# Custom rootogram implementation needed

Influence and Outliers

# Standardized residuals
std_resid = (y - results.predict(X)) / np.sqrt(results.predict(X))

# Check for outliers (|std_resid| > 2)
outliers = np.where(np.abs(std_resid) > 2)[0]
print(f"Number of outliers: {len(outliers)}")

# Leverage (hat values) - for logit/probit
# from statsmodels.stats.outliers_influence

Hypothesis Testing

# Single parameter test (automatic in summary)

# Multiple parameters: Wald test
# Test H0: β₁ = β₂ = 0
R = [[0, 1, 0, 0], [0, 0, 1, 0]]
wald_test = results.wald_test(R)
print(wald_test)

# Likelihood ratio test for nested models
model_reduced = Logit(y, X_reduced).fit()
model_full = Logit(y, X_full).fit()

lr_stat = 2 * (model_full.llf - model_reduced.llf)
df = model_full.df_model - model_reduced.df_model
from scipy import stats
lr_pval = 1 - stats.chi2.cdf(lr_stat, df)
print(f"LR test p-value: {lr_pval:.4f}")

Model Selection and Comparison

# Fit multiple models
models = {
    'Logit': Logit(y, X).fit(),
    'Probit': Probit(y, X).fit(),
    # Add more models
}

# Compare AIC/BIC
comparison = pd.DataFrame({
    'AIC': {name: model.aic for name, model in models.items()},
    'BIC': {name: model.bic for name, model in models.items()},
    'Pseudo R²': {name: model.prsquared for name, model in models.items()}
})
print(comparison.sort_values('AIC'))

# Cross-validation for predictive performance
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

# Use sklearn wrapper or manual CV

Formula API

Use R-style formulas for easier specification.

import statsmodels.formula.api as smf

# Logit with formula
formula = 'y ~ x1 + x2 + C(category) + x1:x2'
results = smf.logit(formula, data=df).fit()

# MNLogit with formula
results = smf.mnlogit(formula, data=df).fit()

# Poisson with formula
results = smf.poisson(formula, data=df).fit()

# Negative Binomial with formula
results = smf.negativebinomial(formula, data=df).fit()

Common Applications

Binary Classification (Marketing Response)

# Predict customer purchase probability
X = sm.add_constant(customer_features)
model = Logit(purchased, X)
results = model.fit()

# Targeting: select top 20% likely to purchase
probs = results.predict(X)
top_20_pct_idx = np.argsort(probs)[-int(0.2*len(probs)):]

Multinomial Choice (Transportation Mode)

# Predict transportation mode choice
model = MNLogit(mode_choice, X)
results = model.fit()

# Predicted mode for new commuter
new_commuter = sm.add_constant(new_features)
mode_probs = results.predict(new_commuter)
predicted_mode = mode_probs.argmax(axis=1)

Count Data (Number of Doctor Visits)

# Model healthcare utilization
model = NegativeBinomial(num_visits, X)
results = model.fit()

# Expected visits for new patient
expected_visits = results.predict(new_patient_X)

Zero-Inflated (Insurance Claims)

# Many people have zero claims
# Zero-inflation: some never claim
# Count process: those who might claim

zip_model = ZeroInflatedPoisson(claims, X_count, exog_infl=X_inflation)
results = zip_model.fit()

# P(never file claim)
never_claim_prob = results.predict(X, which='prob-zero')

# Expected claims
expected_claims = results.predict(X, which='mean')

Best Practices

Check data type: Ensure response matches model (binary, counts, categories)
Add constant: Always use sm.add_constant() unless no intercept desired
Scale continuous predictors: For better convergence and interpretation
Check convergence: Look for convergence warnings
Use formula API: For categorical variables and interactions
Marginal effects: Report marginal effects, not just coefficients
Model comparison: Use AIC/BIC and cross-validation
Validate: Holdout set or cross-validation for predictive models
Check overdispersion: For count models, test Poisson assumption
Consider alternatives: Zero-inflation, hurdle models for excess zeros

Common Pitfalls

Forgetting constant: No intercept term
Perfect separation: Logit/probit may not converge
Using Poisson with overdispersion: Check and use Negative Binomial
Misinterpreting coefficients: Remember they're on log-odds/log scale
Not checking convergence: Optimization may fail silently
Wrong distribution: Match model to data type (binary/count/categorical)
Ignoring excess zeros: Use ZIP/ZINB when appropriate
Not validating predictions: Always check out-of-sample performance
Comparing non-nested models: Use AIC/BIC, not likelihood ratio test
Ordinal as nominal: Use OrderedModel for ordered categories

17 KiB Raw Blame History Unescape Escape

Discrete Choice Models Reference

Overview

Binary Models

Logit (Logistic Regression)

Probit

Multinomial Models

MNLogit (Multinomial Logit)

Conditional Logit

Count Models

Poisson

Negative Binomial

Zero-Inflated Models

Hurdle Models

Ordinal Models

Ordered Logit/Probit

Model Diagnostics

Goodness of Fit

Classification Metrics (Binary)

Classification Metrics (Multinomial)

Count Model Diagnostics

Influence and Outliers

Hypothesis Testing

Model Selection and Comparison

Formula API

Common Applications

Binary Classification (Marketing Response)

Multinomial Choice (Transportation Mode)

Count Data (Number of Doctor Visits)

Zero-Inflated (Insurance Claims)

Best Practices

Common Pitfalls

17 KiB

Raw Blame History