Initial commit
This commit is contained in:
378
skills/scikit-survival/references/evaluation-metrics.md
Normal file
378
skills/scikit-survival/references/evaluation-metrics.md
Normal file
@@ -0,0 +1,378 @@
|
||||
# Evaluation Metrics for Survival Models
|
||||
|
||||
## Overview
|
||||
|
||||
Evaluating survival models requires specialized metrics that account for censored data. scikit-survival provides three main categories of metrics:
|
||||
1. Concordance Index (C-index)
|
||||
2. Time-dependent ROC and AUC
|
||||
3. Brier Score
|
||||
|
||||
## Concordance Index (C-index)
|
||||
|
||||
### What It Measures
|
||||
|
||||
The concordance index measures the rank correlation between predicted risk scores and observed event times. It represents the probability that, for a random pair of subjects, the model correctly orders their survival times.
|
||||
|
||||
**Range**: 0 to 1
|
||||
- 0.5 = random predictions
|
||||
- 1.0 = perfect concordance
|
||||
- Typical good performance: 0.7-0.8
|
||||
|
||||
### Two Implementations
|
||||
|
||||
#### Harrell's C-index (concordance_index_censored)
|
||||
|
||||
The traditional estimator, simpler but has limitations.
|
||||
|
||||
**When to Use:**
|
||||
- Low censoring rates (< 40%)
|
||||
- Quick evaluation during development
|
||||
- Comparing models on same dataset
|
||||
|
||||
**Limitations:**
|
||||
- Becomes increasingly biased with high censoring rates
|
||||
- Overestimates performance starting at approximately 49% censoring
|
||||
|
||||
```python
|
||||
from sksurv.metrics import concordance_index_censored
|
||||
|
||||
# Compute Harrell's C-index
|
||||
result = concordance_index_censored(y_test['event'], y_test['time'], risk_scores)
|
||||
c_index = result[0]
|
||||
print(f"Harrell's C-index: {c_index:.3f}")
|
||||
```
|
||||
|
||||
#### Uno's C-index (concordance_index_ipcw)
|
||||
|
||||
Inverse probability of censoring weighted (IPCW) estimator that corrects for censoring bias.
|
||||
|
||||
**When to Use:**
|
||||
- Moderate to high censoring rates (> 40%)
|
||||
- Need unbiased estimates
|
||||
- Comparing models across different datasets
|
||||
- Publishing results (more robust)
|
||||
|
||||
**Advantages:**
|
||||
- Remains stable even with high censoring
|
||||
- More reliable estimates
|
||||
- Less biased
|
||||
|
||||
```python
|
||||
from sksurv.metrics import concordance_index_ipcw
|
||||
|
||||
# Compute Uno's C-index
|
||||
# Requires training data for IPCW calculation
|
||||
c_index, concordant, discordant, tied_risk = concordance_index_ipcw(
|
||||
y_train, y_test, risk_scores
|
||||
)
|
||||
print(f"Uno's C-index: {c_index:.3f}")
|
||||
```
|
||||
|
||||
### Choosing Between Harrell's and Uno's
|
||||
|
||||
**Use Uno's C-index when:**
|
||||
- Censoring rate > 40%
|
||||
- Need most accurate estimates
|
||||
- Comparing models from different studies
|
||||
- Publishing research
|
||||
|
||||
**Use Harrell's C-index when:**
|
||||
- Low censoring rates
|
||||
- Quick model comparisons during development
|
||||
- Computational efficiency is critical
|
||||
|
||||
### Example Comparison
|
||||
|
||||
```python
|
||||
from sksurv.metrics import concordance_index_censored, concordance_index_ipcw
|
||||
|
||||
# Harrell's C-index
|
||||
harrell = concordance_index_censored(y_test['event'], y_test['time'], risk_scores)[0]
|
||||
|
||||
# Uno's C-index
|
||||
uno = concordance_index_ipcw(y_train, y_test, risk_scores)[0]
|
||||
|
||||
print(f"Harrell's C-index: {harrell:.3f}")
|
||||
print(f"Uno's C-index: {uno:.3f}")
|
||||
```
|
||||
|
||||
## Time-Dependent ROC and AUC
|
||||
|
||||
### What It Measures
|
||||
|
||||
Time-dependent AUC evaluates model discrimination at specific time points. It distinguishes subjects who experience events by time *t* from those who don't.
|
||||
|
||||
**Question answered**: "How well does the model predict who will have an event by time t?"
|
||||
|
||||
### When to Use
|
||||
|
||||
- Predicting event occurrence within specific time windows
|
||||
- Clinical decision-making at specific timepoints (e.g., 5-year survival)
|
||||
- Want to evaluate performance across different time horizons
|
||||
- Need both discrimination and timing information
|
||||
|
||||
### Key Function: cumulative_dynamic_auc
|
||||
|
||||
```python
|
||||
from sksurv.metrics import cumulative_dynamic_auc
|
||||
|
||||
# Define evaluation times
|
||||
times = [365, 730, 1095, 1460, 1825] # 1, 2, 3, 4, 5 years
|
||||
|
||||
# Compute time-dependent AUC
|
||||
auc, mean_auc = cumulative_dynamic_auc(
|
||||
y_train, y_test, risk_scores, times
|
||||
)
|
||||
|
||||
# Plot AUC over time
|
||||
import matplotlib.pyplot as plt
|
||||
plt.plot(times, auc, marker='o')
|
||||
plt.xlabel('Time (days)')
|
||||
plt.ylabel('Time-dependent AUC')
|
||||
plt.title('Model Discrimination Over Time')
|
||||
plt.show()
|
||||
|
||||
print(f"Mean AUC: {mean_auc:.3f}")
|
||||
```
|
||||
|
||||
### Interpretation
|
||||
|
||||
- **AUC at time t**: Probability model correctly ranks a subject who had event by time t above one who didn't
|
||||
- **Varying AUC over time**: Indicates model performance changes with time horizon
|
||||
- **Mean AUC**: Overall summary of discrimination across all time points
|
||||
|
||||
### Example: Comparing Models
|
||||
|
||||
```python
|
||||
# Compare two models
|
||||
auc1, mean_auc1 = cumulative_dynamic_auc(y_train, y_test, risk_scores1, times)
|
||||
auc2, mean_auc2 = cumulative_dynamic_auc(y_train, y_test, risk_scores2, times)
|
||||
|
||||
plt.plot(times, auc1, marker='o', label='Model 1')
|
||||
plt.plot(times, auc2, marker='s', label='Model 2')
|
||||
plt.xlabel('Time (days)')
|
||||
plt.ylabel('Time-dependent AUC')
|
||||
plt.legend()
|
||||
plt.show()
|
||||
```
|
||||
|
||||
## Brier Score
|
||||
|
||||
### What It Measures
|
||||
|
||||
Brier score extends mean squared error to survival data with censoring. It measures both discrimination (ranking) and calibration (accuracy of predicted probabilities).
|
||||
|
||||
**Formula**: **(1/n) Σ (S(t|x_i) - I(T_i > t))²**
|
||||
|
||||
where S(t|x_i) is predicted survival probability at time t for subject i.
|
||||
|
||||
**Range**: 0 to 1
|
||||
- 0 = perfect predictions
|
||||
- Lower is better
|
||||
- Typical good performance: < 0.2
|
||||
|
||||
### When to Use
|
||||
|
||||
- Need calibration assessment (not just ranking)
|
||||
- Want to evaluate predicted probabilities, not just risk scores
|
||||
- Comparing models that output survival functions
|
||||
- Clinical applications requiring probability estimates
|
||||
|
||||
### Key Functions
|
||||
|
||||
#### brier_score: Single Time Point
|
||||
|
||||
```python
|
||||
from sksurv.metrics import brier_score
|
||||
|
||||
# Compute Brier score at specific time
|
||||
time_point = 1825 # 5 years
|
||||
surv_probs = model.predict_survival_function(X_test)
|
||||
# Extract survival probability at time_point for each subject
|
||||
surv_at_t = [fn(time_point) for fn in surv_probs]
|
||||
|
||||
bs = brier_score(y_train, y_test, surv_at_t, time_point)[1]
|
||||
print(f"Brier score at {time_point} days: {bs:.3f}")
|
||||
```
|
||||
|
||||
#### integrated_brier_score: Summary Across Time
|
||||
|
||||
```python
|
||||
from sksurv.metrics import integrated_brier_score
|
||||
|
||||
# Compute integrated Brier score
|
||||
times = [365, 730, 1095, 1460, 1825]
|
||||
surv_probs = model.predict_survival_function(X_test)
|
||||
|
||||
ibs = integrated_brier_score(y_train, y_test, surv_probs, times)
|
||||
print(f"Integrated Brier Score: {ibs:.3f}")
|
||||
```
|
||||
|
||||
### Interpretation
|
||||
|
||||
- **Brier score at time t**: Expected squared difference between predicted and actual survival at time t
|
||||
- **Integrated Brier Score**: Weighted average of Brier scores across time
|
||||
- **Lower values = better predictions**
|
||||
|
||||
### Comparison with Null Model
|
||||
|
||||
Always compare against a baseline (e.g., Kaplan-Meier):
|
||||
|
||||
```python
|
||||
from sksurv.nonparametric import kaplan_meier_estimator
|
||||
|
||||
# Compute Kaplan-Meier baseline
|
||||
time_km, surv_km = kaplan_meier_estimator(y_train['event'], y_train['time'])
|
||||
|
||||
# Predict with KM for each test subject
|
||||
surv_km_test = [surv_km[time_km <= time_point][-1] if any(time_km <= time_point) else 1.0
|
||||
for _ in range(len(X_test))]
|
||||
|
||||
bs_km = brier_score(y_train, y_test, surv_km_test, time_point)[1]
|
||||
bs_model = brier_score(y_train, y_test, surv_at_t, time_point)[1]
|
||||
|
||||
print(f"Kaplan-Meier Brier Score: {bs_km:.3f}")
|
||||
print(f"Model Brier Score: {bs_model:.3f}")
|
||||
print(f"Improvement: {(bs_km - bs_model) / bs_km * 100:.1f}%")
|
||||
```
|
||||
|
||||
## Using Metrics with Cross-Validation
|
||||
|
||||
### Concordance Index Scorer
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import cross_val_score
|
||||
from sksurv.metrics import as_concordance_index_ipcw_scorer
|
||||
|
||||
# Create scorer
|
||||
scorer = as_concordance_index_ipcw_scorer()
|
||||
|
||||
# Perform cross-validation
|
||||
scores = cross_val_score(model, X, y, cv=5, scoring=scorer)
|
||||
print(f"Mean C-index: {scores.mean():.3f} (±{scores.std():.3f})")
|
||||
```
|
||||
|
||||
### Integrated Brier Score Scorer
|
||||
|
||||
```python
|
||||
from sksurv.metrics import as_integrated_brier_score_scorer
|
||||
|
||||
# Define time points for evaluation
|
||||
times = np.percentile(y['time'][y['event']], [25, 50, 75])
|
||||
|
||||
# Create scorer
|
||||
scorer = as_integrated_brier_score_scorer(times)
|
||||
|
||||
# Perform cross-validation
|
||||
scores = cross_val_score(model, X, y, cv=5, scoring=scorer)
|
||||
print(f"Mean IBS: {scores.mean():.3f} (±{scores.std():.3f})")
|
||||
```
|
||||
|
||||
## Model Selection with GridSearchCV
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import GridSearchCV
|
||||
from sksurv.ensemble import RandomSurvivalForest
|
||||
from sksurv.metrics import as_concordance_index_ipcw_scorer
|
||||
|
||||
# Define parameter grid
|
||||
param_grid = {
|
||||
'n_estimators': [100, 200, 300],
|
||||
'min_samples_split': [10, 20, 30],
|
||||
'max_depth': [None, 10, 20]
|
||||
}
|
||||
|
||||
# Create scorer
|
||||
scorer = as_concordance_index_ipcw_scorer()
|
||||
|
||||
# Perform grid search
|
||||
cv = GridSearchCV(
|
||||
RandomSurvivalForest(random_state=42),
|
||||
param_grid,
|
||||
scoring=scorer,
|
||||
cv=5,
|
||||
n_jobs=-1
|
||||
)
|
||||
cv.fit(X, y)
|
||||
|
||||
print(f"Best parameters: {cv.best_params_}")
|
||||
print(f"Best C-index: {cv.best_score_:.3f}")
|
||||
```
|
||||
|
||||
## Comprehensive Model Evaluation
|
||||
|
||||
### Recommended Evaluation Pipeline
|
||||
|
||||
```python
|
||||
from sksurv.metrics import (
|
||||
concordance_index_censored,
|
||||
concordance_index_ipcw,
|
||||
cumulative_dynamic_auc,
|
||||
integrated_brier_score
|
||||
)
|
||||
|
||||
def evaluate_survival_model(model, X_train, X_test, y_train, y_test):
|
||||
"""Comprehensive evaluation of survival model"""
|
||||
|
||||
# Get predictions
|
||||
risk_scores = model.predict(X_test)
|
||||
surv_funcs = model.predict_survival_function(X_test)
|
||||
|
||||
# 1. Concordance Index (both versions)
|
||||
c_harrell = concordance_index_censored(y_test['event'], y_test['time'], risk_scores)[0]
|
||||
c_uno = concordance_index_ipcw(y_train, y_test, risk_scores)[0]
|
||||
|
||||
# 2. Time-dependent AUC
|
||||
times = np.percentile(y_test['time'][y_test['event']], [25, 50, 75])
|
||||
auc, mean_auc = cumulative_dynamic_auc(y_train, y_test, risk_scores, times)
|
||||
|
||||
# 3. Integrated Brier Score
|
||||
ibs = integrated_brier_score(y_train, y_test, surv_funcs, times)
|
||||
|
||||
# Print results
|
||||
print("=" * 50)
|
||||
print("Model Evaluation Results")
|
||||
print("=" * 50)
|
||||
print(f"Harrell's C-index: {c_harrell:.3f}")
|
||||
print(f"Uno's C-index: {c_uno:.3f}")
|
||||
print(f"Mean AUC: {mean_auc:.3f}")
|
||||
print(f"Integrated Brier: {ibs:.3f}")
|
||||
print("=" * 50)
|
||||
|
||||
return {
|
||||
'c_harrell': c_harrell,
|
||||
'c_uno': c_uno,
|
||||
'mean_auc': mean_auc,
|
||||
'ibs': ibs,
|
||||
'time_auc': dict(zip(times, auc))
|
||||
}
|
||||
|
||||
# Use the evaluation function
|
||||
results = evaluate_survival_model(model, X_train, X_test, y_train, y_test)
|
||||
```
|
||||
|
||||
## Choosing the Right Metric
|
||||
|
||||
### Decision Guide
|
||||
|
||||
**Use C-index (Uno's) when:**
|
||||
- Primary goal is ranking/discrimination
|
||||
- Don't need calibrated probabilities
|
||||
- Standard survival analysis setting
|
||||
- Most common choice
|
||||
|
||||
**Use Time-dependent AUC when:**
|
||||
- Need discrimination at specific time points
|
||||
- Clinical decisions at specific horizons
|
||||
- Want to understand how performance varies over time
|
||||
|
||||
**Use Brier Score when:**
|
||||
- Need calibrated probability estimates
|
||||
- Both discrimination AND calibration important
|
||||
- Clinical decision-making requiring probabilities
|
||||
- Want comprehensive assessment
|
||||
|
||||
**Best Practice**: Report multiple metrics for comprehensive evaluation. At minimum, report:
|
||||
- Uno's C-index (discrimination)
|
||||
- Integrated Brier Score (discrimination + calibration)
|
||||
- Time-dependent AUC at clinically relevant time points
|
||||
Reference in New Issue
Block a user