Initial commit
This commit is contained in:
397
skills/scikit-survival/references/competing-risks.md
Normal file
397
skills/scikit-survival/references/competing-risks.md
Normal file
@@ -0,0 +1,397 @@
|
||||
# Competing Risks Analysis
|
||||
|
||||
## Overview
|
||||
|
||||
Competing risks occur when subjects can experience one of several mutually exclusive events (event types). When one event occurs, it prevents ("competes with") the occurrence of other events.
|
||||
|
||||
### Examples of Competing Risks
|
||||
|
||||
**Medical Research:**
|
||||
- Death from cancer vs. death from cardiovascular disease vs. death from other causes
|
||||
- Relapse vs. death without relapse in cancer studies
|
||||
- Different types of infections in transplant patients
|
||||
|
||||
**Other Applications:**
|
||||
- Job termination: retirement vs. resignation vs. termination for cause
|
||||
- Equipment failure: different failure modes
|
||||
- Customer churn: different reasons for leaving
|
||||
|
||||
### Key Concept: Cumulative Incidence Function (CIF)
|
||||
|
||||
The **Cumulative Incidence Function (CIF)** represents the probability of experiencing a specific event type by time *t*, accounting for the presence of competing risks.
|
||||
|
||||
**CIF_k(t) = P(T ≤ t, event type = k)**
|
||||
|
||||
This differs from the Kaplan-Meier estimator, which would overestimate event probabilities when competing risks are present.
|
||||
|
||||
## When to Use Competing Risks Analysis
|
||||
|
||||
**Use competing risks when:**
|
||||
- Multiple mutually exclusive event types exist
|
||||
- Occurrence of one event prevents others
|
||||
- Need to estimate probability of specific event types
|
||||
- Want to understand how covariates affect different event types
|
||||
|
||||
**Don't use when:**
|
||||
- Only one event type of interest (standard survival analysis)
|
||||
- Events are not mutually exclusive (use recurrent events methods)
|
||||
- Competing events are extremely rare (can treat as censoring)
|
||||
|
||||
## Cumulative Incidence with Competing Risks
|
||||
|
||||
### cumulative_incidence_competing_risks Function
|
||||
|
||||
Estimates the cumulative incidence function for each event type.
|
||||
|
||||
```python
|
||||
from sksurv.nonparametric import cumulative_incidence_competing_risks
|
||||
from sksurv.datasets import load_leukemia
|
||||
|
||||
# Load data with competing risks
|
||||
X, y = load_leukemia()
|
||||
# y has event types: 0=censored, 1=relapse, 2=death
|
||||
|
||||
# Compute cumulative incidence for each event type
|
||||
# Returns: time points, CIF for event 1, CIF for event 2, ...
|
||||
time_points, cif_1, cif_2 = cumulative_incidence_competing_risks(y)
|
||||
|
||||
# Plot cumulative incidence functions
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
plt.figure(figsize=(10, 6))
|
||||
plt.step(time_points, cif_1, where='post', label='Relapse', linewidth=2)
|
||||
plt.step(time_points, cif_2, where='post', label='Death in remission', linewidth=2)
|
||||
plt.xlabel('Time (weeks)')
|
||||
plt.ylabel('Cumulative Incidence')
|
||||
plt.title('Competing Risks: Relapse vs Death')
|
||||
plt.legend()
|
||||
plt.grid(True, alpha=0.3)
|
||||
plt.show()
|
||||
```
|
||||
|
||||
### Interpretation
|
||||
|
||||
- **CIF at time t**: Probability of experiencing that specific event by time t
|
||||
- **Sum of all CIFs**: Total probability of experiencing any event (all cause)
|
||||
- **1 - sum of CIFs**: Probability of being event-free and uncensored
|
||||
|
||||
## Data Format for Competing Risks
|
||||
|
||||
### Creating Structured Array with Event Types
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
from sksurv.util import Surv
|
||||
|
||||
# Event types: 0 = censored, 1 = event type 1, 2 = event type 2
|
||||
event_types = np.array([0, 1, 2, 1, 0, 2, 1])
|
||||
times = np.array([10.2, 5.3, 8.1, 3.7, 12.5, 6.8, 4.2])
|
||||
|
||||
# Create survival array
|
||||
# For competing risks: event=True if any event occurred
|
||||
# Store event type separately or encode in the event field
|
||||
y = Surv.from_arrays(
|
||||
event=(event_types > 0), # True if any event
|
||||
time=times
|
||||
)
|
||||
|
||||
# Keep event_types for distinguishing between event types
|
||||
```
|
||||
|
||||
### Converting Data with Event Types
|
||||
|
||||
```python
|
||||
import pandas as pd
|
||||
from sksurv.util import Surv
|
||||
|
||||
# Assume data has: time, event_type columns
|
||||
# event_type: 0=censored, 1=type1, 2=type2, etc.
|
||||
|
||||
df = pd.read_csv('competing_risks_data.csv')
|
||||
|
||||
# Create survival outcome
|
||||
y = Surv.from_arrays(
|
||||
event=(df['event_type'] > 0),
|
||||
time=df['time']
|
||||
)
|
||||
|
||||
# Store event types
|
||||
event_types = df['event_type'].values
|
||||
```
|
||||
|
||||
## Comparing Cumulative Incidence Between Groups
|
||||
|
||||
### Stratified Analysis
|
||||
|
||||
```python
|
||||
from sksurv.nonparametric import cumulative_incidence_competing_risks
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
# Split by treatment group
|
||||
mask_treatment = X['treatment'] == 'A'
|
||||
mask_control = X['treatment'] == 'B'
|
||||
|
||||
y_treatment = y[mask_treatment]
|
||||
y_control = y[mask_control]
|
||||
|
||||
# Compute CIF for each group
|
||||
time_trt, cif1_trt, cif2_trt = cumulative_incidence_competing_risks(y_treatment)
|
||||
time_ctl, cif1_ctl, cif2_ctl = cumulative_incidence_competing_risks(y_control)
|
||||
|
||||
# Plot comparison
|
||||
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
|
||||
|
||||
# Event type 1
|
||||
ax1.step(time_trt, cif1_trt, where='post', label='Treatment', linewidth=2)
|
||||
ax1.step(time_ctl, cif1_ctl, where='post', label='Control', linewidth=2)
|
||||
ax1.set_xlabel('Time')
|
||||
ax1.set_ylabel('Cumulative Incidence')
|
||||
ax1.set_title('Event Type 1')
|
||||
ax1.legend()
|
||||
ax1.grid(True, alpha=0.3)
|
||||
|
||||
# Event type 2
|
||||
ax2.step(time_trt, cif2_trt, where='post', label='Treatment', linewidth=2)
|
||||
ax2.step(time_ctl, cif2_ctl, where='post', label='Control', linewidth=2)
|
||||
ax2.set_xlabel('Time')
|
||||
ax2.set_ylabel('Cumulative Incidence')
|
||||
ax2.set_title('Event Type 2')
|
||||
ax2.legend()
|
||||
ax2.grid(True, alpha=0.3)
|
||||
|
||||
plt.tight_layout()
|
||||
plt.show()
|
||||
```
|
||||
|
||||
## Statistical Testing with Competing Risks
|
||||
|
||||
### Gray's Test
|
||||
|
||||
Compare cumulative incidence functions between groups using Gray's test (available in other packages like lifelines).
|
||||
|
||||
```python
|
||||
# Note: Gray's test not directly available in scikit-survival
|
||||
# Consider using lifelines or other packages
|
||||
|
||||
# from lifelines.statistics import multivariate_logrank_test
|
||||
# result = multivariate_logrank_test(times, groups, events, event_of_interest=1)
|
||||
```
|
||||
|
||||
## Modeling with Competing Risks
|
||||
|
||||
### Approach 1: Cause-Specific Hazard Models
|
||||
|
||||
Fit separate Cox models for each event type, treating other event types as censored.
|
||||
|
||||
```python
|
||||
from sksurv.linear_model import CoxPHSurvivalAnalysis
|
||||
from sksurv.util import Surv
|
||||
|
||||
# Separate outcome for each event type
|
||||
# Event type 1: treat type 2 as censored
|
||||
y_event1 = Surv.from_arrays(
|
||||
event=(event_types == 1),
|
||||
time=times
|
||||
)
|
||||
|
||||
# Event type 2: treat type 1 as censored
|
||||
y_event2 = Surv.from_arrays(
|
||||
event=(event_types == 2),
|
||||
time=times
|
||||
)
|
||||
|
||||
# Fit cause-specific models
|
||||
cox_event1 = CoxPHSurvivalAnalysis()
|
||||
cox_event1.fit(X, y_event1)
|
||||
|
||||
cox_event2 = CoxPHSurvivalAnalysis()
|
||||
cox_event2.fit(X, y_event2)
|
||||
|
||||
# Interpret coefficients for each event type
|
||||
print("Event Type 1 (e.g., Relapse):")
|
||||
print(cox_event1.coef_)
|
||||
|
||||
print("\nEvent Type 2 (e.g., Death):")
|
||||
print(cox_event2.coef_)
|
||||
```
|
||||
|
||||
**Interpretation:**
|
||||
- Separate model for each competing event
|
||||
- Coefficients show effect on cause-specific hazard for that event type
|
||||
- A covariate may increase risk for one event type but decrease for another
|
||||
|
||||
### Approach 2: Fine-Gray Sub-distribution Hazard Model
|
||||
|
||||
Models the cumulative incidence directly (not available directly in scikit-survival, but can use other packages).
|
||||
|
||||
```python
|
||||
# Note: Fine-Gray model not directly in scikit-survival
|
||||
# Consider using lifelines or rpy2 to access R's cmprsk package
|
||||
|
||||
# from lifelines import CRCSplineFitter
|
||||
# crc = CRCSplineFitter()
|
||||
# crc.fit(df, event_col='event', duration_col='time')
|
||||
```
|
||||
|
||||
## Practical Example: Complete Competing Risks Analysis
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
import matplotlib.pyplot as plt
|
||||
from sksurv.nonparametric import cumulative_incidence_competing_risks
|
||||
from sksurv.linear_model import CoxPHSurvivalAnalysis
|
||||
from sksurv.util import Surv
|
||||
|
||||
# Simulate competing risks data
|
||||
np.random.seed(42)
|
||||
n = 200
|
||||
|
||||
# Create features
|
||||
age = np.random.normal(60, 10, n)
|
||||
treatment = np.random.choice(['A', 'B'], n)
|
||||
|
||||
# Simulate event times and types
|
||||
# Event types: 0=censored, 1=relapse, 2=death
|
||||
times = np.random.exponential(100, n)
|
||||
event_types = np.zeros(n, dtype=int)
|
||||
|
||||
# Higher age increases both events, treatment A reduces relapse
|
||||
for i in range(n):
|
||||
if times[i] < 150: # Event occurred
|
||||
# Probability of each event type
|
||||
p_relapse = 0.6 if treatment[i] == 'B' else 0.4
|
||||
event_types[i] = 1 if np.random.rand() < p_relapse else 2
|
||||
else:
|
||||
times[i] = 150 # Censored at study end
|
||||
|
||||
# Create DataFrame
|
||||
df = pd.DataFrame({
|
||||
'time': times,
|
||||
'event_type': event_types,
|
||||
'age': age,
|
||||
'treatment': treatment
|
||||
})
|
||||
|
||||
# Encode treatment
|
||||
df['treatment_A'] = (df['treatment'] == 'A').astype(int)
|
||||
|
||||
# 1. OVERALL CUMULATIVE INCIDENCE
|
||||
print("=" * 60)
|
||||
print("OVERALL CUMULATIVE INCIDENCE")
|
||||
print("=" * 60)
|
||||
|
||||
y_all = Surv.from_arrays(event=(df['event_type'] > 0), time=df['time'])
|
||||
time_points, cif_relapse, cif_death = cumulative_incidence_competing_risks(y_all)
|
||||
|
||||
plt.figure(figsize=(10, 6))
|
||||
plt.step(time_points, cif_relapse, where='post', label='Relapse', linewidth=2)
|
||||
plt.step(time_points, cif_death, where='post', label='Death', linewidth=2)
|
||||
plt.xlabel('Time (days)')
|
||||
plt.ylabel('Cumulative Incidence')
|
||||
plt.title('Competing Risks: Relapse vs Death')
|
||||
plt.legend()
|
||||
plt.grid(True, alpha=0.3)
|
||||
plt.show()
|
||||
|
||||
print(f"5-year relapse incidence: {cif_relapse[-1]:.2%}")
|
||||
print(f"5-year death incidence: {cif_death[-1]:.2%}")
|
||||
|
||||
# 2. STRATIFIED BY TREATMENT
|
||||
print("\n" + "=" * 60)
|
||||
print("CUMULATIVE INCIDENCE BY TREATMENT")
|
||||
print("=" * 60)
|
||||
|
||||
for trt in ['A', 'B']:
|
||||
mask = df['treatment'] == trt
|
||||
y_trt = Surv.from_arrays(
|
||||
event=(df.loc[mask, 'event_type'] > 0),
|
||||
time=df.loc[mask, 'time']
|
||||
)
|
||||
time_trt, cif1_trt, cif2_trt = cumulative_incidence_competing_risks(y_trt)
|
||||
print(f"\nTreatment {trt}:")
|
||||
print(f" 5-year relapse: {cif1_trt[-1]:.2%}")
|
||||
print(f" 5-year death: {cif2_trt[-1]:.2%}")
|
||||
|
||||
# 3. CAUSE-SPECIFIC MODELS
|
||||
print("\n" + "=" * 60)
|
||||
print("CAUSE-SPECIFIC HAZARD MODELS")
|
||||
print("=" * 60)
|
||||
|
||||
X = df[['age', 'treatment_A']]
|
||||
|
||||
# Model for relapse (event type 1)
|
||||
y_relapse = Surv.from_arrays(
|
||||
event=(df['event_type'] == 1),
|
||||
time=df['time']
|
||||
)
|
||||
cox_relapse = CoxPHSurvivalAnalysis()
|
||||
cox_relapse.fit(X, y_relapse)
|
||||
|
||||
print("\nRelapse Model:")
|
||||
print(f" Age: HR = {np.exp(cox_relapse.coef_[0]):.3f}")
|
||||
print(f" Treatment A: HR = {np.exp(cox_relapse.coef_[1]):.3f}")
|
||||
|
||||
# Model for death (event type 2)
|
||||
y_death = Surv.from_arrays(
|
||||
event=(df['event_type'] == 2),
|
||||
time=df['time']
|
||||
)
|
||||
cox_death = CoxPHSurvivalAnalysis()
|
||||
cox_death.fit(X, y_death)
|
||||
|
||||
print("\nDeath Model:")
|
||||
print(f" Age: HR = {np.exp(cox_death.coef_[0]):.3f}")
|
||||
print(f" Treatment A: HR = {np.exp(cox_death.coef_[1]):.3f}")
|
||||
|
||||
print("\n" + "=" * 60)
|
||||
```
|
||||
|
||||
## Important Considerations
|
||||
|
||||
### Censoring in Competing Risks
|
||||
|
||||
- **Administrative censoring**: Subject still at risk at end of study
|
||||
- **Loss to follow-up**: Subject leaves study before event
|
||||
- **Competing event**: Other event occurred - NOT censored for CIF, but censored for cause-specific models
|
||||
|
||||
### Choosing Between Cause-Specific and Sub-distribution Models
|
||||
|
||||
**Cause-Specific Hazard Models:**
|
||||
- Easier to interpret
|
||||
- Direct effect on hazard rate
|
||||
- Better for understanding etiology
|
||||
- Can fit with scikit-survival
|
||||
|
||||
**Fine-Gray Sub-distribution Models:**
|
||||
- Models cumulative incidence directly
|
||||
- Better for prediction and risk stratification
|
||||
- More appropriate for clinical decision-making
|
||||
- Requires other packages
|
||||
|
||||
### Common Mistakes
|
||||
|
||||
**Mistake 1**: Using Kaplan-Meier to estimate event-specific probabilities
|
||||
- **Wrong**: Kaplan-Meier for event type 1, treating type 2 as censored
|
||||
- **Correct**: Cumulative incidence function accounting for competing risks
|
||||
|
||||
**Mistake 2**: Ignoring competing risks when they're substantial
|
||||
- If competing event rate > 10-20%, should use competing risks methods
|
||||
|
||||
**Mistake 3**: Confusing cause-specific and sub-distribution hazards
|
||||
- They answer different questions
|
||||
- Use appropriate model for your research question
|
||||
|
||||
## Summary
|
||||
|
||||
**Key Functions:**
|
||||
- `cumulative_incidence_competing_risks`: Estimate CIF for each event type
|
||||
- Fit separate Cox models for cause-specific hazards
|
||||
- Use stratified analysis to compare groups
|
||||
|
||||
**Best Practices:**
|
||||
1. Always plot cumulative incidence functions
|
||||
2. Report both event-specific and overall incidence
|
||||
3. Use cause-specific models in scikit-survival
|
||||
4. Consider Fine-Gray models (other packages) for prediction
|
||||
5. Be explicit about which events are competing vs censored
|
||||
182
skills/scikit-survival/references/cox-models.md
Normal file
182
skills/scikit-survival/references/cox-models.md
Normal file
@@ -0,0 +1,182 @@
|
||||
# Cox Proportional Hazards Models
|
||||
|
||||
## Overview
|
||||
|
||||
Cox proportional hazards models are semi-parametric models that relate covariates to the time of an event. The hazard function for individual *i* is expressed as:
|
||||
|
||||
**h_i(t) = h_0(t) × exp(β^T x_i)**
|
||||
|
||||
where:
|
||||
- h_0(t) is the baseline hazard function (unspecified)
|
||||
- β is the vector of coefficients
|
||||
- x_i is the covariate vector for individual *i*
|
||||
|
||||
The key assumption is that the hazard ratio between two individuals is constant over time (proportional hazards).
|
||||
|
||||
## CoxPHSurvivalAnalysis
|
||||
|
||||
Basic Cox proportional hazards model for survival analysis.
|
||||
|
||||
### When to Use
|
||||
- Standard survival analysis with censored data
|
||||
- Need interpretable coefficients (log hazard ratios)
|
||||
- Proportional hazards assumption holds
|
||||
- Dataset has relatively few features
|
||||
|
||||
### Key Parameters
|
||||
- `alpha`: Regularization parameter (default: 0, no regularization)
|
||||
- `ties`: Method for handling tied event times ('breslow' or 'efron')
|
||||
- `n_iter`: Maximum number of iterations for optimization
|
||||
|
||||
### Example Usage
|
||||
```python
|
||||
from sksurv.linear_model import CoxPHSurvivalAnalysis
|
||||
from sksurv.datasets import load_gbsg2
|
||||
|
||||
# Load data
|
||||
X, y = load_gbsg2()
|
||||
|
||||
# Fit Cox model
|
||||
estimator = CoxPHSurvivalAnalysis()
|
||||
estimator.fit(X, y)
|
||||
|
||||
# Get coefficients (log hazard ratios)
|
||||
coefficients = estimator.coef_
|
||||
|
||||
# Predict risk scores
|
||||
risk_scores = estimator.predict(X)
|
||||
```
|
||||
|
||||
## CoxnetSurvivalAnalysis
|
||||
|
||||
Cox model with elastic net penalty for feature selection and regularization.
|
||||
|
||||
### When to Use
|
||||
- High-dimensional data (many features)
|
||||
- Need automatic feature selection
|
||||
- Want to handle multicollinearity
|
||||
- Require sparse models
|
||||
|
||||
### Penalty Types
|
||||
- **Ridge (L2)**: alpha_min_ratio=1.0, l1_ratio=0
|
||||
- Shrinks all coefficients
|
||||
- Good when all features are relevant
|
||||
|
||||
- **Lasso (L1)**: l1_ratio=1.0
|
||||
- Performs feature selection (sets coefficients to zero)
|
||||
- Good for sparse models
|
||||
|
||||
- **Elastic Net**: 0 < l1_ratio < 1
|
||||
- Combination of L1 and L2
|
||||
- Balances feature selection and grouping
|
||||
|
||||
### Key Parameters
|
||||
- `l1_ratio`: Balance between L1 and L2 penalty (0=Ridge, 1=Lasso)
|
||||
- `alpha_min_ratio`: Ratio of smallest to largest penalty in regularization path
|
||||
- `n_alphas`: Number of alphas along regularization path
|
||||
- `fit_baseline_model`: Whether to fit unpenalized baseline model
|
||||
|
||||
### Example Usage
|
||||
```python
|
||||
from sksurv.linear_model import CoxnetSurvivalAnalysis
|
||||
|
||||
# Fit with elastic net penalty
|
||||
estimator = CoxnetSurvivalAnalysis(l1_ratio=0.5, alpha_min_ratio=0.01)
|
||||
estimator.fit(X, y)
|
||||
|
||||
# Access regularization path
|
||||
alphas = estimator.alphas_
|
||||
coefficients_path = estimator.coef_path_
|
||||
|
||||
# Predict with specific alpha
|
||||
risk_scores = estimator.predict(X, alpha=0.1)
|
||||
```
|
||||
|
||||
### Cross-Validation for Alpha Selection
|
||||
```python
|
||||
from sklearn.model_selection import GridSearchCV
|
||||
from sksurv.metrics import concordance_index_censored
|
||||
|
||||
# Define parameter grid
|
||||
param_grid = {'l1_ratio': [0.1, 0.5, 0.9],
|
||||
'alpha_min_ratio': [0.01, 0.001]}
|
||||
|
||||
# Grid search with C-index
|
||||
cv = GridSearchCV(CoxnetSurvivalAnalysis(),
|
||||
param_grid,
|
||||
scoring='concordance_index_ipcw',
|
||||
cv=5)
|
||||
cv.fit(X, y)
|
||||
|
||||
# Best parameters
|
||||
best_params = cv.best_params_
|
||||
```
|
||||
|
||||
## IPCRidge
|
||||
|
||||
Inverse probability of censoring weighted Ridge regression for accelerated failure time models.
|
||||
|
||||
### When to Use
|
||||
- Prefer accelerated failure time (AFT) framework over proportional hazards
|
||||
- Need to model how features accelerate/decelerate survival time
|
||||
- High censoring rates
|
||||
- Want regularization with Ridge penalty
|
||||
|
||||
### Key Difference from Cox Models
|
||||
AFT models assume features multiply survival time by a constant factor, rather than multiplying the hazard rate. The model predicts log survival time directly.
|
||||
|
||||
### Example Usage
|
||||
```python
|
||||
from sksurv.linear_model import IPCRidge
|
||||
|
||||
# Fit IPCRidge model
|
||||
estimator = IPCRidge(alpha=1.0)
|
||||
estimator.fit(X, y)
|
||||
|
||||
# Predict log survival time
|
||||
log_time = estimator.predict(X)
|
||||
```
|
||||
|
||||
## Model Comparison and Selection
|
||||
|
||||
### Choosing Between Models
|
||||
|
||||
**Use CoxPHSurvivalAnalysis when:**
|
||||
- Small to moderate number of features
|
||||
- Want interpretable hazard ratios
|
||||
- Standard survival analysis setting
|
||||
|
||||
**Use CoxnetSurvivalAnalysis when:**
|
||||
- High-dimensional data (p >> n)
|
||||
- Need feature selection
|
||||
- Want to identify important predictors
|
||||
- Presence of multicollinearity
|
||||
|
||||
**Use IPCRidge when:**
|
||||
- AFT framework is more appropriate
|
||||
- High censoring rates
|
||||
- Want to model time directly rather than hazard
|
||||
|
||||
### Checking Proportional Hazards Assumption
|
||||
|
||||
The proportional hazards assumption should be verified using:
|
||||
- Schoenfeld residuals
|
||||
- Log-log survival plots
|
||||
- Statistical tests (available in other packages like lifelines)
|
||||
|
||||
If violated, consider:
|
||||
- Stratification by violating covariates
|
||||
- Time-varying coefficients
|
||||
- Alternative models (AFT, parametric models)
|
||||
|
||||
## Interpretation
|
||||
|
||||
### Cox Model Coefficients
|
||||
- Positive coefficient: increased hazard (shorter survival)
|
||||
- Negative coefficient: decreased hazard (longer survival)
|
||||
- Hazard ratio = exp(β) for one-unit increase in covariate
|
||||
- Example: β=0.693 → HR=2.0 (doubles the hazard)
|
||||
|
||||
### Risk Scores
|
||||
- Higher risk score = higher risk of event = shorter expected survival
|
||||
- Risk scores are relative; use survival functions for absolute predictions
|
||||
494
skills/scikit-survival/references/data-handling.md
Normal file
494
skills/scikit-survival/references/data-handling.md
Normal file
@@ -0,0 +1,494 @@
|
||||
# Data Handling and Preprocessing
|
||||
|
||||
## Understanding Survival Data
|
||||
|
||||
### The Surv Object
|
||||
|
||||
Survival data in scikit-survival is represented using structured arrays with two fields:
|
||||
- **event**: Boolean indicating whether the event occurred (True) or was censored (False)
|
||||
- **time**: Time to event or censoring
|
||||
|
||||
```python
|
||||
from sksurv.util import Surv
|
||||
|
||||
# Create survival outcome from separate arrays
|
||||
event = np.array([True, False, True, False, True])
|
||||
time = np.array([5.2, 10.1, 3.7, 8.9, 6.3])
|
||||
|
||||
y = Surv.from_arrays(event=event, time=time)
|
||||
print(y.dtype) # [('event', '?'), ('time', '<f8')]
|
||||
```
|
||||
|
||||
### Types of Censoring
|
||||
|
||||
**Right Censoring** (most common):
|
||||
- Subject didn't experience event by end of study
|
||||
- Subject lost to follow-up
|
||||
- Subject withdrew from study
|
||||
|
||||
**Left Censoring**:
|
||||
- Event occurred before observation began
|
||||
- Rare in practice
|
||||
|
||||
**Interval Censoring**:
|
||||
- Event occurred in a known time interval
|
||||
- Requires specialized methods
|
||||
|
||||
scikit-survival primarily handles right-censored data.
|
||||
|
||||
## Loading Data
|
||||
|
||||
### Built-in Datasets
|
||||
|
||||
```python
|
||||
from sksurv.datasets import (
|
||||
load_aids,
|
||||
load_breast_cancer,
|
||||
load_gbsg2,
|
||||
load_veterans_lung_cancer,
|
||||
load_whas500
|
||||
)
|
||||
|
||||
# Load dataset
|
||||
X, y = load_breast_cancer()
|
||||
|
||||
# X is pandas DataFrame with features
|
||||
# y is structured array with 'event' and 'time'
|
||||
print(f"Features shape: {X.shape}")
|
||||
print(f"Number of events: {y['event'].sum()}")
|
||||
print(f"Censoring rate: {1 - y['event'].mean():.2%}")
|
||||
```
|
||||
|
||||
### Loading Custom Data
|
||||
|
||||
#### From Pandas DataFrame
|
||||
|
||||
```python
|
||||
import pandas as pd
|
||||
from sksurv.util import Surv
|
||||
|
||||
# Load data
|
||||
df = pd.read_csv('survival_data.csv')
|
||||
|
||||
# Separate features and outcome
|
||||
X = df.drop(['time', 'event'], axis=1)
|
||||
y = Surv.from_dataframe('event', 'time', df)
|
||||
```
|
||||
|
||||
#### From CSV with Surv.from_arrays
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
from sksurv.util import Surv
|
||||
|
||||
# Load data
|
||||
df = pd.read_csv('survival_data.csv')
|
||||
|
||||
# Create feature matrix
|
||||
X = df.drop(['time', 'event'], axis=1)
|
||||
|
||||
# Create survival outcome
|
||||
y = Surv.from_arrays(
|
||||
event=df['event'].astype(bool),
|
||||
time=df['time'].astype(float)
|
||||
)
|
||||
```
|
||||
|
||||
### Loading ARFF Files
|
||||
|
||||
```python
|
||||
from sksurv.io import loadarff
|
||||
|
||||
# Load ARFF format (Weka format)
|
||||
data = loadarff('survival_data.arff')
|
||||
|
||||
# Extract X and y
|
||||
X = data[0] # pandas DataFrame
|
||||
y = data[1] # structured array
|
||||
```
|
||||
|
||||
## Data Preprocessing
|
||||
|
||||
### Handling Categorical Variables
|
||||
|
||||
#### Method 1: OneHotEncoder (scikit-survival)
|
||||
|
||||
```python
|
||||
from sksurv.preprocessing import OneHotEncoder
|
||||
import pandas as pd
|
||||
|
||||
# Identify categorical columns
|
||||
categorical_cols = ['gender', 'race', 'treatment']
|
||||
|
||||
# One-hot encode
|
||||
encoder = OneHotEncoder()
|
||||
X_encoded = encoder.fit_transform(X[categorical_cols])
|
||||
|
||||
# Combine with numerical features
|
||||
numerical_cols = [col for col in X.columns if col not in categorical_cols]
|
||||
X_processed = pd.concat([X[numerical_cols], X_encoded], axis=1)
|
||||
```
|
||||
|
||||
#### Method 2: encode_categorical
|
||||
|
||||
```python
|
||||
from sksurv.preprocessing import encode_categorical
|
||||
|
||||
# Automatically encode all categorical columns
|
||||
X_encoded = encode_categorical(X)
|
||||
```
|
||||
|
||||
#### Method 3: Pandas get_dummies
|
||||
|
||||
```python
|
||||
import pandas as pd
|
||||
|
||||
# One-hot encode categorical variables
|
||||
X_encoded = pd.get_dummies(X, drop_first=True)
|
||||
```
|
||||
|
||||
### Standardization
|
||||
|
||||
Standardization is important for:
|
||||
- Cox models with regularization
|
||||
- SVMs
|
||||
- Models sensitive to feature scales
|
||||
|
||||
```python
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
|
||||
# Standardize features
|
||||
scaler = StandardScaler()
|
||||
X_scaled = scaler.fit_transform(X)
|
||||
|
||||
# Convert back to DataFrame
|
||||
X_scaled = pd.DataFrame(X_scaled, columns=X.columns, index=X.index)
|
||||
```
|
||||
|
||||
### Handling Missing Data
|
||||
|
||||
#### Check for Missing Values
|
||||
|
||||
```python
|
||||
# Check missing values
|
||||
missing = X.isnull().sum()
|
||||
print(missing[missing > 0])
|
||||
|
||||
# Visualize missing data
|
||||
import seaborn as sns
|
||||
sns.heatmap(X.isnull(), cbar=False)
|
||||
```
|
||||
|
||||
#### Imputation Strategies
|
||||
|
||||
```python
|
||||
from sklearn.impute import SimpleImputer
|
||||
|
||||
# Mean imputation for numerical features
|
||||
num_imputer = SimpleImputer(strategy='mean')
|
||||
X_num = X.select_dtypes(include=[np.number])
|
||||
X_num_imputed = num_imputer.fit_transform(X_num)
|
||||
|
||||
# Most frequent for categorical
|
||||
cat_imputer = SimpleImputer(strategy='most_frequent')
|
||||
X_cat = X.select_dtypes(include=['object', 'category'])
|
||||
X_cat_imputed = cat_imputer.fit_transform(X_cat)
|
||||
```
|
||||
|
||||
#### Advanced Imputation
|
||||
|
||||
```python
|
||||
from sklearn.experimental import enable_iterative_imputer
|
||||
from sklearn.impute import IterativeImputer
|
||||
|
||||
# Iterative imputation
|
||||
imputer = IterativeImputer(random_state=42)
|
||||
X_imputed = imputer.fit_transform(X)
|
||||
```
|
||||
|
||||
### Feature Selection
|
||||
|
||||
#### Variance Threshold
|
||||
|
||||
```python
|
||||
from sklearn.feature_selection import VarianceThreshold
|
||||
|
||||
# Remove low variance features
|
||||
selector = VarianceThreshold(threshold=0.01)
|
||||
X_selected = selector.fit_transform(X)
|
||||
|
||||
# Get selected feature names
|
||||
selected_features = X.columns[selector.get_support()]
|
||||
```
|
||||
|
||||
#### Univariate Feature Selection
|
||||
|
||||
```python
|
||||
from sklearn.feature_selection import SelectKBest
|
||||
from sksurv.util import Surv
|
||||
|
||||
# Select top k features
|
||||
selector = SelectKBest(k=10)
|
||||
X_selected = selector.fit_transform(X, y)
|
||||
|
||||
# Get selected features
|
||||
selected_features = X.columns[selector.get_support()]
|
||||
```
|
||||
|
||||
## Complete Preprocessing Pipeline
|
||||
|
||||
### Using sklearn Pipeline
|
||||
|
||||
```python
|
||||
from sklearn.pipeline import Pipeline
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
from sklearn.impute import SimpleImputer
|
||||
from sksurv.linear_model import CoxPHSurvivalAnalysis
|
||||
|
||||
# Create preprocessing and modeling pipeline
|
||||
pipeline = Pipeline([
|
||||
('imputer', SimpleImputer(strategy='mean')),
|
||||
('scaler', StandardScaler()),
|
||||
('model', CoxPHSurvivalAnalysis())
|
||||
])
|
||||
|
||||
# Fit pipeline
|
||||
pipeline.fit(X, y)
|
||||
|
||||
# Predict
|
||||
predictions = pipeline.predict(X_test)
|
||||
```
|
||||
|
||||
### Custom Preprocessing Function
|
||||
|
||||
```python
|
||||
def preprocess_survival_data(X, y=None, scaler=None, encoder=None):
|
||||
"""
|
||||
Complete preprocessing pipeline for survival data
|
||||
|
||||
Parameters:
|
||||
-----------
|
||||
X : DataFrame
|
||||
Feature matrix
|
||||
y : structured array, optional
|
||||
Survival outcome (for filtering invalid samples)
|
||||
scaler : StandardScaler, optional
|
||||
Fitted scaler (for test data)
|
||||
encoder : OneHotEncoder, optional
|
||||
Fitted encoder (for test data)
|
||||
|
||||
Returns:
|
||||
--------
|
||||
X_processed : DataFrame
|
||||
Processed features
|
||||
scaler : StandardScaler
|
||||
Fitted scaler
|
||||
encoder : OneHotEncoder
|
||||
Fitted encoder
|
||||
"""
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
from sksurv.preprocessing import encode_categorical
|
||||
|
||||
# 1. Handle missing values
|
||||
# Remove rows with missing outcome
|
||||
if y is not None:
|
||||
mask = np.isfinite(y['time']) & (y['time'] > 0)
|
||||
X = X[mask]
|
||||
y = y[mask]
|
||||
|
||||
# Impute missing features
|
||||
X = X.fillna(X.median())
|
||||
|
||||
# 2. Encode categorical variables
|
||||
if encoder is None:
|
||||
X_processed = encode_categorical(X)
|
||||
encoder = None # encode_categorical doesn't return encoder
|
||||
else:
|
||||
X_processed = encode_categorical(X)
|
||||
|
||||
# 3. Standardize numerical features
|
||||
if scaler is None:
|
||||
scaler = StandardScaler()
|
||||
X_processed = pd.DataFrame(
|
||||
scaler.fit_transform(X_processed),
|
||||
columns=X_processed.columns,
|
||||
index=X_processed.index
|
||||
)
|
||||
else:
|
||||
X_processed = pd.DataFrame(
|
||||
scaler.transform(X_processed),
|
||||
columns=X_processed.columns,
|
||||
index=X_processed.index
|
||||
)
|
||||
|
||||
if y is not None:
|
||||
return X_processed, y, scaler, encoder
|
||||
else:
|
||||
return X_processed, scaler, encoder
|
||||
|
||||
# Usage
|
||||
X_train_processed, y_train_processed, scaler, encoder = preprocess_survival_data(X_train, y_train)
|
||||
X_test_processed, _, _ = preprocess_survival_data(X_test, scaler=scaler, encoder=encoder)
|
||||
```
|
||||
|
||||
## Data Quality Checks
|
||||
|
||||
### Validate Survival Data
|
||||
|
||||
```python
|
||||
def validate_survival_data(y):
|
||||
"""Check survival data quality"""
|
||||
|
||||
# Check for negative times
|
||||
if np.any(y['time'] <= 0):
|
||||
print("WARNING: Found non-positive survival times")
|
||||
print(f"Negative times: {np.sum(y['time'] <= 0)}")
|
||||
|
||||
# Check for missing values
|
||||
if np.any(~np.isfinite(y['time'])):
|
||||
print("WARNING: Found missing survival times")
|
||||
print(f"Missing times: {np.sum(~np.isfinite(y['time']))}")
|
||||
|
||||
# Censoring rate
|
||||
censor_rate = 1 - y['event'].mean()
|
||||
print(f"Censoring rate: {censor_rate:.2%}")
|
||||
|
||||
if censor_rate > 0.7:
|
||||
print("WARNING: High censoring rate (>70%)")
|
||||
print("Consider using Uno's C-index instead of Harrell's")
|
||||
|
||||
# Event rate
|
||||
print(f"Number of events: {y['event'].sum()}")
|
||||
print(f"Number of censored: {(~y['event']).sum()}")
|
||||
|
||||
# Time statistics
|
||||
print(f"Median time: {np.median(y['time']):.2f}")
|
||||
print(f"Time range: [{np.min(y['time']):.2f}, {np.max(y['time']):.2f}]")
|
||||
|
||||
# Use validation
|
||||
validate_survival_data(y)
|
||||
```
|
||||
|
||||
### Check for Sufficient Events
|
||||
|
||||
```python
|
||||
def check_events_per_feature(X, y, min_events_per_feature=10):
|
||||
"""
|
||||
Check if there are sufficient events per feature.
|
||||
Rule of thumb: at least 10 events per feature for Cox models.
|
||||
"""
|
||||
n_events = y['event'].sum()
|
||||
n_features = X.shape[1]
|
||||
events_per_feature = n_events / n_features
|
||||
|
||||
print(f"Number of events: {n_events}")
|
||||
print(f"Number of features: {n_features}")
|
||||
print(f"Events per feature: {events_per_feature:.1f}")
|
||||
|
||||
if events_per_feature < min_events_per_feature:
|
||||
print(f"WARNING: Low events per feature ratio (<{min_events_per_feature})")
|
||||
print("Consider:")
|
||||
print(" - Feature selection")
|
||||
print(" - Regularization (CoxnetSurvivalAnalysis)")
|
||||
print(" - Collecting more data")
|
||||
|
||||
return events_per_feature
|
||||
|
||||
# Use check
|
||||
check_events_per_feature(X, y)
|
||||
```
|
||||
|
||||
## Train-Test Split
|
||||
|
||||
### Random Split
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import train_test_split
|
||||
|
||||
# Split data
|
||||
X_train, X_test, y_train, y_test = train_test_split(
|
||||
X, y, test_size=0.2, random_state=42
|
||||
)
|
||||
```
|
||||
|
||||
### Stratified Split
|
||||
|
||||
Ensure similar censoring rates and time distributions:
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import train_test_split
|
||||
|
||||
# Create stratification labels
|
||||
# Stratify by event status and time quartiles
|
||||
time_quartiles = pd.qcut(y['time'], q=4, labels=False)
|
||||
strat_labels = y['event'].astype(int) * 10 + time_quartiles
|
||||
|
||||
# Stratified split
|
||||
X_train, X_test, y_train, y_test = train_test_split(
|
||||
X, y, test_size=0.2, stratify=strat_labels, random_state=42
|
||||
)
|
||||
|
||||
# Verify similar distributions
|
||||
print("Training set:")
|
||||
print(f" Censoring rate: {1 - y_train['event'].mean():.2%}")
|
||||
print(f" Median time: {np.median(y_train['time']):.2f}")
|
||||
|
||||
print("Test set:")
|
||||
print(f" Censoring rate: {1 - y_test['event'].mean():.2%}")
|
||||
print(f" Median time: {np.median(y_test['time']):.2f}")
|
||||
```
|
||||
|
||||
## Working with Time-Varying Covariates
|
||||
|
||||
Note: scikit-survival doesn't directly support time-varying covariates. For such data, consider:
|
||||
1. Time-stratified analysis
|
||||
2. Landmarking approach
|
||||
3. Using other packages (e.g., lifelines)
|
||||
|
||||
## Summary: Complete Data Preparation Workflow
|
||||
|
||||
```python
|
||||
from sksurv.util import Surv
|
||||
from sklearn.model_selection import train_test_split
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
from sksurv.preprocessing import encode_categorical
|
||||
import pandas as pd
|
||||
import numpy as np
|
||||
|
||||
# 1. Load data
|
||||
df = pd.read_csv('data.csv')
|
||||
|
||||
# 2. Create survival outcome
|
||||
y = Surv.from_dataframe('event', 'time', df)
|
||||
|
||||
# 3. Prepare features
|
||||
X = df.drop(['event', 'time'], axis=1)
|
||||
|
||||
# 4. Validate data
|
||||
validate_survival_data(y)
|
||||
check_events_per_feature(X, y)
|
||||
|
||||
# 5. Handle missing values
|
||||
X = X.fillna(X.median())
|
||||
|
||||
# 6. Split data
|
||||
X_train, X_test, y_train, y_test = train_test_split(
|
||||
X, y, test_size=0.2, random_state=42
|
||||
)
|
||||
|
||||
# 7. Encode categorical variables
|
||||
X_train = encode_categorical(X_train)
|
||||
X_test = encode_categorical(X_test)
|
||||
|
||||
# 8. Standardize
|
||||
scaler = StandardScaler()
|
||||
X_train_scaled = scaler.fit_transform(X_train)
|
||||
X_test_scaled = scaler.transform(X_test)
|
||||
|
||||
# Convert back to DataFrames
|
||||
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns)
|
||||
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns)
|
||||
|
||||
# Now ready for modeling!
|
||||
```
|
||||
327
skills/scikit-survival/references/ensemble-models.md
Normal file
327
skills/scikit-survival/references/ensemble-models.md
Normal file
@@ -0,0 +1,327 @@
|
||||
# Ensemble Models for Survival Analysis
|
||||
|
||||
## Random Survival Forests
|
||||
|
||||
### Overview
|
||||
|
||||
Random Survival Forests extend the random forest algorithm to survival analysis with censored data. They build multiple decision trees on bootstrap samples and aggregate predictions.
|
||||
|
||||
### How They Work
|
||||
|
||||
1. **Bootstrap Sampling**: Each tree is built on a different bootstrap sample of the training data
|
||||
2. **Feature Randomness**: At each node, only a random subset of features is considered for splitting
|
||||
3. **Survival Function Estimation**: At terminal nodes, Kaplan-Meier and Nelson-Aalen estimators compute survival functions
|
||||
4. **Ensemble Aggregation**: Final predictions average survival functions across all trees
|
||||
|
||||
### When to Use
|
||||
|
||||
- Complex non-linear relationships between features and survival
|
||||
- No assumptions about functional form needed
|
||||
- Want robust predictions with minimal tuning
|
||||
- Need feature importance estimates
|
||||
- Have sufficient sample size (typically n > 100)
|
||||
|
||||
### Key Parameters
|
||||
|
||||
- `n_estimators`: Number of trees (default: 100)
|
||||
- More trees = more stable predictions but slower
|
||||
- Typical range: 100-1000
|
||||
|
||||
- `max_depth`: Maximum depth of trees
|
||||
- Controls tree complexity
|
||||
- None = nodes expanded until pure or min_samples_split
|
||||
|
||||
- `min_samples_split`: Minimum samples to split a node (default: 6)
|
||||
- Larger values = more regularization
|
||||
|
||||
- `min_samples_leaf`: Minimum samples at leaf nodes (default: 3)
|
||||
- Prevents overfitting to small groups
|
||||
|
||||
- `max_features`: Number of features to consider at each split
|
||||
- 'sqrt': sqrt(n_features) - good default
|
||||
- 'log2': log2(n_features)
|
||||
- None: all features
|
||||
|
||||
- `n_jobs`: Number of parallel jobs (-1 uses all processors)
|
||||
|
||||
### Example Usage
|
||||
|
||||
```python
|
||||
from sksurv.ensemble import RandomSurvivalForest
|
||||
from sksurv.datasets import load_breast_cancer
|
||||
|
||||
# Load data
|
||||
X, y = load_breast_cancer()
|
||||
|
||||
# Fit Random Survival Forest
|
||||
rsf = RandomSurvivalForest(n_estimators=1000,
|
||||
min_samples_split=10,
|
||||
min_samples_leaf=15,
|
||||
max_features="sqrt",
|
||||
n_jobs=-1,
|
||||
random_state=42)
|
||||
rsf.fit(X, y)
|
||||
|
||||
# Predict risk scores
|
||||
risk_scores = rsf.predict(X)
|
||||
|
||||
# Predict survival functions
|
||||
surv_funcs = rsf.predict_survival_function(X)
|
||||
|
||||
# Predict cumulative hazard functions
|
||||
chf_funcs = rsf.predict_cumulative_hazard_function(X)
|
||||
```
|
||||
|
||||
### Feature Importance
|
||||
|
||||
**Important**: Built-in feature importance based on split impurity is not reliable for survival data. Use permutation-based feature importance instead.
|
||||
|
||||
```python
|
||||
from sklearn.inspection import permutation_importance
|
||||
from sksurv.metrics import concordance_index_censored
|
||||
|
||||
# Define scoring function
|
||||
def score_survival_model(model, X, y):
|
||||
prediction = model.predict(X)
|
||||
result = concordance_index_censored(y['event'], y['time'], prediction)
|
||||
return result[0]
|
||||
|
||||
# Compute permutation importance
|
||||
perm_importance = permutation_importance(
|
||||
rsf, X, y,
|
||||
n_repeats=10,
|
||||
random_state=42,
|
||||
scoring=score_survival_model
|
||||
)
|
||||
|
||||
# Get feature importance
|
||||
feature_importance = perm_importance.importances_mean
|
||||
```
|
||||
|
||||
## Gradient Boosting Survival Analysis
|
||||
|
||||
### Overview
|
||||
|
||||
Gradient boosting builds an ensemble by sequentially adding weak learners that correct errors of previous learners. The model is: **f(x) = Σ β_m g(x; θ_m)**
|
||||
|
||||
### Model Types
|
||||
|
||||
#### GradientBoostingSurvivalAnalysis
|
||||
|
||||
Uses regression trees as base learners. Can capture complex non-linear relationships.
|
||||
|
||||
**When to Use:**
|
||||
- Need to model complex non-linear relationships
|
||||
- Want high predictive performance
|
||||
- Have sufficient data to avoid overfitting
|
||||
- Can tune hyperparameters carefully
|
||||
|
||||
#### ComponentwiseGradientBoostingSurvivalAnalysis
|
||||
|
||||
Uses component-wise least squares as base learners. Produces linear models with automatic feature selection.
|
||||
|
||||
**When to Use:**
|
||||
- Want interpretable linear model
|
||||
- Need automatic feature selection (like Lasso)
|
||||
- Have high-dimensional data
|
||||
- Prefer sparse models
|
||||
|
||||
### Loss Functions
|
||||
|
||||
#### Cox's Partial Likelihood (default)
|
||||
|
||||
Maintains proportional hazards framework but replaces linear model with additive ensemble model.
|
||||
|
||||
**Appropriate for:**
|
||||
- Standard survival analysis settings
|
||||
- When proportional hazards is reasonable
|
||||
- Most use cases
|
||||
|
||||
#### Accelerated Failure Time (AFT)
|
||||
|
||||
Assumes features accelerate or decelerate survival time by a constant factor. Loss function: **(1/n) Σ ω_i (log y_i - f(x_i))²**
|
||||
|
||||
**Appropriate for:**
|
||||
- AFT framework preferred over proportional hazards
|
||||
- Want to model time directly
|
||||
- Need to interpret effects on survival time
|
||||
|
||||
### Regularization Strategies
|
||||
|
||||
Three main techniques prevent overfitting:
|
||||
|
||||
1. **Learning Rate** (`learning_rate < 1`)
|
||||
- Shrinks contribution of each base learner
|
||||
- Smaller values need more iterations but better generalization
|
||||
- Typical range: 0.01 - 0.1
|
||||
|
||||
2. **Dropout** (`dropout_rate > 0`)
|
||||
- Randomly drops previous learners during training
|
||||
- Forces learners to be more robust
|
||||
- Typical range: 0.01 - 0.2
|
||||
|
||||
3. **Subsampling** (`subsample < 1`)
|
||||
- Uses random subset of data for each iteration
|
||||
- Adds randomness and reduces overfitting
|
||||
- Typical range: 0.5 - 0.9
|
||||
|
||||
**Recommendation**: Combine small learning rate with early stopping for best performance.
|
||||
|
||||
### Key Parameters
|
||||
|
||||
- `loss`: Loss function ('coxph' or 'ipcwls')
|
||||
- `learning_rate`: Shrinks contribution of each tree (default: 0.1)
|
||||
- `n_estimators`: Number of boosting iterations (default: 100)
|
||||
- `subsample`: Fraction of samples for each iteration (default: 1.0)
|
||||
- `dropout_rate`: Dropout rate for learners (default: 0.0)
|
||||
- `max_depth`: Maximum depth of trees (default: 3)
|
||||
- `min_samples_split`: Minimum samples to split node (default: 2)
|
||||
- `min_samples_leaf`: Minimum samples at leaf (default: 1)
|
||||
- `max_features`: Features to consider at each split
|
||||
|
||||
### Example Usage
|
||||
|
||||
```python
|
||||
from sksurv.ensemble import GradientBoostingSurvivalAnalysis
|
||||
from sklearn.model_selection import train_test_split
|
||||
|
||||
# Split data
|
||||
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
|
||||
|
||||
# Fit gradient boosting model
|
||||
gbs = GradientBoostingSurvivalAnalysis(
|
||||
loss='coxph',
|
||||
learning_rate=0.05,
|
||||
n_estimators=200,
|
||||
subsample=0.8,
|
||||
dropout_rate=0.1,
|
||||
max_depth=3,
|
||||
random_state=42
|
||||
)
|
||||
gbs.fit(X_train, y_train)
|
||||
|
||||
# Predict risk scores
|
||||
risk_scores = gbs.predict(X_test)
|
||||
|
||||
# Predict survival functions
|
||||
surv_funcs = gbs.predict_survival_function(X_test)
|
||||
|
||||
# Predict cumulative hazard functions
|
||||
chf_funcs = gbs.predict_cumulative_hazard_function(X_test)
|
||||
```
|
||||
|
||||
### Early Stopping
|
||||
|
||||
Use validation set to prevent overfitting:
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import train_test_split
|
||||
|
||||
# Create train/validation split
|
||||
X_tr, X_val, y_tr, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)
|
||||
|
||||
# Fit with early stopping
|
||||
gbs = GradientBoostingSurvivalAnalysis(
|
||||
n_estimators=1000,
|
||||
learning_rate=0.01,
|
||||
max_depth=3,
|
||||
validation_fraction=0.2,
|
||||
n_iter_no_change=10,
|
||||
random_state=42
|
||||
)
|
||||
gbs.fit(X_tr, y_tr)
|
||||
|
||||
# Number of iterations used
|
||||
print(f"Used {gbs.n_estimators_} iterations")
|
||||
```
|
||||
|
||||
### Hyperparameter Tuning
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import GridSearchCV
|
||||
|
||||
param_grid = {
|
||||
'learning_rate': [0.01, 0.05, 0.1],
|
||||
'n_estimators': [100, 200, 300],
|
||||
'max_depth': [3, 5, 7],
|
||||
'subsample': [0.8, 1.0]
|
||||
}
|
||||
|
||||
cv = GridSearchCV(
|
||||
GradientBoostingSurvivalAnalysis(),
|
||||
param_grid,
|
||||
scoring='concordance_index_ipcw',
|
||||
cv=5,
|
||||
n_jobs=-1
|
||||
)
|
||||
cv.fit(X, y)
|
||||
|
||||
best_model = cv.best_estimator_
|
||||
```
|
||||
|
||||
## ComponentwiseGradientBoostingSurvivalAnalysis
|
||||
|
||||
### Overview
|
||||
|
||||
Uses component-wise least squares, producing sparse linear models with automatic feature selection similar to Lasso.
|
||||
|
||||
### When to Use
|
||||
|
||||
- Want interpretable linear model
|
||||
- Need automatic feature selection
|
||||
- Have high-dimensional data with many irrelevant features
|
||||
- Prefer coefficient-based interpretation
|
||||
|
||||
### Example Usage
|
||||
|
||||
```python
|
||||
from sksurv.ensemble import ComponentwiseGradientBoostingSurvivalAnalysis
|
||||
|
||||
# Fit componentwise boosting
|
||||
cgbs = ComponentwiseGradientBoostingSurvivalAnalysis(
|
||||
loss='coxph',
|
||||
learning_rate=0.1,
|
||||
n_estimators=100
|
||||
)
|
||||
cgbs.fit(X, y)
|
||||
|
||||
# Get selected features and coefficients
|
||||
coef = cgbs.coef_
|
||||
selected_features = [i for i, c in enumerate(coef) if c != 0]
|
||||
```
|
||||
|
||||
## ExtraSurvivalTrees
|
||||
|
||||
Extremely randomized survival trees - similar to Random Survival Forest but with additional randomness in split selection.
|
||||
|
||||
### When to Use
|
||||
|
||||
- Want even more regularization than Random Survival Forest
|
||||
- Have limited data
|
||||
- Need faster training
|
||||
|
||||
### Key Difference
|
||||
|
||||
Instead of finding the best split for selected features, it randomly selects split points, adding more diversity to the ensemble.
|
||||
|
||||
```python
|
||||
from sksurv.ensemble import ExtraSurvivalTrees
|
||||
|
||||
est = ExtraSurvivalTrees(n_estimators=100, random_state=42)
|
||||
est.fit(X, y)
|
||||
```
|
||||
|
||||
## Model Comparison
|
||||
|
||||
| Model | Complexity | Interpretability | Performance | Speed |
|
||||
|-------|-----------|------------------|-------------|-------|
|
||||
| Random Survival Forest | Medium | Low | High | Medium |
|
||||
| GradientBoostingSurvivalAnalysis | High | Low | Highest | Slow |
|
||||
| ComponentwiseGradientBoostingSurvivalAnalysis | Low | High | Medium | Fast |
|
||||
| ExtraSurvivalTrees | Medium | Low | Medium-High | Fast |
|
||||
|
||||
**General Recommendations:**
|
||||
- **Best overall performance**: GradientBoostingSurvivalAnalysis with tuning
|
||||
- **Best balance**: RandomSurvivalForest
|
||||
- **Best interpretability**: ComponentwiseGradientBoostingSurvivalAnalysis
|
||||
- **Fastest training**: ExtraSurvivalTrees
|
||||
378
skills/scikit-survival/references/evaluation-metrics.md
Normal file
378
skills/scikit-survival/references/evaluation-metrics.md
Normal file
@@ -0,0 +1,378 @@
|
||||
# Evaluation Metrics for Survival Models
|
||||
|
||||
## Overview
|
||||
|
||||
Evaluating survival models requires specialized metrics that account for censored data. scikit-survival provides three main categories of metrics:
|
||||
1. Concordance Index (C-index)
|
||||
2. Time-dependent ROC and AUC
|
||||
3. Brier Score
|
||||
|
||||
## Concordance Index (C-index)
|
||||
|
||||
### What It Measures
|
||||
|
||||
The concordance index measures the rank correlation between predicted risk scores and observed event times. It represents the probability that, for a random pair of subjects, the model correctly orders their survival times.
|
||||
|
||||
**Range**: 0 to 1
|
||||
- 0.5 = random predictions
|
||||
- 1.0 = perfect concordance
|
||||
- Typical good performance: 0.7-0.8
|
||||
|
||||
### Two Implementations
|
||||
|
||||
#### Harrell's C-index (concordance_index_censored)
|
||||
|
||||
The traditional estimator, simpler but has limitations.
|
||||
|
||||
**When to Use:**
|
||||
- Low censoring rates (< 40%)
|
||||
- Quick evaluation during development
|
||||
- Comparing models on same dataset
|
||||
|
||||
**Limitations:**
|
||||
- Becomes increasingly biased with high censoring rates
|
||||
- Overestimates performance starting at approximately 49% censoring
|
||||
|
||||
```python
|
||||
from sksurv.metrics import concordance_index_censored
|
||||
|
||||
# Compute Harrell's C-index
|
||||
result = concordance_index_censored(y_test['event'], y_test['time'], risk_scores)
|
||||
c_index = result[0]
|
||||
print(f"Harrell's C-index: {c_index:.3f}")
|
||||
```
|
||||
|
||||
#### Uno's C-index (concordance_index_ipcw)
|
||||
|
||||
Inverse probability of censoring weighted (IPCW) estimator that corrects for censoring bias.
|
||||
|
||||
**When to Use:**
|
||||
- Moderate to high censoring rates (> 40%)
|
||||
- Need unbiased estimates
|
||||
- Comparing models across different datasets
|
||||
- Publishing results (more robust)
|
||||
|
||||
**Advantages:**
|
||||
- Remains stable even with high censoring
|
||||
- More reliable estimates
|
||||
- Less biased
|
||||
|
||||
```python
|
||||
from sksurv.metrics import concordance_index_ipcw
|
||||
|
||||
# Compute Uno's C-index
|
||||
# Requires training data for IPCW calculation
|
||||
c_index, concordant, discordant, tied_risk = concordance_index_ipcw(
|
||||
y_train, y_test, risk_scores
|
||||
)
|
||||
print(f"Uno's C-index: {c_index:.3f}")
|
||||
```
|
||||
|
||||
### Choosing Between Harrell's and Uno's
|
||||
|
||||
**Use Uno's C-index when:**
|
||||
- Censoring rate > 40%
|
||||
- Need most accurate estimates
|
||||
- Comparing models from different studies
|
||||
- Publishing research
|
||||
|
||||
**Use Harrell's C-index when:**
|
||||
- Low censoring rates
|
||||
- Quick model comparisons during development
|
||||
- Computational efficiency is critical
|
||||
|
||||
### Example Comparison
|
||||
|
||||
```python
|
||||
from sksurv.metrics import concordance_index_censored, concordance_index_ipcw
|
||||
|
||||
# Harrell's C-index
|
||||
harrell = concordance_index_censored(y_test['event'], y_test['time'], risk_scores)[0]
|
||||
|
||||
# Uno's C-index
|
||||
uno = concordance_index_ipcw(y_train, y_test, risk_scores)[0]
|
||||
|
||||
print(f"Harrell's C-index: {harrell:.3f}")
|
||||
print(f"Uno's C-index: {uno:.3f}")
|
||||
```
|
||||
|
||||
## Time-Dependent ROC and AUC
|
||||
|
||||
### What It Measures
|
||||
|
||||
Time-dependent AUC evaluates model discrimination at specific time points. It distinguishes subjects who experience events by time *t* from those who don't.
|
||||
|
||||
**Question answered**: "How well does the model predict who will have an event by time t?"
|
||||
|
||||
### When to Use
|
||||
|
||||
- Predicting event occurrence within specific time windows
|
||||
- Clinical decision-making at specific timepoints (e.g., 5-year survival)
|
||||
- Want to evaluate performance across different time horizons
|
||||
- Need both discrimination and timing information
|
||||
|
||||
### Key Function: cumulative_dynamic_auc
|
||||
|
||||
```python
|
||||
from sksurv.metrics import cumulative_dynamic_auc
|
||||
|
||||
# Define evaluation times
|
||||
times = [365, 730, 1095, 1460, 1825] # 1, 2, 3, 4, 5 years
|
||||
|
||||
# Compute time-dependent AUC
|
||||
auc, mean_auc = cumulative_dynamic_auc(
|
||||
y_train, y_test, risk_scores, times
|
||||
)
|
||||
|
||||
# Plot AUC over time
|
||||
import matplotlib.pyplot as plt
|
||||
plt.plot(times, auc, marker='o')
|
||||
plt.xlabel('Time (days)')
|
||||
plt.ylabel('Time-dependent AUC')
|
||||
plt.title('Model Discrimination Over Time')
|
||||
plt.show()
|
||||
|
||||
print(f"Mean AUC: {mean_auc:.3f}")
|
||||
```
|
||||
|
||||
### Interpretation
|
||||
|
||||
- **AUC at time t**: Probability model correctly ranks a subject who had event by time t above one who didn't
|
||||
- **Varying AUC over time**: Indicates model performance changes with time horizon
|
||||
- **Mean AUC**: Overall summary of discrimination across all time points
|
||||
|
||||
### Example: Comparing Models
|
||||
|
||||
```python
|
||||
# Compare two models
|
||||
auc1, mean_auc1 = cumulative_dynamic_auc(y_train, y_test, risk_scores1, times)
|
||||
auc2, mean_auc2 = cumulative_dynamic_auc(y_train, y_test, risk_scores2, times)
|
||||
|
||||
plt.plot(times, auc1, marker='o', label='Model 1')
|
||||
plt.plot(times, auc2, marker='s', label='Model 2')
|
||||
plt.xlabel('Time (days)')
|
||||
plt.ylabel('Time-dependent AUC')
|
||||
plt.legend()
|
||||
plt.show()
|
||||
```
|
||||
|
||||
## Brier Score
|
||||
|
||||
### What It Measures
|
||||
|
||||
Brier score extends mean squared error to survival data with censoring. It measures both discrimination (ranking) and calibration (accuracy of predicted probabilities).
|
||||
|
||||
**Formula**: **(1/n) Σ (S(t|x_i) - I(T_i > t))²**
|
||||
|
||||
where S(t|x_i) is predicted survival probability at time t for subject i.
|
||||
|
||||
**Range**: 0 to 1
|
||||
- 0 = perfect predictions
|
||||
- Lower is better
|
||||
- Typical good performance: < 0.2
|
||||
|
||||
### When to Use
|
||||
|
||||
- Need calibration assessment (not just ranking)
|
||||
- Want to evaluate predicted probabilities, not just risk scores
|
||||
- Comparing models that output survival functions
|
||||
- Clinical applications requiring probability estimates
|
||||
|
||||
### Key Functions
|
||||
|
||||
#### brier_score: Single Time Point
|
||||
|
||||
```python
|
||||
from sksurv.metrics import brier_score
|
||||
|
||||
# Compute Brier score at specific time
|
||||
time_point = 1825 # 5 years
|
||||
surv_probs = model.predict_survival_function(X_test)
|
||||
# Extract survival probability at time_point for each subject
|
||||
surv_at_t = [fn(time_point) for fn in surv_probs]
|
||||
|
||||
bs = brier_score(y_train, y_test, surv_at_t, time_point)[1]
|
||||
print(f"Brier score at {time_point} days: {bs:.3f}")
|
||||
```
|
||||
|
||||
#### integrated_brier_score: Summary Across Time
|
||||
|
||||
```python
|
||||
from sksurv.metrics import integrated_brier_score
|
||||
|
||||
# Compute integrated Brier score
|
||||
times = [365, 730, 1095, 1460, 1825]
|
||||
surv_probs = model.predict_survival_function(X_test)
|
||||
|
||||
ibs = integrated_brier_score(y_train, y_test, surv_probs, times)
|
||||
print(f"Integrated Brier Score: {ibs:.3f}")
|
||||
```
|
||||
|
||||
### Interpretation
|
||||
|
||||
- **Brier score at time t**: Expected squared difference between predicted and actual survival at time t
|
||||
- **Integrated Brier Score**: Weighted average of Brier scores across time
|
||||
- **Lower values = better predictions**
|
||||
|
||||
### Comparison with Null Model
|
||||
|
||||
Always compare against a baseline (e.g., Kaplan-Meier):
|
||||
|
||||
```python
|
||||
from sksurv.nonparametric import kaplan_meier_estimator
|
||||
|
||||
# Compute Kaplan-Meier baseline
|
||||
time_km, surv_km = kaplan_meier_estimator(y_train['event'], y_train['time'])
|
||||
|
||||
# Predict with KM for each test subject
|
||||
surv_km_test = [surv_km[time_km <= time_point][-1] if any(time_km <= time_point) else 1.0
|
||||
for _ in range(len(X_test))]
|
||||
|
||||
bs_km = brier_score(y_train, y_test, surv_km_test, time_point)[1]
|
||||
bs_model = brier_score(y_train, y_test, surv_at_t, time_point)[1]
|
||||
|
||||
print(f"Kaplan-Meier Brier Score: {bs_km:.3f}")
|
||||
print(f"Model Brier Score: {bs_model:.3f}")
|
||||
print(f"Improvement: {(bs_km - bs_model) / bs_km * 100:.1f}%")
|
||||
```
|
||||
|
||||
## Using Metrics with Cross-Validation
|
||||
|
||||
### Concordance Index Scorer
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import cross_val_score
|
||||
from sksurv.metrics import as_concordance_index_ipcw_scorer
|
||||
|
||||
# Create scorer
|
||||
scorer = as_concordance_index_ipcw_scorer()
|
||||
|
||||
# Perform cross-validation
|
||||
scores = cross_val_score(model, X, y, cv=5, scoring=scorer)
|
||||
print(f"Mean C-index: {scores.mean():.3f} (±{scores.std():.3f})")
|
||||
```
|
||||
|
||||
### Integrated Brier Score Scorer
|
||||
|
||||
```python
|
||||
from sksurv.metrics import as_integrated_brier_score_scorer
|
||||
|
||||
# Define time points for evaluation
|
||||
times = np.percentile(y['time'][y['event']], [25, 50, 75])
|
||||
|
||||
# Create scorer
|
||||
scorer = as_integrated_brier_score_scorer(times)
|
||||
|
||||
# Perform cross-validation
|
||||
scores = cross_val_score(model, X, y, cv=5, scoring=scorer)
|
||||
print(f"Mean IBS: {scores.mean():.3f} (±{scores.std():.3f})")
|
||||
```
|
||||
|
||||
## Model Selection with GridSearchCV
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import GridSearchCV
|
||||
from sksurv.ensemble import RandomSurvivalForest
|
||||
from sksurv.metrics import as_concordance_index_ipcw_scorer
|
||||
|
||||
# Define parameter grid
|
||||
param_grid = {
|
||||
'n_estimators': [100, 200, 300],
|
||||
'min_samples_split': [10, 20, 30],
|
||||
'max_depth': [None, 10, 20]
|
||||
}
|
||||
|
||||
# Create scorer
|
||||
scorer = as_concordance_index_ipcw_scorer()
|
||||
|
||||
# Perform grid search
|
||||
cv = GridSearchCV(
|
||||
RandomSurvivalForest(random_state=42),
|
||||
param_grid,
|
||||
scoring=scorer,
|
||||
cv=5,
|
||||
n_jobs=-1
|
||||
)
|
||||
cv.fit(X, y)
|
||||
|
||||
print(f"Best parameters: {cv.best_params_}")
|
||||
print(f"Best C-index: {cv.best_score_:.3f}")
|
||||
```
|
||||
|
||||
## Comprehensive Model Evaluation
|
||||
|
||||
### Recommended Evaluation Pipeline
|
||||
|
||||
```python
|
||||
from sksurv.metrics import (
|
||||
concordance_index_censored,
|
||||
concordance_index_ipcw,
|
||||
cumulative_dynamic_auc,
|
||||
integrated_brier_score
|
||||
)
|
||||
|
||||
def evaluate_survival_model(model, X_train, X_test, y_train, y_test):
|
||||
"""Comprehensive evaluation of survival model"""
|
||||
|
||||
# Get predictions
|
||||
risk_scores = model.predict(X_test)
|
||||
surv_funcs = model.predict_survival_function(X_test)
|
||||
|
||||
# 1. Concordance Index (both versions)
|
||||
c_harrell = concordance_index_censored(y_test['event'], y_test['time'], risk_scores)[0]
|
||||
c_uno = concordance_index_ipcw(y_train, y_test, risk_scores)[0]
|
||||
|
||||
# 2. Time-dependent AUC
|
||||
times = np.percentile(y_test['time'][y_test['event']], [25, 50, 75])
|
||||
auc, mean_auc = cumulative_dynamic_auc(y_train, y_test, risk_scores, times)
|
||||
|
||||
# 3. Integrated Brier Score
|
||||
ibs = integrated_brier_score(y_train, y_test, surv_funcs, times)
|
||||
|
||||
# Print results
|
||||
print("=" * 50)
|
||||
print("Model Evaluation Results")
|
||||
print("=" * 50)
|
||||
print(f"Harrell's C-index: {c_harrell:.3f}")
|
||||
print(f"Uno's C-index: {c_uno:.3f}")
|
||||
print(f"Mean AUC: {mean_auc:.3f}")
|
||||
print(f"Integrated Brier: {ibs:.3f}")
|
||||
print("=" * 50)
|
||||
|
||||
return {
|
||||
'c_harrell': c_harrell,
|
||||
'c_uno': c_uno,
|
||||
'mean_auc': mean_auc,
|
||||
'ibs': ibs,
|
||||
'time_auc': dict(zip(times, auc))
|
||||
}
|
||||
|
||||
# Use the evaluation function
|
||||
results = evaluate_survival_model(model, X_train, X_test, y_train, y_test)
|
||||
```
|
||||
|
||||
## Choosing the Right Metric
|
||||
|
||||
### Decision Guide
|
||||
|
||||
**Use C-index (Uno's) when:**
|
||||
- Primary goal is ranking/discrimination
|
||||
- Don't need calibrated probabilities
|
||||
- Standard survival analysis setting
|
||||
- Most common choice
|
||||
|
||||
**Use Time-dependent AUC when:**
|
||||
- Need discrimination at specific time points
|
||||
- Clinical decisions at specific horizons
|
||||
- Want to understand how performance varies over time
|
||||
|
||||
**Use Brier Score when:**
|
||||
- Need calibrated probability estimates
|
||||
- Both discrimination AND calibration important
|
||||
- Clinical decision-making requiring probabilities
|
||||
- Want comprehensive assessment
|
||||
|
||||
**Best Practice**: Report multiple metrics for comprehensive evaluation. At minimum, report:
|
||||
- Uno's C-index (discrimination)
|
||||
- Integrated Brier Score (discrimination + calibration)
|
||||
- Time-dependent AUC at clinically relevant time points
|
||||
411
skills/scikit-survival/references/svm-models.md
Normal file
411
skills/scikit-survival/references/svm-models.md
Normal file
@@ -0,0 +1,411 @@
|
||||
# Survival Support Vector Machines
|
||||
|
||||
## Overview
|
||||
|
||||
Survival Support Vector Machines (SVMs) adapt the traditional SVM framework to survival analysis with censored data. They optimize a ranking objective that encourages correct ordering of survival times.
|
||||
|
||||
### Core Idea
|
||||
|
||||
SVMs for survival analysis learn a function f(x) that produces risk scores, where the optimization ensures that subjects with shorter survival times receive higher risk scores than those with longer times.
|
||||
|
||||
## When to Use Survival SVMs
|
||||
|
||||
**Appropriate for:**
|
||||
- Medium-sized datasets (typically 100-10,000 samples)
|
||||
- Need for non-linear decision boundaries (kernel SVMs)
|
||||
- Want margin-based learning with regularization
|
||||
- Have well-defined feature space
|
||||
|
||||
**Not ideal for:**
|
||||
- Very large datasets (>100,000 samples) - ensemble methods may be faster
|
||||
- Need interpretable coefficients - use Cox models instead
|
||||
- Require survival function estimates - use Random Survival Forest
|
||||
- Very high dimensional data - use regularized Cox or gradient boosting
|
||||
|
||||
## Model Types
|
||||
|
||||
### FastSurvivalSVM
|
||||
|
||||
Linear survival SVM optimized for speed using coordinate descent.
|
||||
|
||||
**When to Use:**
|
||||
- Linear relationships expected
|
||||
- Large datasets where speed matters
|
||||
- Want fast training and prediction
|
||||
|
||||
**Key Parameters:**
|
||||
- `alpha`: Regularization parameter (default: 1.0)
|
||||
- Higher = more regularization
|
||||
- `rank_ratio`: Trade-off between ranking and regression (default: 1.0)
|
||||
- `max_iter`: Maximum iterations (default: 20)
|
||||
- `tol`: Tolerance for stopping criterion (default: 1e-5)
|
||||
|
||||
```python
|
||||
from sksurv.svm import FastSurvivalSVM
|
||||
|
||||
# Fit linear survival SVM
|
||||
estimator = FastSurvivalSVM(alpha=1.0, max_iter=100, tol=1e-5, random_state=42)
|
||||
estimator.fit(X, y)
|
||||
|
||||
# Predict risk scores
|
||||
risk_scores = estimator.predict(X_test)
|
||||
```
|
||||
|
||||
### FastKernelSurvivalSVM
|
||||
|
||||
Kernel survival SVM for non-linear relationships.
|
||||
|
||||
**When to Use:**
|
||||
- Non-linear relationships between features and survival
|
||||
- Medium-sized datasets
|
||||
- Can afford longer training time for better performance
|
||||
|
||||
**Kernel Options:**
|
||||
- `'linear'`: Linear kernel, equivalent to FastSurvivalSVM
|
||||
- `'poly'`: Polynomial kernel
|
||||
- `'rbf'`: Radial basis function (Gaussian) kernel - most common
|
||||
- `'sigmoid'`: Sigmoid kernel
|
||||
- Custom kernel function
|
||||
|
||||
**Key Parameters:**
|
||||
- `alpha`: Regularization parameter (default: 1.0)
|
||||
- `kernel`: Kernel function (default: 'rbf')
|
||||
- `gamma`: Kernel coefficient for rbf, poly, sigmoid
|
||||
- `degree`: Degree for polynomial kernel
|
||||
- `coef0`: Independent term for poly and sigmoid
|
||||
- `rank_ratio`: Trade-off parameter (default: 1.0)
|
||||
- `max_iter`: Maximum iterations (default: 20)
|
||||
|
||||
```python
|
||||
from sksurv.svm import FastKernelSurvivalSVM
|
||||
|
||||
# Fit RBF kernel survival SVM
|
||||
estimator = FastKernelSurvivalSVM(
|
||||
alpha=1.0,
|
||||
kernel='rbf',
|
||||
gamma='scale',
|
||||
max_iter=50,
|
||||
random_state=42
|
||||
)
|
||||
estimator.fit(X, y)
|
||||
|
||||
# Predict risk scores
|
||||
risk_scores = estimator.predict(X_test)
|
||||
```
|
||||
|
||||
### HingeLossSurvivalSVM
|
||||
|
||||
Survival SVM using hinge loss, more similar to classification SVM.
|
||||
|
||||
**When to Use:**
|
||||
- Want hinge loss instead of squared hinge
|
||||
- Sparse solutions desired
|
||||
- Similar behavior to classification SVMs
|
||||
|
||||
**Key Parameters:**
|
||||
- `alpha`: Regularization parameter
|
||||
- `fit_intercept`: Whether to fit intercept term (default: False)
|
||||
|
||||
```python
|
||||
from sksurv.svm import HingeLossSurvivalSVM
|
||||
|
||||
# Fit hinge loss SVM
|
||||
estimator = HingeLossSurvivalSVM(alpha=1.0, fit_intercept=False, random_state=42)
|
||||
estimator.fit(X, y)
|
||||
|
||||
# Predict risk scores
|
||||
risk_scores = estimator.predict(X_test)
|
||||
```
|
||||
|
||||
### NaiveSurvivalSVM
|
||||
|
||||
Original formulation of survival SVM using quadratic programming.
|
||||
|
||||
**When to Use:**
|
||||
- Small datasets
|
||||
- Research/benchmarking purposes
|
||||
- Other methods don't converge
|
||||
|
||||
**Limitations:**
|
||||
- Slower than Fast variants
|
||||
- Less scalable
|
||||
|
||||
```python
|
||||
from sksurv.svm import NaiveSurvivalSVM
|
||||
|
||||
# Fit naive SVM (slower)
|
||||
estimator = NaiveSurvivalSVM(alpha=1.0, random_state=42)
|
||||
estimator.fit(X, y)
|
||||
|
||||
# Predict
|
||||
risk_scores = estimator.predict(X_test)
|
||||
```
|
||||
|
||||
### MinlipSurvivalAnalysis
|
||||
|
||||
Survival analysis using minimizing Lipschitz constant approach.
|
||||
|
||||
**When to Use:**
|
||||
- Want different optimization objective
|
||||
- Research applications
|
||||
- Alternative to standard survival SVMs
|
||||
|
||||
```python
|
||||
from sksurv.svm import MinlipSurvivalAnalysis
|
||||
|
||||
# Fit Minlip model
|
||||
estimator = MinlipSurvivalAnalysis(alpha=1.0, random_state=42)
|
||||
estimator.fit(X, y)
|
||||
|
||||
# Predict
|
||||
risk_scores = estimator.predict(X_test)
|
||||
```
|
||||
|
||||
## Hyperparameter Tuning
|
||||
|
||||
### Tuning Alpha (Regularization)
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import GridSearchCV
|
||||
from sksurv.metrics import as_concordance_index_ipcw_scorer
|
||||
|
||||
# Define parameter grid
|
||||
param_grid = {
|
||||
'alpha': [0.1, 0.5, 1.0, 5.0, 10.0, 50.0]
|
||||
}
|
||||
|
||||
# Grid search
|
||||
cv = GridSearchCV(
|
||||
FastSurvivalSVM(),
|
||||
param_grid,
|
||||
scoring=as_concordance_index_ipcw_scorer(),
|
||||
cv=5,
|
||||
n_jobs=-1
|
||||
)
|
||||
cv.fit(X, y)
|
||||
|
||||
print(f"Best alpha: {cv.best_params_['alpha']}")
|
||||
print(f"Best C-index: {cv.best_score_:.3f}")
|
||||
```
|
||||
|
||||
### Tuning Kernel Parameters
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import GridSearchCV
|
||||
|
||||
# Define parameter grid for kernel SVM
|
||||
param_grid = {
|
||||
'alpha': [0.1, 1.0, 10.0],
|
||||
'gamma': ['scale', 'auto', 0.001, 0.01, 0.1, 1.0]
|
||||
}
|
||||
|
||||
# Grid search
|
||||
cv = GridSearchCV(
|
||||
FastKernelSurvivalSVM(kernel='rbf'),
|
||||
param_grid,
|
||||
scoring=as_concordance_index_ipcw_scorer(),
|
||||
cv=5,
|
||||
n_jobs=-1
|
||||
)
|
||||
cv.fit(X, y)
|
||||
|
||||
print(f"Best parameters: {cv.best_params_}")
|
||||
print(f"Best C-index: {cv.best_score_:.3f}")
|
||||
```
|
||||
|
||||
## Clinical Kernel Transform
|
||||
|
||||
### ClinicalKernelTransform
|
||||
|
||||
Special kernel that combines clinical features with molecular data for improved predictions in medical applications.
|
||||
|
||||
**Use Case:**
|
||||
- Have both clinical variables (age, stage, etc.) and high-dimensional molecular data (gene expression, genomics)
|
||||
- Clinical features should have different weighting
|
||||
- Want to integrate heterogeneous data types
|
||||
|
||||
**Key Parameters:**
|
||||
- `fit_once`: Whether to fit kernel once or refit during cross-validation (default: False)
|
||||
- Clinical features should be passed separately from molecular features
|
||||
|
||||
```python
|
||||
from sksurv.kernels import ClinicalKernelTransform
|
||||
from sksurv.svm import FastKernelSurvivalSVM
|
||||
from sklearn.pipeline import make_pipeline
|
||||
|
||||
# Separate clinical and molecular features
|
||||
clinical_features = ['age', 'stage', 'grade']
|
||||
X_clinical = X[clinical_features]
|
||||
X_molecular = X.drop(clinical_features, axis=1)
|
||||
|
||||
# Create pipeline with clinical kernel
|
||||
estimator = make_pipeline(
|
||||
ClinicalKernelTransform(),
|
||||
FastKernelSurvivalSVM()
|
||||
)
|
||||
|
||||
# Fit model
|
||||
# ClinicalKernelTransform expects tuple (clinical, molecular)
|
||||
X_combined = list(zip(X_clinical.values, X_molecular.values))
|
||||
estimator.fit(X_combined, y)
|
||||
```
|
||||
|
||||
## Practical Examples
|
||||
|
||||
### Example 1: Linear SVM with Cross-Validation
|
||||
|
||||
```python
|
||||
from sksurv.svm import FastSurvivalSVM
|
||||
from sklearn.model_selection import cross_val_score
|
||||
from sksurv.metrics import as_concordance_index_ipcw_scorer
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
|
||||
# Standardize features (important for SVMs!)
|
||||
scaler = StandardScaler()
|
||||
X_scaled = scaler.fit_transform(X)
|
||||
|
||||
# Create model
|
||||
svm = FastSurvivalSVM(alpha=1.0, max_iter=100, random_state=42)
|
||||
|
||||
# Cross-validation
|
||||
scores = cross_val_score(
|
||||
svm, X_scaled, y,
|
||||
cv=5,
|
||||
scoring=as_concordance_index_ipcw_scorer(),
|
||||
n_jobs=-1
|
||||
)
|
||||
|
||||
print(f"Mean C-index: {scores.mean():.3f} (±{scores.std():.3f})")
|
||||
```
|
||||
|
||||
### Example 2: Kernel SVM with Different Kernels
|
||||
|
||||
```python
|
||||
from sksurv.svm import FastKernelSurvivalSVM
|
||||
from sklearn.model_selection import train_test_split
|
||||
from sksurv.metrics import concordance_index_ipcw
|
||||
|
||||
# Split data
|
||||
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
|
||||
|
||||
# Standardize
|
||||
scaler = StandardScaler()
|
||||
X_train_scaled = scaler.fit_transform(X_train)
|
||||
X_test_scaled = scaler.transform(X_test)
|
||||
|
||||
# Compare different kernels
|
||||
kernels = ['linear', 'poly', 'rbf', 'sigmoid']
|
||||
results = {}
|
||||
|
||||
for kernel in kernels:
|
||||
# Fit model
|
||||
svm = FastKernelSurvivalSVM(kernel=kernel, alpha=1.0, random_state=42)
|
||||
svm.fit(X_train_scaled, y_train)
|
||||
|
||||
# Predict
|
||||
risk_scores = svm.predict(X_test_scaled)
|
||||
|
||||
# Evaluate
|
||||
c_index = concordance_index_ipcw(y_train, y_test, risk_scores)[0]
|
||||
results[kernel] = c_index
|
||||
|
||||
print(f"{kernel:10s}: C-index = {c_index:.3f}")
|
||||
|
||||
# Best kernel
|
||||
best_kernel = max(results, key=results.get)
|
||||
print(f"\nBest kernel: {best_kernel} (C-index = {results[best_kernel]:.3f})")
|
||||
```
|
||||
|
||||
### Example 3: Full Pipeline with Hyperparameter Tuning
|
||||
|
||||
```python
|
||||
from sksurv.svm import FastKernelSurvivalSVM
|
||||
from sklearn.model_selection import GridSearchCV, train_test_split
|
||||
from sklearn.pipeline import Pipeline
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
from sksurv.metrics import as_concordance_index_ipcw_scorer
|
||||
|
||||
# Split data
|
||||
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
|
||||
|
||||
# Create pipeline
|
||||
pipeline = Pipeline([
|
||||
('scaler', StandardScaler()),
|
||||
('svm', FastKernelSurvivalSVM(kernel='rbf'))
|
||||
])
|
||||
|
||||
# Define parameter grid
|
||||
param_grid = {
|
||||
'svm__alpha': [0.1, 1.0, 10.0],
|
||||
'svm__gamma': ['scale', 0.01, 0.1, 1.0]
|
||||
}
|
||||
|
||||
# Grid search
|
||||
cv = GridSearchCV(
|
||||
pipeline,
|
||||
param_grid,
|
||||
scoring=as_concordance_index_ipcw_scorer(),
|
||||
cv=5,
|
||||
n_jobs=-1,
|
||||
verbose=1
|
||||
)
|
||||
cv.fit(X_train, y_train)
|
||||
|
||||
# Best model
|
||||
best_model = cv.best_estimator_
|
||||
print(f"Best parameters: {cv.best_params_}")
|
||||
print(f"Best CV C-index: {cv.best_score_:.3f}")
|
||||
|
||||
# Evaluate on test set
|
||||
risk_scores = best_model.predict(X_test)
|
||||
c_index = concordance_index_ipcw(y_train, y_test, risk_scores)[0]
|
||||
print(f"Test C-index: {c_index:.3f}")
|
||||
```
|
||||
|
||||
## Important Considerations
|
||||
|
||||
### Feature Scaling
|
||||
|
||||
**CRITICAL**: Always standardize features before using SVMs!
|
||||
|
||||
```python
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
|
||||
scaler = StandardScaler()
|
||||
X_train_scaled = scaler.fit_transform(X_train)
|
||||
X_test_scaled = scaler.transform(X_test)
|
||||
```
|
||||
|
||||
### Computational Complexity
|
||||
|
||||
- **FastSurvivalSVM**: O(n × p) per iteration - fast
|
||||
- **FastKernelSurvivalSVM**: O(n² × p) - slower, scales quadratically
|
||||
- **NaiveSurvivalSVM**: O(n³) - very slow for large datasets
|
||||
|
||||
For large datasets (>10,000 samples), prefer:
|
||||
- FastSurvivalSVM (linear)
|
||||
- Gradient Boosting
|
||||
- Random Survival Forest
|
||||
|
||||
### When SVMs May Not Be Best Choice
|
||||
|
||||
- **Very large datasets**: Ensemble methods are faster
|
||||
- **Need survival functions**: Use Random Survival Forest or Cox models
|
||||
- **Need interpretability**: Use Cox models
|
||||
- **Very high dimensional**: Use penalized Cox (Coxnet) or gradient boosting with feature selection
|
||||
|
||||
## Model Selection Guide
|
||||
|
||||
| Model | Speed | Non-linearity | Scalability | Interpretability |
|
||||
|-------|-------|---------------|-------------|------------------|
|
||||
| FastSurvivalSVM | Fast | No | High | Medium |
|
||||
| FastKernelSurvivalSVM | Medium | Yes | Medium | Low |
|
||||
| HingeLossSurvivalSVM | Fast | No | High | Medium |
|
||||
| NaiveSurvivalSVM | Slow | No | Low | Medium |
|
||||
|
||||
**General Recommendations:**
|
||||
- Start with **FastSurvivalSVM** for baseline
|
||||
- Try **FastKernelSurvivalSVM** with RBF if non-linearity expected
|
||||
- Use grid search to tune alpha and gamma
|
||||
- Always standardize features
|
||||
- Compare with Random Survival Forest and Gradient Boosting
|
||||
Reference in New Issue
Block a user