Initial commit

2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions
--- a/skills/scikit-survival/references/competing-risks.md
+++ b/skills/scikit-survival/references/competing-risks.md
@@ -0,0 +1,397 @@
+# Competing Risks Analysis
+
+## Overview
+
+Competing risks occur when subjects can experience one of several mutually exclusive events (event types). When one event occurs, it prevents ("competes with") the occurrence of other events.
+
+### Examples of Competing Risks
+
+**Medical Research:**
+- Death from cancer vs. death from cardiovascular disease vs. death from other causes
+- Relapse vs. death without relapse in cancer studies
+- Different types of infections in transplant patients
+
+**Other Applications:**
+- Job termination: retirement vs. resignation vs. termination for cause
+- Equipment failure: different failure modes
+- Customer churn: different reasons for leaving
+
+### Key Concept: Cumulative Incidence Function (CIF)
+
+The **Cumulative Incidence Function (CIF)** represents the probability of experiencing a specific event type by time *t*, accounting for the presence of competing risks.
+
+**CIF_k(t) = P(T ≤ t, event type = k)**
+
+This differs from the Kaplan-Meier estimator, which would overestimate event probabilities when competing risks are present.
+
+## When to Use Competing Risks Analysis
+
+**Use competing risks when:**
+- Multiple mutually exclusive event types exist
+- Occurrence of one event prevents others
+- Need to estimate probability of specific event types
+- Want to understand how covariates affect different event types
+
+**Don't use when:**
+- Only one event type of interest (standard survival analysis)
+- Events are not mutually exclusive (use recurrent events methods)
+- Competing events are extremely rare (can treat as censoring)
+
+## Cumulative Incidence with Competing Risks
+
+### cumulative_incidence_competing_risks Function
+
+Estimates the cumulative incidence function for each event type.
+
+```python
+from sksurv.nonparametric import cumulative_incidence_competing_risks
+from sksurv.datasets import load_leukemia
+
+# Load data with competing risks
+X, y = load_leukemia()
+# y has event types: 0=censored, 1=relapse, 2=death
+
+# Compute cumulative incidence for each event type
+# Returns: time points, CIF for event 1, CIF for event 2, ...
+time_points, cif_1, cif_2 = cumulative_incidence_competing_risks(y)
+
+# Plot cumulative incidence functions
+import matplotlib.pyplot as plt
+
+plt.figure(figsize=(10, 6))
+plt.step(time_points, cif_1, where='post', label='Relapse', linewidth=2)
+plt.step(time_points, cif_2, where='post', label='Death in remission', linewidth=2)
+plt.xlabel('Time (weeks)')
+plt.ylabel('Cumulative Incidence')
+plt.title('Competing Risks: Relapse vs Death')
+plt.legend()
+plt.grid(True, alpha=0.3)
+plt.show()
+```
+
+### Interpretation
+
+- **CIF at time t**: Probability of experiencing that specific event by time t
+- **Sum of all CIFs**: Total probability of experiencing any event (all cause)
+- **1 - sum of CIFs**: Probability of being event-free and uncensored
+
+## Data Format for Competing Risks
+
+### Creating Structured Array with Event Types
+
+```python
+import numpy as np
+from sksurv.util import Surv
+
+# Event types: 0 = censored, 1 = event type 1, 2 = event type 2
+event_types = np.array([0, 1, 2, 1, 0, 2, 1])
+times = np.array([10.2, 5.3, 8.1, 3.7, 12.5, 6.8, 4.2])
+
+# Create survival array
+# For competing risks: event=True if any event occurred
+# Store event type separately or encode in the event field
+y = Surv.from_arrays(
+    event=(event_types > 0),  # True if any event
+    time=times
+)
+
+# Keep event_types for distinguishing between event types
+```
+
+### Converting Data with Event Types
+
+```python
+import pandas as pd
+from sksurv.util import Surv
+
+# Assume data has: time, event_type columns
+# event_type: 0=censored, 1=type1, 2=type2, etc.
+
+df = pd.read_csv('competing_risks_data.csv')
+
+# Create survival outcome
+y = Surv.from_arrays(
+    event=(df['event_type'] > 0),
+    time=df['time']
+)
+
+# Store event types
+event_types = df['event_type'].values
+```
+
+## Comparing Cumulative Incidence Between Groups
+
+### Stratified Analysis
+
+```python
+from sksurv.nonparametric import cumulative_incidence_competing_risks
+import matplotlib.pyplot as plt
+
+# Split by treatment group
+mask_treatment = X['treatment'] == 'A'
+mask_control = X['treatment'] == 'B'
+
+y_treatment = y[mask_treatment]
+y_control = y[mask_control]
+
+# Compute CIF for each group
+time_trt, cif1_trt, cif2_trt = cumulative_incidence_competing_risks(y_treatment)
+time_ctl, cif1_ctl, cif2_ctl = cumulative_incidence_competing_risks(y_control)
+
+# Plot comparison
+fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
+
+# Event type 1
+ax1.step(time_trt, cif1_trt, where='post', label='Treatment', linewidth=2)
+ax1.step(time_ctl, cif1_ctl, where='post', label='Control', linewidth=2)
+ax1.set_xlabel('Time')
+ax1.set_ylabel('Cumulative Incidence')
+ax1.set_title('Event Type 1')
+ax1.legend()
+ax1.grid(True, alpha=0.3)
+
+# Event type 2
+ax2.step(time_trt, cif2_trt, where='post', label='Treatment', linewidth=2)
+ax2.step(time_ctl, cif2_ctl, where='post', label='Control', linewidth=2)
+ax2.set_xlabel('Time')
+ax2.set_ylabel('Cumulative Incidence')
+ax2.set_title('Event Type 2')
+ax2.legend()
+ax2.grid(True, alpha=0.3)
+
+plt.tight_layout()
+plt.show()
+```
+
+## Statistical Testing with Competing Risks
+
+### Gray's Test
+
+Compare cumulative incidence functions between groups using Gray's test (available in other packages like lifelines).
+
+```python
+# Note: Gray's test not directly available in scikit-survival
+# Consider using lifelines or other packages
+
+# from lifelines.statistics import multivariate_logrank_test
+# result = multivariate_logrank_test(times, groups, events, event_of_interest=1)
+```
+
+## Modeling with Competing Risks
+
+### Approach 1: Cause-Specific Hazard Models
+
+Fit separate Cox models for each event type, treating other event types as censored.
+
+```python
+from sksurv.linear_model import CoxPHSurvivalAnalysis
+from sksurv.util import Surv
+
+# Separate outcome for each event type
+# Event type 1: treat type 2 as censored
+y_event1 = Surv.from_arrays(
+    event=(event_types == 1),
+    time=times
+)
+
+# Event type 2: treat type 1 as censored
+y_event2 = Surv.from_arrays(
+    event=(event_types == 2),
+    time=times
+)
+
+# Fit cause-specific models
+cox_event1 = CoxPHSurvivalAnalysis()
+cox_event1.fit(X, y_event1)
+
+cox_event2 = CoxPHSurvivalAnalysis()
+cox_event2.fit(X, y_event2)
+
+# Interpret coefficients for each event type
+print("Event Type 1 (e.g., Relapse):")
+print(cox_event1.coef_)
+
+print("\nEvent Type 2 (e.g., Death):")
+print(cox_event2.coef_)
+```
+
+**Interpretation:**
+- Separate model for each competing event
+- Coefficients show effect on cause-specific hazard for that event type
+- A covariate may increase risk for one event type but decrease for another
+
+### Approach 2: Fine-Gray Sub-distribution Hazard Model
+
+Models the cumulative incidence directly (not available directly in scikit-survival, but can use other packages).
+
+```python
+# Note: Fine-Gray model not directly in scikit-survival
+# Consider using lifelines or rpy2 to access R's cmprsk package
+
+# from lifelines import CRCSplineFitter
+# crc = CRCSplineFitter()
+# crc.fit(df, event_col='event', duration_col='time')
+```
+
+## Practical Example: Complete Competing Risks Analysis
+
+```python
+import numpy as np
+import pandas as pd
+import matplotlib.pyplot as plt
+from sksurv.nonparametric import cumulative_incidence_competing_risks
+from sksurv.linear_model import CoxPHSurvivalAnalysis
+from sksurv.util import Surv
+
+# Simulate competing risks data
+np.random.seed(42)
+n = 200
+
+# Create features
+age = np.random.normal(60, 10, n)
+treatment = np.random.choice(['A', 'B'], n)
+
+# Simulate event times and types
+# Event types: 0=censored, 1=relapse, 2=death
+times = np.random.exponential(100, n)
+event_types = np.zeros(n, dtype=int)
+
+# Higher age increases both events, treatment A reduces relapse
+for i in range(n):
+    if times[i] < 150:  # Event occurred
+        # Probability of each event type
+        p_relapse = 0.6 if treatment[i] == 'B' else 0.4
+        event_types[i] = 1 if np.random.rand() < p_relapse else 2
+    else:
+        times[i] = 150  # Censored at study end
+
+# Create DataFrame
+df = pd.DataFrame({
+    'time': times,
+    'event_type': event_types,
+    'age': age,
+    'treatment': treatment
+})
+
+# Encode treatment
+df['treatment_A'] = (df['treatment'] == 'A').astype(int)
+
+# 1. OVERALL CUMULATIVE INCIDENCE
+print("=" * 60)
+print("OVERALL CUMULATIVE INCIDENCE")
+print("=" * 60)
+
+y_all = Surv.from_arrays(event=(df['event_type'] > 0), time=df['time'])
+time_points, cif_relapse, cif_death = cumulative_incidence_competing_risks(y_all)
+
+plt.figure(figsize=(10, 6))
+plt.step(time_points, cif_relapse, where='post', label='Relapse', linewidth=2)
+plt.step(time_points, cif_death, where='post', label='Death', linewidth=2)
+plt.xlabel('Time (days)')
+plt.ylabel('Cumulative Incidence')
+plt.title('Competing Risks: Relapse vs Death')
+plt.legend()
+plt.grid(True, alpha=0.3)
+plt.show()
+
+print(f"5-year relapse incidence: {cif_relapse[-1]:.2%}")
+print(f"5-year death incidence: {cif_death[-1]:.2%}")
+
+# 2. STRATIFIED BY TREATMENT
+print("\n" + "=" * 60)
+print("CUMULATIVE INCIDENCE BY TREATMENT")
+print("=" * 60)
+
+for trt in ['A', 'B']:
+    mask = df['treatment'] == trt
+    y_trt = Surv.from_arrays(
+        event=(df.loc[mask, 'event_type'] > 0),
+        time=df.loc[mask, 'time']
+    )
+    time_trt, cif1_trt, cif2_trt = cumulative_incidence_competing_risks(y_trt)
+    print(f"\nTreatment {trt}:")
+    print(f"  5-year relapse: {cif1_trt[-1]:.2%}")
+    print(f"  5-year death: {cif2_trt[-1]:.2%}")
+
+# 3. CAUSE-SPECIFIC MODELS
+print("\n" + "=" * 60)
+print("CAUSE-SPECIFIC HAZARD MODELS")
+print("=" * 60)
+
+X = df[['age', 'treatment_A']]
+
+# Model for relapse (event type 1)
+y_relapse = Surv.from_arrays(
+    event=(df['event_type'] == 1),
+    time=df['time']
+)
+cox_relapse = CoxPHSurvivalAnalysis()
+cox_relapse.fit(X, y_relapse)
+
+print("\nRelapse Model:")
+print(f"  Age:        HR = {np.exp(cox_relapse.coef_[0]):.3f}")
+print(f"  Treatment A: HR = {np.exp(cox_relapse.coef_[1]):.3f}")
+
+# Model for death (event type 2)
+y_death = Surv.from_arrays(
+    event=(df['event_type'] == 2),
+    time=df['time']
+)
+cox_death = CoxPHSurvivalAnalysis()
+cox_death.fit(X, y_death)
+
+print("\nDeath Model:")
+print(f"  Age:        HR = {np.exp(cox_death.coef_[0]):.3f}")
+print(f"  Treatment A: HR = {np.exp(cox_death.coef_[1]):.3f}")
+
+print("\n" + "=" * 60)
+```
+
+## Important Considerations
+
+### Censoring in Competing Risks
+
+- **Administrative censoring**: Subject still at risk at end of study
+- **Loss to follow-up**: Subject leaves study before event
+- **Competing event**: Other event occurred - NOT censored for CIF, but censored for cause-specific models
+
+### Choosing Between Cause-Specific and Sub-distribution Models
+
+**Cause-Specific Hazard Models:**
+- Easier to interpret
+- Direct effect on hazard rate
+- Better for understanding etiology
+- Can fit with scikit-survival
+
+**Fine-Gray Sub-distribution Models:**
+- Models cumulative incidence directly
+- Better for prediction and risk stratification
+- More appropriate for clinical decision-making
+- Requires other packages
+
+### Common Mistakes
+
+**Mistake 1**: Using Kaplan-Meier to estimate event-specific probabilities
+- **Wrong**: Kaplan-Meier for event type 1, treating type 2 as censored
+- **Correct**: Cumulative incidence function accounting for competing risks
+
+**Mistake 2**: Ignoring competing risks when they're substantial
+- If competing event rate > 10-20%, should use competing risks methods
+
+**Mistake 3**: Confusing cause-specific and sub-distribution hazards
+- They answer different questions
+- Use appropriate model for your research question
+
+## Summary
+
+**Key Functions:**
+- `cumulative_incidence_competing_risks`: Estimate CIF for each event type
+- Fit separate Cox models for cause-specific hazards
+- Use stratified analysis to compare groups
+
+**Best Practices:**
+1. Always plot cumulative incidence functions
+2. Report both event-specific and overall incidence
+3. Use cause-specific models in scikit-survival
+4. Consider Fine-Gray models (other packages) for prediction
+5. Be explicit about which events are competing vs censored