24 KiB
Outcome Analysis and Statistical Methods Guide
Overview
Rigorous outcome analysis is essential for clinical decision support documents. This guide covers survival analysis, response assessment, statistical testing, and data visualization for patient cohort analyses and treatment evaluation.
Survival Analysis
Kaplan-Meier Method
Overview
- Non-parametric estimator of survival function from time-to-event data
- Handles censored observations (patients alive at last follow-up)
- Provides survival probability at each time point
- Generates characteristic step-function survival curves
Key Concepts
Censoring
- Right censoring: Most common - patient alive at last follow-up or study end
- Left censoring: Rare in clinical studies
- Interval censoring: Event occurred between two assessment times
- Informative vs non-informative: Censoring should be independent of outcome
Survival Function S(t)
- S(t) = Probability of surviving beyond time t
- S(0) = 1.0 (100% alive at time zero)
- S(t) decreases as time increases
- Step decreases at each event time
Median Survival
- Time point where S(t) = 0.50
- 50% of patients alive, 50% have had event
- Reported with 95% confidence interval
- "Not reached (NR)" if fewer than 50% events
Survival Rates at Fixed Time Points
- 1-year survival rate, 2-year survival rate, 5-year survival rate
- Read from K-M curve at specific time point
- Report with 95% CI: S(t) ± 1.96 × SE
Calculation Example
Time Events At Risk Survival Probability
0 0 100 1.000
3 2 100 0.980 (98/100)
5 1 95 0.970 (97/100 × 95/98)
8 3 87 0.936 (94/100 × 92/95 × 84/87)
...
Log-Rank Test
Purpose: Compare survival curves between two or more groups
Null Hypothesis: No difference in survival distributions between groups
Test Statistic
- Compares observed vs expected events in each group at each time point
- Weights all time points equally
- Follows chi-square distribution with df = k-1 (k groups)
Reporting
- Chi-square statistic, degrees of freedom, p-value
- Example: χ² = 6.82, df = 1, p = 0.009
- Interpretation: Significant difference in survival curves
Assumptions
- Censoring is non-informative and independent
- Proportional hazards (constant HR over time)
- If non-proportional, consider time-varying effects
Alternatives for Non-Proportional Hazards
- Gehan-Breslow test: Weights early events more heavily
- Peto-Peto test: Modifies Gehan-Breslow weighting
- Restricted mean survival time (RMST): Difference in area under K-M curve
Cox Proportional Hazards Regression
Purpose: Multivariable survival analysis, estimate hazard ratios adjusting for covariates
Model: h(t|X) = h₀(t) × exp(β₁X₁ + β₂X₂ + ... + βₚXₚ)
- h(t|X): Hazard rate for individual with covariates X
- h₀(t): Baseline hazard function (unspecified)
- exp(β): Hazard ratio for one-unit change in covariate
Hazard Ratio Interpretation
- HR = 1.0: No effect
- HR > 1.0: Increased risk (harmful)
- HR < 1.0: Decreased risk (beneficial)
- HR = 0.50: 50% reduction in hazard (risk of event)
Example Output
Variable HR 95% CI p-value
Treatment (B vs A) 0.62 0.43-0.89 0.010
Age (per 10 years) 1.15 1.02-1.30 0.021
ECOG PS (2 vs 0-1) 1.85 1.21-2.83 0.004
Biomarker+ (vs -) 0.71 0.48-1.05 0.089
Proportional Hazards Assumption
- Hazard ratio constant over time
- Test: Schoenfeld residuals, log-minus-log plots
- Violation: Time-varying effects, consider stratification or time-dependent covariates
Multivariable vs Univariable
- Univariable: One covariate at a time, unadjusted HRs
- Multivariable: Multiple covariates simultaneously, adjusted HRs
- Report both: Univariable for all variables, multivariable for final model
Model Selection
- Forward selection: Start with empty model, add significant variables
- Backward elimination: Start with all variables, remove non-significant
- Clinical judgment: Include known prognostic factors regardless of p-value
- Parsimony: Avoid overfitting, rule of thumb 1 variable per 10-15 events
Response Assessment
RECIST v1.1 (Response Evaluation Criteria in Solid Tumors)
Target Lesions
- Select up to 5 lesions total (maximum 2 per organ)
- Measurable: ≥10 mm longest diameter (≥15 mm for lymph nodes short axis)
- Sum of longest diameters (SLD) at baseline
Response Categories
Complete Response (CR)
- Disappearance of all target and non-target lesions
- Lymph nodes must regress to <10 mm short axis
- Confirmation required at ≥4 weeks
Partial Response (PR)
- ≥30% decrease in SLD from baseline
- No new lesions or unequivocal progression of non-target lesions
- Confirmation required at ≥4 weeks
Stable Disease (SD)
- Neither PR nor PD criteria met
- Minimum duration typically 6-8 weeks from baseline
Progressive Disease (PD)
- ≥20% increase in SLD AND ≥5 mm absolute increase from smallest SLD (nadir)
- OR appearance of new lesions
- OR unequivocal progression of non-target lesions
Example Calculation
Baseline SLD: 80 mm (4 target lesions)
Week 6 SLD: 52 mm
Percent change: (52 - 80)/80 × 100% = -35%
Classification: Partial Response (≥30% decrease)
Week 12 SLD: 48 mm (nadir)
Week 18 SLD: 62 mm
Percent change from nadir: (62 - 48)/48 × 100% = +29%
Absolute change: 62 - 48 = 14 mm
Classification: Progressive Disease (>20% AND ≥5 mm increase)
iRECIST (Immune RECIST)
Purpose: Account for atypical response patterns with immunotherapy
Modifications from RECIST v1.1
iUPD (Immune Unconfirmed Progressive Disease)
- Initial increase in tumor burden or new lesions
- Requires confirmation at next assessment (≥4 weeks later)
- Continue treatment if clinically stable
iCPD (Immune Confirmed Progressive Disease)
- Confirmed progression at repeat imaging
- Discontinue immunotherapy
Pseudoprogression
- Initial apparent progression followed by response
- Mechanism: Immune cell infiltration increases tumor size
- Incidence: 5-10% of patients on immunotherapy
- Management: Continue treatment if patient clinically stable
New Lesions
- Record size and location but continue treatment
- Do not automatically classify as PD
- Confirm progression if new lesions grow or additional new lesions appear
Other Response Criteria
Lugano Classification (Lymphoma)
- PET-based: Deauville 5-point scale
- Score 1-3: Negative (metabolic CR)
- Score 4-5: Positive (residual disease)
- CT-based: If PET not available
- Bone marrow: Required for staging in some lymphomas
RANO (Response Assessment in Neuro-Oncology)
- Glioblastoma-specific: Accounts for pseudoprogression with radiation/temozolomide
- Enhancing disease: Bidimensional measurements (product of perpendicular diameters)
- Non-enhancing disease: FLAIR changes assessed separately
- Corticosteroid dose: Must document, increase may indicate progression
mRECIST (Modified RECIST for HCC)
- Viable tumor: Enhancing portion only (arterial phase enhancement)
- Necrosis: Non-enhancing areas excluded from measurements
- Application: Hepatocellular carcinoma with arterial enhancement
Outcome Metrics
Efficacy Endpoints
Overall Survival (OS)
- Definition: Time from randomization/treatment start to death from any cause
- Advantages: Objective, not subject to assessment bias, regulatory gold standard
- Disadvantages: Requires long follow-up, affected by subsequent therapies
- Censoring: Last known alive date
- Analysis: Kaplan-Meier, log-rank test, Cox regression
Progression-Free Survival (PFS)
- Definition: Time from randomization to progression (RECIST) or death
- Advantages: Earlier readout than OS, direct treatment effect
- Disadvantages: Requires regular imaging, subject to assessment timing
- Censoring: Last tumor assessment without progression
- Sensitivity Analysis: Assess impact of censoring assumptions
Objective Response Rate (ORR)
- Definition: Proportion of patients achieving CR or PR (best response)
- Denominator: Evaluable patients (baseline measurable disease)
- Reporting: Percentage with 95% CI (exact binomial method)
- Duration: Time from first response to progression (DOR)
- Advantage: Binary endpoint, no censoring complications
Disease Control Rate (DCR)
- Definition: CR + PR + SD (stable disease ≥6-8 weeks)
- Less Stringent: Captures clinical benefit beyond objective response
- Reporting: Percentage with 95% CI
Duration of Response (DOR)
- Definition: Time from first CR or PR to progression (among responders only)
- Population: Subset analysis of responders
- Analysis: Kaplan-Meier among responders
- Reporting: Median DOR with 95% CI
Time to Treatment Failure (TTF)
- Definition: Time from start to discontinuation for any reason (progression, toxicity, death, patient choice)
- Advantage: Reflects real-world treatment duration
- Components: PFS + toxicity-related discontinuations
Safety Endpoints
Adverse Events (CTCAE v5.0)
Grading
- Grade 1: Mild, asymptomatic or mild symptoms, clinical intervention not indicated
- Grade 2: Moderate, minimal/local intervention indicated, age-appropriate ADL limitation
- Grade 3: Severe or medically significant, not immediately life-threatening, hospitalization/prolongation indicated, disabling, self-care ADL limitation
- Grade 4: Life-threatening consequences, urgent intervention indicated
- Grade 5: Death related to adverse event
Reporting Standards
Adverse Event Summary Table:
AE Term (MedDRA) Any Grade, n (%) Grade 3-4, n (%) Grade 5, n (%)
Trt A Trt B Trt A Trt B Trt A Trt B
─────────────────────────────────────────────────────────────────────────
Hematologic
Anemia 45 (90%) 42 (84%) 8 (16%) 6 (12%) 0 0
Neutropenia 35 (70%) 38 (76%) 15 (30%) 18 (36%) 0 0
Thrombocytopenia 28 (56%) 25 (50%) 6 (12%) 4 (8%) 0 0
Febrile neutropenia 4 (8%) 6 (12%) 4 (8%) 6 (12%) 0 0
Gastrointestinal
Nausea 42 (84%) 40 (80%) 2 (4%) 1 (2%) 0 0
Diarrhea 31 (62%) 28 (56%) 5 (10%) 3 (6%) 0 0
Mucositis 18 (36%) 15 (30%) 3 (6%) 2 (4%) 0 0
Any AE 50 (100%) 50 (100%) 38 (76%) 35 (70%) 1 (2%) 0
Serious Adverse Events (SAEs)
- SAE incidence and type
- Relationship to treatment (related vs unrelated)
- Outcome (resolved, ongoing, fatal)
- Causality assessment (definite, probable, possible, unlikely, unrelated)
Treatment Modifications
- Dose reductions: n (%), reason
- Dose delays: n (%), duration
- Discontinuations: n (%), reason (toxicity vs progression vs other)
- Relative dose intensity: (actual dose delivered / planned dose) × 100%
Statistical Analysis Methods
Comparing Continuous Outcomes
Independent Samples t-test
- Application: Compare means between two independent groups (normally distributed)
- Assumptions: Normal distribution, equal variances (or use Welch's t-test)
- Reporting: Mean ± SD for each group, mean difference (95% CI), t-statistic, df, p-value
- Example: Mean age 62.3 ± 8.4 vs 58.7 ± 9.1 years, difference 3.6 years (95% CI 0.2-7.0, p=0.038)
Mann-Whitney U Test (Wilcoxon Rank-Sum)
- Application: Compare medians between two groups (non-normal distribution)
- Non-parametric: No distributional assumptions
- Reporting: Median [IQR] for each group, median difference, U-statistic, p-value
- Example: Median time to response 6.2 [4.1-8.3] vs 8.5 [5.9-11.2] weeks, p=0.042
ANOVA (Analysis of Variance)
- Application: Compare means across three or more groups
- Output: F-statistic, p-value (overall test)
- Post-hoc: If significant, pairwise comparisons with Tukey or Bonferroni correction
- Example: Treatment effect varied by biomarker subgroup (F=4.32, df=2, p=0.016)
Comparing Categorical Outcomes
Chi-Square Test for Independence
- Application: Compare proportions between two or more groups
- Assumptions: Expected count ≥5 in at least 80% of cells
- Reporting: n (%) for each cell, χ², df, p-value
- Example: ORR 45% vs 30%, χ²=6.21, df=1, p=0.013
Fisher's Exact Test
- Application: 2×2 tables when expected count <5
- Exact p-value: No large-sample approximation
- Two-sided vs one-sided: Typically report two-sided
- Example: SAE rate 3/20 (15%) vs 8/22 (36%), Fisher's exact p=0.083
McNemar's Test
- Application: Paired categorical data (before/after, matched pairs)
- Example: Response before vs after treatment switch in same patients
Sample Size and Power
Power Analysis Components
- Alpha (α): Type I error rate, typically 0.05 (two-sided)
- Beta (β): Type II error rate, typically 0.10 or 0.20
- Power: 1 - β, typically 0.80 or 0.90 (80-90% power)
- Effect size: Expected difference (HR, mean difference, proportion difference)
- Sample size: Number of patients or events needed
Survival Study Sample Size
- Events-driven: Need sufficient events (deaths, progressions)
- Rule of thumb: 80% power requires approximately 165 events for HR=0.70 (α=0.05, two-sided)
- Accrual time + follow-up time determines calendar time
Response Rate Study
Example: Detect ORR difference 45% vs 30% (15 percentage points)
- α = 0.05 (two-sided)
- Power = 0.80
- Sample size: n = 94 per group (188 total)
- With 10% dropout: n = 105 per group (210 total)
Data Visualization
Survival Curves
Kaplan-Meier Plot Best Practices
# Key elements for publication-quality survival curve
1. X-axis: Time (months or years), starts at 0
2. Y-axis: Survival probability (0 to 1.0 or 0% to 100%)
3. Step function: Survival curve with steps at event times
4. 95% CI bands: Shaded region around survival curve (optional but recommended)
5. Number at risk table: Below x-axis showing n at risk at time intervals
6. Censoring marks: Vertical tick marks (|) at censored observations
7. Legend: Clearly identify each curve
8. Log-rank p-value: Prominently displayed
9. Median survival: Horizontal line at 0.50, labeled
10. Follow-up: Median follow-up time reported
Number at Risk Table Format
Number at risk
Group A 50 42 35 28 18 10 5
Group B 48 38 29 19 12 6 2
Time 0 6 12 18 24 30 36 (months)
Hazard Ratio Annotation
On plot: HR 0.62 (95% CI 0.43-0.89), p=0.010
Or in caption: Log-rank test p=0.010; Cox model HR=0.62 (95% CI 0.43-0.89)
Waterfall Plots
Purpose: Visualize individual patient responses to treatment
Construction
- X-axis: Individual patients (anonymized patient IDs)
- Y-axis: Best % change from baseline tumor burden
- Bars: Vertical bars, one per patient
- Positive values: Tumor growth
- Negative values: Tumor shrinkage
- Ordering: Sorted from best response (left) to worst (right)
- Color coding:
- Green/blue: CR or PR (≥30% decrease)
- Yellow: SD (-30% to +20%)
- Red: PD (≥20% increase)
- Reference lines: Horizontal lines at +20% (PD), -30% (PR)
- Annotations: Biomarker status, response duration (symbols)
Example Annotations
■ = Biomarker-positive
○ = Biomarker-negative
* = Ongoing response
† = Progressed
Forest Plots
Purpose: Display subgroup analyses with hazard ratios and confidence intervals
Construction
- Y-axis: Subgroup categories
- X-axis: Hazard ratio (log scale), vertical line at HR=1.0
- Points: HR estimate for each subgroup
- Horizontal lines: 95% confidence interval
- Square size: Proportional to sample size or precision
- Overall effect: Diamond at bottom, width represents 95% CI
Subgroups to Display
Subgroup n HR (95% CI) Favors A Favors B
──────────────────────────────────────────────────────────────────────────
Overall 300 0.65 (0.48-0.88) ●────┤
Age
<65 years 180 0.58 (0.39-0.86) ●────┤
≥65 years 120 0.78 (0.49-1.24) ●──────┤
Sex
Male 175 0.62 (0.43-0.90) ●────┤
Female 125 0.70 (0.44-1.12) ●─────┤
Biomarker Status
Positive 140 0.45 (0.28-0.72) ●───┤
Negative 160 0.89 (0.59-1.34) ●──────┤
p-interaction=0.041
0.25 0.5 1.0 2.0
Hazard Ratio
Interaction Testing
- Test whether treatment effect differs across subgroups
- p-interaction <0.05 suggests heterogeneity
- Pre-specify subgroups to avoid data mining
Spider Plots
Purpose: Display longitudinal tumor burden changes over time for individual patients
Construction
- X-axis: Time from treatment start (weeks or months)
- Y-axis: % change from baseline tumor burden
- Lines: One line per patient connecting assessments
- Color coding: By response category or biomarker status
- Reference lines: 0% (no change), +20% (PD threshold), -30% (PR threshold)
Clinical Insights
- Identify delayed responders (initial SD then PR)
- Detect early progression (rapid upward trajectory)
- Assess depth of response (maximum tumor shrinkage)
- Duration visualization (when lines cross PD threshold)
Swimmer Plots
Purpose: Display treatment duration and response for individual patients
Construction
- X-axis: Time from treatment start (weeks or months)
- Y-axis: Individual patients (one row per patient)
- Bars: Horizontal bars representing treatment duration
- Symbols:
- ● Start of treatment
- ▼ Ongoing treatment (arrow)
- ■ Progressive disease (end of bar)
- ◆ Death
- | Dose modification
- Color: Response status (CR=green, PR=blue, SD=yellow, PD=red)
Example
Patient ID |0 3 6 9 12 15 18 21 24 months
──────────────|──────────────────────────────────────────
Pt-001 ●═══PR═══════════|════════PR══════════▼
Pt-002 ●═══PR═══════════════PD■
Pt-003 ●══════SD══════════PD■
Pt-004 ●PR══════════════════════════════════PR▼
...
Confidence Intervals
Interpretation
95% Confidence Interval
- Range of plausible values for true population parameter
- If study repeated 100 times, 95 of the 95% CIs would contain true value
- Not: 95% probability true value within this interval (frequentist, not Bayesian)
Relationship to p-value
- If 95% CI excludes null value (HR=1.0, difference=0), p<0.05
- If 95% CI includes null value, p≥0.05
- CI provides more information: magnitude and precision of effect
Precision
- Narrow CI: High precision, large sample size
- Wide CI: Low precision, small sample size or high variability
- Example: HR 0.65 (95% CI 0.62-0.68) very precise; HR 0.65 (0.30-1.40) imprecise
Calculation Methods
Hazard Ratio CI
- From Cox regression output
- Standard error of log(HR) → exp(log(HR) ± 1.96×SE)
- Example: HR=0.62, SE(logHR)=0.185 → 95% CI (0.43, 0.89)
Survival Rate CI (Greenwood Formula)
- SE(S(t)) = S(t) × sqrt(Σ[d_i / (n_i × (n_i - d_i))])
- 95% CI: S(t) ± 1.96 × SE(S(t))
- Can use complementary log-log transformation for better properties
Proportion CI (Exact Binomial)
- For ORR, DCR: Use exact method (Clopper-Pearson) for small samples
- Wilson score interval: Better properties than normal approximation
- Example: 12/30 responses → ORR 40% (95% CI 22.7-59.4%)
Censoring and Missing Data
Types of Censoring
Right Censoring
- End of study: Patient alive at study termination (administrative censoring)
- Loss to follow-up: Patient stops attending visits
- Withdrawal: Patient withdraws consent
- Competing risk: Death from unrelated cause (in disease-specific survival)
Handling Censoring
- Assumption: Non-informative - censoring independent of event probability
- Sensitivity Analysis: Assess impact if assumption violated
- Best case: All censored patients never progress
- Worst case: All censored patients progress immediately after censoring
- Actual result should fall between best/worst case
Missing Data
Mechanisms
- MCAR (Missing Completely at Random): Missingness unrelated to any variable
- MAR (Missing at Random): Missingness related to observed but not unobserved variables
- NMAR (Not Missing at Random): Missingness related to the missing value itself
Handling Strategies
- Complete case analysis: Exclude patients with missing data (biased if not MCAR)
- Multiple imputation: Generate multiple plausible datasets, analyze each, pool results
- Maximum likelihood: Estimate parameters using all available data
- Sensitivity analysis: Assess robustness to missing data assumptions
Response Assessment Missing Data
- Unevaluable for response: Baseline measurable disease but post-baseline assessment missing
- Exclude from ORR denominator or count as non-responder (sensitivity analysis)
- PFS censoring: Last adequate tumor assessment date if later assessments missing
Reporting Standards
CONSORT Statement (RCTs)
Flow Diagram
- Assessed for eligibility (n=)
- Randomized (n=)
- Allocated to intervention (n=)
- Lost to follow-up (n=, reasons)
- Discontinued intervention (n=, reasons)
- Analyzed (n=)
Baseline Table
- Demographics and clinical characteristics
- Baseline prognostic factors
- Show balance between arms
Outcomes Table
- Primary endpoint results with CI and p-value
- Secondary endpoints
- Safety summary
STROBE Statement (Observational Studies)
Study Design: Cohort, case-control, or cross-sectional
Participants: Eligibility, sources, selection methods, sample size
Variables: Clearly define outcomes, exposures, predictors, confounders
Statistical Methods: Describe all methods, handling of missing data, sensitivity analyses
Results: Participant flow, descriptive data, outcome data, main results, other analyses
Reproducible Research Practices
Statistical Analysis Plan (SAP)
- Pre-specify all analyses before data lock
- Primary and secondary endpoints
- Analysis populations (ITT, per-protocol, safety)
- Statistical tests and models
- Subgroup analyses (pre-specified)
- Interim analyses (if planned)
- Multiple testing procedures
Transparency
- Report all pre-specified analyses
- Distinguish pre-specified from post-hoc exploratory
- Report both positive and negative results
- Provide access to anonymized individual patient data (when possible)
Software and Tools
R Packages for Survival Analysis
- survival: Core package (Surv, survfit, coxph, survdiff)
- survminer: Publication-ready Kaplan-Meier plots (ggsurvplot)
- rms: Regression modeling strategies
- flexsurv: Flexible parametric survival models
Python Libraries
- lifelines: Kaplan-Meier, Cox regression, survival curves
- scikit-survival: Machine learning for survival analysis
- matplotlib: Custom survival curve plotting
Statistical Software
- R: Most comprehensive for survival analysis
- Stata: Medical statistics, good for epidemiology
- SAS: Industry standard for clinical trials
- GraphPad Prism: User-friendly for basic analyses
- SPSS: Point-and-click interface, limited survival features