# Causal Inference & Root Cause Analysis Methodology ## Root Cause Analysis Workflow Copy this checklist and track your progress: ``` Root Cause Analysis Progress: - [ ] Step 1: Generate hypotheses using structured techniques - [ ] Step 2: Build causal model and distinguish cause types - [ ] Step 3: Gather evidence and assess temporal sequence - [ ] Step 4: Test causality with Bradford Hill criteria - [ ] Step 5: Verify root cause and check coherence ``` **Step 1: Generate hypotheses using structured techniques** Use 5 Whys, Fishbone diagrams, timeline analysis, or differential diagnosis to systematically generate potential causes. See [Hypothesis Generation Techniques](#hypothesis-generation-techniques) for detailed methods. **Step 2: Build causal model and distinguish cause types** Map causal chains to distinguish root causes from proximate causes, symptoms, and confounders. See [Causal Model Building](#causal-model-building) for cause type definitions and chain mapping. **Step 3: Gather evidence and assess temporal sequence** Collect evidence for each hypothesis and verify temporal relationships (cause must precede effect). See [Evidence Assessment Methods](#evidence-assessment-methods) for essential tests and evidence hierarchy. **Step 4: Test causality with Bradford Hill criteria** Score hypotheses using the 9 Bradford Hill criteria (strength, consistency, specificity, temporality, dose-response, plausibility, coherence, experiment, analogy). See [Bradford Hill Criteria](#bradford-hill-criteria) for scoring rubric. **Step 5: Verify root cause and check coherence** Ensure the identified root cause has strong evidence, fits with known facts, and addresses systemic issues. See [Quality Checklist](#quality-checklist) and [Worked Example](#worked-example-website-conversion-drop) for validation techniques. --- ## Hypothesis Generation Techniques ### 5 Whys (Trace Back to Root) Start with the effect and ask "why" repeatedly until you reach the root cause. **Process:** ``` Effect: {Observed problem} Why? → {Proximate cause 1} Why? → {Proximate cause 2} Why? → {Intermediate cause} Why? → {Deeper cause} Why? → {Root cause} ``` **Example:** ``` Effect: Server crashed Why? → Out of memory Why? → Memory leak in new code Why? → Connection pool not releasing connections Why? → Error handling missing in new feature Why? → Code review process didn't catch this edge case (ROOT CAUSE) ``` **When to stop**: When you reach a cause that, if fixed, would prevent recurrence. **Pitfalls to avoid**: - Stopping too early at proximate causes - Following only one path (should explore multiple branches) - Accepting vague answers ("because it broke") --- ### Fishbone Diagram (Categorize Causes) Organize potential causes by category to ensure comprehensive coverage. **Categories (6 Ms)**: ``` Methods Machines Materials | | | | | | └───────────────┴───────────────┴─────────→ Effect | | | | | | Manpower Measurement Environment ``` **Adapted for software/product**: - **People**: Skills, knowledge, communication, errors - **Process**: Procedures, workflows, standards, review - **Technology**: Tools, systems, infrastructure, dependencies - **Environment**: External factors, timing, context **Example (API outage)**: - **People**: On-call engineer inexperienced with database - **Process**: No rollback procedure documented - **Technology**: Database connection pooling misconfigured - **Environment**: Traffic spike coincided with deploy --- ### Timeline Analysis (What Changed?) Map events leading up to effect to identify temporal correlations. **Process**: 1. Establish baseline period (normal operation) 2. Mark when effect first observed 3. List all changes in days/hours leading up to effect 4. Identify changes that temporally precede effect **Example**: ``` T-7 days: Normal operation (baseline) T-3 days: Deployed new payment processor integration T-2 days: Traffic increased 20% T-1 day: First error logged (dismissed as fluke) T-0 (14:00): Conversion rate dropped 30% T+1 hour: Alert fired, investigation started ``` **Look for**: Changes, anomalies, or events right before effect (temporal precedence). **Red flags**: - Multiple changes at once (hard to isolate cause) - Long lag between change and effect (harder to connect) - No changes before effect (may be external factor or accumulated technical debt) --- ### Differential Diagnosis List all conditions/causes that could produce the observed symptoms. **Process**: 1. List all possible causes (be comprehensive) 2. For each cause, list symptoms it would produce 3. Compare predicted symptoms to observed symptoms 4. Eliminate causes that don't match **Example (API returning 500 errors)**: | Cause | Predicted Symptoms | Observed? | |-------|-------------------|-----------| | Code bug (logic error) | Specific endpoints fail, reproducible | ✓ Yes | | Resource exhaustion (memory) | All endpoints slow, CPU/memory high | ✗ CPU normal | | Dependency failure (database) | All database queries fail, connection errors | ✗ DB responsive | | Configuration error | Service won't start or immediate failure | ✗ Started fine | | DDoS attack | High traffic, distributed sources | ✗ Traffic normal | **Conclusion**: Code bug most likely (matches symptoms, others ruled out). --- ## Causal Model Building ### Distinguish Cause Types #### 1. Root Cause - **Definition**: Fundamental underlying issue - **Test**: If fixed, problem doesn't recur - **Usually**: Structural, procedural, or systemic - **Example**: "No code review process for infrastructure changes" #### 2. Proximate Cause - **Definition**: Immediate trigger - **Test**: Directly precedes effect - **Often**: Symptom of deeper root cause - **Example**: "Deployed buggy config file" #### 3. Contributing Factor - **Definition**: Makes effect more likely or severe - **Test**: Not sufficient alone, but amplifies - **Often**: Context or conditions - **Example**: "High traffic amplified impact of bug" #### 4. Coincidence - **Definition**: Happened around same time, no causal relationship - **Test**: No mechanism, doesn't pass counterfactual - **Example**: "Marketing campaign launched same day" (but unrelated to technical issue) --- ### Map Causal Chains #### Linear Chain ``` Root Cause ↓ (mechanism) Intermediate Cause ↓ (mechanism) Proximate Cause ↓ (mechanism) Observed Effect ``` **Example**: ``` No SSL cert renewal process (Root) ↓ (no automated reminders) Cert expires unnoticed (Intermediate) ↓ (HTTPS handshake fails) HTTPS connections fail (Proximate) ↓ (users can't connect) Users can't access site (Effect) ``` #### Branching Chain (Multiple Paths) ``` Root Cause ↙ ↘ Path A Path B ↓ ↓ Effect A Effect B ``` **Example**: ``` Database migration error (Root) ↙ ↘ Missing indexes Schema mismatch ↓ ↓ Slow queries Type errors ``` #### Common Cause (Confounding) ``` Confounder (Z) ↙ ↘ Variable X Variable Y ``` X and Y correlate because Z causes both, but X doesn't cause Y. **Example**: ``` Hot weather (Confounder) ↙ ↘ Ice cream sales Drownings ``` Ice cream doesn't cause drownings; both caused by hot weather/summer season. --- ## Evidence Assessment Methods ### Essential Tests #### 1. Temporal Sequence (Must Pass) **Rule**: Cause MUST precede effect. If effect happened before "cause," it's not causal. **How to verify**: - Create detailed timeline - Mark exact timestamps of cause and effect - Calculate lag (effect should follow cause) **Example**: - Migration deployed: March 10, 2:00 PM - Queries slowed: March 10, 2:15 PM - ✓ Cause precedes effect by 15 minutes #### 2. Counterfactual (Strong Evidence) **Question**: "What would have happened without the cause?" **How to test**: - **Control group**: Compare cases with cause vs without - **Rollback**: Remove cause, see if effect disappears - **Baseline comparison**: Compare to period before cause - **A/B test**: Random assignment of cause **Example**: - Hypothesis: New feature X causes conversion drop - Test: A/B test with X enabled vs disabled - Result: No conversion difference → X not the cause #### 3. Mechanism (Plausibility Test) **Question**: Can you explain HOW cause produces effect? **Requirements**: - Pathway from cause to effect - Each step makes sense - Supported by theory or evidence **Example - Strong**: - Cause: Increased checkout latency (5 sec) - Mechanism: Users abandon slow pages - Evidence: Industry data shows 7+ sec → 20% abandon - ✓ Clear, plausible mechanism **Example - Weak**: - Cause: Moon phase (full moon) - Mechanism: ??? (no plausible pathway) - Effect: Website traffic increase - ✗ No mechanism, likely spurious correlation --- ### Evidence Hierarchy **Strongest Evidence** (Gold Standard): 1. **Randomized Controlled Trial (RCT)** - Random assignment eliminates confounding - Compare treatment vs control group - Example: A/B test with random user assignment **Strong Evidence**: 2. **Quasi-Experiment / Natural Experiment** - Some random-like assignment (not perfect) - Example: Policy implemented in one region, not another - Control for confounders statistically **Medium Evidence**: 3. **Longitudinal Study (Before/After)** - Track same subjects over time - Controls for individual differences - Example: Metric before and after change 4. **Well-Controlled Observational Study** - Statistical controls for confounders - Multiple variables measured - Example: Regression analysis with covariates **Weak Evidence**: 5. **Cross-Sectional Correlation** - Single point in time - Can't establish temporal order - Example: Survey at one moment 6. **Anecdote / Single Case** - May not generalize - Many confounders - Example: One user complaint **Recommendation**: For high-stakes decisions, aim for quasi-experiment or better. --- ### Bradford Hill Criteria Nine factors that strengthen causal claims. Score each on 1-3 scale. #### 1. Strength **Question**: How strong is the association? - **3**: Very strong correlation (r > 0.7, OR > 10) - **2**: Moderate correlation (r = 0.3-0.7, OR = 2-10) - **1**: Weak correlation (r < 0.3, OR < 2) **Example**: 10x latency increase = strong association (3/3) #### 2. Consistency **Question**: Does relationship replicate across contexts? - **3**: Always consistent (multiple studies, settings) - **2**: Mostly consistent (some exceptions) - **1**: Rarely consistent (contradictory results) **Example**: Latency affects conversion on all pages, devices, countries = consistent (3/3) #### 3. Specificity **Question**: Is cause-effect relationship specific? - **3**: Specific cause → specific effect - **2**: Somewhat specific (some other effects) - **1**: General/non-specific (cause has many effects) **Example**: Missing index affects only queries on that table = specific (3/3) #### 4. Temporality **Question**: Does cause precede effect? - **3**: Always (clear temporal sequence) - **2**: Usually (mostly precedes) - **1**: Unclear or reverse causation possible **Example**: Migration (2:00 PM) before slow queries (2:15 PM) = always (3/3) #### 5. Dose-Response **Question**: Does more cause → more effect? - **3**: Clear gradient (linear or monotonic) - **2**: Some gradient (mostly true) - **1**: No gradient (flat or random) **Example**: Larger tables → slower queries = clear gradient (3/3) #### 6. Plausibility **Question**: Does mechanism make sense? - **3**: Very plausible (well-established theory) - **2**: Somewhat plausible (reasonable) - **1**: Implausible (contradicts theory) **Example**: Index scans faster than table scans = well-established (3/3) #### 7. Coherence **Question**: Fits with existing knowledge? - **3**: Fits well (no contradictions) - **2**: Somewhat fits (minor contradictions) - **1**: Contradicts existing knowledge **Example**: Aligns with database optimization theory = fits well (3/3) #### 8. Experiment **Question**: Does intervention change outcome? - **3**: Strong experimental evidence (RCT) - **2**: Some experimental evidence (quasi) - **1**: No experimental evidence (observational only) **Example**: Rollback restored performance = strong experiment (3/3) #### 9. Analogy **Question**: Similar cause-effect patterns exist? - **3**: Strong analogies (many similar cases) - **2**: Some analogies (a few similar) - **1**: No analogies (unique case) **Example**: Similar patterns in other databases = some analogies (2/3) **Scoring**: - **Total 18-27**: Strong causal evidence - **Total 12-17**: Moderate evidence - **Total < 12**: Weak evidence (need more investigation) --- ## Worked Example: Website Conversion Drop ### Problem Website conversion rate dropped from 5% to 3% (40% relative drop) starting November 15th. --- ### 1. Effect Definition **Effect**: Conversion rate dropped 40% (5% → 3%) **Quantification**: - Baseline: 5% (stable for 6 months prior) - Current: 3% (observed for 2 weeks) - Absolute drop: 2 percentage points - Relative drop: 40% - Impact: ~$100k/week revenue loss **Timeline**: - Last normal day: November 14th (5.1% conversion) - First drop observed: November 15th (3.2% conversion) - Ongoing: Yes (still at 3% as of November 29th) **Impact**: - 10,000 daily visitors - 500 conversions → 300 conversions - $200 average order value - Loss: 200 conversions × $200 = $40k/day --- ### 2. Competing Hypotheses (Using Multiple Techniques) #### Using 5 Whys: ``` Effect: Conversion dropped 40% Why? → Users abandoning checkout Why? → Checkout takes too long Why? → Payment processor is slow Why? → New processor has higher latency Why? → Switched to cheaper processor (ROOT) ``` #### Using Fishbone Diagram: **Technology**: - New payment processor (Nov 10) - New checkout UI (Nov 14) **Process**: - Lack of A/B testing - No performance monitoring **Environment**: - Seasonal traffic shift (holiday season) **People**: - User behavior changes? #### Using Timeline Analysis: ``` Nov 10: Switched to new payment processor Nov 14: Deployed new checkout UI Nov 15: Conversion drop observed (2:00 AM) ``` #### Competing Hypotheses Generated: **Hypothesis 1: New checkout UI** (deployed Nov 14) - Type: Proximate cause - Evidence for: Timing matches (Nov 14 → Nov 15) - Evidence against: A/B test showed no difference; Rollback didn't fix **Hypothesis 2: Payment processor switch** (Nov 10) - Type: Root cause candidate - Evidence for: New processor slower (400ms vs 100ms); Timing precedes - Evidence against: 5-day lag (why not immediate?) **Hypothesis 3: Payment latency increase** (Nov 15) - Type: Proximate cause/symptom - Evidence for: Logs show 5-8 sec checkout (was 2-3 sec); User complaints - Evidence against: None (strong evidence) **Hypothesis 4: Seasonal traffic shift** - Type: Confounder - Evidence for: Holiday season - Evidence against: Traffic composition unchanged --- ### 3. Causal Model #### Causal Chain: ``` ROOT: Switched to cheaper payment processor (Nov 10) ↓ (mechanism: lower throughput processor) INTERMEDIATE: Payment API latency under load ↓ (mechanism: slow API → page delays) PROXIMATE: Checkout takes 5-8 seconds (Nov 15+) ↓ (mechanism: users abandon slow pages) EFFECT: Conversion drops 40% ``` #### Why Nov 15 lag? - Nov 10-14: Weekday traffic (low, processor handled it) - Nov 15: Weekend traffic spike (2x normal) - Processor couldn't handle weekend load → latency spike #### Confounders Ruled Out: - UI change: Rollback had no effect - Seasonality: Traffic patterns unchanged - Marketing: No changes --- ### 4. Evidence Assessment **Temporal Sequence**: ✓ PASS (3/3) - Processor switch (Nov 10) → Latency (Nov 15) → Drop (Nov 15) - Clear precedence **Counterfactual**: ✓ PASS (3/3) - Test: Switched back to old processor - Result: Conversion recovered to 4.8% - Strong evidence **Mechanism**: ✓ PASS (3/3) - Slow processor (400ms) → High load → 5-8 sec checkout → User abandonment - Industry data: 7+ sec = 20% abandon - Clear, plausible mechanism **Dose-Response**: ✓ PASS (3/3) | Checkout Time | Conversion | |---------------|------------| | <3 sec | 5% | | 3-5 sec | 4% | | 5-7 sec | 3% | | >7 sec | 2% | Clear gradient **Consistency**: ✓ PASS (3/3) - All devices (mobile, desktop, tablet) - All payment methods - All countries - Consistent pattern **Bradford Hill Score: 24/27** (Very Strong) 1. Strength: 3 (40% drop) 2. Consistency: 3 (all segments) 3. Specificity: 2 (latency affects other things) 4. Temporality: 3 (clear sequence) 5. Dose-response: 3 (gradient) 6. Plausibility: 3 (well-known) 7. Coherence: 3 (fits theory) 8. Experiment: 3 (rollback test) 9. Analogy: 2 (some similar cases) --- ### 5. Conclusion **Root Cause**: Switched to cheaper payment processor with insufficient throughput for weekend traffic. **Confidence**: High (95%+) **Why high confidence**: - Perfect temporal sequence - Strong counterfactual (rollback restored) - Clear mechanism - Dose-response present - Consistent across contexts - No confounders - Bradford Hill 24/27 **Why root (not symptom)**: - Processor switch is decision that created problem - Latency is symptom of this decision - Fixing latency alone treats symptom - Reverting switch eliminates problem --- ### 6. Recommended Actions **Immediate**: 1. ✓ Reverted to old processor (Nov 28) 2. Negotiate better rates with old processor **If keeping new processor**: 1. Add caching layer (reduce latency <2 sec) 2. Implement async checkout 3. Load test at 3x peak **Validation**: - A/B test: Old processor vs new+caching - Monitor: Latency p95, conversion rate - Success: Conversion >4.5%, latency <2 sec --- ### 7. Limitations **Data limitations**: - No abandonment reason tracking - Processor metrics limited (black box) **Analysis limitations**: - Didn't test all latency optimizations - Small sample for some device types **Generalizability**: - Specific to our checkout flow - May not apply to other traffic patterns --- ## Quality Checklist Before finalizing root cause analysis, verify: **Effect Definition**: - [ ] Effect clearly described (not vague) - [ ] Quantified with magnitude - [ ] Timeline established (when started, duration) - [ ] Baseline for comparison stated **Hypothesis Generation**: - [ ] Multiple hypotheses generated (3+ competing) - [ ] Used systematic techniques (5 Whys, Fishbone, Timeline, Differential) - [ ] Proximate vs root distinguished - [ ] Confounders identified **Causal Model**: - [ ] Causal chain mapped (root → proximate → effect) - [ ] Mechanisms explained for each link - [ ] Confounding relationships noted - [ ] Direct vs indirect causation clear **Evidence Assessment**: - [ ] Temporal sequence verified (cause before effect) - [ ] Counterfactual tested (what if cause absent) - [ ] Mechanism explained with supporting evidence - [ ] Dose-response checked (more cause → more effect) - [ ] Consistency assessed (holds across contexts) - [ ] Confounders ruled out or controlled **Conclusion**: - [ ] Root cause clearly stated - [ ] Confidence level stated with justification - [ ] Distinguished from proximate causes/symptoms - [ ] Alternative explanations acknowledged - [ ] Limitations noted **Recommendations**: - [ ] Address root cause (not just symptoms) - [ ] Actionable interventions proposed - [ ] Validation tests specified - [ ] Success metrics defined **Minimum Standards**: - For medium-stakes (postmortems, feature issues): Score ≥ 3.5 on rubric - For high-stakes (infrastructure, safety): Score ≥ 4.0 on rubric - Red flags: <3 on Temporal Sequence, Counterfactual, or Mechanism = weak causal claim