20 KiB
Causal Inference & Root Cause Analysis Methodology
Root Cause Analysis Workflow
Copy this checklist and track your progress:
Root Cause Analysis Progress:
- [ ] Step 1: Generate hypotheses using structured techniques
- [ ] Step 2: Build causal model and distinguish cause types
- [ ] Step 3: Gather evidence and assess temporal sequence
- [ ] Step 4: Test causality with Bradford Hill criteria
- [ ] Step 5: Verify root cause and check coherence
Step 1: Generate hypotheses using structured techniques
Use 5 Whys, Fishbone diagrams, timeline analysis, or differential diagnosis to systematically generate potential causes. See Hypothesis Generation Techniques for detailed methods.
Step 2: Build causal model and distinguish cause types
Map causal chains to distinguish root causes from proximate causes, symptoms, and confounders. See Causal Model Building for cause type definitions and chain mapping.
Step 3: Gather evidence and assess temporal sequence
Collect evidence for each hypothesis and verify temporal relationships (cause must precede effect). See Evidence Assessment Methods for essential tests and evidence hierarchy.
Step 4: Test causality with Bradford Hill criteria
Score hypotheses using the 9 Bradford Hill criteria (strength, consistency, specificity, temporality, dose-response, plausibility, coherence, experiment, analogy). See Bradford Hill Criteria for scoring rubric.
Step 5: Verify root cause and check coherence
Ensure the identified root cause has strong evidence, fits with known facts, and addresses systemic issues. See Quality Checklist and Worked Example for validation techniques.
Hypothesis Generation Techniques
5 Whys (Trace Back to Root)
Start with the effect and ask "why" repeatedly until you reach the root cause.
Process:
Effect: {Observed problem}
Why? → {Proximate cause 1}
Why? → {Proximate cause 2}
Why? → {Intermediate cause}
Why? → {Deeper cause}
Why? → {Root cause}
Example:
Effect: Server crashed
Why? → Out of memory
Why? → Memory leak in new code
Why? → Connection pool not releasing connections
Why? → Error handling missing in new feature
Why? → Code review process didn't catch this edge case (ROOT CAUSE)
When to stop: When you reach a cause that, if fixed, would prevent recurrence.
Pitfalls to avoid:
- Stopping too early at proximate causes
- Following only one path (should explore multiple branches)
- Accepting vague answers ("because it broke")
Fishbone Diagram (Categorize Causes)
Organize potential causes by category to ensure comprehensive coverage.
Categories (6 Ms):
Methods Machines Materials
| | |
| | |
└───────────────┴───────────────┴─────────→ Effect
| | |
| | |
Manpower Measurement Environment
Adapted for software/product:
- People: Skills, knowledge, communication, errors
- Process: Procedures, workflows, standards, review
- Technology: Tools, systems, infrastructure, dependencies
- Environment: External factors, timing, context
Example (API outage):
- People: On-call engineer inexperienced with database
- Process: No rollback procedure documented
- Technology: Database connection pooling misconfigured
- Environment: Traffic spike coincided with deploy
Timeline Analysis (What Changed?)
Map events leading up to effect to identify temporal correlations.
Process:
- Establish baseline period (normal operation)
- Mark when effect first observed
- List all changes in days/hours leading up to effect
- Identify changes that temporally precede effect
Example:
T-7 days: Normal operation (baseline)
T-3 days: Deployed new payment processor integration
T-2 days: Traffic increased 20%
T-1 day: First error logged (dismissed as fluke)
T-0 (14:00): Conversion rate dropped 30%
T+1 hour: Alert fired, investigation started
Look for: Changes, anomalies, or events right before effect (temporal precedence).
Red flags:
- Multiple changes at once (hard to isolate cause)
- Long lag between change and effect (harder to connect)
- No changes before effect (may be external factor or accumulated technical debt)
Differential Diagnosis
List all conditions/causes that could produce the observed symptoms.
Process:
- List all possible causes (be comprehensive)
- For each cause, list symptoms it would produce
- Compare predicted symptoms to observed symptoms
- Eliminate causes that don't match
Example (API returning 500 errors):
| Cause | Predicted Symptoms | Observed? |
|---|---|---|
| Code bug (logic error) | Specific endpoints fail, reproducible | ✓ Yes |
| Resource exhaustion (memory) | All endpoints slow, CPU/memory high | ✗ CPU normal |
| Dependency failure (database) | All database queries fail, connection errors | ✗ DB responsive |
| Configuration error | Service won't start or immediate failure | ✗ Started fine |
| DDoS attack | High traffic, distributed sources | ✗ Traffic normal |
Conclusion: Code bug most likely (matches symptoms, others ruled out).
Causal Model Building
Distinguish Cause Types
1. Root Cause
- Definition: Fundamental underlying issue
- Test: If fixed, problem doesn't recur
- Usually: Structural, procedural, or systemic
- Example: "No code review process for infrastructure changes"
2. Proximate Cause
- Definition: Immediate trigger
- Test: Directly precedes effect
- Often: Symptom of deeper root cause
- Example: "Deployed buggy config file"
3. Contributing Factor
- Definition: Makes effect more likely or severe
- Test: Not sufficient alone, but amplifies
- Often: Context or conditions
- Example: "High traffic amplified impact of bug"
4. Coincidence
- Definition: Happened around same time, no causal relationship
- Test: No mechanism, doesn't pass counterfactual
- Example: "Marketing campaign launched same day" (but unrelated to technical issue)
Map Causal Chains
Linear Chain
Root Cause
↓ (mechanism)
Intermediate Cause
↓ (mechanism)
Proximate Cause
↓ (mechanism)
Observed Effect
Example:
No SSL cert renewal process (Root)
↓ (no automated reminders)
Cert expires unnoticed (Intermediate)
↓ (HTTPS handshake fails)
HTTPS connections fail (Proximate)
↓ (users can't connect)
Users can't access site (Effect)
Branching Chain (Multiple Paths)
Root Cause
↙ ↘
Path A Path B
↓ ↓
Effect A Effect B
Example:
Database migration error (Root)
↙ ↘
Missing indexes Schema mismatch
↓ ↓
Slow queries Type errors
Common Cause (Confounding)
Confounder (Z)
↙ ↘
Variable X Variable Y
X and Y correlate because Z causes both, but X doesn't cause Y.
Example:
Hot weather (Confounder)
↙ ↘
Ice cream sales Drownings
Ice cream doesn't cause drownings; both caused by hot weather/summer season.
Evidence Assessment Methods
Essential Tests
1. Temporal Sequence (Must Pass)
Rule: Cause MUST precede effect. If effect happened before "cause," it's not causal.
How to verify:
- Create detailed timeline
- Mark exact timestamps of cause and effect
- Calculate lag (effect should follow cause)
Example:
- Migration deployed: March 10, 2:00 PM
- Queries slowed: March 10, 2:15 PM
- ✓ Cause precedes effect by 15 minutes
2. Counterfactual (Strong Evidence)
Question: "What would have happened without the cause?"
How to test:
- Control group: Compare cases with cause vs without
- Rollback: Remove cause, see if effect disappears
- Baseline comparison: Compare to period before cause
- A/B test: Random assignment of cause
Example:
- Hypothesis: New feature X causes conversion drop
- Test: A/B test with X enabled vs disabled
- Result: No conversion difference → X not the cause
3. Mechanism (Plausibility Test)
Question: Can you explain HOW cause produces effect?
Requirements:
- Pathway from cause to effect
- Each step makes sense
- Supported by theory or evidence
Example - Strong:
- Cause: Increased checkout latency (5 sec)
- Mechanism: Users abandon slow pages
- Evidence: Industry data shows 7+ sec → 20% abandon
- ✓ Clear, plausible mechanism
Example - Weak:
- Cause: Moon phase (full moon)
- Mechanism: ??? (no plausible pathway)
- Effect: Website traffic increase
- ✗ No mechanism, likely spurious correlation
Evidence Hierarchy
Strongest Evidence (Gold Standard):
- Randomized Controlled Trial (RCT)
- Random assignment eliminates confounding
- Compare treatment vs control group
- Example: A/B test with random user assignment
Strong Evidence:
- Quasi-Experiment / Natural Experiment
- Some random-like assignment (not perfect)
- Example: Policy implemented in one region, not another
- Control for confounders statistically
Medium Evidence:
-
Longitudinal Study (Before/After)
- Track same subjects over time
- Controls for individual differences
- Example: Metric before and after change
-
Well-Controlled Observational Study
- Statistical controls for confounders
- Multiple variables measured
- Example: Regression analysis with covariates
Weak Evidence:
-
Cross-Sectional Correlation
- Single point in time
- Can't establish temporal order
- Example: Survey at one moment
-
Anecdote / Single Case
- May not generalize
- Many confounders
- Example: One user complaint
Recommendation: For high-stakes decisions, aim for quasi-experiment or better.
Bradford Hill Criteria
Nine factors that strengthen causal claims. Score each on 1-3 scale.
1. Strength
Question: How strong is the association?
- 3: Very strong correlation (r > 0.7, OR > 10)
- 2: Moderate correlation (r = 0.3-0.7, OR = 2-10)
- 1: Weak correlation (r < 0.3, OR < 2)
Example: 10x latency increase = strong association (3/3)
2. Consistency
Question: Does relationship replicate across contexts?
- 3: Always consistent (multiple studies, settings)
- 2: Mostly consistent (some exceptions)
- 1: Rarely consistent (contradictory results)
Example: Latency affects conversion on all pages, devices, countries = consistent (3/3)
3. Specificity
Question: Is cause-effect relationship specific?
- 3: Specific cause → specific effect
- 2: Somewhat specific (some other effects)
- 1: General/non-specific (cause has many effects)
Example: Missing index affects only queries on that table = specific (3/3)
4. Temporality
Question: Does cause precede effect?
- 3: Always (clear temporal sequence)
- 2: Usually (mostly precedes)
- 1: Unclear or reverse causation possible
Example: Migration (2:00 PM) before slow queries (2:15 PM) = always (3/3)
5. Dose-Response
Question: Does more cause → more effect?
- 3: Clear gradient (linear or monotonic)
- 2: Some gradient (mostly true)
- 1: No gradient (flat or random)
Example: Larger tables → slower queries = clear gradient (3/3)
6. Plausibility
Question: Does mechanism make sense?
- 3: Very plausible (well-established theory)
- 2: Somewhat plausible (reasonable)
- 1: Implausible (contradicts theory)
Example: Index scans faster than table scans = well-established (3/3)
7. Coherence
Question: Fits with existing knowledge?
- 3: Fits well (no contradictions)
- 2: Somewhat fits (minor contradictions)
- 1: Contradicts existing knowledge
Example: Aligns with database optimization theory = fits well (3/3)
8. Experiment
Question: Does intervention change outcome?
- 3: Strong experimental evidence (RCT)
- 2: Some experimental evidence (quasi)
- 1: No experimental evidence (observational only)
Example: Rollback restored performance = strong experiment (3/3)
9. Analogy
Question: Similar cause-effect patterns exist?
- 3: Strong analogies (many similar cases)
- 2: Some analogies (a few similar)
- 1: No analogies (unique case)
Example: Similar patterns in other databases = some analogies (2/3)
Scoring:
- Total 18-27: Strong causal evidence
- Total 12-17: Moderate evidence
- Total < 12: Weak evidence (need more investigation)
Worked Example: Website Conversion Drop
Problem
Website conversion rate dropped from 5% to 3% (40% relative drop) starting November 15th.
1. Effect Definition
Effect: Conversion rate dropped 40% (5% → 3%)
Quantification:
- Baseline: 5% (stable for 6 months prior)
- Current: 3% (observed for 2 weeks)
- Absolute drop: 2 percentage points
- Relative drop: 40%
- Impact: ~$100k/week revenue loss
Timeline:
- Last normal day: November 14th (5.1% conversion)
- First drop observed: November 15th (3.2% conversion)
- Ongoing: Yes (still at 3% as of November 29th)
Impact:
- 10,000 daily visitors
- 500 conversions → 300 conversions
- $200 average order value
- Loss: 200 conversions × $200 = $40k/day
2. Competing Hypotheses (Using Multiple Techniques)
Using 5 Whys:
Effect: Conversion dropped 40%
Why? → Users abandoning checkout
Why? → Checkout takes too long
Why? → Payment processor is slow
Why? → New processor has higher latency
Why? → Switched to cheaper processor (ROOT)
Using Fishbone Diagram:
Technology:
- New payment processor (Nov 10)
- New checkout UI (Nov 14)
Process:
- Lack of A/B testing
- No performance monitoring
Environment:
- Seasonal traffic shift (holiday season)
People:
- User behavior changes?
Using Timeline Analysis:
Nov 10: Switched to new payment processor
Nov 14: Deployed new checkout UI
Nov 15: Conversion drop observed (2:00 AM)
Competing Hypotheses Generated:
Hypothesis 1: New checkout UI (deployed Nov 14)
- Type: Proximate cause
- Evidence for: Timing matches (Nov 14 → Nov 15)
- Evidence against: A/B test showed no difference; Rollback didn't fix
Hypothesis 2: Payment processor switch (Nov 10)
- Type: Root cause candidate
- Evidence for: New processor slower (400ms vs 100ms); Timing precedes
- Evidence against: 5-day lag (why not immediate?)
Hypothesis 3: Payment latency increase (Nov 15)
- Type: Proximate cause/symptom
- Evidence for: Logs show 5-8 sec checkout (was 2-3 sec); User complaints
- Evidence against: None (strong evidence)
Hypothesis 4: Seasonal traffic shift
- Type: Confounder
- Evidence for: Holiday season
- Evidence against: Traffic composition unchanged
3. Causal Model
Causal Chain:
ROOT: Switched to cheaper payment processor (Nov 10)
↓ (mechanism: lower throughput processor)
INTERMEDIATE: Payment API latency under load
↓ (mechanism: slow API → page delays)
PROXIMATE: Checkout takes 5-8 seconds (Nov 15+)
↓ (mechanism: users abandon slow pages)
EFFECT: Conversion drops 40%
Why Nov 15 lag?
- Nov 10-14: Weekday traffic (low, processor handled it)
- Nov 15: Weekend traffic spike (2x normal)
- Processor couldn't handle weekend load → latency spike
Confounders Ruled Out:
- UI change: Rollback had no effect
- Seasonality: Traffic patterns unchanged
- Marketing: No changes
4. Evidence Assessment
Temporal Sequence: ✓ PASS (3/3)
- Processor switch (Nov 10) → Latency (Nov 15) → Drop (Nov 15)
- Clear precedence
Counterfactual: ✓ PASS (3/3)
- Test: Switched back to old processor
- Result: Conversion recovered to 4.8%
- Strong evidence
Mechanism: ✓ PASS (3/3)
- Slow processor (400ms) → High load → 5-8 sec checkout → User abandonment
- Industry data: 7+ sec = 20% abandon
- Clear, plausible mechanism
Dose-Response: ✓ PASS (3/3)
| Checkout Time | Conversion |
|---|---|
| <3 sec | 5% |
| 3-5 sec | 4% |
| 5-7 sec | 3% |
| >7 sec | 2% |
Clear gradient
Consistency: ✓ PASS (3/3)
- All devices (mobile, desktop, tablet)
- All payment methods
- All countries
- Consistent pattern
Bradford Hill Score: 24/27 (Very Strong)
- Strength: 3 (40% drop)
- Consistency: 3 (all segments)
- Specificity: 2 (latency affects other things)
- Temporality: 3 (clear sequence)
- Dose-response: 3 (gradient)
- Plausibility: 3 (well-known)
- Coherence: 3 (fits theory)
- Experiment: 3 (rollback test)
- Analogy: 2 (some similar cases)
5. Conclusion
Root Cause: Switched to cheaper payment processor with insufficient throughput for weekend traffic.
Confidence: High (95%+)
Why high confidence:
- Perfect temporal sequence
- Strong counterfactual (rollback restored)
- Clear mechanism
- Dose-response present
- Consistent across contexts
- No confounders
- Bradford Hill 24/27
Why root (not symptom):
- Processor switch is decision that created problem
- Latency is symptom of this decision
- Fixing latency alone treats symptom
- Reverting switch eliminates problem
6. Recommended Actions
Immediate:
- ✓ Reverted to old processor (Nov 28)
- Negotiate better rates with old processor
If keeping new processor:
- Add caching layer (reduce latency <2 sec)
- Implement async checkout
- Load test at 3x peak
Validation:
- A/B test: Old processor vs new+caching
- Monitor: Latency p95, conversion rate
- Success: Conversion >4.5%, latency <2 sec
7. Limitations
Data limitations:
- No abandonment reason tracking
- Processor metrics limited (black box)
Analysis limitations:
- Didn't test all latency optimizations
- Small sample for some device types
Generalizability:
- Specific to our checkout flow
- May not apply to other traffic patterns
Quality Checklist
Before finalizing root cause analysis, verify:
Effect Definition:
- Effect clearly described (not vague)
- Quantified with magnitude
- Timeline established (when started, duration)
- Baseline for comparison stated
Hypothesis Generation:
- Multiple hypotheses generated (3+ competing)
- Used systematic techniques (5 Whys, Fishbone, Timeline, Differential)
- Proximate vs root distinguished
- Confounders identified
Causal Model:
- Causal chain mapped (root → proximate → effect)
- Mechanisms explained for each link
- Confounding relationships noted
- Direct vs indirect causation clear
Evidence Assessment:
- Temporal sequence verified (cause before effect)
- Counterfactual tested (what if cause absent)
- Mechanism explained with supporting evidence
- Dose-response checked (more cause → more effect)
- Consistency assessed (holds across contexts)
- Confounders ruled out or controlled
Conclusion:
- Root cause clearly stated
- Confidence level stated with justification
- Distinguished from proximate causes/symptoms
- Alternative explanations acknowledged
- Limitations noted
Recommendations:
- Address root cause (not just symptoms)
- Actionable interventions proposed
- Validation tests specified
- Success metrics defined
Minimum Standards:
- For medium-stakes (postmortems, feature issues): Score ≥ 3.5 on rubric
- For high-stakes (infrastructure, safety): Score ≥ 4.0 on rubric
- Red flags: <3 on Temporal Sequence, Counterfactual, or Mechanism = weak causal claim