Files
gh-lyndonkl-claude/skills/causal-inference-root-cause/resources/methodology.md
2025-11-30 08:38:26 +08:00

20 KiB
Raw Blame History

Causal Inference & Root Cause Analysis Methodology

Root Cause Analysis Workflow

Copy this checklist and track your progress:

Root Cause Analysis Progress:
- [ ] Step 1: Generate hypotheses using structured techniques
- [ ] Step 2: Build causal model and distinguish cause types
- [ ] Step 3: Gather evidence and assess temporal sequence
- [ ] Step 4: Test causality with Bradford Hill criteria
- [ ] Step 5: Verify root cause and check coherence

Step 1: Generate hypotheses using structured techniques

Use 5 Whys, Fishbone diagrams, timeline analysis, or differential diagnosis to systematically generate potential causes. See Hypothesis Generation Techniques for detailed methods.

Step 2: Build causal model and distinguish cause types

Map causal chains to distinguish root causes from proximate causes, symptoms, and confounders. See Causal Model Building for cause type definitions and chain mapping.

Step 3: Gather evidence and assess temporal sequence

Collect evidence for each hypothesis and verify temporal relationships (cause must precede effect). See Evidence Assessment Methods for essential tests and evidence hierarchy.

Step 4: Test causality with Bradford Hill criteria

Score hypotheses using the 9 Bradford Hill criteria (strength, consistency, specificity, temporality, dose-response, plausibility, coherence, experiment, analogy). See Bradford Hill Criteria for scoring rubric.

Step 5: Verify root cause and check coherence

Ensure the identified root cause has strong evidence, fits with known facts, and addresses systemic issues. See Quality Checklist and Worked Example for validation techniques.


Hypothesis Generation Techniques

5 Whys (Trace Back to Root)

Start with the effect and ask "why" repeatedly until you reach the root cause.

Process:

Effect: {Observed problem}
Why? → {Proximate cause 1}
Why? → {Proximate cause 2}
Why? → {Intermediate cause}
Why? → {Deeper cause}
Why? → {Root cause}

Example:

Effect: Server crashed
Why? → Out of memory
Why? → Memory leak in new code
Why? → Connection pool not releasing connections
Why? → Error handling missing in new feature
Why? → Code review process didn't catch this edge case (ROOT CAUSE)

When to stop: When you reach a cause that, if fixed, would prevent recurrence.

Pitfalls to avoid:

  • Stopping too early at proximate causes
  • Following only one path (should explore multiple branches)
  • Accepting vague answers ("because it broke")

Fishbone Diagram (Categorize Causes)

Organize potential causes by category to ensure comprehensive coverage.

Categories (6 Ms):

Methods         Machines        Materials
   |               |               |
   |               |               |
   └───────────────┴───────────────┴─────────→ Effect
   |               |               |
   |               |               |
Manpower      Measurement    Environment

Adapted for software/product:

  • People: Skills, knowledge, communication, errors
  • Process: Procedures, workflows, standards, review
  • Technology: Tools, systems, infrastructure, dependencies
  • Environment: External factors, timing, context

Example (API outage):

  • People: On-call engineer inexperienced with database
  • Process: No rollback procedure documented
  • Technology: Database connection pooling misconfigured
  • Environment: Traffic spike coincided with deploy

Timeline Analysis (What Changed?)

Map events leading up to effect to identify temporal correlations.

Process:

  1. Establish baseline period (normal operation)
  2. Mark when effect first observed
  3. List all changes in days/hours leading up to effect
  4. Identify changes that temporally precede effect

Example:

T-7 days: Normal operation (baseline)
T-3 days: Deployed new payment processor integration
T-2 days: Traffic increased 20%
T-1 day: First error logged (dismissed as fluke)
T-0 (14:00): Conversion rate dropped 30%
T+1 hour: Alert fired, investigation started

Look for: Changes, anomalies, or events right before effect (temporal precedence).

Red flags:

  • Multiple changes at once (hard to isolate cause)
  • Long lag between change and effect (harder to connect)
  • No changes before effect (may be external factor or accumulated technical debt)

Differential Diagnosis

List all conditions/causes that could produce the observed symptoms.

Process:

  1. List all possible causes (be comprehensive)
  2. For each cause, list symptoms it would produce
  3. Compare predicted symptoms to observed symptoms
  4. Eliminate causes that don't match

Example (API returning 500 errors):

Cause Predicted Symptoms Observed?
Code bug (logic error) Specific endpoints fail, reproducible ✓ Yes
Resource exhaustion (memory) All endpoints slow, CPU/memory high ✗ CPU normal
Dependency failure (database) All database queries fail, connection errors ✗ DB responsive
Configuration error Service won't start or immediate failure ✗ Started fine
DDoS attack High traffic, distributed sources ✗ Traffic normal

Conclusion: Code bug most likely (matches symptoms, others ruled out).


Causal Model Building

Distinguish Cause Types

1. Root Cause

  • Definition: Fundamental underlying issue
  • Test: If fixed, problem doesn't recur
  • Usually: Structural, procedural, or systemic
  • Example: "No code review process for infrastructure changes"

2. Proximate Cause

  • Definition: Immediate trigger
  • Test: Directly precedes effect
  • Often: Symptom of deeper root cause
  • Example: "Deployed buggy config file"

3. Contributing Factor

  • Definition: Makes effect more likely or severe
  • Test: Not sufficient alone, but amplifies
  • Often: Context or conditions
  • Example: "High traffic amplified impact of bug"

4. Coincidence

  • Definition: Happened around same time, no causal relationship
  • Test: No mechanism, doesn't pass counterfactual
  • Example: "Marketing campaign launched same day" (but unrelated to technical issue)

Map Causal Chains

Linear Chain

Root Cause
    ↓ (mechanism)
Intermediate Cause
    ↓ (mechanism)
Proximate Cause
    ↓ (mechanism)
Observed Effect

Example:

No SSL cert renewal process (Root)
    ↓ (no automated reminders)
Cert expires unnoticed (Intermediate)
    ↓ (HTTPS handshake fails)
HTTPS connections fail (Proximate)
    ↓ (users can't connect)
Users can't access site (Effect)

Branching Chain (Multiple Paths)

        Root Cause
           ↙    ↘
    Path A      Path B
       ↓          ↓
    Effect A   Effect B

Example:

    Database migration error (Root)
           ↙              ↘
  Missing indexes    Schema mismatch
       ↓                    ↓
  Slow queries       Type errors

Common Cause (Confounding)

    Confounder (Z)
      ↙    ↘
Variable X   Variable Y

X and Y correlate because Z causes both, but X doesn't cause Y.

Example:

    Hot weather (Confounder)
      ↙              ↘
Ice cream sales   Drownings

Ice cream doesn't cause drownings; both caused by hot weather/summer season.


Evidence Assessment Methods

Essential Tests

1. Temporal Sequence (Must Pass)

Rule: Cause MUST precede effect. If effect happened before "cause," it's not causal.

How to verify:

  • Create detailed timeline
  • Mark exact timestamps of cause and effect
  • Calculate lag (effect should follow cause)

Example:

  • Migration deployed: March 10, 2:00 PM
  • Queries slowed: March 10, 2:15 PM
  • ✓ Cause precedes effect by 15 minutes

2. Counterfactual (Strong Evidence)

Question: "What would have happened without the cause?"

How to test:

  • Control group: Compare cases with cause vs without
  • Rollback: Remove cause, see if effect disappears
  • Baseline comparison: Compare to period before cause
  • A/B test: Random assignment of cause

Example:

  • Hypothesis: New feature X causes conversion drop
  • Test: A/B test with X enabled vs disabled
  • Result: No conversion difference → X not the cause

3. Mechanism (Plausibility Test)

Question: Can you explain HOW cause produces effect?

Requirements:

  • Pathway from cause to effect
  • Each step makes sense
  • Supported by theory or evidence

Example - Strong:

  • Cause: Increased checkout latency (5 sec)
  • Mechanism: Users abandon slow pages
  • Evidence: Industry data shows 7+ sec → 20% abandon
  • ✓ Clear, plausible mechanism

Example - Weak:

  • Cause: Moon phase (full moon)
  • Mechanism: ??? (no plausible pathway)
  • Effect: Website traffic increase
  • ✗ No mechanism, likely spurious correlation

Evidence Hierarchy

Strongest Evidence (Gold Standard):

  1. Randomized Controlled Trial (RCT)
    • Random assignment eliminates confounding
    • Compare treatment vs control group
    • Example: A/B test with random user assignment

Strong Evidence:

  1. Quasi-Experiment / Natural Experiment
    • Some random-like assignment (not perfect)
    • Example: Policy implemented in one region, not another
    • Control for confounders statistically

Medium Evidence:

  1. Longitudinal Study (Before/After)

    • Track same subjects over time
    • Controls for individual differences
    • Example: Metric before and after change
  2. Well-Controlled Observational Study

    • Statistical controls for confounders
    • Multiple variables measured
    • Example: Regression analysis with covariates

Weak Evidence:

  1. Cross-Sectional Correlation

    • Single point in time
    • Can't establish temporal order
    • Example: Survey at one moment
  2. Anecdote / Single Case

    • May not generalize
    • Many confounders
    • Example: One user complaint

Recommendation: For high-stakes decisions, aim for quasi-experiment or better.


Bradford Hill Criteria

Nine factors that strengthen causal claims. Score each on 1-3 scale.

1. Strength

Question: How strong is the association?

  • 3: Very strong correlation (r > 0.7, OR > 10)
  • 2: Moderate correlation (r = 0.3-0.7, OR = 2-10)
  • 1: Weak correlation (r < 0.3, OR < 2)

Example: 10x latency increase = strong association (3/3)

2. Consistency

Question: Does relationship replicate across contexts?

  • 3: Always consistent (multiple studies, settings)
  • 2: Mostly consistent (some exceptions)
  • 1: Rarely consistent (contradictory results)

Example: Latency affects conversion on all pages, devices, countries = consistent (3/3)

3. Specificity

Question: Is cause-effect relationship specific?

  • 3: Specific cause → specific effect
  • 2: Somewhat specific (some other effects)
  • 1: General/non-specific (cause has many effects)

Example: Missing index affects only queries on that table = specific (3/3)

4. Temporality

Question: Does cause precede effect?

  • 3: Always (clear temporal sequence)
  • 2: Usually (mostly precedes)
  • 1: Unclear or reverse causation possible

Example: Migration (2:00 PM) before slow queries (2:15 PM) = always (3/3)

5. Dose-Response

Question: Does more cause → more effect?

  • 3: Clear gradient (linear or monotonic)
  • 2: Some gradient (mostly true)
  • 1: No gradient (flat or random)

Example: Larger tables → slower queries = clear gradient (3/3)

6. Plausibility

Question: Does mechanism make sense?

  • 3: Very plausible (well-established theory)
  • 2: Somewhat plausible (reasonable)
  • 1: Implausible (contradicts theory)

Example: Index scans faster than table scans = well-established (3/3)

7. Coherence

Question: Fits with existing knowledge?

  • 3: Fits well (no contradictions)
  • 2: Somewhat fits (minor contradictions)
  • 1: Contradicts existing knowledge

Example: Aligns with database optimization theory = fits well (3/3)

8. Experiment

Question: Does intervention change outcome?

  • 3: Strong experimental evidence (RCT)
  • 2: Some experimental evidence (quasi)
  • 1: No experimental evidence (observational only)

Example: Rollback restored performance = strong experiment (3/3)

9. Analogy

Question: Similar cause-effect patterns exist?

  • 3: Strong analogies (many similar cases)
  • 2: Some analogies (a few similar)
  • 1: No analogies (unique case)

Example: Similar patterns in other databases = some analogies (2/3)

Scoring:

  • Total 18-27: Strong causal evidence
  • Total 12-17: Moderate evidence
  • Total < 12: Weak evidence (need more investigation)

Worked Example: Website Conversion Drop

Problem

Website conversion rate dropped from 5% to 3% (40% relative drop) starting November 15th.


1. Effect Definition

Effect: Conversion rate dropped 40% (5% → 3%)

Quantification:

  • Baseline: 5% (stable for 6 months prior)
  • Current: 3% (observed for 2 weeks)
  • Absolute drop: 2 percentage points
  • Relative drop: 40%
  • Impact: ~$100k/week revenue loss

Timeline:

  • Last normal day: November 14th (5.1% conversion)
  • First drop observed: November 15th (3.2% conversion)
  • Ongoing: Yes (still at 3% as of November 29th)

Impact:

  • 10,000 daily visitors
  • 500 conversions → 300 conversions
  • $200 average order value
  • Loss: 200 conversions × $200 = $40k/day

2. Competing Hypotheses (Using Multiple Techniques)

Using 5 Whys:

Effect: Conversion dropped 40%
Why? → Users abandoning checkout
Why? → Checkout takes too long
Why? → Payment processor is slow
Why? → New processor has higher latency
Why? → Switched to cheaper processor (ROOT)

Using Fishbone Diagram:

Technology:

  • New payment processor (Nov 10)
  • New checkout UI (Nov 14)

Process:

  • Lack of A/B testing
  • No performance monitoring

Environment:

  • Seasonal traffic shift (holiday season)

People:

  • User behavior changes?

Using Timeline Analysis:

Nov 10: Switched to new payment processor
Nov 14: Deployed new checkout UI
Nov 15: Conversion drop observed (2:00 AM)

Competing Hypotheses Generated:

Hypothesis 1: New checkout UI (deployed Nov 14)

  • Type: Proximate cause
  • Evidence for: Timing matches (Nov 14 → Nov 15)
  • Evidence against: A/B test showed no difference; Rollback didn't fix

Hypothesis 2: Payment processor switch (Nov 10)

  • Type: Root cause candidate
  • Evidence for: New processor slower (400ms vs 100ms); Timing precedes
  • Evidence against: 5-day lag (why not immediate?)

Hypothesis 3: Payment latency increase (Nov 15)

  • Type: Proximate cause/symptom
  • Evidence for: Logs show 5-8 sec checkout (was 2-3 sec); User complaints
  • Evidence against: None (strong evidence)

Hypothesis 4: Seasonal traffic shift

  • Type: Confounder
  • Evidence for: Holiday season
  • Evidence against: Traffic composition unchanged

3. Causal Model

Causal Chain:

ROOT: Switched to cheaper payment processor (Nov 10)
    ↓ (mechanism: lower throughput processor)
INTERMEDIATE: Payment API latency under load
    ↓ (mechanism: slow API → page delays)
PROXIMATE: Checkout takes 5-8 seconds (Nov 15+)
    ↓ (mechanism: users abandon slow pages)
EFFECT: Conversion drops 40%

Why Nov 15 lag?

  • Nov 10-14: Weekday traffic (low, processor handled it)
  • Nov 15: Weekend traffic spike (2x normal)
  • Processor couldn't handle weekend load → latency spike

Confounders Ruled Out:

  • UI change: Rollback had no effect
  • Seasonality: Traffic patterns unchanged
  • Marketing: No changes

4. Evidence Assessment

Temporal Sequence: ✓ PASS (3/3)

  • Processor switch (Nov 10) → Latency (Nov 15) → Drop (Nov 15)
  • Clear precedence

Counterfactual: ✓ PASS (3/3)

  • Test: Switched back to old processor
  • Result: Conversion recovered to 4.8%
  • Strong evidence

Mechanism: ✓ PASS (3/3)

  • Slow processor (400ms) → High load → 5-8 sec checkout → User abandonment
  • Industry data: 7+ sec = 20% abandon
  • Clear, plausible mechanism

Dose-Response: ✓ PASS (3/3)

Checkout Time Conversion
<3 sec 5%
3-5 sec 4%
5-7 sec 3%
>7 sec 2%

Clear gradient

Consistency: ✓ PASS (3/3)

  • All devices (mobile, desktop, tablet)
  • All payment methods
  • All countries
  • Consistent pattern

Bradford Hill Score: 24/27 (Very Strong)

  1. Strength: 3 (40% drop)
  2. Consistency: 3 (all segments)
  3. Specificity: 2 (latency affects other things)
  4. Temporality: 3 (clear sequence)
  5. Dose-response: 3 (gradient)
  6. Plausibility: 3 (well-known)
  7. Coherence: 3 (fits theory)
  8. Experiment: 3 (rollback test)
  9. Analogy: 2 (some similar cases)

5. Conclusion

Root Cause: Switched to cheaper payment processor with insufficient throughput for weekend traffic.

Confidence: High (95%+)

Why high confidence:

  • Perfect temporal sequence
  • Strong counterfactual (rollback restored)
  • Clear mechanism
  • Dose-response present
  • Consistent across contexts
  • No confounders
  • Bradford Hill 24/27

Why root (not symptom):

  • Processor switch is decision that created problem
  • Latency is symptom of this decision
  • Fixing latency alone treats symptom
  • Reverting switch eliminates problem

Immediate:

  1. ✓ Reverted to old processor (Nov 28)
  2. Negotiate better rates with old processor

If keeping new processor:

  1. Add caching layer (reduce latency <2 sec)
  2. Implement async checkout
  3. Load test at 3x peak

Validation:

  • A/B test: Old processor vs new+caching
  • Monitor: Latency p95, conversion rate
  • Success: Conversion >4.5%, latency <2 sec

7. Limitations

Data limitations:

  • No abandonment reason tracking
  • Processor metrics limited (black box)

Analysis limitations:

  • Didn't test all latency optimizations
  • Small sample for some device types

Generalizability:

  • Specific to our checkout flow
  • May not apply to other traffic patterns

Quality Checklist

Before finalizing root cause analysis, verify:

Effect Definition:

  • Effect clearly described (not vague)
  • Quantified with magnitude
  • Timeline established (when started, duration)
  • Baseline for comparison stated

Hypothesis Generation:

  • Multiple hypotheses generated (3+ competing)
  • Used systematic techniques (5 Whys, Fishbone, Timeline, Differential)
  • Proximate vs root distinguished
  • Confounders identified

Causal Model:

  • Causal chain mapped (root → proximate → effect)
  • Mechanisms explained for each link
  • Confounding relationships noted
  • Direct vs indirect causation clear

Evidence Assessment:

  • Temporal sequence verified (cause before effect)
  • Counterfactual tested (what if cause absent)
  • Mechanism explained with supporting evidence
  • Dose-response checked (more cause → more effect)
  • Consistency assessed (holds across contexts)
  • Confounders ruled out or controlled

Conclusion:

  • Root cause clearly stated
  • Confidence level stated with justification
  • Distinguished from proximate causes/symptoms
  • Alternative explanations acknowledged
  • Limitations noted

Recommendations:

  • Address root cause (not just symptoms)
  • Actionable interventions proposed
  • Validation tests specified
  • Success metrics defined

Minimum Standards:

  • For medium-stakes (postmortems, feature issues): Score ≥ 3.5 on rubric
  • For high-stakes (infrastructure, safety): Score ≥ 4.0 on rubric
  • Red flags: <3 on Temporal Sequence, Counterfactual, or Mechanism = weak causal claim