zhongwei/gh-lyndonkl-claude

Fork 0

Files

Zhongwei Li 41d9f6b189 Initial commit

2025-11-30 08:38:26 +08:00

20 KiB

Raw Blame History

Causal Inference & Root Cause Analysis Methodology

Root Cause Analysis Workflow

Copy this checklist and track your progress:

Root Cause Analysis Progress:
- [ ] Step 1: Generate hypotheses using structured techniques
- [ ] Step 2: Build causal model and distinguish cause types
- [ ] Step 3: Gather evidence and assess temporal sequence
- [ ] Step 4: Test causality with Bradford Hill criteria
- [ ] Step 5: Verify root cause and check coherence

Step 1: Generate hypotheses using structured techniques

Use 5 Whys, Fishbone diagrams, timeline analysis, or differential diagnosis to systematically generate potential causes. See Hypothesis Generation Techniques for detailed methods.

Step 2: Build causal model and distinguish cause types

Map causal chains to distinguish root causes from proximate causes, symptoms, and confounders. See Causal Model Building for cause type definitions and chain mapping.

Step 3: Gather evidence and assess temporal sequence

Collect evidence for each hypothesis and verify temporal relationships (cause must precede effect). See Evidence Assessment Methods for essential tests and evidence hierarchy.

Step 4: Test causality with Bradford Hill criteria

Score hypotheses using the 9 Bradford Hill criteria (strength, consistency, specificity, temporality, dose-response, plausibility, coherence, experiment, analogy). See Bradford Hill Criteria for scoring rubric.

Step 5: Verify root cause and check coherence

Ensure the identified root cause has strong evidence, fits with known facts, and addresses systemic issues. See Quality Checklist and Worked Example for validation techniques.

Hypothesis Generation Techniques

5 Whys (Trace Back to Root)

Start with the effect and ask "why" repeatedly until you reach the root cause.

Process:

Effect: {Observed problem}
Why? → {Proximate cause 1}
Why? → {Proximate cause 2}
Why? → {Intermediate cause}
Why? → {Deeper cause}
Why? → {Root cause}

Example:

Effect: Server crashed
Why? → Out of memory
Why? → Memory leak in new code
Why? → Connection pool not releasing connections
Why? → Error handling missing in new feature
Why? → Code review process didn't catch this edge case (ROOT CAUSE)

When to stop: When you reach a cause that, if fixed, would prevent recurrence.

Pitfalls to avoid:

Stopping too early at proximate causes
Following only one path (should explore multiple branches)
Accepting vague answers ("because it broke")

Fishbone Diagram (Categorize Causes)

Organize potential causes by category to ensure comprehensive coverage.

Categories (6 Ms):

Methods         Machines        Materials
   |               |               |
   |               |               |
   └───────────────┴───────────────┴─────────→ Effect
   |               |               |
   |               |               |
Manpower      Measurement    Environment

Adapted for software/product:

People: Skills, knowledge, communication, errors
Process: Procedures, workflows, standards, review
Technology: Tools, systems, infrastructure, dependencies
Environment: External factors, timing, context

Example (API outage):

People: On-call engineer inexperienced with database
Process: No rollback procedure documented
Technology: Database connection pooling misconfigured
Environment: Traffic spike coincided with deploy

Timeline Analysis (What Changed?)

Map events leading up to effect to identify temporal correlations.

Process:

Establish baseline period (normal operation)
Mark when effect first observed
List all changes in days/hours leading up to effect
Identify changes that temporally precede effect

Example:

T-7 days: Normal operation (baseline)
T-3 days: Deployed new payment processor integration
T-2 days: Traffic increased 20%
T-1 day: First error logged (dismissed as fluke)
T-0 (14:00): Conversion rate dropped 30%
T+1 hour: Alert fired, investigation started

Look for: Changes, anomalies, or events right before effect (temporal precedence).

Red flags:

Multiple changes at once (hard to isolate cause)
Long lag between change and effect (harder to connect)
No changes before effect (may be external factor or accumulated technical debt)

Differential Diagnosis

List all conditions/causes that could produce the observed symptoms.

Process:

List all possible causes (be comprehensive)
For each cause, list symptoms it would produce
Compare predicted symptoms to observed symptoms
Eliminate causes that don't match

Example (API returning 500 errors):

Cause	Predicted Symptoms	Observed?
Code bug (logic error)	Specific endpoints fail, reproducible	✓ Yes
Resource exhaustion (memory)	All endpoints slow, CPU/memory high	✗ CPU normal
Dependency failure (database)	All database queries fail, connection errors	✗ DB responsive
Configuration error	Service won't start or immediate failure	✗ Started fine
DDoS attack	High traffic, distributed sources	✗ Traffic normal

Conclusion: Code bug most likely (matches symptoms, others ruled out).

Causal Model Building

Distinguish Cause Types

1. Root Cause

Definition: Fundamental underlying issue
Test: If fixed, problem doesn't recur
Usually: Structural, procedural, or systemic
Example: "No code review process for infrastructure changes"

2. Proximate Cause

Definition: Immediate trigger
Test: Directly precedes effect
Often: Symptom of deeper root cause
Example: "Deployed buggy config file"

3. Contributing Factor

Definition: Makes effect more likely or severe
Test: Not sufficient alone, but amplifies
Often: Context or conditions
Example: "High traffic amplified impact of bug"

4. Coincidence

Definition: Happened around same time, no causal relationship
Test: No mechanism, doesn't pass counterfactual
Example: "Marketing campaign launched same day" (but unrelated to technical issue)

Map Causal Chains

Linear Chain

Root Cause
    ↓ (mechanism)
Intermediate Cause
    ↓ (mechanism)
Proximate Cause
    ↓ (mechanism)
Observed Effect

Example:

No SSL cert renewal process (Root)
    ↓ (no automated reminders)
Cert expires unnoticed (Intermediate)
    ↓ (HTTPS handshake fails)
HTTPS connections fail (Proximate)
    ↓ (users can't connect)
Users can't access site (Effect)

Branching Chain (Multiple Paths)

        Root Cause
           ↙    ↘
    Path A      Path B
       ↓          ↓
    Effect A   Effect B

Example:

    Database migration error (Root)
           ↙              ↘
  Missing indexes    Schema mismatch
       ↓                    ↓
  Slow queries       Type errors

Common Cause (Confounding)

    Confounder (Z)
      ↙    ↘
Variable X   Variable Y

X and Y correlate because Z causes both, but X doesn't cause Y.

Example:

    Hot weather (Confounder)
      ↙              ↘
Ice cream sales   Drownings

Ice cream doesn't cause drownings; both caused by hot weather/summer season.

Evidence Assessment Methods

Essential Tests

1. Temporal Sequence (Must Pass)

Rule: Cause MUST precede effect. If effect happened before "cause," it's not causal.

How to verify:

Create detailed timeline
Mark exact timestamps of cause and effect
Calculate lag (effect should follow cause)

Example:

Migration deployed: March 10, 2:00 PM
Queries slowed: March 10, 2:15 PM
✓ Cause precedes effect by 15 minutes

2. Counterfactual (Strong Evidence)

Question: "What would have happened without the cause?"

How to test:

Control group: Compare cases with cause vs without
Rollback: Remove cause, see if effect disappears
Baseline comparison: Compare to period before cause
A/B test: Random assignment of cause

Example:

Hypothesis: New feature X causes conversion drop
Test: A/B test with X enabled vs disabled
Result: No conversion difference → X not the cause

3. Mechanism (Plausibility Test)

Question: Can you explain HOW cause produces effect?

Requirements:

Pathway from cause to effect
Each step makes sense
Supported by theory or evidence

Example - Strong:

Cause: Increased checkout latency (5 sec)
Mechanism: Users abandon slow pages
Evidence: Industry data shows 7+ sec → 20% abandon
✓ Clear, plausible mechanism

Example - Weak:

Cause: Moon phase (full moon)
Mechanism: ??? (no plausible pathway)
Effect: Website traffic increase
✗ No mechanism, likely spurious correlation

Evidence Hierarchy

Strongest Evidence (Gold Standard):

Randomized Controlled Trial (RCT)
- Random assignment eliminates confounding
- Compare treatment vs control group
- Example: A/B test with random user assignment

Strong Evidence:

Quasi-Experiment / Natural Experiment
- Some random-like assignment (not perfect)
- Example: Policy implemented in one region, not another
- Control for confounders statistically

Medium Evidence:

Longitudinal Study (Before/After)
- Track same subjects over time
- Controls for individual differences
- Example: Metric before and after change
Well-Controlled Observational Study
- Statistical controls for confounders
- Multiple variables measured
- Example: Regression analysis with covariates

Weak Evidence:

Cross-Sectional Correlation
- Single point in time
- Can't establish temporal order
- Example: Survey at one moment
Anecdote / Single Case
- May not generalize
- Many confounders
- Example: One user complaint

Recommendation: For high-stakes decisions, aim for quasi-experiment or better.

Bradford Hill Criteria

Nine factors that strengthen causal claims. Score each on 1-3 scale.

1. Strength

Question: How strong is the association?

3: Very strong correlation (r > 0.7, OR > 10)
2: Moderate correlation (r = 0.3-0.7, OR = 2-10)
1: Weak correlation (r < 0.3, OR < 2)

Example: 10x latency increase = strong association (3/3)

2. Consistency

Question: Does relationship replicate across contexts?

3: Always consistent (multiple studies, settings)
2: Mostly consistent (some exceptions)
1: Rarely consistent (contradictory results)

Example: Latency affects conversion on all pages, devices, countries = consistent (3/3)

3. Specificity

Question: Is cause-effect relationship specific?

3: Specific cause → specific effect
2: Somewhat specific (some other effects)
1: General/non-specific (cause has many effects)

Example: Missing index affects only queries on that table = specific (3/3)

4. Temporality

Question: Does cause precede effect?

3: Always (clear temporal sequence)
2: Usually (mostly precedes)
1: Unclear or reverse causation possible

Example: Migration (2:00 PM) before slow queries (2:15 PM) = always (3/3)

5. Dose-Response

Question: Does more cause → more effect?

3: Clear gradient (linear or monotonic)
2: Some gradient (mostly true)
1: No gradient (flat or random)

Example: Larger tables → slower queries = clear gradient (3/3)

6. Plausibility

Question: Does mechanism make sense?

3: Very plausible (well-established theory)
2: Somewhat plausible (reasonable)
1: Implausible (contradicts theory)

Example: Index scans faster than table scans = well-established (3/3)

7. Coherence

Question: Fits with existing knowledge?

3: Fits well (no contradictions)
2: Somewhat fits (minor contradictions)
1: Contradicts existing knowledge

Example: Aligns with database optimization theory = fits well (3/3)

8. Experiment

Question: Does intervention change outcome?

3: Strong experimental evidence (RCT)
2: Some experimental evidence (quasi)
1: No experimental evidence (observational only)

Example: Rollback restored performance = strong experiment (3/3)

9. Analogy

Question: Similar cause-effect patterns exist?

3: Strong analogies (many similar cases)
2: Some analogies (a few similar)
1: No analogies (unique case)

Example: Similar patterns in other databases = some analogies (2/3)

Scoring:

Total 18-27: Strong causal evidence
Total 12-17: Moderate evidence
Total < 12: Weak evidence (need more investigation)

Worked Example: Website Conversion Drop

Problem

Website conversion rate dropped from 5% to 3% (40% relative drop) starting November 15th.

1. Effect Definition

Effect: Conversion rate dropped 40% (5% → 3%)

Quantification:

Baseline: 5% (stable for 6 months prior)
Current: 3% (observed for 2 weeks)
Absolute drop: 2 percentage points
Relative drop: 40%
Impact: ~$100k/week revenue loss

Timeline:

Last normal day: November 14th (5.1% conversion)
First drop observed: November 15th (3.2% conversion)
Ongoing: Yes (still at 3% as of November 29th)

Impact:

10,000 daily visitors
500 conversions → 300 conversions
$200 average order value
Loss: 200 conversions × $200 = $40k/day

2. Competing Hypotheses (Using Multiple Techniques)

Using 5 Whys:

Effect: Conversion dropped 40%
Why? → Users abandoning checkout
Why? → Checkout takes too long
Why? → Payment processor is slow
Why? → New processor has higher latency
Why? → Switched to cheaper processor (ROOT)

Using Fishbone Diagram:

Technology:

New payment processor (Nov 10)
New checkout UI (Nov 14)

Process:

Lack of A/B testing
No performance monitoring

Environment:

Seasonal traffic shift (holiday season)

People:

User behavior changes?

Using Timeline Analysis:

Nov 10: Switched to new payment processor
Nov 14: Deployed new checkout UI
Nov 15: Conversion drop observed (2:00 AM)

Competing Hypotheses Generated:

Hypothesis 1: New checkout UI (deployed Nov 14)

Type: Proximate cause
Evidence for: Timing matches (Nov 14 → Nov 15)
Evidence against: A/B test showed no difference; Rollback didn't fix

Hypothesis 2: Payment processor switch (Nov 10)

Type: Root cause candidate
Evidence for: New processor slower (400ms vs 100ms); Timing precedes
Evidence against: 5-day lag (why not immediate?)

Hypothesis 3: Payment latency increase (Nov 15)

Type: Proximate cause/symptom
Evidence for: Logs show 5-8 sec checkout (was 2-3 sec); User complaints
Evidence against: None (strong evidence)

Hypothesis 4: Seasonal traffic shift

Type: Confounder
Evidence for: Holiday season
Evidence against: Traffic composition unchanged

3. Causal Model

Causal Chain:

ROOT: Switched to cheaper payment processor (Nov 10)
    ↓ (mechanism: lower throughput processor)
INTERMEDIATE: Payment API latency under load
    ↓ (mechanism: slow API → page delays)
PROXIMATE: Checkout takes 5-8 seconds (Nov 15+)
    ↓ (mechanism: users abandon slow pages)
EFFECT: Conversion drops 40%

Why Nov 15 lag?

Nov 10-14: Weekday traffic (low, processor handled it)
Nov 15: Weekend traffic spike (2x normal)
Processor couldn't handle weekend load → latency spike

Confounders Ruled Out:

UI change: Rollback had no effect
Seasonality: Traffic patterns unchanged
Marketing: No changes

4. Evidence Assessment

Temporal Sequence: ✓ PASS (3/3)

Processor switch (Nov 10) → Latency (Nov 15) → Drop (Nov 15)
Clear precedence

Counterfactual: ✓ PASS (3/3)

Test: Switched back to old processor
Result: Conversion recovered to 4.8%
Strong evidence

Mechanism: ✓ PASS (3/3)

Slow processor (400ms) → High load → 5-8 sec checkout → User abandonment
Industry data: 7+ sec = 20% abandon
Clear, plausible mechanism

Dose-Response: ✓ PASS (3/3)

Checkout Time	Conversion
<3 sec	5%
3-5 sec	4%
5-7 sec	3%
>7 sec	2%

Clear gradient

Consistency: ✓ PASS (3/3)

All devices (mobile, desktop, tablet)
All payment methods
All countries
Consistent pattern

Bradford Hill Score: 24/27 (Very Strong)

Strength: 3 (40% drop)
Consistency: 3 (all segments)
Specificity: 2 (latency affects other things)
Temporality: 3 (clear sequence)
Dose-response: 3 (gradient)
Plausibility: 3 (well-known)
Coherence: 3 (fits theory)
Experiment: 3 (rollback test)
Analogy: 2 (some similar cases)

5. Conclusion

Root Cause: Switched to cheaper payment processor with insufficient throughput for weekend traffic.

Confidence: High (95%+)

Why high confidence:

Perfect temporal sequence
Strong counterfactual (rollback restored)
Clear mechanism
Dose-response present
Consistent across contexts
No confounders
Bradford Hill 24/27

Why root (not symptom):

Processor switch is decision that created problem
Latency is symptom of this decision
Fixing latency alone treats symptom
Reverting switch eliminates problem

6. Recommended Actions

Immediate:

✓ Reverted to old processor (Nov 28)
Negotiate better rates with old processor

If keeping new processor:

Add caching layer (reduce latency <2 sec)
Implement async checkout
Load test at 3x peak

Validation:

A/B test: Old processor vs new+caching
Monitor: Latency p95, conversion rate
Success: Conversion >4.5%, latency <2 sec

7. Limitations

Data limitations:

No abandonment reason tracking
Processor metrics limited (black box)

Analysis limitations:

Didn't test all latency optimizations
Small sample for some device types

Generalizability:

Specific to our checkout flow
May not apply to other traffic patterns

Quality Checklist

Before finalizing root cause analysis, verify:

Effect Definition:

Effect clearly described (not vague)
Quantified with magnitude
Timeline established (when started, duration)
Baseline for comparison stated

Hypothesis Generation:

Multiple hypotheses generated (3+ competing)
Used systematic techniques (5 Whys, Fishbone, Timeline, Differential)
Proximate vs root distinguished
Confounders identified

Causal Model:

Causal chain mapped (root → proximate → effect)
Mechanisms explained for each link
Confounding relationships noted
Direct vs indirect causation clear

Evidence Assessment:

Temporal sequence verified (cause before effect)
Counterfactual tested (what if cause absent)
Mechanism explained with supporting evidence
Dose-response checked (more cause → more effect)
Consistency assessed (holds across contexts)
Confounders ruled out or controlled

Conclusion:

Root cause clearly stated
Confidence level stated with justification
Distinguished from proximate causes/symptoms
Alternative explanations acknowledged
Limitations noted

Recommendations:

Address root cause (not just symptoms)
Actionable interventions proposed
Validation tests specified
Success metrics defined

Minimum Standards:

For medium-stakes (postmortems, feature issues): Score ≥ 3.5 on rubric
For high-stakes (infrastructure, safety): Score ≥ 4.0 on rubric
Red flags: <3 on Temporal Sequence, Counterfactual, or Mechanism = weak causal claim

20 KiB Raw Blame History Unescape Escape

Causal Inference & Root Cause Analysis Methodology

Root Cause Analysis Workflow

Hypothesis Generation Techniques

5 Whys (Trace Back to Root)

Fishbone Diagram (Categorize Causes)

Timeline Analysis (What Changed?)

Differential Diagnosis

Causal Model Building

Distinguish Cause Types

1. Root Cause

2. Proximate Cause

3. Contributing Factor

4. Coincidence

Map Causal Chains

Linear Chain

Branching Chain (Multiple Paths)

Common Cause (Confounding)

Evidence Assessment Methods

Essential Tests

1. Temporal Sequence (Must Pass)

2. Counterfactual (Strong Evidence)

3. Mechanism (Plausibility Test)

Evidence Hierarchy

Bradford Hill Criteria

1. Strength

2. Consistency

3. Specificity

4. Temporality

5. Dose-Response

6. Plausibility

7. Coherence

8. Experiment

9. Analogy

Worked Example: Website Conversion Drop

Problem

1. Effect Definition

2. Competing Hypotheses (Using Multiple Techniques)

Using 5 Whys:

Using Fishbone Diagram:

Using Timeline Analysis:

Competing Hypotheses Generated:

3. Causal Model

Causal Chain:

Why Nov 15 lag?

Confounders Ruled Out:

4. Evidence Assessment

5. Conclusion

6. Recommended Actions

7. Limitations

Quality Checklist

20 KiB

Raw Blame History