Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:38:26 +08:00
commit 41d9f6b189
304 changed files with 98322 additions and 0 deletions

View File

@@ -0,0 +1,697 @@
# Causal Inference & Root Cause Analysis Methodology
## Root Cause Analysis Workflow
Copy this checklist and track your progress:
```
Root Cause Analysis Progress:
- [ ] Step 1: Generate hypotheses using structured techniques
- [ ] Step 2: Build causal model and distinguish cause types
- [ ] Step 3: Gather evidence and assess temporal sequence
- [ ] Step 4: Test causality with Bradford Hill criteria
- [ ] Step 5: Verify root cause and check coherence
```
**Step 1: Generate hypotheses using structured techniques**
Use 5 Whys, Fishbone diagrams, timeline analysis, or differential diagnosis to systematically generate potential causes. See [Hypothesis Generation Techniques](#hypothesis-generation-techniques) for detailed methods.
**Step 2: Build causal model and distinguish cause types**
Map causal chains to distinguish root causes from proximate causes, symptoms, and confounders. See [Causal Model Building](#causal-model-building) for cause type definitions and chain mapping.
**Step 3: Gather evidence and assess temporal sequence**
Collect evidence for each hypothesis and verify temporal relationships (cause must precede effect). See [Evidence Assessment Methods](#evidence-assessment-methods) for essential tests and evidence hierarchy.
**Step 4: Test causality with Bradford Hill criteria**
Score hypotheses using the 9 Bradford Hill criteria (strength, consistency, specificity, temporality, dose-response, plausibility, coherence, experiment, analogy). See [Bradford Hill Criteria](#bradford-hill-criteria) for scoring rubric.
**Step 5: Verify root cause and check coherence**
Ensure the identified root cause has strong evidence, fits with known facts, and addresses systemic issues. See [Quality Checklist](#quality-checklist) and [Worked Example](#worked-example-website-conversion-drop) for validation techniques.
---
## Hypothesis Generation Techniques
### 5 Whys (Trace Back to Root)
Start with the effect and ask "why" repeatedly until you reach the root cause.
**Process:**
```
Effect: {Observed problem}
Why? → {Proximate cause 1}
Why? → {Proximate cause 2}
Why? → {Intermediate cause}
Why? → {Deeper cause}
Why? → {Root cause}
```
**Example:**
```
Effect: Server crashed
Why? → Out of memory
Why? → Memory leak in new code
Why? → Connection pool not releasing connections
Why? → Error handling missing in new feature
Why? → Code review process didn't catch this edge case (ROOT CAUSE)
```
**When to stop**: When you reach a cause that, if fixed, would prevent recurrence.
**Pitfalls to avoid**:
- Stopping too early at proximate causes
- Following only one path (should explore multiple branches)
- Accepting vague answers ("because it broke")
---
### Fishbone Diagram (Categorize Causes)
Organize potential causes by category to ensure comprehensive coverage.
**Categories (6 Ms)**:
```
Methods Machines Materials
| | |
| | |
└───────────────┴───────────────┴─────────→ Effect
| | |
| | |
Manpower Measurement Environment
```
**Adapted for software/product**:
- **People**: Skills, knowledge, communication, errors
- **Process**: Procedures, workflows, standards, review
- **Technology**: Tools, systems, infrastructure, dependencies
- **Environment**: External factors, timing, context
**Example (API outage)**:
- **People**: On-call engineer inexperienced with database
- **Process**: No rollback procedure documented
- **Technology**: Database connection pooling misconfigured
- **Environment**: Traffic spike coincided with deploy
---
### Timeline Analysis (What Changed?)
Map events leading up to effect to identify temporal correlations.
**Process**:
1. Establish baseline period (normal operation)
2. Mark when effect first observed
3. List all changes in days/hours leading up to effect
4. Identify changes that temporally precede effect
**Example**:
```
T-7 days: Normal operation (baseline)
T-3 days: Deployed new payment processor integration
T-2 days: Traffic increased 20%
T-1 day: First error logged (dismissed as fluke)
T-0 (14:00): Conversion rate dropped 30%
T+1 hour: Alert fired, investigation started
```
**Look for**: Changes, anomalies, or events right before effect (temporal precedence).
**Red flags**:
- Multiple changes at once (hard to isolate cause)
- Long lag between change and effect (harder to connect)
- No changes before effect (may be external factor or accumulated technical debt)
---
### Differential Diagnosis
List all conditions/causes that could produce the observed symptoms.
**Process**:
1. List all possible causes (be comprehensive)
2. For each cause, list symptoms it would produce
3. Compare predicted symptoms to observed symptoms
4. Eliminate causes that don't match
**Example (API returning 500 errors)**:
| Cause | Predicted Symptoms | Observed? |
|-------|-------------------|-----------|
| Code bug (logic error) | Specific endpoints fail, reproducible | ✓ Yes |
| Resource exhaustion (memory) | All endpoints slow, CPU/memory high | ✗ CPU normal |
| Dependency failure (database) | All database queries fail, connection errors | ✗ DB responsive |
| Configuration error | Service won't start or immediate failure | ✗ Started fine |
| DDoS attack | High traffic, distributed sources | ✗ Traffic normal |
**Conclusion**: Code bug most likely (matches symptoms, others ruled out).
---
## Causal Model Building
### Distinguish Cause Types
#### 1. Root Cause
- **Definition**: Fundamental underlying issue
- **Test**: If fixed, problem doesn't recur
- **Usually**: Structural, procedural, or systemic
- **Example**: "No code review process for infrastructure changes"
#### 2. Proximate Cause
- **Definition**: Immediate trigger
- **Test**: Directly precedes effect
- **Often**: Symptom of deeper root cause
- **Example**: "Deployed buggy config file"
#### 3. Contributing Factor
- **Definition**: Makes effect more likely or severe
- **Test**: Not sufficient alone, but amplifies
- **Often**: Context or conditions
- **Example**: "High traffic amplified impact of bug"
#### 4. Coincidence
- **Definition**: Happened around same time, no causal relationship
- **Test**: No mechanism, doesn't pass counterfactual
- **Example**: "Marketing campaign launched same day" (but unrelated to technical issue)
---
### Map Causal Chains
#### Linear Chain
```
Root Cause
↓ (mechanism)
Intermediate Cause
↓ (mechanism)
Proximate Cause
↓ (mechanism)
Observed Effect
```
**Example**:
```
No SSL cert renewal process (Root)
↓ (no automated reminders)
Cert expires unnoticed (Intermediate)
↓ (HTTPS handshake fails)
HTTPS connections fail (Proximate)
↓ (users can't connect)
Users can't access site (Effect)
```
#### Branching Chain (Multiple Paths)
```
Root Cause
↙ ↘
Path A Path B
↓ ↓
Effect A Effect B
```
**Example**:
```
Database migration error (Root)
↙ ↘
Missing indexes Schema mismatch
↓ ↓
Slow queries Type errors
```
#### Common Cause (Confounding)
```
Confounder (Z)
↙ ↘
Variable X Variable Y
```
X and Y correlate because Z causes both, but X doesn't cause Y.
**Example**:
```
Hot weather (Confounder)
↙ ↘
Ice cream sales Drownings
```
Ice cream doesn't cause drownings; both caused by hot weather/summer season.
---
## Evidence Assessment Methods
### Essential Tests
#### 1. Temporal Sequence (Must Pass)
**Rule**: Cause MUST precede effect. If effect happened before "cause," it's not causal.
**How to verify**:
- Create detailed timeline
- Mark exact timestamps of cause and effect
- Calculate lag (effect should follow cause)
**Example**:
- Migration deployed: March 10, 2:00 PM
- Queries slowed: March 10, 2:15 PM
- ✓ Cause precedes effect by 15 minutes
#### 2. Counterfactual (Strong Evidence)
**Question**: "What would have happened without the cause?"
**How to test**:
- **Control group**: Compare cases with cause vs without
- **Rollback**: Remove cause, see if effect disappears
- **Baseline comparison**: Compare to period before cause
- **A/B test**: Random assignment of cause
**Example**:
- Hypothesis: New feature X causes conversion drop
- Test: A/B test with X enabled vs disabled
- Result: No conversion difference → X not the cause
#### 3. Mechanism (Plausibility Test)
**Question**: Can you explain HOW cause produces effect?
**Requirements**:
- Pathway from cause to effect
- Each step makes sense
- Supported by theory or evidence
**Example - Strong**:
- Cause: Increased checkout latency (5 sec)
- Mechanism: Users abandon slow pages
- Evidence: Industry data shows 7+ sec → 20% abandon
- ✓ Clear, plausible mechanism
**Example - Weak**:
- Cause: Moon phase (full moon)
- Mechanism: ??? (no plausible pathway)
- Effect: Website traffic increase
- ✗ No mechanism, likely spurious correlation
---
### Evidence Hierarchy
**Strongest Evidence** (Gold Standard):
1. **Randomized Controlled Trial (RCT)**
- Random assignment eliminates confounding
- Compare treatment vs control group
- Example: A/B test with random user assignment
**Strong Evidence**:
2. **Quasi-Experiment / Natural Experiment**
- Some random-like assignment (not perfect)
- Example: Policy implemented in one region, not another
- Control for confounders statistically
**Medium Evidence**:
3. **Longitudinal Study (Before/After)**
- Track same subjects over time
- Controls for individual differences
- Example: Metric before and after change
4. **Well-Controlled Observational Study**
- Statistical controls for confounders
- Multiple variables measured
- Example: Regression analysis with covariates
**Weak Evidence**:
5. **Cross-Sectional Correlation**
- Single point in time
- Can't establish temporal order
- Example: Survey at one moment
6. **Anecdote / Single Case**
- May not generalize
- Many confounders
- Example: One user complaint
**Recommendation**: For high-stakes decisions, aim for quasi-experiment or better.
---
### Bradford Hill Criteria
Nine factors that strengthen causal claims. Score each on 1-3 scale.
#### 1. Strength
**Question**: How strong is the association?
- **3**: Very strong correlation (r > 0.7, OR > 10)
- **2**: Moderate correlation (r = 0.3-0.7, OR = 2-10)
- **1**: Weak correlation (r < 0.3, OR < 2)
**Example**: 10x latency increase = strong association (3/3)
#### 2. Consistency
**Question**: Does relationship replicate across contexts?
- **3**: Always consistent (multiple studies, settings)
- **2**: Mostly consistent (some exceptions)
- **1**: Rarely consistent (contradictory results)
**Example**: Latency affects conversion on all pages, devices, countries = consistent (3/3)
#### 3. Specificity
**Question**: Is cause-effect relationship specific?
- **3**: Specific cause → specific effect
- **2**: Somewhat specific (some other effects)
- **1**: General/non-specific (cause has many effects)
**Example**: Missing index affects only queries on that table = specific (3/3)
#### 4. Temporality
**Question**: Does cause precede effect?
- **3**: Always (clear temporal sequence)
- **2**: Usually (mostly precedes)
- **1**: Unclear or reverse causation possible
**Example**: Migration (2:00 PM) before slow queries (2:15 PM) = always (3/3)
#### 5. Dose-Response
**Question**: Does more cause → more effect?
- **3**: Clear gradient (linear or monotonic)
- **2**: Some gradient (mostly true)
- **1**: No gradient (flat or random)
**Example**: Larger tables → slower queries = clear gradient (3/3)
#### 6. Plausibility
**Question**: Does mechanism make sense?
- **3**: Very plausible (well-established theory)
- **2**: Somewhat plausible (reasonable)
- **1**: Implausible (contradicts theory)
**Example**: Index scans faster than table scans = well-established (3/3)
#### 7. Coherence
**Question**: Fits with existing knowledge?
- **3**: Fits well (no contradictions)
- **2**: Somewhat fits (minor contradictions)
- **1**: Contradicts existing knowledge
**Example**: Aligns with database optimization theory = fits well (3/3)
#### 8. Experiment
**Question**: Does intervention change outcome?
- **3**: Strong experimental evidence (RCT)
- **2**: Some experimental evidence (quasi)
- **1**: No experimental evidence (observational only)
**Example**: Rollback restored performance = strong experiment (3/3)
#### 9. Analogy
**Question**: Similar cause-effect patterns exist?
- **3**: Strong analogies (many similar cases)
- **2**: Some analogies (a few similar)
- **1**: No analogies (unique case)
**Example**: Similar patterns in other databases = some analogies (2/3)
**Scoring**:
- **Total 18-27**: Strong causal evidence
- **Total 12-17**: Moderate evidence
- **Total < 12**: Weak evidence (need more investigation)
---
## Worked Example: Website Conversion Drop
### Problem
Website conversion rate dropped from 5% to 3% (40% relative drop) starting November 15th.
---
### 1. Effect Definition
**Effect**: Conversion rate dropped 40% (5% → 3%)
**Quantification**:
- Baseline: 5% (stable for 6 months prior)
- Current: 3% (observed for 2 weeks)
- Absolute drop: 2 percentage points
- Relative drop: 40%
- Impact: ~$100k/week revenue loss
**Timeline**:
- Last normal day: November 14th (5.1% conversion)
- First drop observed: November 15th (3.2% conversion)
- Ongoing: Yes (still at 3% as of November 29th)
**Impact**:
- 10,000 daily visitors
- 500 conversions → 300 conversions
- $200 average order value
- Loss: 200 conversions × $200 = $40k/day
---
### 2. Competing Hypotheses (Using Multiple Techniques)
#### Using 5 Whys:
```
Effect: Conversion dropped 40%
Why? → Users abandoning checkout
Why? → Checkout takes too long
Why? → Payment processor is slow
Why? → New processor has higher latency
Why? → Switched to cheaper processor (ROOT)
```
#### Using Fishbone Diagram:
**Technology**:
- New payment processor (Nov 10)
- New checkout UI (Nov 14)
**Process**:
- Lack of A/B testing
- No performance monitoring
**Environment**:
- Seasonal traffic shift (holiday season)
**People**:
- User behavior changes?
#### Using Timeline Analysis:
```
Nov 10: Switched to new payment processor
Nov 14: Deployed new checkout UI
Nov 15: Conversion drop observed (2:00 AM)
```
#### Competing Hypotheses Generated:
**Hypothesis 1: New checkout UI** (deployed Nov 14)
- Type: Proximate cause
- Evidence for: Timing matches (Nov 14 → Nov 15)
- Evidence against: A/B test showed no difference; Rollback didn't fix
**Hypothesis 2: Payment processor switch** (Nov 10)
- Type: Root cause candidate
- Evidence for: New processor slower (400ms vs 100ms); Timing precedes
- Evidence against: 5-day lag (why not immediate?)
**Hypothesis 3: Payment latency increase** (Nov 15)
- Type: Proximate cause/symptom
- Evidence for: Logs show 5-8 sec checkout (was 2-3 sec); User complaints
- Evidence against: None (strong evidence)
**Hypothesis 4: Seasonal traffic shift**
- Type: Confounder
- Evidence for: Holiday season
- Evidence against: Traffic composition unchanged
---
### 3. Causal Model
#### Causal Chain:
```
ROOT: Switched to cheaper payment processor (Nov 10)
↓ (mechanism: lower throughput processor)
INTERMEDIATE: Payment API latency under load
↓ (mechanism: slow API → page delays)
PROXIMATE: Checkout takes 5-8 seconds (Nov 15+)
↓ (mechanism: users abandon slow pages)
EFFECT: Conversion drops 40%
```
#### Why Nov 15 lag?
- Nov 10-14: Weekday traffic (low, processor handled it)
- Nov 15: Weekend traffic spike (2x normal)
- Processor couldn't handle weekend load → latency spike
#### Confounders Ruled Out:
- UI change: Rollback had no effect
- Seasonality: Traffic patterns unchanged
- Marketing: No changes
---
### 4. Evidence Assessment
**Temporal Sequence**: ✓ PASS (3/3)
- Processor switch (Nov 10) → Latency (Nov 15) → Drop (Nov 15)
- Clear precedence
**Counterfactual**: ✓ PASS (3/3)
- Test: Switched back to old processor
- Result: Conversion recovered to 4.8%
- Strong evidence
**Mechanism**: ✓ PASS (3/3)
- Slow processor (400ms) → High load → 5-8 sec checkout → User abandonment
- Industry data: 7+ sec = 20% abandon
- Clear, plausible mechanism
**Dose-Response**: ✓ PASS (3/3)
| Checkout Time | Conversion |
|---------------|------------|
| <3 sec | 5% |
| 3-5 sec | 4% |
| 5-7 sec | 3% |
| >7 sec | 2% |
Clear gradient
**Consistency**: ✓ PASS (3/3)
- All devices (mobile, desktop, tablet)
- All payment methods
- All countries
- Consistent pattern
**Bradford Hill Score: 24/27** (Very Strong)
1. Strength: 3 (40% drop)
2. Consistency: 3 (all segments)
3. Specificity: 2 (latency affects other things)
4. Temporality: 3 (clear sequence)
5. Dose-response: 3 (gradient)
6. Plausibility: 3 (well-known)
7. Coherence: 3 (fits theory)
8. Experiment: 3 (rollback test)
9. Analogy: 2 (some similar cases)
---
### 5. Conclusion
**Root Cause**: Switched to cheaper payment processor with insufficient throughput for weekend traffic.
**Confidence**: High (95%+)
**Why high confidence**:
- Perfect temporal sequence
- Strong counterfactual (rollback restored)
- Clear mechanism
- Dose-response present
- Consistent across contexts
- No confounders
- Bradford Hill 24/27
**Why root (not symptom)**:
- Processor switch is decision that created problem
- Latency is symptom of this decision
- Fixing latency alone treats symptom
- Reverting switch eliminates problem
---
### 6. Recommended Actions
**Immediate**:
1. ✓ Reverted to old processor (Nov 28)
2. Negotiate better rates with old processor
**If keeping new processor**:
1. Add caching layer (reduce latency <2 sec)
2. Implement async checkout
3. Load test at 3x peak
**Validation**:
- A/B test: Old processor vs new+caching
- Monitor: Latency p95, conversion rate
- Success: Conversion >4.5%, latency <2 sec
---
### 7. Limitations
**Data limitations**:
- No abandonment reason tracking
- Processor metrics limited (black box)
**Analysis limitations**:
- Didn't test all latency optimizations
- Small sample for some device types
**Generalizability**:
- Specific to our checkout flow
- May not apply to other traffic patterns
---
## Quality Checklist
Before finalizing root cause analysis, verify:
**Effect Definition**:
- [ ] Effect clearly described (not vague)
- [ ] Quantified with magnitude
- [ ] Timeline established (when started, duration)
- [ ] Baseline for comparison stated
**Hypothesis Generation**:
- [ ] Multiple hypotheses generated (3+ competing)
- [ ] Used systematic techniques (5 Whys, Fishbone, Timeline, Differential)
- [ ] Proximate vs root distinguished
- [ ] Confounders identified
**Causal Model**:
- [ ] Causal chain mapped (root → proximate → effect)
- [ ] Mechanisms explained for each link
- [ ] Confounding relationships noted
- [ ] Direct vs indirect causation clear
**Evidence Assessment**:
- [ ] Temporal sequence verified (cause before effect)
- [ ] Counterfactual tested (what if cause absent)
- [ ] Mechanism explained with supporting evidence
- [ ] Dose-response checked (more cause → more effect)
- [ ] Consistency assessed (holds across contexts)
- [ ] Confounders ruled out or controlled
**Conclusion**:
- [ ] Root cause clearly stated
- [ ] Confidence level stated with justification
- [ ] Distinguished from proximate causes/symptoms
- [ ] Alternative explanations acknowledged
- [ ] Limitations noted
**Recommendations**:
- [ ] Address root cause (not just symptoms)
- [ ] Actionable interventions proposed
- [ ] Validation tests specified
- [ ] Success metrics defined
**Minimum Standards**:
- For medium-stakes (postmortems, feature issues): Score ≥ 3.5 on rubric
- For high-stakes (infrastructure, safety): Score ≥ 4.0 on rubric
- Red flags: <3 on Temporal Sequence, Counterfactual, or Mechanism = weak causal claim