Initial commit

2025-11-30 08:38:26 +08:00
commit 41d9f6b189
304 changed files with 98322 additions and 0 deletions
--- a/skills/causal-inference-root-cause/resources/methodology.md
+++ b/skills/causal-inference-root-cause/resources/methodology.md
@@ -0,0 +1,697 @@
+# Causal Inference & Root Cause Analysis Methodology
+
+## Root Cause Analysis Workflow
+
+Copy this checklist and track your progress:
+
+```
+Root Cause Analysis Progress:
+- [ ] Step 1: Generate hypotheses using structured techniques
+- [ ] Step 2: Build causal model and distinguish cause types
+- [ ] Step 3: Gather evidence and assess temporal sequence
+- [ ] Step 4: Test causality with Bradford Hill criteria
+- [ ] Step 5: Verify root cause and check coherence
+```
+
+**Step 1: Generate hypotheses using structured techniques**
+
+Use 5 Whys, Fishbone diagrams, timeline analysis, or differential diagnosis to systematically generate potential causes. See [Hypothesis Generation Techniques](#hypothesis-generation-techniques) for detailed methods.
+
+**Step 2: Build causal model and distinguish cause types**
+
+Map causal chains to distinguish root causes from proximate causes, symptoms, and confounders. See [Causal Model Building](#causal-model-building) for cause type definitions and chain mapping.
+
+**Step 3: Gather evidence and assess temporal sequence**
+
+Collect evidence for each hypothesis and verify temporal relationships (cause must precede effect). See [Evidence Assessment Methods](#evidence-assessment-methods) for essential tests and evidence hierarchy.
+
+**Step 4: Test causality with Bradford Hill criteria**
+
+Score hypotheses using the 9 Bradford Hill criteria (strength, consistency, specificity, temporality, dose-response, plausibility, coherence, experiment, analogy). See [Bradford Hill Criteria](#bradford-hill-criteria) for scoring rubric.
+
+**Step 5: Verify root cause and check coherence**
+
+Ensure the identified root cause has strong evidence, fits with known facts, and addresses systemic issues. See [Quality Checklist](#quality-checklist) and [Worked Example](#worked-example-website-conversion-drop) for validation techniques.
+
+---
+
+## Hypothesis Generation Techniques
+
+### 5 Whys (Trace Back to Root)
+
+Start with the effect and ask "why" repeatedly until you reach the root cause.
+
+**Process:**
+```
+Effect: {Observed problem}
+Why? → {Proximate cause 1}
+Why? → {Proximate cause 2}
+Why? → {Intermediate cause}
+Why? → {Deeper cause}
+Why? → {Root cause}
+```
+
+**Example:**
+```
+Effect: Server crashed
+Why? → Out of memory
+Why? → Memory leak in new code
+Why? → Connection pool not releasing connections
+Why? → Error handling missing in new feature
+Why? → Code review process didn't catch this edge case (ROOT CAUSE)
+```
+
+**When to stop**: When you reach a cause that, if fixed, would prevent recurrence.
+
+**Pitfalls to avoid**:
+- Stopping too early at proximate causes
+- Following only one path (should explore multiple branches)
+- Accepting vague answers ("because it broke")
+
+---
+
+### Fishbone Diagram (Categorize Causes)
+
+Organize potential causes by category to ensure comprehensive coverage.
+
+**Categories (6 Ms)**:
+```
+Methods         Machines        Materials
+   |               |               |
+   |               |               |
+   └───────────────┴───────────────┴─────────→ Effect
+   |               |               |
+   |               |               |
+Manpower      Measurement    Environment
+```
+
+**Adapted for software/product**:
+- **People**: Skills, knowledge, communication, errors
+- **Process**: Procedures, workflows, standards, review
+- **Technology**: Tools, systems, infrastructure, dependencies
+- **Environment**: External factors, timing, context
+
+**Example (API outage)**:
+- **People**: On-call engineer inexperienced with database
+- **Process**: No rollback procedure documented
+- **Technology**: Database connection pooling misconfigured
+- **Environment**: Traffic spike coincided with deploy
+
+---
+
+### Timeline Analysis (What Changed?)
+
+Map events leading up to effect to identify temporal correlations.
+
+**Process**:
+1. Establish baseline period (normal operation)
+2. Mark when effect first observed
+3. List all changes in days/hours leading up to effect
+4. Identify changes that temporally precede effect
+
+**Example**:
+```
+T-7 days: Normal operation (baseline)
+T-3 days: Deployed new payment processor integration
+T-2 days: Traffic increased 20%
+T-1 day: First error logged (dismissed as fluke)
+T-0 (14:00): Conversion rate dropped 30%
+T+1 hour: Alert fired, investigation started
+```
+
+**Look for**: Changes, anomalies, or events right before effect (temporal precedence).
+
+**Red flags**:
+- Multiple changes at once (hard to isolate cause)
+- Long lag between change and effect (harder to connect)
+- No changes before effect (may be external factor or accumulated technical debt)
+
+---
+
+### Differential Diagnosis
+
+List all conditions/causes that could produce the observed symptoms.
+
+**Process**:
+1. List all possible causes (be comprehensive)
+2. For each cause, list symptoms it would produce
+3. Compare predicted symptoms to observed symptoms
+4. Eliminate causes that don't match
+
+**Example (API returning 500 errors)**:
+
+| Cause | Predicted Symptoms | Observed? |
+|-------|-------------------|-----------|
+| Code bug (logic error) | Specific endpoints fail, reproducible | ✓ Yes |
+| Resource exhaustion (memory) | All endpoints slow, CPU/memory high | ✗ CPU normal |
+| Dependency failure (database) | All database queries fail, connection errors | ✗ DB responsive |
+| Configuration error | Service won't start or immediate failure | ✗ Started fine |
+| DDoS attack | High traffic, distributed sources | ✗ Traffic normal |
+
+**Conclusion**: Code bug most likely (matches symptoms, others ruled out).
+
+---
+
+## Causal Model Building
+
+### Distinguish Cause Types
+
+#### 1. Root Cause
+- **Definition**: Fundamental underlying issue
+- **Test**: If fixed, problem doesn't recur
+- **Usually**: Structural, procedural, or systemic
+- **Example**: "No code review process for infrastructure changes"
+
+#### 2. Proximate Cause
+- **Definition**: Immediate trigger
+- **Test**: Directly precedes effect
+- **Often**: Symptom of deeper root cause
+- **Example**: "Deployed buggy config file"
+
+#### 3. Contributing Factor
+- **Definition**: Makes effect more likely or severe
+- **Test**: Not sufficient alone, but amplifies
+- **Often**: Context or conditions
+- **Example**: "High traffic amplified impact of bug"
+
+#### 4. Coincidence
+- **Definition**: Happened around same time, no causal relationship
+- **Test**: No mechanism, doesn't pass counterfactual
+- **Example**: "Marketing campaign launched same day" (but unrelated to technical issue)
+
+---
+
+### Map Causal Chains
+
+#### Linear Chain
+```
+Root Cause
+    ↓ (mechanism)
+Intermediate Cause
+    ↓ (mechanism)
+Proximate Cause
+    ↓ (mechanism)
+Observed Effect
+```
+
+**Example**:
+```
+No SSL cert renewal process (Root)
+    ↓ (no automated reminders)
+Cert expires unnoticed (Intermediate)
+    ↓ (HTTPS handshake fails)
+HTTPS connections fail (Proximate)
+    ↓ (users can't connect)
+Users can't access site (Effect)
+```
+
+#### Branching Chain (Multiple Paths)
+```
+        Root Cause
+           ↙    ↘
+    Path A      Path B
+       ↓          ↓
+    Effect A   Effect B
+```
+
+**Example**:
+```
+    Database migration error (Root)
+           ↙              ↘
+  Missing indexes    Schema mismatch
+       ↓                    ↓
+  Slow queries       Type errors
+```
+
+#### Common Cause (Confounding)
+```
+    Confounder (Z)
+      ↙    ↘
+Variable X   Variable Y
+```
+X and Y correlate because Z causes both, but X doesn't cause Y.
+
+**Example**:
+```
+    Hot weather (Confounder)
+      ↙              ↘
+Ice cream sales   Drownings
+```
+Ice cream doesn't cause drownings; both caused by hot weather/summer season.
+
+---
+
+## Evidence Assessment Methods
+
+### Essential Tests
+
+#### 1. Temporal Sequence (Must Pass)
+**Rule**: Cause MUST precede effect. If effect happened before "cause," it's not causal.
+
+**How to verify**:
+- Create detailed timeline
+- Mark exact timestamps of cause and effect
+- Calculate lag (effect should follow cause)
+
+**Example**:
+- Migration deployed: March 10, 2:00 PM
+- Queries slowed: March 10, 2:15 PM
+- ✓ Cause precedes effect by 15 minutes
+
+#### 2. Counterfactual (Strong Evidence)
+**Question**: "What would have happened without the cause?"
+
+**How to test**:
+- **Control group**: Compare cases with cause vs without
+- **Rollback**: Remove cause, see if effect disappears
+- **Baseline comparison**: Compare to period before cause
+- **A/B test**: Random assignment of cause
+
+**Example**:
+- Hypothesis: New feature X causes conversion drop
+- Test: A/B test with X enabled vs disabled
+- Result: No conversion difference → X not the cause
+
+#### 3. Mechanism (Plausibility Test)
+**Question**: Can you explain HOW cause produces effect?
+
+**Requirements**:
+- Pathway from cause to effect
+- Each step makes sense
+- Supported by theory or evidence
+
+**Example - Strong**:
+- Cause: Increased checkout latency (5 sec)
+- Mechanism: Users abandon slow pages
+- Evidence: Industry data shows 7+ sec → 20% abandon
+- ✓ Clear, plausible mechanism
+
+**Example - Weak**:
+- Cause: Moon phase (full moon)
+- Mechanism: ??? (no plausible pathway)
+- Effect: Website traffic increase
+- ✗ No mechanism, likely spurious correlation
+
+---
+
+### Evidence Hierarchy
+
+**Strongest Evidence** (Gold Standard):
+
+1. **Randomized Controlled Trial (RCT)**
+   - Random assignment eliminates confounding
+   - Compare treatment vs control group
+   - Example: A/B test with random user assignment
+
+**Strong Evidence**:
+
+2. **Quasi-Experiment / Natural Experiment**
+   - Some random-like assignment (not perfect)
+   - Example: Policy implemented in one region, not another
+   - Control for confounders statistically
+
+**Medium Evidence**:
+
+3. **Longitudinal Study (Before/After)**
+   - Track same subjects over time
+   - Controls for individual differences
+   - Example: Metric before and after change
+
+4. **Well-Controlled Observational Study**
+   - Statistical controls for confounders
+   - Multiple variables measured
+   - Example: Regression analysis with covariates
+
+**Weak Evidence**:
+
+5. **Cross-Sectional Correlation**
+   - Single point in time
+   - Can't establish temporal order
+   - Example: Survey at one moment
+
+6. **Anecdote / Single Case**
+   - May not generalize
+   - Many confounders
+   - Example: One user complaint
+
+**Recommendation**: For high-stakes decisions, aim for quasi-experiment or better.
+
+---
+
+### Bradford Hill Criteria
+
+Nine factors that strengthen causal claims. Score each on 1-3 scale.
+
+#### 1. Strength
+**Question**: How strong is the association?
+
+- **3**: Very strong correlation (r > 0.7, OR > 10)
+- **2**: Moderate correlation (r = 0.3-0.7, OR = 2-10)
+- **1**: Weak correlation (r < 0.3, OR < 2)
+
+**Example**: 10x latency increase = strong association (3/3)
+
+#### 2. Consistency
+**Question**: Does relationship replicate across contexts?
+
+- **3**: Always consistent (multiple studies, settings)
+- **2**: Mostly consistent (some exceptions)
+- **1**: Rarely consistent (contradictory results)
+
+**Example**: Latency affects conversion on all pages, devices, countries = consistent (3/3)
+
+#### 3. Specificity
+**Question**: Is cause-effect relationship specific?
+
+- **3**: Specific cause → specific effect
+- **2**: Somewhat specific (some other effects)
+- **1**: General/non-specific (cause has many effects)
+
+**Example**: Missing index affects only queries on that table = specific (3/3)
+
+#### 4. Temporality
+**Question**: Does cause precede effect?
+
+- **3**: Always (clear temporal sequence)
+- **2**: Usually (mostly precedes)
+- **1**: Unclear or reverse causation possible
+
+**Example**: Migration (2:00 PM) before slow queries (2:15 PM) = always (3/3)
+
+#### 5. Dose-Response
+**Question**: Does more cause → more effect?
+
+- **3**: Clear gradient (linear or monotonic)
+- **2**: Some gradient (mostly true)
+- **1**: No gradient (flat or random)
+
+**Example**: Larger tables → slower queries = clear gradient (3/3)
+
+#### 6. Plausibility
+**Question**: Does mechanism make sense?
+
+- **3**: Very plausible (well-established theory)
+- **2**: Somewhat plausible (reasonable)
+- **1**: Implausible (contradicts theory)
+
+**Example**: Index scans faster than table scans = well-established (3/3)
+
+#### 7. Coherence
+**Question**: Fits with existing knowledge?
+
+- **3**: Fits well (no contradictions)
+- **2**: Somewhat fits (minor contradictions)
+- **1**: Contradicts existing knowledge
+
+**Example**: Aligns with database optimization theory = fits well (3/3)
+
+#### 8. Experiment
+**Question**: Does intervention change outcome?
+
+- **3**: Strong experimental evidence (RCT)
+- **2**: Some experimental evidence (quasi)
+- **1**: No experimental evidence (observational only)
+
+**Example**: Rollback restored performance = strong experiment (3/3)
+
+#### 9. Analogy
+**Question**: Similar cause-effect patterns exist?
+
+- **3**: Strong analogies (many similar cases)
+- **2**: Some analogies (a few similar)
+- **1**: No analogies (unique case)
+
+**Example**: Similar patterns in other databases = some analogies (2/3)
+
+**Scoring**:
+- **Total 18-27**: Strong causal evidence
+- **Total 12-17**: Moderate evidence
+- **Total < 12**: Weak evidence (need more investigation)
+
+---
+
+## Worked Example: Website Conversion Drop
+
+### Problem
+
+Website conversion rate dropped from 5% to 3% (40% relative drop) starting November 15th.
+
+---
+
+### 1. Effect Definition
+
+**Effect**: Conversion rate dropped 40% (5% → 3%)
+
+**Quantification**:
+- Baseline: 5% (stable for 6 months prior)
+- Current: 3% (observed for 2 weeks)
+- Absolute drop: 2 percentage points
+- Relative drop: 40%
+- Impact: ~$100k/week revenue loss
+
+**Timeline**:
+- Last normal day: November 14th (5.1% conversion)
+- First drop observed: November 15th (3.2% conversion)
+- Ongoing: Yes (still at 3% as of November 29th)
+
+**Impact**:
+- 10,000 daily visitors
+- 500 conversions → 300 conversions
+- $200 average order value
+- Loss: 200 conversions × $200 = $40k/day
+
+---
+
+### 2. Competing Hypotheses (Using Multiple Techniques)
+
+#### Using 5 Whys:
+```
+Effect: Conversion dropped 40%
+Why? → Users abandoning checkout
+Why? → Checkout takes too long
+Why? → Payment processor is slow
+Why? → New processor has higher latency
+Why? → Switched to cheaper processor (ROOT)
+```
+
+#### Using Fishbone Diagram:
+
+**Technology**:
+- New payment processor (Nov 10)
+- New checkout UI (Nov 14)
+
+**Process**:
+- Lack of A/B testing
+- No performance monitoring
+
+**Environment**:
+- Seasonal traffic shift (holiday season)
+
+**People**:
+- User behavior changes?
+
+#### Using Timeline Analysis:
+```
+Nov 10: Switched to new payment processor
+Nov 14: Deployed new checkout UI
+Nov 15: Conversion drop observed (2:00 AM)
+```
+
+#### Competing Hypotheses Generated:
+
+**Hypothesis 1: New checkout UI** (deployed Nov 14)
+- Type: Proximate cause
+- Evidence for: Timing matches (Nov 14 → Nov 15)
+- Evidence against: A/B test showed no difference; Rollback didn't fix
+
+**Hypothesis 2: Payment processor switch** (Nov 10)
+- Type: Root cause candidate
+- Evidence for: New processor slower (400ms vs 100ms); Timing precedes
+- Evidence against: 5-day lag (why not immediate?)
+
+**Hypothesis 3: Payment latency increase** (Nov 15)
+- Type: Proximate cause/symptom
+- Evidence for: Logs show 5-8 sec checkout (was 2-3 sec); User complaints
+- Evidence against: None (strong evidence)
+
+**Hypothesis 4: Seasonal traffic shift**
+- Type: Confounder
+- Evidence for: Holiday season
+- Evidence against: Traffic composition unchanged
+
+---
+
+### 3. Causal Model
+
+#### Causal Chain:
+```
+ROOT: Switched to cheaper payment processor (Nov 10)
+    ↓ (mechanism: lower throughput processor)
+INTERMEDIATE: Payment API latency under load
+    ↓ (mechanism: slow API → page delays)
+PROXIMATE: Checkout takes 5-8 seconds (Nov 15+)
+    ↓ (mechanism: users abandon slow pages)
+EFFECT: Conversion drops 40%
+```
+
+#### Why Nov 15 lag?
+- Nov 10-14: Weekday traffic (low, processor handled it)
+- Nov 15: Weekend traffic spike (2x normal)
+- Processor couldn't handle weekend load → latency spike
+
+#### Confounders Ruled Out:
+- UI change: Rollback had no effect
+- Seasonality: Traffic patterns unchanged
+- Marketing: No changes
+
+---
+
+### 4. Evidence Assessment
+
+**Temporal Sequence**: ✓ PASS (3/3)
+- Processor switch (Nov 10) → Latency (Nov 15) → Drop (Nov 15)
+- Clear precedence
+
+**Counterfactual**: ✓ PASS (3/3)
+- Test: Switched back to old processor
+- Result: Conversion recovered to 4.8%
+- Strong evidence
+
+**Mechanism**: ✓ PASS (3/3)
+- Slow processor (400ms) → High load → 5-8 sec checkout → User abandonment
+- Industry data: 7+ sec = 20% abandon
+- Clear, plausible mechanism
+
+**Dose-Response**: ✓ PASS (3/3)
+| Checkout Time | Conversion |
+|---------------|------------|
+| <3 sec | 5% |
+| 3-5 sec | 4% |
+| 5-7 sec | 3% |
+| >7 sec | 2% |
+
+Clear gradient
+
+**Consistency**: ✓ PASS (3/3)
+- All devices (mobile, desktop, tablet)
+- All payment methods
+- All countries
+- Consistent pattern
+
+**Bradford Hill Score: 24/27** (Very Strong)
+1. Strength: 3 (40% drop)
+2. Consistency: 3 (all segments)
+3. Specificity: 2 (latency affects other things)
+4. Temporality: 3 (clear sequence)
+5. Dose-response: 3 (gradient)
+6. Plausibility: 3 (well-known)
+7. Coherence: 3 (fits theory)
+8. Experiment: 3 (rollback test)
+9. Analogy: 2 (some similar cases)
+
+---
+
+### 5. Conclusion
+
+**Root Cause**: Switched to cheaper payment processor with insufficient throughput for weekend traffic.
+
+**Confidence**: High (95%+)
+
+**Why high confidence**:
+- Perfect temporal sequence
+- Strong counterfactual (rollback restored)
+- Clear mechanism
+- Dose-response present
+- Consistent across contexts
+- No confounders
+- Bradford Hill 24/27
+
+**Why root (not symptom)**:
+- Processor switch is decision that created problem
+- Latency is symptom of this decision
+- Fixing latency alone treats symptom
+- Reverting switch eliminates problem
+
+---
+
+### 6. Recommended Actions
+
+**Immediate**:
+1. ✓ Reverted to old processor (Nov 28)
+2. Negotiate better rates with old processor
+
+**If keeping new processor**:
+1. Add caching layer (reduce latency <2 sec)
+2. Implement async checkout
+3. Load test at 3x peak
+
+**Validation**:
+- A/B test: Old processor vs new+caching
+- Monitor: Latency p95, conversion rate
+- Success: Conversion >4.5%, latency <2 sec
+
+---
+
+### 7. Limitations
+
+**Data limitations**:
+- No abandonment reason tracking
+- Processor metrics limited (black box)
+
+**Analysis limitations**:
+- Didn't test all latency optimizations
+- Small sample for some device types
+
+**Generalizability**:
+- Specific to our checkout flow
+- May not apply to other traffic patterns
+
+---
+
+## Quality Checklist
+
+Before finalizing root cause analysis, verify:
+
+**Effect Definition**:
+- [ ] Effect clearly described (not vague)
+- [ ] Quantified with magnitude
+- [ ] Timeline established (when started, duration)
+- [ ] Baseline for comparison stated
+
+**Hypothesis Generation**:
+- [ ] Multiple hypotheses generated (3+ competing)
+- [ ] Used systematic techniques (5 Whys, Fishbone, Timeline, Differential)
+- [ ] Proximate vs root distinguished
+- [ ] Confounders identified
+
+**Causal Model**:
+- [ ] Causal chain mapped (root → proximate → effect)
+- [ ] Mechanisms explained for each link
+- [ ] Confounding relationships noted
+- [ ] Direct vs indirect causation clear
+
+**Evidence Assessment**:
+- [ ] Temporal sequence verified (cause before effect)
+- [ ] Counterfactual tested (what if cause absent)
+- [ ] Mechanism explained with supporting evidence
+- [ ] Dose-response checked (more cause → more effect)
+- [ ] Consistency assessed (holds across contexts)
+- [ ] Confounders ruled out or controlled
+
+**Conclusion**:
+- [ ] Root cause clearly stated
+- [ ] Confidence level stated with justification
+- [ ] Distinguished from proximate causes/symptoms
+- [ ] Alternative explanations acknowledged
+- [ ] Limitations noted
+
+**Recommendations**:
+- [ ] Address root cause (not just symptoms)
+- [ ] Actionable interventions proposed
+- [ ] Validation tests specified
+- [ ] Success metrics defined
+
+**Minimum Standards**:
+- For medium-stakes (postmortems, feature issues): Score ≥ 3.5 on rubric
+- For high-stakes (infrastructure, safety): Score ≥ 4.0 on rubric
+- Red flags: <3 on Temporal Sequence, Counterfactual, or Mechanism = weak causal claim