Initial commit
This commit is contained in:
@@ -0,0 +1,145 @@
|
||||
{
|
||||
"name": "Causal Inference & Root Cause Analysis Quality Rubric",
|
||||
"scale": {
|
||||
"min": 1,
|
||||
"max": 5,
|
||||
"description": "1=Poor, 2=Fair, 3=Good, 4=Very Good, 5=Excellent"
|
||||
},
|
||||
"criteria": [
|
||||
{
|
||||
"name": "Effect Definition Clarity",
|
||||
"description": "Effect/outcome is clearly defined, quantified, and temporally bounded",
|
||||
"scoring": {
|
||||
"1": "Effect vaguely described (e.g., 'things are slow'), no quantification or timeline",
|
||||
"2": "Effect described but lacks quantification or timeline details",
|
||||
"3": "Effect clearly described with either quantification or timeline",
|
||||
"4": "Effect clearly described with quantification and timeline, baseline comparison present",
|
||||
"5": "Effect precisely quantified with magnitude, timeline, baseline, and impact assessment"
|
||||
}
|
||||
},
|
||||
{
|
||||
"name": "Hypothesis Generation",
|
||||
"description": "Multiple competing hypotheses generated systematically (not just confirming first theory)",
|
||||
"scoring": {
|
||||
"1": "Single hypothesis stated without alternatives",
|
||||
"2": "2 hypotheses mentioned, one clearly favored without testing",
|
||||
"3": "3+ hypotheses listed, some testing of alternatives",
|
||||
"4": "Multiple hypotheses systematically generated using techniques (5 Whys, Fishbone, etc.)",
|
||||
"5": "Comprehensive hypothesis generation with proximate/root causes distinguished and confounders identified"
|
||||
}
|
||||
},
|
||||
{
|
||||
"name": "Root Cause Identification",
|
||||
"description": "Distinguishes root cause from proximate causes and symptoms",
|
||||
"scoring": {
|
||||
"1": "Confuses symptom with cause (e.g., 'app crashed because server returned error')",
|
||||
"2": "Identifies proximate cause but claims it as root without deeper investigation",
|
||||
"3": "Distinguishes proximate from root cause, but mechanism unclear",
|
||||
"4": "Clear root cause identified with explanation of why it's root (not symptom)",
|
||||
"5": "Root cause clearly identified with full causal chain from root → proximate → effect"
|
||||
}
|
||||
},
|
||||
{
|
||||
"name": "Causal Model Quality",
|
||||
"description": "Causal relationships mapped with mechanisms, confounders noted",
|
||||
"scoring": {
|
||||
"1": "No causal model, just list of correlations",
|
||||
"2": "Basic cause → effect stated without mechanisms or confounders",
|
||||
"3": "Causal chain sketched, mechanism mentioned but not detailed",
|
||||
"4": "Clear causal chain with mechanisms explained and confounders identified",
|
||||
"5": "Comprehensive causal model with chains, mechanisms, confounders, mediators/moderators mapped"
|
||||
}
|
||||
},
|
||||
{
|
||||
"name": "Temporal Sequence Verification",
|
||||
"description": "Verified that cause precedes effect (necessary for causation)",
|
||||
"scoring": {
|
||||
"1": "No temporal analysis, timeline unclear",
|
||||
"2": "Timeline mentioned but not used to test causation",
|
||||
"3": "Temporal sequence checked for main hypothesis",
|
||||
"4": "Temporal sequence verified for all hypotheses, rules out reverse causation",
|
||||
"5": "Detailed timeline analysis shows cause clearly precedes effect with lag explained"
|
||||
}
|
||||
},
|
||||
{
|
||||
"name": "Counterfactual Testing",
|
||||
"description": "Tests 'what if cause absent?' using control groups, rollbacks, or baseline comparisons",
|
||||
"scoring": {
|
||||
"1": "No counterfactual reasoning",
|
||||
"2": "Counterfactual mentioned but not tested",
|
||||
"3": "Basic counterfactual test (e.g., before/after comparison)",
|
||||
"4": "Strong counterfactual test (e.g., control group, rollback experiment, A/B test)",
|
||||
"5": "Multiple counterfactual tests with consistent results strengthening causal claim"
|
||||
}
|
||||
},
|
||||
{
|
||||
"name": "Mechanism Explanation",
|
||||
"description": "Explains HOW cause produces effect (not just THAT they correlate)",
|
||||
"scoring": {
|
||||
"1": "No mechanism, just correlation stated",
|
||||
"2": "Vague mechanism ('X affects Y somehow')",
|
||||
"3": "Basic mechanism explained ('X causes Y because...')",
|
||||
"4": "Clear mechanism with pathway and intermediate steps",
|
||||
"5": "Detailed mechanism with supporting evidence (logs, metrics, theory) and plausibility assessment"
|
||||
}
|
||||
},
|
||||
{
|
||||
"name": "Confounding Control",
|
||||
"description": "Identifies and controls for confounding variables (third factors causing both X and Y)",
|
||||
"scoring": {
|
||||
"1": "No mention of confounding, assumes correlation = causation",
|
||||
"2": "Aware of confounding but doesn't identify specific confounders",
|
||||
"3": "Identifies 1-2 potential confounders but doesn't control for them",
|
||||
"4": "Identifies confounders and attempts to control (stratification, regression, matching)",
|
||||
"5": "Comprehensive confounder identification with rigorous control methods and sensitivity analysis"
|
||||
}
|
||||
},
|
||||
{
|
||||
"name": "Evidence Quality & Strength",
|
||||
"description": "Uses high-quality evidence (experiments > observational > anecdotes) and assesses strength systematically",
|
||||
"scoring": {
|
||||
"1": "Relies solely on anecdotes or single observations",
|
||||
"2": "Uses weak evidence (cross-sectional correlation) without acknowledging limits",
|
||||
"3": "Uses moderate evidence (longitudinal data, multiple observations)",
|
||||
"4": "Uses strong evidence (quasi-experiments, well-controlled studies) with strength assessed",
|
||||
"5": "Uses highest-quality evidence (RCTs, multiple converging lines of evidence) with Bradford Hill criteria or similar framework"
|
||||
}
|
||||
},
|
||||
{
|
||||
"name": "Confidence & Limitations",
|
||||
"description": "States confidence level with justification, acknowledges alternative explanations and uncertainties",
|
||||
"scoring": {
|
||||
"1": "Overconfident claims without justification, no alternatives considered",
|
||||
"2": "States conclusion without confidence level or uncertainty",
|
||||
"3": "Mentions confidence level and 1 limitation",
|
||||
"4": "States justified confidence level, acknowledges alternatives and key limitations",
|
||||
"5": "Explicit confidence assessment with justification, comprehensive limitations, alternative explanations evaluated, unresolved uncertainties noted"
|
||||
}
|
||||
}
|
||||
],
|
||||
"overall_assessment": {
|
||||
"thresholds": {
|
||||
"excellent": "Average score ≥ 4.5 (publication-quality causal analysis)",
|
||||
"very_good": "Average score ≥ 4.0 (high-stakes decisions - major product/engineering changes)",
|
||||
"good": "Average score ≥ 3.5 (medium-stakes decisions - feature launches, incident postmortems)",
|
||||
"acceptable": "Average score ≥ 3.0 (low-stakes decisions - exploratory analysis, hypothesis generation)",
|
||||
"needs_rework": "Average score < 3.0 (insufficient for decision-making, redo analysis)"
|
||||
},
|
||||
"stakes_guidance": {
|
||||
"low_stakes": "Exploratory root cause analysis, hypothesis generation: aim for ≥ 3.0",
|
||||
"medium_stakes": "Incident postmortems, feature failure analysis, process improvements: aim for ≥ 3.5",
|
||||
"high_stakes": "Major architectural decisions, safety-critical systems, policy evaluation: aim for ≥ 4.0"
|
||||
}
|
||||
},
|
||||
"common_failure_modes": [
|
||||
"Correlation-causation fallacy: Assuming X causes Y just because they correlate",
|
||||
"Post hoc ergo propter hoc: 'After this, therefore because of this' - temporal sequence ≠ causation",
|
||||
"Stopping at proximate cause: Identifying immediate trigger without tracing to root",
|
||||
"Cherry-picking evidence: Only considering evidence that confirms initial hypothesis",
|
||||
"Ignoring confounders: Not considering third variables that cause both X and Y",
|
||||
"No mechanism: Claiming causation without explaining how X produces Y",
|
||||
"Reverse causation: Assuming X causes Y when actually Y causes X",
|
||||
"Single-case fallacy: Generalizing from one observation without testing consistency"
|
||||
],
|
||||
"usage_instructions": "Rate each criterion on 1-5 scale. Calculate average. For important decisions (postmortems, product changes), minimum score is 3.5. For high-stakes decisions (infrastructure, safety, policy), aim for ≥4.0. Red flags: score <3 on Temporal Sequence, Counterfactual Testing, or Mechanism Explanation means causal claim is weak. Red flag on Confounding Control means correlation may be spurious."
|
||||
}
|
||||
@@ -0,0 +1,675 @@
|
||||
# Root Cause Analysis: Database Query Performance Degradation
|
||||
|
||||
## Executive Summary
|
||||
|
||||
Database query latency increased 10x (p95: 50ms → 500ms) starting March 10th, impacting all API endpoints. Root cause identified as unoptimized database migration that created missing indexes on frequently-queried columns. Confidence: High (95%+). Resolution: Add indexes + optimize query patterns. Time to fix: 2 hours.
|
||||
|
||||
---
|
||||
|
||||
## 1. Effect Definition
|
||||
|
||||
**What happened**: Database query latency increased dramatically across all tables
|
||||
|
||||
**Quantification**:
|
||||
- **Baseline**: p50: 10ms, p95: 50ms, p99: 100ms (stable for 6 months)
|
||||
- **Current**: p50: 100ms, p95: 500ms, p99: 2000ms (10x increase at p95)
|
||||
- **Absolute increase**: +450ms at p95
|
||||
- **Relative increase**: 900% at p95
|
||||
|
||||
**Timeline**:
|
||||
- **First observed**: March 10th, 2:15 PM UTC
|
||||
- **Duration**: Ongoing (March 10-12, 48 hours elapsed)
|
||||
- **Baseline period**: Jan 1 - March 9 (stable)
|
||||
- **Degradation start**: Exact timestamp March 10th 14:15:22 UTC
|
||||
|
||||
**Impact**:
|
||||
- **Users affected**: All users (100% of traffic)
|
||||
- **API endpoints affected**: All endpoints (database-dependent)
|
||||
- **Severity**: High
|
||||
- 25% of API requests timing out (>5 sec)
|
||||
- User-visible page load delays
|
||||
- Support tickets increased 3x
|
||||
- Estimated revenue impact: $50k/day from abandoned transactions
|
||||
|
||||
**Context**:
|
||||
- Database: PostgreSQL 14.7, 500GB data
|
||||
- Application: REST API (Node.js), 10k req/min
|
||||
- Recent changes: Database migration deployed March 10th 2:00 PM
|
||||
|
||||
---
|
||||
|
||||
## 2. Competing Hypotheses
|
||||
|
||||
### Hypothesis 1: Database migration introduced inefficient schema
|
||||
|
||||
- **Type**: Root cause candidate
|
||||
- **Evidence for**:
|
||||
- **Timing**: Migration deployed March 10 2:00 PM, degradation started 2:15 PM (15 min after)
|
||||
- **Perfect temporal match**: Strongest temporal correlation
|
||||
- **Migration contents**: Added new columns, restructured indexes
|
||||
- **Evidence against**:
|
||||
- None - all evidence supports this hypothesis
|
||||
|
||||
### Hypothesis 2: Traffic spike overloaded database
|
||||
|
||||
- **Type**: Confounder / alternative explanation
|
||||
- **Evidence for**:
|
||||
- March 10 is typically high-traffic day (week-end effect)
|
||||
- **Evidence against**:
|
||||
- **Traffic unchanged**: Monitoring shows traffic at 10k req/min (same as baseline)
|
||||
- **No traffic spike at 2:15 PM**: Traffic flat throughout March 10
|
||||
- **Previous high-traffic days handled fine**: Traffic has been higher (15k req/min) without issues
|
||||
|
||||
### Hypothesis 3: Database server resource exhaustion
|
||||
|
||||
- **Type**: Proximate cause / symptom
|
||||
- **Evidence for**:
|
||||
- CPU usage increased from 30% → 80% at 2:15 PM
|
||||
- Disk I/O increased from 100 IOPS → 5000 IOPS
|
||||
- Connection pool near saturation (95/100 connections)
|
||||
- **Evidence against**:
|
||||
- **These are symptoms, not root**: Something CAUSED the increased resource usage
|
||||
- Resource exhaustion doesn't explain WHY queries became slow
|
||||
|
||||
### Hypothesis 4: Slow query introduced by application code change
|
||||
|
||||
- **Type**: Proximate cause candidate
|
||||
- **Evidence for**:
|
||||
- Application deploy on March 9th (1 day prior)
|
||||
- **Evidence against**:
|
||||
- **Timing mismatch**: Deploy was 24 hours before degradation
|
||||
- **No code changes to query logic**: Deploy only changed UI
|
||||
- **Query patterns unchanged**: Same queries, same frequency
|
||||
|
||||
### Hypothesis 5: Database server hardware issue
|
||||
|
||||
- **Type**: Alternative explanation
|
||||
- **Evidence for**:
|
||||
- Occasional disk errors in system logs
|
||||
- **Evidence against**:
|
||||
- **Disk errors present before March 10**: Noise, not new
|
||||
- **No hardware alerts**: Monitoring shows no hardware degradation
|
||||
- **Sudden onset**: Hardware failures typically gradual
|
||||
|
||||
**Most likely root cause**: Database migration (Hypothesis 1)
|
||||
|
||||
**Confounders ruled out**:
|
||||
- Traffic unchanged (Hypothesis 2)
|
||||
- Application code unchanged (Hypothesis 4)
|
||||
- Hardware stable (Hypothesis 5)
|
||||
|
||||
**Symptoms identified**:
|
||||
- Resource exhaustion (Hypothesis 3) is symptom, not root
|
||||
|
||||
---
|
||||
|
||||
## 3. Causal Model
|
||||
|
||||
### Causal Chain: Root → Proximate → Effect
|
||||
|
||||
```
|
||||
ROOT CAUSE:
|
||||
Database migration removed indexes on user_id + created_at columns
|
||||
(March 10, 2:00 PM deployment)
|
||||
↓ (mechanism: queries now do full table scans instead of index scans)
|
||||
|
||||
INTERMEDIATE CAUSE:
|
||||
Every query on users table must scan entire table (5M rows)
|
||||
instead of using index (10-1000 rows)
|
||||
↓ (mechanism: table scans require disk I/O, CPU cycles)
|
||||
|
||||
PROXIMATE CAUSE:
|
||||
Database CPU at 80%, disk I/O at 5000 IOPS (50x increase)
|
||||
Query execution time 10x slower
|
||||
↓ (mechanism: queries queue up, connection pool saturates)
|
||||
|
||||
OBSERVED EFFECT:
|
||||
API endpoints slow (p95: 500ms vs 50ms baseline)
|
||||
25% of requests timeout
|
||||
Users experience page load delays
|
||||
```
|
||||
|
||||
### Why March 10 2:15 PM specifically?
|
||||
|
||||
- **Migration deployed**: March 10 2:00 PM
|
||||
- **Migration applied**: 2:00-2:10 PM (10 min to run schema changes)
|
||||
- **First slow queries**: 2:15 PM (first queries after migration completed)
|
||||
- **5-minute lag**: Time for connection pool to cycle and pick up new schema
|
||||
|
||||
### Missing Index Details
|
||||
|
||||
**Migration removed these indexes**:
|
||||
```sql
|
||||
-- BEFORE (efficient):
|
||||
CREATE INDEX idx_users_user_id_created_at ON users(user_id, created_at);
|
||||
CREATE INDEX idx_transactions_user_id ON transactions(user_id);
|
||||
|
||||
-- AFTER (inefficient):
|
||||
-- (indexes removed by mistake in migration)
|
||||
```
|
||||
|
||||
**Impact**:
|
||||
```sql
|
||||
-- Common query pattern:
|
||||
SELECT * FROM users WHERE user_id = 123 AND created_at > '2024-01-01';
|
||||
|
||||
-- BEFORE (with index): 5ms (index scan, 10 rows)
|
||||
-- AFTER (without index): 500ms (full table scan, 5M rows)
|
||||
```
|
||||
|
||||
### Confounders Ruled Out
|
||||
|
||||
**No confounding variables found**:
|
||||
- **Traffic**: Controlled (unchanged)
|
||||
- **Hardware**: Controlled (stable)
|
||||
- **Code**: Controlled (no changes to queries)
|
||||
- **External dependencies**: Controlled (no changes)
|
||||
|
||||
**Only variable that changed**: Database schema (migration)
|
||||
|
||||
---
|
||||
|
||||
## 4. Evidence Assessment
|
||||
|
||||
### Temporal Sequence: ✓ PASS (5/5)
|
||||
|
||||
**Timeline**:
|
||||
```
|
||||
March 9, 3:00 PM: Application deploy (UI changes only, no queries changed)
|
||||
March 10, 2:00 PM: Database migration starts
|
||||
March 10, 2:10 PM: Migration completes
|
||||
March 10, 2:15 PM: First slow queries logged (p95: 500ms)
|
||||
March 10, 2:20 PM: Alerting fires (p95 exceeds 200ms threshold)
|
||||
```
|
||||
|
||||
**Verdict**: ✓ Cause (migration) clearly precedes effect (slow queries) by 5-15 minutes
|
||||
|
||||
**Why 5-minute lag?**
|
||||
- Connection pool refresh time
|
||||
- Gradual connection cycling to new schema
|
||||
- First slow queries at 2:15 PM were from connections that picked up new schema
|
||||
|
||||
---
|
||||
|
||||
### Strength of Association: ✓ PASS (5/5)
|
||||
|
||||
**Correlation strength**: Very strong (r = 0.99)
|
||||
|
||||
**Evidence**:
|
||||
- **Before migration**: p95 latency stable at 50ms (6 months)
|
||||
- **Immediately after migration**: p95 latency jumped to 500ms
|
||||
- **10x increase**: Large effect size
|
||||
- **100% of queries affected**: All database queries slower, not selective
|
||||
|
||||
**Statistical significance**: P < 0.001 (highly significant)
|
||||
- Comparing 1000 queries before (mean: 50ms) vs 1000 queries after (mean: 500ms)
|
||||
- Effect size: Cohen's d = 5.2 (very large)
|
||||
|
||||
---
|
||||
|
||||
### Dose-Response: ✓ PASS (4/5)
|
||||
|
||||
**Gradient observed**:
|
||||
- **Table size vs latency**:
|
||||
- Small tables (<10k rows): 20ms → 50ms (2.5x increase)
|
||||
- Medium tables (100k rows): 50ms → 200ms (4x increase)
|
||||
- Large tables (5M rows): 50ms → 500ms (10x increase)
|
||||
|
||||
**Mechanism**: Larger tables → more rows to scan → longer queries
|
||||
|
||||
**Interpretation**: Clear dose-response relationship strengthens causation
|
||||
|
||||
---
|
||||
|
||||
### Counterfactual Test: ✓ PASS (5/5)
|
||||
|
||||
**Counterfactual question**: "What if the migration hadn't been deployed?"
|
||||
|
||||
**Test 1: Rollback Experiment** (Strongest evidence)
|
||||
- **Action**: Rolled back database migration March 11, 9:00 AM
|
||||
- **Result**: Latency immediately returned to baseline (p95: 55ms)
|
||||
- **Conclusion**: ✓ Migration removal eliminates effect (strong causation)
|
||||
|
||||
**Test 2: Control Query**
|
||||
- **Tested**: Queries on tables NOT affected by migration (no index changes)
|
||||
- **Result**: Latency unchanged (p95: 50ms before and after migration)
|
||||
- **Conclusion**: ✓ Only migrated tables affected (specificity)
|
||||
|
||||
**Test 3: Historical Comparison**
|
||||
- **Baseline period**: Jan-March 9 (no migration), p95: 50ms
|
||||
- **Degradation period**: March 10-11 (migration active), p95: 500ms
|
||||
- **Post-rollback**: March 11+ (migration reverted), p95: 55ms
|
||||
- **Conclusion**: ✓ Pattern strongly implicates migration
|
||||
|
||||
**Verdict**: Counterfactual tests strongly confirm causation
|
||||
|
||||
---
|
||||
|
||||
### Mechanism: ✓ PASS (5/5)
|
||||
|
||||
**HOW migration caused slow queries**:
|
||||
|
||||
1. **Migration removed indexes**:
|
||||
```sql
|
||||
-- Migration accidentally dropped these indexes:
|
||||
DROP INDEX idx_users_user_id_created_at;
|
||||
DROP INDEX idx_transactions_user_id;
|
||||
```
|
||||
|
||||
2. **Query planner changed strategy**:
|
||||
```
|
||||
BEFORE (with index):
|
||||
EXPLAIN SELECT * FROM users WHERE user_id = 123 AND created_at > '2024-01-01';
|
||||
→ Index Scan using idx_users_user_id_created_at (cost=0.43..8.45 rows=1)
|
||||
|
||||
AFTER (without index):
|
||||
EXPLAIN SELECT * FROM users WHERE user_id = 123 AND created_at > '2024-01-01';
|
||||
→ Seq Scan on users (cost=0.00..112000.00 rows=5000000)
|
||||
```
|
||||
|
||||
3. **Full table scans require disk I/O**:
|
||||
- Index scan: Read 10-1000 rows (1-100 KB) from index + data pages
|
||||
- Full table scan: Read 5M rows (5 GB) from disk
|
||||
- **50x-500x more I/O**
|
||||
|
||||
4. **Increased I/O saturates CPU & disk**:
|
||||
- CPU: Scanning rows, filtering predicates (30% → 80%)
|
||||
- Disk: Reading table pages (100 IOPS → 5000 IOPS)
|
||||
|
||||
5. **Saturation causes queuing**:
|
||||
- Slow queries → connections held longer
|
||||
- Connection pool saturates (95/100)
|
||||
- New queries wait for available connections
|
||||
- Latency compounds
|
||||
|
||||
**Plausibility**: Very high
|
||||
- **Established theory**: Index scans vs table scans (well-known database optimization)
|
||||
- **Quantifiable impact**: Can calculate I/O difference (50x-500x)
|
||||
- **Reproducible**: Same pattern in staging environment
|
||||
|
||||
**Supporting evidence**:
|
||||
- EXPLAIN ANALYZE output shows table scans post-migration
|
||||
- PostgreSQL logs show "sequential scan" warnings
|
||||
- Disk I/O metrics show 50x increase correlated with migration
|
||||
|
||||
**Verdict**: ✓ Mechanism fully explained with strong supporting evidence
|
||||
|
||||
---
|
||||
|
||||
### Consistency: ✓ PASS (5/5)
|
||||
|
||||
**Relationship holds across contexts**:
|
||||
|
||||
1. **All affected tables show same pattern**:
|
||||
- users table: 50ms → 500ms
|
||||
- transactions table: 30ms → 300ms
|
||||
- orders table: 40ms → 400ms
|
||||
- **Consistent 10x degradation**
|
||||
|
||||
2. **All query types affected**:
|
||||
- SELECT: 10x slower
|
||||
- JOIN: 10x slower
|
||||
- Aggregations (COUNT, SUM): 10x slower
|
||||
|
||||
3. **Consistent across all environments**:
|
||||
- Production: 50ms → 500ms
|
||||
- Staging: 45ms → 450ms (when migration tested)
|
||||
- Dev: 40ms → 400ms
|
||||
|
||||
4. **Consistent across time**:
|
||||
- March 10 14:15 - March 11 9:00: p95 at 500ms
|
||||
- Every hour during this period: ~500ms (stable degradation)
|
||||
|
||||
5. **Replication**:
|
||||
- Tested in staging: Same migration → same 10x degradation
|
||||
- Rollback in staging: Latency restored
|
||||
- **Reproducible causal relationship**
|
||||
|
||||
**Verdict**: ✓ Extremely consistent pattern across tables, query types, environments, and time
|
||||
|
||||
---
|
||||
|
||||
### Confounding Control: ✓ PASS (4/5)
|
||||
|
||||
**Potential confounders identified and ruled out**:
|
||||
|
||||
#### Confounder 1: Traffic Spike
|
||||
|
||||
**Could traffic spike cause both**:
|
||||
- Migration deployment (timing coincidence)
|
||||
- Slow queries (overload)
|
||||
|
||||
**Ruled out**:
|
||||
- Traffic monitoring shows flat 10k req/min (no spike)
|
||||
- Even if traffic spiked, wouldn't explain why migration rollback fixed it
|
||||
|
||||
#### Confounder 2: Hardware Degradation
|
||||
|
||||
**Could hardware issue cause both**:
|
||||
- Migration coincidentally deployed during degradation
|
||||
- Slow queries (hardware bottleneck)
|
||||
|
||||
**Ruled out**:
|
||||
- Hardware metrics stable (CPU headroom, no disk errors)
|
||||
- Rollback immediately fixed latency (hardware didn't suddenly improve)
|
||||
|
||||
#### Confounder 3: Application Code Change
|
||||
|
||||
**Could code change cause both**:
|
||||
- Buggy queries
|
||||
- Migration deployed same time as code
|
||||
|
||||
**Ruled out**:
|
||||
- Code deploy was March 9 (24 hrs before degradation)
|
||||
- No query changes in code deploy (only UI)
|
||||
- Rollback of migration (not code) fixed issue
|
||||
|
||||
**Controlled variables**:
|
||||
- ✓ Traffic (flat during period)
|
||||
- ✓ Hardware (stable metrics)
|
||||
- ✓ Code (no query changes)
|
||||
- ✓ External dependencies (no changes)
|
||||
|
||||
**Verdict**: ✓ No confounders found, only migration variable changed
|
||||
|
||||
---
|
||||
|
||||
### Bradford Hill Criteria: 25/27 (Very Strong)
|
||||
|
||||
| Criterion | Score | Justification |
|
||||
|-----------|-------|---------------|
|
||||
| **1. Strength** | 3/3 | 10x latency increase = very strong association |
|
||||
| **2. Consistency** | 3/3 | Consistent across tables, queries, environments, time |
|
||||
| **3. Specificity** | 3/3 | Only migrated tables affected; rollback restores; specific cause → specific effect |
|
||||
| **4. Temporality** | 3/3 | Migration clearly precedes degradation by 5-15 min |
|
||||
| **5. Dose-response** | 3/3 | Larger tables → greater latency increase (clear gradient) |
|
||||
| **6. Plausibility** | 3/3 | Index vs table scan theory well-established |
|
||||
| **7. Coherence** | 3/3 | Fits database optimization knowledge, no contradictions |
|
||||
| **8. Experiment** | 3/3 | Rollback experiment: removing cause eliminates effect |
|
||||
| **9. Analogy** | 1/3 | Similar patterns exist (missing indexes → slow queries) but not perfect analogy |
|
||||
|
||||
**Total**: 25/27 = **Very strong causal evidence**
|
||||
|
||||
**Interpretation**: All criteria met except weak analogy (not critical). Strong case for causation.
|
||||
|
||||
---
|
||||
|
||||
## 5. Conclusion
|
||||
|
||||
### Most Likely Root Cause
|
||||
|
||||
**Root cause**: Database migration removed indexes on `user_id` and `created_at` columns, forcing query planner to use full table scans instead of efficient index scans.
|
||||
|
||||
**Confidence level**: **High (95%+)**
|
||||
|
||||
**Reasoning**:
|
||||
1. **Perfect temporal sequence**: Migration (2:00 PM) → degradation (2:15 PM)
|
||||
2. **Strong counterfactual test**: Rollback immediately restored performance
|
||||
3. **Clear mechanism**: Index scans (fast) → table scans (slow) with 50x-500x more I/O
|
||||
4. **Dose-response**: Larger tables show greater degradation
|
||||
5. **Consistency**: Pattern holds across all tables, queries, environments
|
||||
6. **No confounders**: Traffic, hardware, code all controlled
|
||||
7. **Bradford Hill 25/27**: Very strong causal evidence
|
||||
8. **Reproducible**: Same effect in staging environment
|
||||
|
||||
**Why this is root cause (not symptom)**:
|
||||
- **Missing indexes** is the fundamental issue
|
||||
- **High CPU/disk I/O** is symptom of table scans
|
||||
- **Slow queries** is symptom of high resource usage
|
||||
- Fixing missing indexes eliminates all downstream symptoms
|
||||
|
||||
**Causal Mechanism**:
|
||||
```
|
||||
Missing indexes (root)
|
||||
↓
|
||||
Query planner uses table scans (mechanism)
|
||||
↓
|
||||
50x-500x more disk I/O (mechanism)
|
||||
↓
|
||||
CPU & disk saturation (symptom)
|
||||
↓
|
||||
Query queuing, connection pool saturation (symptom)
|
||||
↓
|
||||
10x latency increase (observed effect)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Alternative Explanations (Ruled Out)
|
||||
|
||||
#### Alternative 1: Traffic Spike
|
||||
|
||||
**Why less likely**:
|
||||
- Traffic monitoring shows flat 10k req/min (no spike)
|
||||
- Previous traffic spikes (15k req/min) handled without issue
|
||||
- Rollback fixed latency without changing traffic
|
||||
|
||||
#### Alternative 2: Hardware Degradation
|
||||
|
||||
**Why less likely**:
|
||||
- Hardware metrics stable (no degradation)
|
||||
- Sudden onset inconsistent with hardware failure (usually gradual)
|
||||
- Rollback immediately fixed issue (hardware didn't change)
|
||||
|
||||
#### Alternative 3: Application Code Bug
|
||||
|
||||
**Why less likely**:
|
||||
- Code deploy 24 hours before degradation (timing mismatch)
|
||||
- No query logic changes in deploy
|
||||
- Rollback of migration (not code) fixed issue
|
||||
|
||||
---
|
||||
|
||||
### Unresolved Questions
|
||||
|
||||
1. **Why were indexes removed?**
|
||||
- Migration script error? (likely)
|
||||
- Intentional optimization attempt gone wrong? (possible)
|
||||
- Need to review migration PR and approval process
|
||||
|
||||
2. **How did this pass review?**
|
||||
- Were indexes intentionally removed or accidental?
|
||||
- Was migration tested in staging before production?
|
||||
- Need process improvement
|
||||
|
||||
3. **Why no pre-deploy testing catch this?**
|
||||
- Staging environment testing missed this
|
||||
- Query performance tests insufficient
|
||||
- Need better pre-deploy validation
|
||||
|
||||
---
|
||||
|
||||
## 6. Recommended Actions
|
||||
|
||||
### Immediate Actions (Address Root Cause)
|
||||
|
||||
**1. Re-add missing indexes** (DONE - March 11, 9:00 AM)
|
||||
```sql
|
||||
CREATE INDEX idx_users_user_id_created_at
|
||||
ON users(user_id, created_at);
|
||||
|
||||
CREATE INDEX idx_transactions_user_id
|
||||
ON transactions(user_id);
|
||||
```
|
||||
- **Result**: Latency restored to 55ms (within 5ms of baseline)
|
||||
- **Time to fix**: 15 minutes (index creation)
|
||||
|
||||
**2. Validate index coverage** (IN PROGRESS)
|
||||
- Audit all tables for missing indexes
|
||||
- Compare production indexes to staging/dev
|
||||
- Document expected indexes per table
|
||||
|
||||
### Preventive Actions (Process Improvements)
|
||||
|
||||
**1. Improve migration review process**
|
||||
- **Require EXPLAIN ANALYZE before/after** for all migrations
|
||||
- **Staging performance tests mandatory** (query latency benchmarks)
|
||||
- **Index change review**: Any index drop requires extra approval
|
||||
|
||||
**2. Add pre-deploy validation**
|
||||
- Automated query performance regression tests
|
||||
- Alert if any query >2x slower in staging
|
||||
- Block deployment if performance degrades >20%
|
||||
|
||||
**3. Improve monitoring & alerting**
|
||||
- Alert on index usage changes (track `pg_stat_user_indexes`)
|
||||
- Alert on query plan changes (seq scan warnings)
|
||||
- Dashboards for index hit rate, table scan frequency
|
||||
|
||||
**4. Database migration checklist**
|
||||
- [ ] EXPLAIN ANALYZE on affected queries
|
||||
- [ ] Staging performance tests passed
|
||||
- [ ] Index usage reviewed
|
||||
- [ ] Rollback plan documented
|
||||
- [ ] Monitoring in place
|
||||
|
||||
### Validation Tests (Confirm Fix)
|
||||
|
||||
**1. Performance benchmark** (DONE)
|
||||
- **Test**: Run 1000 queries pre-fix vs post-fix
|
||||
- **Result**:
|
||||
- Pre-fix (migration): p95 = 500ms
|
||||
- Post-fix (indexes restored): p95 = 55ms
|
||||
- **Conclusion**: ✓ Fix successful
|
||||
|
||||
**2. Load test** (DONE)
|
||||
- **Test**: 15k req/min (1.5x normal traffic)
|
||||
- **Result**: p95 = 60ms (acceptable, <10% degradation)
|
||||
- **Conclusion**: ✓ System can handle load with indexes
|
||||
|
||||
**3. Index usage monitoring** (ONGOING)
|
||||
- **Metrics**: `pg_stat_user_indexes` shows indexes being used
|
||||
- **Query plans**: EXPLAIN shows index scans (not seq scans)
|
||||
- **Conclusion**: ✓ Indexes actively used
|
||||
|
||||
---
|
||||
|
||||
### Success Criteria
|
||||
|
||||
**Performance restored**:
|
||||
- [x] p95 latency <100ms (achieved: 55ms)
|
||||
- [x] p99 latency <200ms (achieved: 120ms)
|
||||
- [x] CPU usage <50% (achieved: 35%)
|
||||
- [x] Disk I/O <500 IOPS (achieved: 150 IOPS)
|
||||
- [x] Connection pool utilization <70% (achieved: 45%)
|
||||
|
||||
**User impact resolved**:
|
||||
- [x] Timeout rate <1% (achieved: 0.2%)
|
||||
- [x] Support tickets normalized (dropped 80%)
|
||||
- [x] Page load times back to normal
|
||||
|
||||
**Process improvements**:
|
||||
- [x] Migration checklist created
|
||||
- [x] Performance regression tests added to CI/CD
|
||||
- [ ] Post-mortem doc written (IN PROGRESS)
|
||||
- [ ] Team training on index optimization (SCHEDULED)
|
||||
|
||||
---
|
||||
|
||||
## 7. Limitations
|
||||
|
||||
### Data Limitations
|
||||
|
||||
**Missing data**:
|
||||
- **No query performance baselines in staging**: Can't compare staging pre/post migration
|
||||
- **Limited historical index usage data**: `pg_stat_user_indexes` only has 7 days retention
|
||||
- **No migration testing logs**: Can't determine if migration was tested in staging
|
||||
|
||||
**Measurement limitations**:
|
||||
- **Latency measured at application layer**: Database-internal latency not tracked separately
|
||||
- **No per-query latency breakdown**: Can't isolate which specific queries most affected
|
||||
|
||||
### Analysis Limitations
|
||||
|
||||
**Assumptions**:
|
||||
- **Assumed connection pool refresh time**: Estimated 5 min for connections to cycle to new schema (not measured)
|
||||
- **Didn't test other potential optimizations**: Only tested rollback, not alternative fixes (e.g., query rewriting)
|
||||
|
||||
**Alternative fixes not explored**:
|
||||
- Could queries be rewritten to work without indexes? (possible but not investigated)
|
||||
- Could connection pool be increased? (wouldn't fix root cause)
|
||||
|
||||
### Generalizability
|
||||
|
||||
**Context-specific**:
|
||||
- This analysis applies to PostgreSQL databases with similar query patterns
|
||||
- May not apply to other database systems (MySQL, MongoDB, etc.) with different query optimizers
|
||||
- Specific to tables with millions of rows (small tables less affected)
|
||||
|
||||
**Lessons learned**:
|
||||
- Index removal can cause 10x+ performance degradation for large tables
|
||||
- Migration testing in staging must include performance benchmarks
|
||||
- Rollback plans essential for database schema changes
|
||||
|
||||
---
|
||||
|
||||
## 8. Meta: Analysis Quality Self-Assessment
|
||||
|
||||
Using `rubric_causal_inference_root_cause.json`:
|
||||
|
||||
### Scores:
|
||||
|
||||
1. **Effect Definition Clarity**: 5/5 (precise quantification, timeline, baseline, impact)
|
||||
2. **Hypothesis Generation**: 5/5 (5 hypotheses, systematic evaluation)
|
||||
3. **Root Cause Identification**: 5/5 (root vs proximate distinguished, causal chain clear)
|
||||
4. **Causal Model Quality**: 5/5 (full chain, mechanisms, confounders noted)
|
||||
5. **Temporal Sequence Verification**: 5/5 (detailed timeline, lag explained)
|
||||
6. **Counterfactual Testing**: 5/5 (rollback experiment + control queries)
|
||||
7. **Mechanism Explanation**: 5/5 (detailed mechanism with EXPLAIN output evidence)
|
||||
8. **Confounding Control**: 4/5 (identified and ruled out major confounders, comprehensive)
|
||||
9. **Evidence Quality & Strength**: 5/5 (quasi-experiment via rollback, Bradford Hill 25/27)
|
||||
10. **Confidence & Limitations**: 5/5 (explicit confidence, limitations, alternatives evaluated)
|
||||
|
||||
**Average**: 4.9/5 - **Excellent** (publication-quality analysis)
|
||||
|
||||
**Assessment**: This root cause analysis exceeds standards for high-stakes engineering decisions. Strong evidence across all criteria, particularly counterfactual testing (rollback experiment) and mechanism explanation (query plans). Appropriate for postmortem documentation and process improvement decisions.
|
||||
|
||||
---
|
||||
|
||||
## Appendix: Supporting Evidence
|
||||
|
||||
### A. Query Plans Before/After
|
||||
|
||||
**BEFORE (with index)**:
|
||||
```sql
|
||||
EXPLAIN ANALYZE SELECT * FROM users WHERE user_id = 123 AND created_at > '2024-01-01';
|
||||
|
||||
Index Scan using idx_users_user_id_created_at on users
|
||||
(cost=0.43..8.45 rows=1 width=152)
|
||||
(actual time=0.025..0.030 rows=1 loops=1)
|
||||
Index Cond: ((user_id = 123) AND (created_at > '2024-01-01'::date))
|
||||
Planning Time: 0.112 ms
|
||||
Execution Time: 0.052 ms ← Fast
|
||||
```
|
||||
|
||||
**AFTER (without index)**:
|
||||
```sql
|
||||
EXPLAIN ANALYZE SELECT * FROM users WHERE user_id = 123 AND created_at > '2024-01-01';
|
||||
|
||||
Seq Scan on users
|
||||
(cost=0.00..112000.00 rows=5000000 width=152)
|
||||
(actual time=0.025..485.234 rows=1 loops=1)
|
||||
Filter: ((user_id = 123) AND (created_at > '2024-01-01'::date))
|
||||
Rows Removed by Filter: 4999999
|
||||
Planning Time: 0.108 ms
|
||||
Execution Time: 485.267 ms ← 10,000x slower
|
||||
```
|
||||
|
||||
### B. Monitoring Metrics
|
||||
|
||||
**Latency (p95)**:
|
||||
- March 9: 50ms (stable)
|
||||
- March 10 14:00: 50ms (pre-migration)
|
||||
- March 10 14:15: 500ms (post-migration) ← **10x jump**
|
||||
- March 11 09:00: 550ms (still degraded)
|
||||
- March 11 09:15: 55ms (rollback restored)
|
||||
|
||||
**Database CPU**:
|
||||
- Baseline: 30%
|
||||
- March 10 14:15: 80% ← Spike at migration time
|
||||
- March 11 09:15: 35% (rollback restored)
|
||||
|
||||
**Disk I/O (IOPS)**:
|
||||
- Baseline: 100 IOPS
|
||||
- March 10 14:15: 5000 IOPS ← 50x increase
|
||||
- March 11 09:15: 150 IOPS (rollback restored)
|
||||
697
skills/causal-inference-root-cause/resources/methodology.md
Normal file
697
skills/causal-inference-root-cause/resources/methodology.md
Normal file
@@ -0,0 +1,697 @@
|
||||
# Causal Inference & Root Cause Analysis Methodology
|
||||
|
||||
## Root Cause Analysis Workflow
|
||||
|
||||
Copy this checklist and track your progress:
|
||||
|
||||
```
|
||||
Root Cause Analysis Progress:
|
||||
- [ ] Step 1: Generate hypotheses using structured techniques
|
||||
- [ ] Step 2: Build causal model and distinguish cause types
|
||||
- [ ] Step 3: Gather evidence and assess temporal sequence
|
||||
- [ ] Step 4: Test causality with Bradford Hill criteria
|
||||
- [ ] Step 5: Verify root cause and check coherence
|
||||
```
|
||||
|
||||
**Step 1: Generate hypotheses using structured techniques**
|
||||
|
||||
Use 5 Whys, Fishbone diagrams, timeline analysis, or differential diagnosis to systematically generate potential causes. See [Hypothesis Generation Techniques](#hypothesis-generation-techniques) for detailed methods.
|
||||
|
||||
**Step 2: Build causal model and distinguish cause types**
|
||||
|
||||
Map causal chains to distinguish root causes from proximate causes, symptoms, and confounders. See [Causal Model Building](#causal-model-building) for cause type definitions and chain mapping.
|
||||
|
||||
**Step 3: Gather evidence and assess temporal sequence**
|
||||
|
||||
Collect evidence for each hypothesis and verify temporal relationships (cause must precede effect). See [Evidence Assessment Methods](#evidence-assessment-methods) for essential tests and evidence hierarchy.
|
||||
|
||||
**Step 4: Test causality with Bradford Hill criteria**
|
||||
|
||||
Score hypotheses using the 9 Bradford Hill criteria (strength, consistency, specificity, temporality, dose-response, plausibility, coherence, experiment, analogy). See [Bradford Hill Criteria](#bradford-hill-criteria) for scoring rubric.
|
||||
|
||||
**Step 5: Verify root cause and check coherence**
|
||||
|
||||
Ensure the identified root cause has strong evidence, fits with known facts, and addresses systemic issues. See [Quality Checklist](#quality-checklist) and [Worked Example](#worked-example-website-conversion-drop) for validation techniques.
|
||||
|
||||
---
|
||||
|
||||
## Hypothesis Generation Techniques
|
||||
|
||||
### 5 Whys (Trace Back to Root)
|
||||
|
||||
Start with the effect and ask "why" repeatedly until you reach the root cause.
|
||||
|
||||
**Process:**
|
||||
```
|
||||
Effect: {Observed problem}
|
||||
Why? → {Proximate cause 1}
|
||||
Why? → {Proximate cause 2}
|
||||
Why? → {Intermediate cause}
|
||||
Why? → {Deeper cause}
|
||||
Why? → {Root cause}
|
||||
```
|
||||
|
||||
**Example:**
|
||||
```
|
||||
Effect: Server crashed
|
||||
Why? → Out of memory
|
||||
Why? → Memory leak in new code
|
||||
Why? → Connection pool not releasing connections
|
||||
Why? → Error handling missing in new feature
|
||||
Why? → Code review process didn't catch this edge case (ROOT CAUSE)
|
||||
```
|
||||
|
||||
**When to stop**: When you reach a cause that, if fixed, would prevent recurrence.
|
||||
|
||||
**Pitfalls to avoid**:
|
||||
- Stopping too early at proximate causes
|
||||
- Following only one path (should explore multiple branches)
|
||||
- Accepting vague answers ("because it broke")
|
||||
|
||||
---
|
||||
|
||||
### Fishbone Diagram (Categorize Causes)
|
||||
|
||||
Organize potential causes by category to ensure comprehensive coverage.
|
||||
|
||||
**Categories (6 Ms)**:
|
||||
```
|
||||
Methods Machines Materials
|
||||
| | |
|
||||
| | |
|
||||
└───────────────┴───────────────┴─────────→ Effect
|
||||
| | |
|
||||
| | |
|
||||
Manpower Measurement Environment
|
||||
```
|
||||
|
||||
**Adapted for software/product**:
|
||||
- **People**: Skills, knowledge, communication, errors
|
||||
- **Process**: Procedures, workflows, standards, review
|
||||
- **Technology**: Tools, systems, infrastructure, dependencies
|
||||
- **Environment**: External factors, timing, context
|
||||
|
||||
**Example (API outage)**:
|
||||
- **People**: On-call engineer inexperienced with database
|
||||
- **Process**: No rollback procedure documented
|
||||
- **Technology**: Database connection pooling misconfigured
|
||||
- **Environment**: Traffic spike coincided with deploy
|
||||
|
||||
---
|
||||
|
||||
### Timeline Analysis (What Changed?)
|
||||
|
||||
Map events leading up to effect to identify temporal correlations.
|
||||
|
||||
**Process**:
|
||||
1. Establish baseline period (normal operation)
|
||||
2. Mark when effect first observed
|
||||
3. List all changes in days/hours leading up to effect
|
||||
4. Identify changes that temporally precede effect
|
||||
|
||||
**Example**:
|
||||
```
|
||||
T-7 days: Normal operation (baseline)
|
||||
T-3 days: Deployed new payment processor integration
|
||||
T-2 days: Traffic increased 20%
|
||||
T-1 day: First error logged (dismissed as fluke)
|
||||
T-0 (14:00): Conversion rate dropped 30%
|
||||
T+1 hour: Alert fired, investigation started
|
||||
```
|
||||
|
||||
**Look for**: Changes, anomalies, or events right before effect (temporal precedence).
|
||||
|
||||
**Red flags**:
|
||||
- Multiple changes at once (hard to isolate cause)
|
||||
- Long lag between change and effect (harder to connect)
|
||||
- No changes before effect (may be external factor or accumulated technical debt)
|
||||
|
||||
---
|
||||
|
||||
### Differential Diagnosis
|
||||
|
||||
List all conditions/causes that could produce the observed symptoms.
|
||||
|
||||
**Process**:
|
||||
1. List all possible causes (be comprehensive)
|
||||
2. For each cause, list symptoms it would produce
|
||||
3. Compare predicted symptoms to observed symptoms
|
||||
4. Eliminate causes that don't match
|
||||
|
||||
**Example (API returning 500 errors)**:
|
||||
|
||||
| Cause | Predicted Symptoms | Observed? |
|
||||
|-------|-------------------|-----------|
|
||||
| Code bug (logic error) | Specific endpoints fail, reproducible | ✓ Yes |
|
||||
| Resource exhaustion (memory) | All endpoints slow, CPU/memory high | ✗ CPU normal |
|
||||
| Dependency failure (database) | All database queries fail, connection errors | ✗ DB responsive |
|
||||
| Configuration error | Service won't start or immediate failure | ✗ Started fine |
|
||||
| DDoS attack | High traffic, distributed sources | ✗ Traffic normal |
|
||||
|
||||
**Conclusion**: Code bug most likely (matches symptoms, others ruled out).
|
||||
|
||||
---
|
||||
|
||||
## Causal Model Building
|
||||
|
||||
### Distinguish Cause Types
|
||||
|
||||
#### 1. Root Cause
|
||||
- **Definition**: Fundamental underlying issue
|
||||
- **Test**: If fixed, problem doesn't recur
|
||||
- **Usually**: Structural, procedural, or systemic
|
||||
- **Example**: "No code review process for infrastructure changes"
|
||||
|
||||
#### 2. Proximate Cause
|
||||
- **Definition**: Immediate trigger
|
||||
- **Test**: Directly precedes effect
|
||||
- **Often**: Symptom of deeper root cause
|
||||
- **Example**: "Deployed buggy config file"
|
||||
|
||||
#### 3. Contributing Factor
|
||||
- **Definition**: Makes effect more likely or severe
|
||||
- **Test**: Not sufficient alone, but amplifies
|
||||
- **Often**: Context or conditions
|
||||
- **Example**: "High traffic amplified impact of bug"
|
||||
|
||||
#### 4. Coincidence
|
||||
- **Definition**: Happened around same time, no causal relationship
|
||||
- **Test**: No mechanism, doesn't pass counterfactual
|
||||
- **Example**: "Marketing campaign launched same day" (but unrelated to technical issue)
|
||||
|
||||
---
|
||||
|
||||
### Map Causal Chains
|
||||
|
||||
#### Linear Chain
|
||||
```
|
||||
Root Cause
|
||||
↓ (mechanism)
|
||||
Intermediate Cause
|
||||
↓ (mechanism)
|
||||
Proximate Cause
|
||||
↓ (mechanism)
|
||||
Observed Effect
|
||||
```
|
||||
|
||||
**Example**:
|
||||
```
|
||||
No SSL cert renewal process (Root)
|
||||
↓ (no automated reminders)
|
||||
Cert expires unnoticed (Intermediate)
|
||||
↓ (HTTPS handshake fails)
|
||||
HTTPS connections fail (Proximate)
|
||||
↓ (users can't connect)
|
||||
Users can't access site (Effect)
|
||||
```
|
||||
|
||||
#### Branching Chain (Multiple Paths)
|
||||
```
|
||||
Root Cause
|
||||
↙ ↘
|
||||
Path A Path B
|
||||
↓ ↓
|
||||
Effect A Effect B
|
||||
```
|
||||
|
||||
**Example**:
|
||||
```
|
||||
Database migration error (Root)
|
||||
↙ ↘
|
||||
Missing indexes Schema mismatch
|
||||
↓ ↓
|
||||
Slow queries Type errors
|
||||
```
|
||||
|
||||
#### Common Cause (Confounding)
|
||||
```
|
||||
Confounder (Z)
|
||||
↙ ↘
|
||||
Variable X Variable Y
|
||||
```
|
||||
X and Y correlate because Z causes both, but X doesn't cause Y.
|
||||
|
||||
**Example**:
|
||||
```
|
||||
Hot weather (Confounder)
|
||||
↙ ↘
|
||||
Ice cream sales Drownings
|
||||
```
|
||||
Ice cream doesn't cause drownings; both caused by hot weather/summer season.
|
||||
|
||||
---
|
||||
|
||||
## Evidence Assessment Methods
|
||||
|
||||
### Essential Tests
|
||||
|
||||
#### 1. Temporal Sequence (Must Pass)
|
||||
**Rule**: Cause MUST precede effect. If effect happened before "cause," it's not causal.
|
||||
|
||||
**How to verify**:
|
||||
- Create detailed timeline
|
||||
- Mark exact timestamps of cause and effect
|
||||
- Calculate lag (effect should follow cause)
|
||||
|
||||
**Example**:
|
||||
- Migration deployed: March 10, 2:00 PM
|
||||
- Queries slowed: March 10, 2:15 PM
|
||||
- ✓ Cause precedes effect by 15 minutes
|
||||
|
||||
#### 2. Counterfactual (Strong Evidence)
|
||||
**Question**: "What would have happened without the cause?"
|
||||
|
||||
**How to test**:
|
||||
- **Control group**: Compare cases with cause vs without
|
||||
- **Rollback**: Remove cause, see if effect disappears
|
||||
- **Baseline comparison**: Compare to period before cause
|
||||
- **A/B test**: Random assignment of cause
|
||||
|
||||
**Example**:
|
||||
- Hypothesis: New feature X causes conversion drop
|
||||
- Test: A/B test with X enabled vs disabled
|
||||
- Result: No conversion difference → X not the cause
|
||||
|
||||
#### 3. Mechanism (Plausibility Test)
|
||||
**Question**: Can you explain HOW cause produces effect?
|
||||
|
||||
**Requirements**:
|
||||
- Pathway from cause to effect
|
||||
- Each step makes sense
|
||||
- Supported by theory or evidence
|
||||
|
||||
**Example - Strong**:
|
||||
- Cause: Increased checkout latency (5 sec)
|
||||
- Mechanism: Users abandon slow pages
|
||||
- Evidence: Industry data shows 7+ sec → 20% abandon
|
||||
- ✓ Clear, plausible mechanism
|
||||
|
||||
**Example - Weak**:
|
||||
- Cause: Moon phase (full moon)
|
||||
- Mechanism: ??? (no plausible pathway)
|
||||
- Effect: Website traffic increase
|
||||
- ✗ No mechanism, likely spurious correlation
|
||||
|
||||
---
|
||||
|
||||
### Evidence Hierarchy
|
||||
|
||||
**Strongest Evidence** (Gold Standard):
|
||||
|
||||
1. **Randomized Controlled Trial (RCT)**
|
||||
- Random assignment eliminates confounding
|
||||
- Compare treatment vs control group
|
||||
- Example: A/B test with random user assignment
|
||||
|
||||
**Strong Evidence**:
|
||||
|
||||
2. **Quasi-Experiment / Natural Experiment**
|
||||
- Some random-like assignment (not perfect)
|
||||
- Example: Policy implemented in one region, not another
|
||||
- Control for confounders statistically
|
||||
|
||||
**Medium Evidence**:
|
||||
|
||||
3. **Longitudinal Study (Before/After)**
|
||||
- Track same subjects over time
|
||||
- Controls for individual differences
|
||||
- Example: Metric before and after change
|
||||
|
||||
4. **Well-Controlled Observational Study**
|
||||
- Statistical controls for confounders
|
||||
- Multiple variables measured
|
||||
- Example: Regression analysis with covariates
|
||||
|
||||
**Weak Evidence**:
|
||||
|
||||
5. **Cross-Sectional Correlation**
|
||||
- Single point in time
|
||||
- Can't establish temporal order
|
||||
- Example: Survey at one moment
|
||||
|
||||
6. **Anecdote / Single Case**
|
||||
- May not generalize
|
||||
- Many confounders
|
||||
- Example: One user complaint
|
||||
|
||||
**Recommendation**: For high-stakes decisions, aim for quasi-experiment or better.
|
||||
|
||||
---
|
||||
|
||||
### Bradford Hill Criteria
|
||||
|
||||
Nine factors that strengthen causal claims. Score each on 1-3 scale.
|
||||
|
||||
#### 1. Strength
|
||||
**Question**: How strong is the association?
|
||||
|
||||
- **3**: Very strong correlation (r > 0.7, OR > 10)
|
||||
- **2**: Moderate correlation (r = 0.3-0.7, OR = 2-10)
|
||||
- **1**: Weak correlation (r < 0.3, OR < 2)
|
||||
|
||||
**Example**: 10x latency increase = strong association (3/3)
|
||||
|
||||
#### 2. Consistency
|
||||
**Question**: Does relationship replicate across contexts?
|
||||
|
||||
- **3**: Always consistent (multiple studies, settings)
|
||||
- **2**: Mostly consistent (some exceptions)
|
||||
- **1**: Rarely consistent (contradictory results)
|
||||
|
||||
**Example**: Latency affects conversion on all pages, devices, countries = consistent (3/3)
|
||||
|
||||
#### 3. Specificity
|
||||
**Question**: Is cause-effect relationship specific?
|
||||
|
||||
- **3**: Specific cause → specific effect
|
||||
- **2**: Somewhat specific (some other effects)
|
||||
- **1**: General/non-specific (cause has many effects)
|
||||
|
||||
**Example**: Missing index affects only queries on that table = specific (3/3)
|
||||
|
||||
#### 4. Temporality
|
||||
**Question**: Does cause precede effect?
|
||||
|
||||
- **3**: Always (clear temporal sequence)
|
||||
- **2**: Usually (mostly precedes)
|
||||
- **1**: Unclear or reverse causation possible
|
||||
|
||||
**Example**: Migration (2:00 PM) before slow queries (2:15 PM) = always (3/3)
|
||||
|
||||
#### 5. Dose-Response
|
||||
**Question**: Does more cause → more effect?
|
||||
|
||||
- **3**: Clear gradient (linear or monotonic)
|
||||
- **2**: Some gradient (mostly true)
|
||||
- **1**: No gradient (flat or random)
|
||||
|
||||
**Example**: Larger tables → slower queries = clear gradient (3/3)
|
||||
|
||||
#### 6. Plausibility
|
||||
**Question**: Does mechanism make sense?
|
||||
|
||||
- **3**: Very plausible (well-established theory)
|
||||
- **2**: Somewhat plausible (reasonable)
|
||||
- **1**: Implausible (contradicts theory)
|
||||
|
||||
**Example**: Index scans faster than table scans = well-established (3/3)
|
||||
|
||||
#### 7. Coherence
|
||||
**Question**: Fits with existing knowledge?
|
||||
|
||||
- **3**: Fits well (no contradictions)
|
||||
- **2**: Somewhat fits (minor contradictions)
|
||||
- **1**: Contradicts existing knowledge
|
||||
|
||||
**Example**: Aligns with database optimization theory = fits well (3/3)
|
||||
|
||||
#### 8. Experiment
|
||||
**Question**: Does intervention change outcome?
|
||||
|
||||
- **3**: Strong experimental evidence (RCT)
|
||||
- **2**: Some experimental evidence (quasi)
|
||||
- **1**: No experimental evidence (observational only)
|
||||
|
||||
**Example**: Rollback restored performance = strong experiment (3/3)
|
||||
|
||||
#### 9. Analogy
|
||||
**Question**: Similar cause-effect patterns exist?
|
||||
|
||||
- **3**: Strong analogies (many similar cases)
|
||||
- **2**: Some analogies (a few similar)
|
||||
- **1**: No analogies (unique case)
|
||||
|
||||
**Example**: Similar patterns in other databases = some analogies (2/3)
|
||||
|
||||
**Scoring**:
|
||||
- **Total 18-27**: Strong causal evidence
|
||||
- **Total 12-17**: Moderate evidence
|
||||
- **Total < 12**: Weak evidence (need more investigation)
|
||||
|
||||
---
|
||||
|
||||
## Worked Example: Website Conversion Drop
|
||||
|
||||
### Problem
|
||||
|
||||
Website conversion rate dropped from 5% to 3% (40% relative drop) starting November 15th.
|
||||
|
||||
---
|
||||
|
||||
### 1. Effect Definition
|
||||
|
||||
**Effect**: Conversion rate dropped 40% (5% → 3%)
|
||||
|
||||
**Quantification**:
|
||||
- Baseline: 5% (stable for 6 months prior)
|
||||
- Current: 3% (observed for 2 weeks)
|
||||
- Absolute drop: 2 percentage points
|
||||
- Relative drop: 40%
|
||||
- Impact: ~$100k/week revenue loss
|
||||
|
||||
**Timeline**:
|
||||
- Last normal day: November 14th (5.1% conversion)
|
||||
- First drop observed: November 15th (3.2% conversion)
|
||||
- Ongoing: Yes (still at 3% as of November 29th)
|
||||
|
||||
**Impact**:
|
||||
- 10,000 daily visitors
|
||||
- 500 conversions → 300 conversions
|
||||
- $200 average order value
|
||||
- Loss: 200 conversions × $200 = $40k/day
|
||||
|
||||
---
|
||||
|
||||
### 2. Competing Hypotheses (Using Multiple Techniques)
|
||||
|
||||
#### Using 5 Whys:
|
||||
```
|
||||
Effect: Conversion dropped 40%
|
||||
Why? → Users abandoning checkout
|
||||
Why? → Checkout takes too long
|
||||
Why? → Payment processor is slow
|
||||
Why? → New processor has higher latency
|
||||
Why? → Switched to cheaper processor (ROOT)
|
||||
```
|
||||
|
||||
#### Using Fishbone Diagram:
|
||||
|
||||
**Technology**:
|
||||
- New payment processor (Nov 10)
|
||||
- New checkout UI (Nov 14)
|
||||
|
||||
**Process**:
|
||||
- Lack of A/B testing
|
||||
- No performance monitoring
|
||||
|
||||
**Environment**:
|
||||
- Seasonal traffic shift (holiday season)
|
||||
|
||||
**People**:
|
||||
- User behavior changes?
|
||||
|
||||
#### Using Timeline Analysis:
|
||||
```
|
||||
Nov 10: Switched to new payment processor
|
||||
Nov 14: Deployed new checkout UI
|
||||
Nov 15: Conversion drop observed (2:00 AM)
|
||||
```
|
||||
|
||||
#### Competing Hypotheses Generated:
|
||||
|
||||
**Hypothesis 1: New checkout UI** (deployed Nov 14)
|
||||
- Type: Proximate cause
|
||||
- Evidence for: Timing matches (Nov 14 → Nov 15)
|
||||
- Evidence against: A/B test showed no difference; Rollback didn't fix
|
||||
|
||||
**Hypothesis 2: Payment processor switch** (Nov 10)
|
||||
- Type: Root cause candidate
|
||||
- Evidence for: New processor slower (400ms vs 100ms); Timing precedes
|
||||
- Evidence against: 5-day lag (why not immediate?)
|
||||
|
||||
**Hypothesis 3: Payment latency increase** (Nov 15)
|
||||
- Type: Proximate cause/symptom
|
||||
- Evidence for: Logs show 5-8 sec checkout (was 2-3 sec); User complaints
|
||||
- Evidence against: None (strong evidence)
|
||||
|
||||
**Hypothesis 4: Seasonal traffic shift**
|
||||
- Type: Confounder
|
||||
- Evidence for: Holiday season
|
||||
- Evidence against: Traffic composition unchanged
|
||||
|
||||
---
|
||||
|
||||
### 3. Causal Model
|
||||
|
||||
#### Causal Chain:
|
||||
```
|
||||
ROOT: Switched to cheaper payment processor (Nov 10)
|
||||
↓ (mechanism: lower throughput processor)
|
||||
INTERMEDIATE: Payment API latency under load
|
||||
↓ (mechanism: slow API → page delays)
|
||||
PROXIMATE: Checkout takes 5-8 seconds (Nov 15+)
|
||||
↓ (mechanism: users abandon slow pages)
|
||||
EFFECT: Conversion drops 40%
|
||||
```
|
||||
|
||||
#### Why Nov 15 lag?
|
||||
- Nov 10-14: Weekday traffic (low, processor handled it)
|
||||
- Nov 15: Weekend traffic spike (2x normal)
|
||||
- Processor couldn't handle weekend load → latency spike
|
||||
|
||||
#### Confounders Ruled Out:
|
||||
- UI change: Rollback had no effect
|
||||
- Seasonality: Traffic patterns unchanged
|
||||
- Marketing: No changes
|
||||
|
||||
---
|
||||
|
||||
### 4. Evidence Assessment
|
||||
|
||||
**Temporal Sequence**: ✓ PASS (3/3)
|
||||
- Processor switch (Nov 10) → Latency (Nov 15) → Drop (Nov 15)
|
||||
- Clear precedence
|
||||
|
||||
**Counterfactual**: ✓ PASS (3/3)
|
||||
- Test: Switched back to old processor
|
||||
- Result: Conversion recovered to 4.8%
|
||||
- Strong evidence
|
||||
|
||||
**Mechanism**: ✓ PASS (3/3)
|
||||
- Slow processor (400ms) → High load → 5-8 sec checkout → User abandonment
|
||||
- Industry data: 7+ sec = 20% abandon
|
||||
- Clear, plausible mechanism
|
||||
|
||||
**Dose-Response**: ✓ PASS (3/3)
|
||||
| Checkout Time | Conversion |
|
||||
|---------------|------------|
|
||||
| <3 sec | 5% |
|
||||
| 3-5 sec | 4% |
|
||||
| 5-7 sec | 3% |
|
||||
| >7 sec | 2% |
|
||||
|
||||
Clear gradient
|
||||
|
||||
**Consistency**: ✓ PASS (3/3)
|
||||
- All devices (mobile, desktop, tablet)
|
||||
- All payment methods
|
||||
- All countries
|
||||
- Consistent pattern
|
||||
|
||||
**Bradford Hill Score: 24/27** (Very Strong)
|
||||
1. Strength: 3 (40% drop)
|
||||
2. Consistency: 3 (all segments)
|
||||
3. Specificity: 2 (latency affects other things)
|
||||
4. Temporality: 3 (clear sequence)
|
||||
5. Dose-response: 3 (gradient)
|
||||
6. Plausibility: 3 (well-known)
|
||||
7. Coherence: 3 (fits theory)
|
||||
8. Experiment: 3 (rollback test)
|
||||
9. Analogy: 2 (some similar cases)
|
||||
|
||||
---
|
||||
|
||||
### 5. Conclusion
|
||||
|
||||
**Root Cause**: Switched to cheaper payment processor with insufficient throughput for weekend traffic.
|
||||
|
||||
**Confidence**: High (95%+)
|
||||
|
||||
**Why high confidence**:
|
||||
- Perfect temporal sequence
|
||||
- Strong counterfactual (rollback restored)
|
||||
- Clear mechanism
|
||||
- Dose-response present
|
||||
- Consistent across contexts
|
||||
- No confounders
|
||||
- Bradford Hill 24/27
|
||||
|
||||
**Why root (not symptom)**:
|
||||
- Processor switch is decision that created problem
|
||||
- Latency is symptom of this decision
|
||||
- Fixing latency alone treats symptom
|
||||
- Reverting switch eliminates problem
|
||||
|
||||
---
|
||||
|
||||
### 6. Recommended Actions
|
||||
|
||||
**Immediate**:
|
||||
1. ✓ Reverted to old processor (Nov 28)
|
||||
2. Negotiate better rates with old processor
|
||||
|
||||
**If keeping new processor**:
|
||||
1. Add caching layer (reduce latency <2 sec)
|
||||
2. Implement async checkout
|
||||
3. Load test at 3x peak
|
||||
|
||||
**Validation**:
|
||||
- A/B test: Old processor vs new+caching
|
||||
- Monitor: Latency p95, conversion rate
|
||||
- Success: Conversion >4.5%, latency <2 sec
|
||||
|
||||
---
|
||||
|
||||
### 7. Limitations
|
||||
|
||||
**Data limitations**:
|
||||
- No abandonment reason tracking
|
||||
- Processor metrics limited (black box)
|
||||
|
||||
**Analysis limitations**:
|
||||
- Didn't test all latency optimizations
|
||||
- Small sample for some device types
|
||||
|
||||
**Generalizability**:
|
||||
- Specific to our checkout flow
|
||||
- May not apply to other traffic patterns
|
||||
|
||||
---
|
||||
|
||||
## Quality Checklist
|
||||
|
||||
Before finalizing root cause analysis, verify:
|
||||
|
||||
**Effect Definition**:
|
||||
- [ ] Effect clearly described (not vague)
|
||||
- [ ] Quantified with magnitude
|
||||
- [ ] Timeline established (when started, duration)
|
||||
- [ ] Baseline for comparison stated
|
||||
|
||||
**Hypothesis Generation**:
|
||||
- [ ] Multiple hypotheses generated (3+ competing)
|
||||
- [ ] Used systematic techniques (5 Whys, Fishbone, Timeline, Differential)
|
||||
- [ ] Proximate vs root distinguished
|
||||
- [ ] Confounders identified
|
||||
|
||||
**Causal Model**:
|
||||
- [ ] Causal chain mapped (root → proximate → effect)
|
||||
- [ ] Mechanisms explained for each link
|
||||
- [ ] Confounding relationships noted
|
||||
- [ ] Direct vs indirect causation clear
|
||||
|
||||
**Evidence Assessment**:
|
||||
- [ ] Temporal sequence verified (cause before effect)
|
||||
- [ ] Counterfactual tested (what if cause absent)
|
||||
- [ ] Mechanism explained with supporting evidence
|
||||
- [ ] Dose-response checked (more cause → more effect)
|
||||
- [ ] Consistency assessed (holds across contexts)
|
||||
- [ ] Confounders ruled out or controlled
|
||||
|
||||
**Conclusion**:
|
||||
- [ ] Root cause clearly stated
|
||||
- [ ] Confidence level stated with justification
|
||||
- [ ] Distinguished from proximate causes/symptoms
|
||||
- [ ] Alternative explanations acknowledged
|
||||
- [ ] Limitations noted
|
||||
|
||||
**Recommendations**:
|
||||
- [ ] Address root cause (not just symptoms)
|
||||
- [ ] Actionable interventions proposed
|
||||
- [ ] Validation tests specified
|
||||
- [ ] Success metrics defined
|
||||
|
||||
**Minimum Standards**:
|
||||
- For medium-stakes (postmortems, feature issues): Score ≥ 3.5 on rubric
|
||||
- For high-stakes (infrastructure, safety): Score ≥ 4.0 on rubric
|
||||
- Red flags: <3 on Temporal Sequence, Counterfactual, or Mechanism = weak causal claim
|
||||
265
skills/causal-inference-root-cause/resources/template.md
Normal file
265
skills/causal-inference-root-cause/resources/template.md
Normal file
@@ -0,0 +1,265 @@
|
||||
# Causal Inference & Root Cause Analysis Template
|
||||
|
||||
## Workflow
|
||||
|
||||
Copy this checklist and track your progress:
|
||||
|
||||
```
|
||||
Root Cause Analysis Progress:
|
||||
- [ ] Step 1: Define effect with quantification and timeline
|
||||
- [ ] Step 2: Generate competing hypotheses systematically
|
||||
- [ ] Step 3: Build causal model with mechanisms
|
||||
- [ ] Step 4: Assess evidence using essential tests
|
||||
- [ ] Step 5: Document conclusion with confidence level
|
||||
```
|
||||
|
||||
**Step 1: Define effect with quantification and timeline**
|
||||
|
||||
Describe what happened specifically, quantify magnitude (baseline vs current, change), establish timeline (first observed, duration, context), and identify impact (who's affected, severity, business impact). Use [Quick Template](#quick-template) section 1.
|
||||
|
||||
**Step 2: Generate competing hypotheses systematically**
|
||||
|
||||
List 3-5 potential causes, classify each as root/proximate/symptom, note evidence for/against each hypothesis, and identify potential confounders (third factors causing both X and Y). Use techniques from [Section-by-Section Guidance](#section-by-section-guidance): 5 Whys, Fishbone Diagram, Timeline Analysis.
|
||||
|
||||
**Step 3: Build causal model with mechanisms**
|
||||
|
||||
Draw causal chain (Root → Intermediate → Proximate → Effect) with mechanisms explaining HOW each step leads to next, and rule out confounders (distinguish Z causing both X and Y from X causing Y). See [Causal Model](#3-causal-model) template structure.
|
||||
|
||||
**Step 4: Assess evidence using essential tests**
|
||||
|
||||
Check temporal sequence (cause before effect?), test counterfactual (what if cause absent?), explain mechanism (HOW does it work?), look for dose-response (more cause → more effect?), verify consistency (holds across contexts?), and control for confounding. See [Evidence Assessment](#4-evidence-assessment) and [Section-by-Section Guidance](#4-evidence-assessment-1).
|
||||
|
||||
**Step 5: Document conclusion with confidence level**
|
||||
|
||||
State root cause with confidence level (High/Medium/Low) and justification, explain complete causal mechanism, clarify why this is root (not symptom), note alternative explanations and why less likely, recommend interventions addressing root cause, propose validation tests, and acknowledge limitations. Use [Quality Checklist](#quality-checklist) to verify completeness.
|
||||
|
||||
## Quick Template
|
||||
|
||||
```markdown
|
||||
# Root Cause Analysis: {Effect/Problem}
|
||||
|
||||
## 1. Effect Definition
|
||||
|
||||
**What happened**: {Clear description of the effect}
|
||||
|
||||
**Quantification**: {Magnitude, frequency, impact}
|
||||
- Baseline: {Normal state}
|
||||
- Current: {Observed state}
|
||||
- Change: {Absolute and relative change}
|
||||
|
||||
**Timeline**:
|
||||
- First observed: {Date/time}
|
||||
- Duration: {Ongoing/resolved}
|
||||
- Context: {What else was happening}
|
||||
|
||||
**Impact**:
|
||||
- Who's affected: {Users, systems, metrics}
|
||||
- Severity: {High/Medium/Low}
|
||||
- Business impact: {Revenue, users, reputation}
|
||||
|
||||
---
|
||||
|
||||
## 2. Competing Hypotheses
|
||||
|
||||
Generate 3-5 potential causes systematically:
|
||||
|
||||
### Hypothesis 1: {Potential cause}
|
||||
- **Type**: Root cause / Proximate cause / Symptom
|
||||
- **Evidence for**: {Supporting data, observations}
|
||||
- **Evidence against**: {Contradicting data}
|
||||
|
||||
### Hypothesis 2: {Potential cause}
|
||||
- **Type**: Root cause / Proximate cause / Symptom
|
||||
- **Evidence for**: {Supporting data}
|
||||
- **Evidence against**: {Contradicting data}
|
||||
|
||||
### Hypothesis 3: {Potential cause}
|
||||
- **Type**: Root cause / Proximate cause / Symptom
|
||||
- **Evidence for**: {Supporting data}
|
||||
- **Evidence against**: {Contradicting data}
|
||||
|
||||
**Potential confounders**: {Third factors that might cause both X and Y}
|
||||
|
||||
---
|
||||
|
||||
## 3. Causal Model
|
||||
|
||||
### Causal Chain (Root → Proximate → Effect)
|
||||
|
||||
```
|
||||
{Root Cause}
|
||||
↓ {mechanism: how does root cause lead to next step?}
|
||||
{Intermediate Cause}
|
||||
↓ {mechanism: how does this lead to proximate?}
|
||||
{Proximate Cause}
|
||||
↓ {mechanism: how does this produce observed effect?}
|
||||
{Observed Effect}
|
||||
```
|
||||
|
||||
**Mechanisms**: {Explain HOW each step leads to the next}
|
||||
|
||||
**Confounders ruled out**: {Z causes both X and Y, but X doesn't cause Y}
|
||||
|
||||
---
|
||||
|
||||
## 4. Evidence Assessment
|
||||
|
||||
### Temporal Sequence
|
||||
- [ ] Does cause precede effect? {Yes/No}
|
||||
- [ ] Timeline: {Cause at time T, effect at time T+lag}
|
||||
|
||||
### Counterfactual Test
|
||||
- [ ] "What if cause hadn't occurred?" → {Expected outcome}
|
||||
- [ ] Evidence: {Control group, rollback, baseline comparison}
|
||||
|
||||
### Mechanism
|
||||
- [ ] Can we explain HOW cause produces effect? {Yes/No}
|
||||
- [ ] Mechanism: {Detailed explanation with supporting evidence}
|
||||
|
||||
### Dose-Response
|
||||
- [ ] More cause → more effect? {Yes/No/Unknown}
|
||||
- [ ] Evidence: {Examples of gradient}
|
||||
|
||||
### Consistency
|
||||
- [ ] Does relationship hold across contexts? {Yes/No}
|
||||
- [ ] Evidence: {Different times, places, populations}
|
||||
|
||||
### Confounding Control
|
||||
- [ ] Identified confounders: {List third variables}
|
||||
- [ ] Controlled for: {How confounders were ruled out}
|
||||
|
||||
---
|
||||
|
||||
## 5. Conclusion
|
||||
|
||||
### Most Likely Root Cause
|
||||
|
||||
**Root cause**: {Identified root cause}
|
||||
|
||||
**Confidence level**: {High/Medium/Low}
|
||||
|
||||
**Justification**: {Why this confidence level based on evidence}
|
||||
|
||||
**Causal mechanism**: {Complete chain from root cause → effect}
|
||||
|
||||
**Why this is root cause (not symptom)**: {If fixed, problem wouldn't recur}
|
||||
|
||||
### Alternative Explanations
|
||||
|
||||
**Alternative 1**: {Other possible cause}
|
||||
- Why less likely: {Evidence against}
|
||||
|
||||
**Alternative 2**: {Other possible cause}
|
||||
- Why less likely: {Evidence against}
|
||||
|
||||
**Unresolved uncertainties**: {What we still don't know}
|
||||
|
||||
---
|
||||
|
||||
## 6. Recommended Actions
|
||||
|
||||
### Immediate Interventions (Address Root Cause)
|
||||
1. {Action to eliminate root cause}
|
||||
2. {Action to prevent recurrence}
|
||||
|
||||
### Tests to Validate Causal Claim
|
||||
1. {Experiment to confirm causation}
|
||||
2. {Observable prediction if theory is correct}
|
||||
|
||||
### Monitoring
|
||||
- Metrics to track: {Key indicators}
|
||||
- Expected change: {What should happen if intervention works}
|
||||
- Timeline: {When to expect results}
|
||||
|
||||
---
|
||||
|
||||
## 7. Limitations
|
||||
|
||||
**Data limitations**: {Missing data, measurement issues}
|
||||
|
||||
**Analysis limitations**: {Confounders not controlled, temporal gaps, sample size}
|
||||
|
||||
**Generalizability**: {Does this apply beyond this specific case?}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Section-by-Section Guidance
|
||||
|
||||
### 1. Effect Definition
|
||||
**Goal**: Precisely specify what you're explaining
|
||||
- Be specific (not "system slow" but "p95 latency 50ms → 500ms")
|
||||
- Quantify magnitude and timeline
|
||||
- Establish baseline for comparison
|
||||
|
||||
### 2. Competing Hypotheses
|
||||
**Goal**: Generate multiple explanations (avoid confirmation bias)
|
||||
- Use techniques: 5 Whys, Fishbone Diagram, Timeline Analysis (see `methodology.md`)
|
||||
- Distinguish: Root cause (fundamental) vs Proximate cause (immediate trigger) vs Symptom
|
||||
- Consider confounders (third variables causing both X and Y)
|
||||
|
||||
### 3. Causal Model
|
||||
**Goal**: Map how causes lead to effect
|
||||
- Show causal chains with mechanisms (not just correlation)
|
||||
- Identify confounding relationships
|
||||
- Distinguish direct vs indirect causation
|
||||
|
||||
### 4. Evidence Assessment
|
||||
**Goal**: Test whether X truly causes Y (not just correlates)
|
||||
|
||||
**Essential tests:**
|
||||
- **Temporal sequence**: Cause must precede effect
|
||||
- **Counterfactual**: What happens when cause is absent?
|
||||
- **Mechanism**: HOW does cause produce effect?
|
||||
|
||||
**Strengthening evidence:**
|
||||
- **Dose-response**: More cause → more effect?
|
||||
- **Consistency**: Holds across contexts?
|
||||
- **Confounding control**: Ruled out third variables?
|
||||
|
||||
### 5. Conclusion
|
||||
**Goal**: State root cause with justified confidence
|
||||
- Distinguish root from proximate causes
|
||||
- State confidence level (High/Medium/Low) with reasoning
|
||||
- Acknowledge alternative explanations and limitations
|
||||
|
||||
### 6. Recommendations
|
||||
**Goal**: Actionable interventions based on root cause
|
||||
- Address root cause (not just symptoms)
|
||||
- Propose tests to validate causal claim
|
||||
- Define success metrics
|
||||
|
||||
### 7. Limitations
|
||||
**Goal**: Acknowledge uncertainties and boundaries
|
||||
- Missing data or measurement issues
|
||||
- Confounders not fully controlled
|
||||
- Scope of generalizability
|
||||
|
||||
---
|
||||
|
||||
## Quick Reference
|
||||
|
||||
**For detailed techniques**: See `methodology.md`
|
||||
- 5 Whys (trace back to root)
|
||||
- Fishbone Diagram (categorize causes)
|
||||
- Timeline Analysis (what changed when?)
|
||||
- Bradford Hill Criteria (9 factors for causation)
|
||||
- Evidence hierarchy (RCT > longitudinal > cross-sectional)
|
||||
- Confounding control methods
|
||||
|
||||
**For complete example**: See `examples/database-performance-degradation.md`
|
||||
- Shows full analysis with all sections
|
||||
- Demonstrates evidence assessment
|
||||
- Includes Bradford Hill scoring
|
||||
|
||||
**Quality checklist**: Before finalizing, verify:
|
||||
- [ ] Effect clearly defined with quantification
|
||||
- [ ] Multiple hypotheses generated (3+ competing explanations)
|
||||
- [ ] Root cause distinguished from proximate/symptoms
|
||||
- [ ] Temporal sequence verified (cause before effect)
|
||||
- [ ] Counterfactual tested (what if cause absent?)
|
||||
- [ ] Mechanism explained (HOW, not just THAT)
|
||||
- [ ] Confounders identified and controlled
|
||||
- [ ] Confidence level stated with justification
|
||||
- [ ] Alternative explanations noted
|
||||
- [ ] Limitations acknowledged
|
||||
Reference in New Issue
Block a user