Initial commit

2025-11-29 18:29:18 +08:00
commit 46dfc30864
25 changed files with 4683 additions and 0 deletions
--- a/skills/incident-response/reference/rca-techniques.md
+++ b/skills/incident-response/reference/rca-techniques.md
@@ -0,0 +1,260 @@
+# Root Cause Analysis Techniques
+
+Comprehensive methods for identifying root causes through 5 Whys, Fishbone Diagrams, Timeline Reconstruction, and data-driven hypothesis testing.
+
+## 5 Whys Technique
+
+### Method
+
+Ask "Why?" iteratively until you reach the root cause (typically 5 levels deep, but can be 3-7).
+
+**Rules**:
+1. Start with the problem statement
+2. Ask "Why did this happen?"
+3. Answer based on facts/data (not assumptions)
+4. Repeat for each answer until root cause found
+5. Root cause = **systemic issue** (not human error)
+
+### Example
+
+**Problem**: Database went down
+
+```
+Why 1: Why did the database go down?
+→ Because the primary database ran out of disk space
+
+Why 2: Why did it run out of disk space?
+→ Because PostgreSQL logs filled the entire disk (450GB)
+
+Why 3: Why did logs grow to 450GB?
+→ Because log rotation was disabled
+
+Why 4: Why was log rotation disabled?
+→ Because the `log_truncate_on_rotation` config was set to `off` during a migration
+
+Why 5: Why was this config change not caught?
+→ Because configuration changes are not code-reviewed and there was no disk monitoring alert
+
+ROOT CAUSE: Missing disk monitoring alerts + configuration change without code review
+```
+
+**Action Items**:
+- Add disk usage monitoring (>90% alert)
+- Require code review for all config changes
+- Enable log rotation on all databases
+
+---
+
+## Fishbone Diagram (Ishikawa)
+
+### Method
+
+Categorize contributing factors into major categories to identify root cause systematically.
+
+**Categories** (6M's):
+1. **Method** (Process)
+2. **Machine** (Technology)
+3. **Material** (Inputs/Data)
+4. **Measurement** (Monitoring)
+5. **Mother Nature** (Environment)
+6. **Manpower** (People/Skills)
+
+### Example
+
+**Problem**: API performance degraded (p95: 200ms → 2000ms)
+
+```
+                              API Performance Degraded
+                                        │
+        ┌───────────────────┬───────────┴───────────┬───────────────────┐
+        │                   │                       │                   │
+    METHOD              MACHINE                 MATERIAL          MEASUREMENT
+  (Process)          (Technology)               (Data)            (Monitoring)
+        │                   │                       │                   │
+   No memory          EventEmitter              Large dataset       No heap
+   profiling in       listeners leak            processing          snapshots
+   code review        (not removed)             (100K orders)       in CI/CD
+        │                   │                       │                   │
+   No long-running    Node.js v14               High traffic        No gradual
+   load tests         (old GC)                  spike (2x)          alerts
+   (only 5min)                                                      (1h → 2h)
+```
+
+**Root Causes Identified**:
+- **Machine**: EventEmitter leak (technical)
+- **Measurement**: No heap monitoring (monitoring gap)
+- **Method**: No memory profiling in code review (process gap)
+
+---
+
+## Timeline Reconstruction
+
+### Method
+
+Build chronological timeline of events to identify causation and correlation.
+
+**Steps**:
+1. Gather logs from all systems (with timestamps)
+2. Normalize to UTC
+3. Plot events chronologically
+4. Identify cause-and-effect relationships
+5. Find the triggering event
+
+### Example
+
+```
+12:00:00 - Normal operation (p95: 200ms, memory: 400MB)
+12:15:00 - Code deployment (v2.15.4)
+12:30:00 - Memory: 720MB (+80% in 15min) ⚠️
+12:45:00 - Memory: 1.2GB (+67% in 15min) ⚠️
+13:00:00 - Memory: 1.8GB (+50% in 15min) 🚨
+13:00:00 - p95 latency: 800ms (4x slower)
+13:15:00 - Memory: 2.3GB (limit reached)
+13:15:00 - Workers start OOMing
+13:20:00 - p95 latency: 2000ms (10x slower)
+13:30:00 - Alert fired: High latency
+14:00:00 - Alert fired: High memory
+
+CORRELATION:
+- Deployment at 12:15 → Memory growth starts at 12:30
+- Memory growth → Latency increase (correlated)
+- TRIGGER: Code deployment v2.15.4
+
+ACTION: Review code changes in v2.15.4
+```
+
+---
+
+## Contributing Factors Analysis
+
+### Levels of Causation
+
+**Immediate Cause** (What happened):
+- Direct technical failure
+- Example: EventEmitter listeners not removed
+
+**Underlying Conditions** (Why it was possible):
+- Missing safeguards
+- Example: No memory profiling in code review
+
+**Latent Failures** (Systemic weaknesses):
+- Organizational/process gaps
+- Example: No developer training on memory management
+
+### Example
+
+**Incident**: Memory leak in production
+
+```
+Immediate Cause:
+└─ Code: EventEmitter .on() used without .removeListener()
+
+Underlying Conditions:
+├─ No code review caught the issue
+├─ No memory profiling in CI/CD
+└─ Short load tests (5min) didn't reveal gradual leak
+
+Latent Failures:
+├─ Team lacks memory management training
+├─ No documentation on EventEmitter best practices
+└─ Culture of "ship fast, fix later"
+```
+
+---
+
+## Hypothesis Testing
+
+### Method
+
+Generate hypotheses, test with data, validate or reject.
+
+**Process**:
+1. Observe symptoms
+2. Generate hypotheses (educated guesses)
+3. Design experiments to test each hypothesis
+4. Collect data
+5. Accept or reject hypothesis
+6. Repeat until root cause found
+
+### Example
+
+**Symptom**: Checkout API slow (p95: 3000ms)
+
+**Hypothesis 1**: Database slow queries
+```
+Test: Check slow query log
+Data: All queries < 50ms ✅
+Result: REJECTED - database is fast
+```
+
+**Hypothesis 2**: External API slow
+```
+Test: Distributed tracing (Jaeger)
+Data: Fraud check API: 2750ms (91% of total time) 🚨
+Result: ACCEPTED - external API is bottleneck
+```
+
+**Hypothesis 3**: Network latency
+```
+Test: curl timing breakdown
+Data: DNS: 50ms, Connect: 30ms, Transfer: 2750ms
+Result: PARTIAL - transfer is slow (not DNS/connect)
+```
+
+**Root Cause**: External fraud check API slow (blocking checkout)
+
+---
+
+## Blameless RCA Principles
+
+### Core Tenets
+
+1. **Focus on Systems, Not People**
+   - ❌ "Engineer made a mistake"
+   - ✅ "Process didn't catch config error"
+
+2. **Assume Good Intent**
+   - Everyone did the best they could with information available
+   - Blame discourages honesty and learning
+
+3. **Multiple Contributing Factors**
+   - Never a single cause
+   - Usually 3-5 factors contribute
+
+4. **Actionable Improvements**
+   - Fix the system, not the person
+   - Concrete action items with owners
+
+### Example (Blameless vs Blame)
+
+**Blamefu (BAD)**:
+```
+Root Cause: Engineer Jane deployed code without testing
+Action Item: Remind Jane to test before deploying
+```
+
+**Blameless (GOOD)**:
+```
+Root Cause: Deployment process allowed untested code to reach production
+Contributing Factors:
+1. No automated tests in CI/CD
+2. Manual deployment process (prone to human error)
+3. No staging environment validation
+
+Action Items:
+1. Add automated tests to CI/CD (Owner: Mike, Due: Dec 20)
+2. Require staging deployment + validation before production (Owner: Sarah, Due: Dec 22)
+3. Implement deployment checklist (Owner: Alex, Due: Dec 18)
+```
+
+---
+
+## Related Documentation
+
+- **Examples**: [Examples Index](../examples/INDEX.md) - RCA examples from real incidents
+- **Severity Matrix**: [incident-severity-matrix.md](incident-severity-matrix.md) - When to perform RCA
+- **Templates**: [Postmortem Template](../templates/postmortem-template.md) - Structured RCA format
+
+---
+
+Return to [reference index](INDEX.md)