261 lines
7.3 KiB
Markdown
261 lines
7.3 KiB
Markdown
# Root Cause Analysis Techniques
|
|
|
|
Comprehensive methods for identifying root causes through 5 Whys, Fishbone Diagrams, Timeline Reconstruction, and data-driven hypothesis testing.
|
|
|
|
## 5 Whys Technique
|
|
|
|
### Method
|
|
|
|
Ask "Why?" iteratively until you reach the root cause (typically 5 levels deep, but can be 3-7).
|
|
|
|
**Rules**:
|
|
1. Start with the problem statement
|
|
2. Ask "Why did this happen?"
|
|
3. Answer based on facts/data (not assumptions)
|
|
4. Repeat for each answer until root cause found
|
|
5. Root cause = **systemic issue** (not human error)
|
|
|
|
### Example
|
|
|
|
**Problem**: Database went down
|
|
|
|
```
|
|
Why 1: Why did the database go down?
|
|
→ Because the primary database ran out of disk space
|
|
|
|
Why 2: Why did it run out of disk space?
|
|
→ Because PostgreSQL logs filled the entire disk (450GB)
|
|
|
|
Why 3: Why did logs grow to 450GB?
|
|
→ Because log rotation was disabled
|
|
|
|
Why 4: Why was log rotation disabled?
|
|
→ Because the `log_truncate_on_rotation` config was set to `off` during a migration
|
|
|
|
Why 5: Why was this config change not caught?
|
|
→ Because configuration changes are not code-reviewed and there was no disk monitoring alert
|
|
|
|
ROOT CAUSE: Missing disk monitoring alerts + configuration change without code review
|
|
```
|
|
|
|
**Action Items**:
|
|
- Add disk usage monitoring (>90% alert)
|
|
- Require code review for all config changes
|
|
- Enable log rotation on all databases
|
|
|
|
---
|
|
|
|
## Fishbone Diagram (Ishikawa)
|
|
|
|
### Method
|
|
|
|
Categorize contributing factors into major categories to identify root cause systematically.
|
|
|
|
**Categories** (6M's):
|
|
1. **Method** (Process)
|
|
2. **Machine** (Technology)
|
|
3. **Material** (Inputs/Data)
|
|
4. **Measurement** (Monitoring)
|
|
5. **Mother Nature** (Environment)
|
|
6. **Manpower** (People/Skills)
|
|
|
|
### Example
|
|
|
|
**Problem**: API performance degraded (p95: 200ms → 2000ms)
|
|
|
|
```
|
|
API Performance Degraded
|
|
│
|
|
┌───────────────────┬───────────┴───────────┬───────────────────┐
|
|
│ │ │ │
|
|
METHOD MACHINE MATERIAL MEASUREMENT
|
|
(Process) (Technology) (Data) (Monitoring)
|
|
│ │ │ │
|
|
No memory EventEmitter Large dataset No heap
|
|
profiling in listeners leak processing snapshots
|
|
code review (not removed) (100K orders) in CI/CD
|
|
│ │ │ │
|
|
No long-running Node.js v14 High traffic No gradual
|
|
load tests (old GC) spike (2x) alerts
|
|
(only 5min) (1h → 2h)
|
|
```
|
|
|
|
**Root Causes Identified**:
|
|
- **Machine**: EventEmitter leak (technical)
|
|
- **Measurement**: No heap monitoring (monitoring gap)
|
|
- **Method**: No memory profiling in code review (process gap)
|
|
|
|
---
|
|
|
|
## Timeline Reconstruction
|
|
|
|
### Method
|
|
|
|
Build chronological timeline of events to identify causation and correlation.
|
|
|
|
**Steps**:
|
|
1. Gather logs from all systems (with timestamps)
|
|
2. Normalize to UTC
|
|
3. Plot events chronologically
|
|
4. Identify cause-and-effect relationships
|
|
5. Find the triggering event
|
|
|
|
### Example
|
|
|
|
```
|
|
12:00:00 - Normal operation (p95: 200ms, memory: 400MB)
|
|
12:15:00 - Code deployment (v2.15.4)
|
|
12:30:00 - Memory: 720MB (+80% in 15min) ⚠️
|
|
12:45:00 - Memory: 1.2GB (+67% in 15min) ⚠️
|
|
13:00:00 - Memory: 1.8GB (+50% in 15min) 🚨
|
|
13:00:00 - p95 latency: 800ms (4x slower)
|
|
13:15:00 - Memory: 2.3GB (limit reached)
|
|
13:15:00 - Workers start OOMing
|
|
13:20:00 - p95 latency: 2000ms (10x slower)
|
|
13:30:00 - Alert fired: High latency
|
|
14:00:00 - Alert fired: High memory
|
|
|
|
CORRELATION:
|
|
- Deployment at 12:15 → Memory growth starts at 12:30
|
|
- Memory growth → Latency increase (correlated)
|
|
- TRIGGER: Code deployment v2.15.4
|
|
|
|
ACTION: Review code changes in v2.15.4
|
|
```
|
|
|
|
---
|
|
|
|
## Contributing Factors Analysis
|
|
|
|
### Levels of Causation
|
|
|
|
**Immediate Cause** (What happened):
|
|
- Direct technical failure
|
|
- Example: EventEmitter listeners not removed
|
|
|
|
**Underlying Conditions** (Why it was possible):
|
|
- Missing safeguards
|
|
- Example: No memory profiling in code review
|
|
|
|
**Latent Failures** (Systemic weaknesses):
|
|
- Organizational/process gaps
|
|
- Example: No developer training on memory management
|
|
|
|
### Example
|
|
|
|
**Incident**: Memory leak in production
|
|
|
|
```
|
|
Immediate Cause:
|
|
└─ Code: EventEmitter .on() used without .removeListener()
|
|
|
|
Underlying Conditions:
|
|
├─ No code review caught the issue
|
|
├─ No memory profiling in CI/CD
|
|
└─ Short load tests (5min) didn't reveal gradual leak
|
|
|
|
Latent Failures:
|
|
├─ Team lacks memory management training
|
|
├─ No documentation on EventEmitter best practices
|
|
└─ Culture of "ship fast, fix later"
|
|
```
|
|
|
|
---
|
|
|
|
## Hypothesis Testing
|
|
|
|
### Method
|
|
|
|
Generate hypotheses, test with data, validate or reject.
|
|
|
|
**Process**:
|
|
1. Observe symptoms
|
|
2. Generate hypotheses (educated guesses)
|
|
3. Design experiments to test each hypothesis
|
|
4. Collect data
|
|
5. Accept or reject hypothesis
|
|
6. Repeat until root cause found
|
|
|
|
### Example
|
|
|
|
**Symptom**: Checkout API slow (p95: 3000ms)
|
|
|
|
**Hypothesis 1**: Database slow queries
|
|
```
|
|
Test: Check slow query log
|
|
Data: All queries < 50ms ✅
|
|
Result: REJECTED - database is fast
|
|
```
|
|
|
|
**Hypothesis 2**: External API slow
|
|
```
|
|
Test: Distributed tracing (Jaeger)
|
|
Data: Fraud check API: 2750ms (91% of total time) 🚨
|
|
Result: ACCEPTED - external API is bottleneck
|
|
```
|
|
|
|
**Hypothesis 3**: Network latency
|
|
```
|
|
Test: curl timing breakdown
|
|
Data: DNS: 50ms, Connect: 30ms, Transfer: 2750ms
|
|
Result: PARTIAL - transfer is slow (not DNS/connect)
|
|
```
|
|
|
|
**Root Cause**: External fraud check API slow (blocking checkout)
|
|
|
|
---
|
|
|
|
## Blameless RCA Principles
|
|
|
|
### Core Tenets
|
|
|
|
1. **Focus on Systems, Not People**
|
|
- ❌ "Engineer made a mistake"
|
|
- ✅ "Process didn't catch config error"
|
|
|
|
2. **Assume Good Intent**
|
|
- Everyone did the best they could with information available
|
|
- Blame discourages honesty and learning
|
|
|
|
3. **Multiple Contributing Factors**
|
|
- Never a single cause
|
|
- Usually 3-5 factors contribute
|
|
|
|
4. **Actionable Improvements**
|
|
- Fix the system, not the person
|
|
- Concrete action items with owners
|
|
|
|
### Example (Blameless vs Blame)
|
|
|
|
**Blamefu (BAD)**:
|
|
```
|
|
Root Cause: Engineer Jane deployed code without testing
|
|
Action Item: Remind Jane to test before deploying
|
|
```
|
|
|
|
**Blameless (GOOD)**:
|
|
```
|
|
Root Cause: Deployment process allowed untested code to reach production
|
|
Contributing Factors:
|
|
1. No automated tests in CI/CD
|
|
2. Manual deployment process (prone to human error)
|
|
3. No staging environment validation
|
|
|
|
Action Items:
|
|
1. Add automated tests to CI/CD (Owner: Mike, Due: Dec 20)
|
|
2. Require staging deployment + validation before production (Owner: Sarah, Due: Dec 22)
|
|
3. Implement deployment checklist (Owner: Alex, Due: Dec 18)
|
|
```
|
|
|
|
---
|
|
|
|
## Related Documentation
|
|
|
|
- **Examples**: [Examples Index](../examples/INDEX.md) - RCA examples from real incidents
|
|
- **Severity Matrix**: [incident-severity-matrix.md](incident-severity-matrix.md) - When to perform RCA
|
|
- **Templates**: [Postmortem Template](../templates/postmortem-template.md) - Structured RCA format
|
|
|
|
---
|
|
|
|
Return to [reference index](INDEX.md)
|