7.3 KiB
Root Cause Analysis Techniques
Comprehensive methods for identifying root causes through 5 Whys, Fishbone Diagrams, Timeline Reconstruction, and data-driven hypothesis testing.
5 Whys Technique
Method
Ask "Why?" iteratively until you reach the root cause (typically 5 levels deep, but can be 3-7).
Rules:
- Start with the problem statement
- Ask "Why did this happen?"
- Answer based on facts/data (not assumptions)
- Repeat for each answer until root cause found
- Root cause = systemic issue (not human error)
Example
Problem: Database went down
Why 1: Why did the database go down?
→ Because the primary database ran out of disk space
Why 2: Why did it run out of disk space?
→ Because PostgreSQL logs filled the entire disk (450GB)
Why 3: Why did logs grow to 450GB?
→ Because log rotation was disabled
Why 4: Why was log rotation disabled?
→ Because the `log_truncate_on_rotation` config was set to `off` during a migration
Why 5: Why was this config change not caught?
→ Because configuration changes are not code-reviewed and there was no disk monitoring alert
ROOT CAUSE: Missing disk monitoring alerts + configuration change without code review
Action Items:
- Add disk usage monitoring (>90% alert)
- Require code review for all config changes
- Enable log rotation on all databases
Fishbone Diagram (Ishikawa)
Method
Categorize contributing factors into major categories to identify root cause systematically.
Categories (6M's):
- Method (Process)
- Machine (Technology)
- Material (Inputs/Data)
- Measurement (Monitoring)
- Mother Nature (Environment)
- Manpower (People/Skills)
Example
Problem: API performance degraded (p95: 200ms → 2000ms)
API Performance Degraded
│
┌───────────────────┬───────────┴───────────┬───────────────────┐
│ │ │ │
METHOD MACHINE MATERIAL MEASUREMENT
(Process) (Technology) (Data) (Monitoring)
│ │ │ │
No memory EventEmitter Large dataset No heap
profiling in listeners leak processing snapshots
code review (not removed) (100K orders) in CI/CD
│ │ │ │
No long-running Node.js v14 High traffic No gradual
load tests (old GC) spike (2x) alerts
(only 5min) (1h → 2h)
Root Causes Identified:
- Machine: EventEmitter leak (technical)
- Measurement: No heap monitoring (monitoring gap)
- Method: No memory profiling in code review (process gap)
Timeline Reconstruction
Method
Build chronological timeline of events to identify causation and correlation.
Steps:
- Gather logs from all systems (with timestamps)
- Normalize to UTC
- Plot events chronologically
- Identify cause-and-effect relationships
- Find the triggering event
Example
12:00:00 - Normal operation (p95: 200ms, memory: 400MB)
12:15:00 - Code deployment (v2.15.4)
12:30:00 - Memory: 720MB (+80% in 15min) ⚠️
12:45:00 - Memory: 1.2GB (+67% in 15min) ⚠️
13:00:00 - Memory: 1.8GB (+50% in 15min) 🚨
13:00:00 - p95 latency: 800ms (4x slower)
13:15:00 - Memory: 2.3GB (limit reached)
13:15:00 - Workers start OOMing
13:20:00 - p95 latency: 2000ms (10x slower)
13:30:00 - Alert fired: High latency
14:00:00 - Alert fired: High memory
CORRELATION:
- Deployment at 12:15 → Memory growth starts at 12:30
- Memory growth → Latency increase (correlated)
- TRIGGER: Code deployment v2.15.4
ACTION: Review code changes in v2.15.4
Contributing Factors Analysis
Levels of Causation
Immediate Cause (What happened):
- Direct technical failure
- Example: EventEmitter listeners not removed
Underlying Conditions (Why it was possible):
- Missing safeguards
- Example: No memory profiling in code review
Latent Failures (Systemic weaknesses):
- Organizational/process gaps
- Example: No developer training on memory management
Example
Incident: Memory leak in production
Immediate Cause:
└─ Code: EventEmitter .on() used without .removeListener()
Underlying Conditions:
├─ No code review caught the issue
├─ No memory profiling in CI/CD
└─ Short load tests (5min) didn't reveal gradual leak
Latent Failures:
├─ Team lacks memory management training
├─ No documentation on EventEmitter best practices
└─ Culture of "ship fast, fix later"
Hypothesis Testing
Method
Generate hypotheses, test with data, validate or reject.
Process:
- Observe symptoms
- Generate hypotheses (educated guesses)
- Design experiments to test each hypothesis
- Collect data
- Accept or reject hypothesis
- Repeat until root cause found
Example
Symptom: Checkout API slow (p95: 3000ms)
Hypothesis 1: Database slow queries
Test: Check slow query log
Data: All queries < 50ms ✅
Result: REJECTED - database is fast
Hypothesis 2: External API slow
Test: Distributed tracing (Jaeger)
Data: Fraud check API: 2750ms (91% of total time) 🚨
Result: ACCEPTED - external API is bottleneck
Hypothesis 3: Network latency
Test: curl timing breakdown
Data: DNS: 50ms, Connect: 30ms, Transfer: 2750ms
Result: PARTIAL - transfer is slow (not DNS/connect)
Root Cause: External fraud check API slow (blocking checkout)
Blameless RCA Principles
Core Tenets
-
Focus on Systems, Not People
- ❌ "Engineer made a mistake"
- ✅ "Process didn't catch config error"
-
Assume Good Intent
- Everyone did the best they could with information available
- Blame discourages honesty and learning
-
Multiple Contributing Factors
- Never a single cause
- Usually 3-5 factors contribute
-
Actionable Improvements
- Fix the system, not the person
- Concrete action items with owners
Example (Blameless vs Blame)
Blamefu (BAD):
Root Cause: Engineer Jane deployed code without testing
Action Item: Remind Jane to test before deploying
Blameless (GOOD):
Root Cause: Deployment process allowed untested code to reach production
Contributing Factors:
1. No automated tests in CI/CD
2. Manual deployment process (prone to human error)
3. No staging environment validation
Action Items:
1. Add automated tests to CI/CD (Owner: Mike, Due: Dec 20)
2. Require staging deployment + validation before production (Owner: Sarah, Due: Dec 22)
3. Implement deployment checklist (Owner: Alex, Due: Dec 18)
Related Documentation
- Examples: Examples Index - RCA examples from real incidents
- Severity Matrix: incident-severity-matrix.md - When to perform RCA
- Templates: Postmortem Template - Structured RCA format
Return to reference index