Files
gh-greyhaven-ai-claude-code…/skills/incident-response/reference/rca-techniques.md
2025-11-29 18:29:18 +08:00

7.3 KiB

Root Cause Analysis Techniques

Comprehensive methods for identifying root causes through 5 Whys, Fishbone Diagrams, Timeline Reconstruction, and data-driven hypothesis testing.

5 Whys Technique

Method

Ask "Why?" iteratively until you reach the root cause (typically 5 levels deep, but can be 3-7).

Rules:

  1. Start with the problem statement
  2. Ask "Why did this happen?"
  3. Answer based on facts/data (not assumptions)
  4. Repeat for each answer until root cause found
  5. Root cause = systemic issue (not human error)

Example

Problem: Database went down

Why 1: Why did the database go down?
→ Because the primary database ran out of disk space

Why 2: Why did it run out of disk space?
→ Because PostgreSQL logs filled the entire disk (450GB)

Why 3: Why did logs grow to 450GB?
→ Because log rotation was disabled

Why 4: Why was log rotation disabled?
→ Because the `log_truncate_on_rotation` config was set to `off` during a migration

Why 5: Why was this config change not caught?
→ Because configuration changes are not code-reviewed and there was no disk monitoring alert

ROOT CAUSE: Missing disk monitoring alerts + configuration change without code review

Action Items:

  • Add disk usage monitoring (>90% alert)
  • Require code review for all config changes
  • Enable log rotation on all databases

Fishbone Diagram (Ishikawa)

Method

Categorize contributing factors into major categories to identify root cause systematically.

Categories (6M's):

  1. Method (Process)
  2. Machine (Technology)
  3. Material (Inputs/Data)
  4. Measurement (Monitoring)
  5. Mother Nature (Environment)
  6. Manpower (People/Skills)

Example

Problem: API performance degraded (p95: 200ms → 2000ms)

                              API Performance Degraded
                                        │
        ┌───────────────────┬───────────┴───────────┬───────────────────┐
        │                   │                       │                   │
    METHOD              MACHINE                 MATERIAL          MEASUREMENT
  (Process)          (Technology)               (Data)            (Monitoring)
        │                   │                       │                   │
   No memory          EventEmitter              Large dataset       No heap
   profiling in       listeners leak            processing          snapshots
   code review        (not removed)             (100K orders)       in CI/CD
        │                   │                       │                   │
   No long-running    Node.js v14               High traffic        No gradual
   load tests         (old GC)                  spike (2x)          alerts
   (only 5min)                                                      (1h → 2h)

Root Causes Identified:

  • Machine: EventEmitter leak (technical)
  • Measurement: No heap monitoring (monitoring gap)
  • Method: No memory profiling in code review (process gap)

Timeline Reconstruction

Method

Build chronological timeline of events to identify causation and correlation.

Steps:

  1. Gather logs from all systems (with timestamps)
  2. Normalize to UTC
  3. Plot events chronologically
  4. Identify cause-and-effect relationships
  5. Find the triggering event

Example

12:00:00 - Normal operation (p95: 200ms, memory: 400MB)
12:15:00 - Code deployment (v2.15.4)
12:30:00 - Memory: 720MB (+80% in 15min) ⚠️
12:45:00 - Memory: 1.2GB (+67% in 15min) ⚠️
13:00:00 - Memory: 1.8GB (+50% in 15min) 🚨
13:00:00 - p95 latency: 800ms (4x slower)
13:15:00 - Memory: 2.3GB (limit reached)
13:15:00 - Workers start OOMing
13:20:00 - p95 latency: 2000ms (10x slower)
13:30:00 - Alert fired: High latency
14:00:00 - Alert fired: High memory

CORRELATION:
- Deployment at 12:15 → Memory growth starts at 12:30
- Memory growth → Latency increase (correlated)
- TRIGGER: Code deployment v2.15.4

ACTION: Review code changes in v2.15.4

Contributing Factors Analysis

Levels of Causation

Immediate Cause (What happened):

  • Direct technical failure
  • Example: EventEmitter listeners not removed

Underlying Conditions (Why it was possible):

  • Missing safeguards
  • Example: No memory profiling in code review

Latent Failures (Systemic weaknesses):

  • Organizational/process gaps
  • Example: No developer training on memory management

Example

Incident: Memory leak in production

Immediate Cause:
└─ Code: EventEmitter .on() used without .removeListener()

Underlying Conditions:
├─ No code review caught the issue
├─ No memory profiling in CI/CD
└─ Short load tests (5min) didn't reveal gradual leak

Latent Failures:
├─ Team lacks memory management training
├─ No documentation on EventEmitter best practices
└─ Culture of "ship fast, fix later"

Hypothesis Testing

Method

Generate hypotheses, test with data, validate or reject.

Process:

  1. Observe symptoms
  2. Generate hypotheses (educated guesses)
  3. Design experiments to test each hypothesis
  4. Collect data
  5. Accept or reject hypothesis
  6. Repeat until root cause found

Example

Symptom: Checkout API slow (p95: 3000ms)

Hypothesis 1: Database slow queries

Test: Check slow query log
Data: All queries < 50ms ✅
Result: REJECTED - database is fast

Hypothesis 2: External API slow

Test: Distributed tracing (Jaeger)
Data: Fraud check API: 2750ms (91% of total time) 🚨
Result: ACCEPTED - external API is bottleneck

Hypothesis 3: Network latency

Test: curl timing breakdown
Data: DNS: 50ms, Connect: 30ms, Transfer: 2750ms
Result: PARTIAL - transfer is slow (not DNS/connect)

Root Cause: External fraud check API slow (blocking checkout)


Blameless RCA Principles

Core Tenets

  1. Focus on Systems, Not People

    • "Engineer made a mistake"
    • "Process didn't catch config error"
  2. Assume Good Intent

    • Everyone did the best they could with information available
    • Blame discourages honesty and learning
  3. Multiple Contributing Factors

    • Never a single cause
    • Usually 3-5 factors contribute
  4. Actionable Improvements

    • Fix the system, not the person
    • Concrete action items with owners

Example (Blameless vs Blame)

Blamefu (BAD):

Root Cause: Engineer Jane deployed code without testing
Action Item: Remind Jane to test before deploying

Blameless (GOOD):

Root Cause: Deployment process allowed untested code to reach production
Contributing Factors:
1. No automated tests in CI/CD
2. Manual deployment process (prone to human error)
3. No staging environment validation

Action Items:
1. Add automated tests to CI/CD (Owner: Mike, Due: Dec 20)
2. Require staging deployment + validation before production (Owner: Sarah, Due: Dec 22)
3. Implement deployment checklist (Owner: Alex, Due: Dec 18)


Return to reference index