zhongwei/gh-greyhaven-ai-claude-code-config-grey-haven-plugins-incident-response

Fork 0

Files

Zhongwei Li 46dfc30864 Initial commit

2025-11-29 18:29:18 +08:00

13 KiB

Raw Blame History

Root Cause Analysis (RCA) Methodology

Comprehensive guide to performing effective root cause analysis for software bugs and incidents.

What is Root Cause Analysis?

Definition: Systematic process of identifying the fundamental reason why a problem occurred, not just treating symptoms.

Goal: Find the root cause(s) to implement prevention strategies that stop recurrence.

Key Principle: Distinguish between:

Symptom: Observable error (e.g., "API returns 500 error")
Proximate Cause: Immediate trigger (e.g., "Database query timeout")
Root Cause: Fundamental reason (e.g., "Missing database index on frequently queried column")

The 5 Whys Technique

Overview

Method: Ask "Why?" five times (or more) to drill down from symptom to root cause.

Origin: Toyota Production System (Lean Manufacturing)

Best For: Sequential cause-effect chains

Example: Null Pointer Error

Problem: API endpoint returns 500 error

Why? → User object is null when accessing .name property
Why? → Database query returned null instead of user
Why? → User ID doesn't exist in database
Why? → Frontend sent incorrect user ID from stale cache
Why? → Cache invalidation not triggered after user deletion
ROOT CAUSE: Missing cache invalidation on user deletion

Rules for Effective 5 Whys

Be Specific: "Database slow" → "Query takes 4.5s (target: <200ms)"
Use Data: Support each "Why" with evidence (logs, metrics, traces)
Don't Stop Too Early: Keep asking until you reach a process/policy root cause
Don't Blame People: Focus on processes, not individuals
May Need More/Fewer Than 5: Stop when you reach actionable root cause

Template

**Problem Statement**: [Observable symptom with impact]

**Why #1**: [First level cause]
**Evidence**: [Logs, metrics, traces]

**Why #2**: [Deeper cause]
**Evidence**: [Supporting data]

**Why #3**: [Even deeper]
**Evidence**: [Supporting data]

**Why #4**: [Near root cause]
**Evidence**: [Supporting data]

**Why #5**: [Root cause]
**Evidence**: [Supporting data]

**Root Cause**: [Fundamental reason]
**Prevention**: [How to prevent recurrence]

Fishbone (Ishikawa) Diagram

Overview

Method: Visual diagram categorizing potential causes into major categories.

Best For: Complex problems with multiple contributing factors

Categories (Software Context):

Code: Logic errors, missing validation, edge cases
Data: Invalid inputs, corrupt data, missing records
Infrastructure: Server issues, network problems, resource limits
Dependencies: Third-party APIs, libraries, services
Process: Deployment issues, configuration errors, environment mismatches
People: Knowledge gaps, communication failures, assumptions

Example Structure

                     Code
                      |
            Missing validation
                 /
                /
API 500 ─────────────── Infrastructure
Error            \
                  \
                Database timeout
                      |
                    Data

When to Use

Multiple potential root causes
Need stakeholder alignment on cause
Complex systems with many dependencies
Post-incident reviews with team

Timeline Analysis

Overview

Method: Create chronological sequence of events leading to incident.

Best For: Understanding cascading failures, race conditions, timing issues

Example

## Timeline: User Profile Page Crash

**T-00:05** - User updates profile information
**T-00:03** - Profile update succeeds, cache invalidation triggered
**T-00:02** - Cache clear initiated but takes 3s (network latency)
**T-00:00** - User refreshes page, cache still has old data
**T+00:01** - API fetches user from cache (stale)
**T+00:02** - Frontend renders with field that was deleted in update
**T+00:03** - JavaScript error: Cannot read property 'X' of undefined
**T+00:04** - Error boundary catches error, shows crash page

**Root Cause**: Cache invalidation is async and completes after page reload, causing stale data rendering.

Components

Element	Description	Example
Timestamp	Relative or absolute time	`T-00:05` or `14:32:15 UTC`
Event	What happened	`User clicked submit`
System State	Relevant state at time	`Cache: stale, DB: updated`
Decision Point	Branch in event chain	`If cache miss: fetch DB`

Distinguishing Root Cause from Symptoms

Symptom vs. Root Cause

Symptom	Root Cause
API returns 500 error	Missing error handling for null user
Database query slow	Missing index on `user_id` column
Memory leak in production	Circular reference in event listeners
User can't login	Session cookie expires after 5 minutes (should be 24h)

Test: The Prevention Question

Ask: "If I fix this, will the problem never happen again?"

If Yes → Likely root cause If No → Still a symptom or contributing factor

Example:

"Add null check" → Prevents this specific null error, but why was data null? (Symptom fix)
"Add database foreign key constraint" → Prevents any invalid user_id from being stored (Root cause fix)

Contributing Factors vs. Root Cause

Multiple Contributing Factors

Complex incidents often have multiple contributing factors and one primary root cause.

Example: Data Loss Incident

**Primary Root Cause**: Database backup script fails silently (no monitoring)

**Contributing Factors**:
1. No backup validation process
2. Backup monitoring disabled in production
3. Backup script lacks error logging
4. No runbook for backup verification
5. Manual backup never tested

**Analysis**: All factors contributed, but root cause is silent failure. Fix that first.

Prioritization

Priority	Factor Type	Action
P0	Root cause	Fix immediately
P1	Major contributor	Fix in same release
P2	Minor contributor	Fix in next sprint
P3	Edge case	Backlog

RCA Documentation Format

Standard Structure

# Root Cause Analysis: [Title]

**Date**: YYYY-MM-DD
**Incident ID**: INC-12345
**Severity**: [SEV1/SEV2/SEV3]
**Participants**: [Names of people involved in RCA]

## Summary

[2-3 sentence overview of incident and root cause]

## Impact

- **Users Affected**: [Number or %]
- **Duration**: [Time from start to resolution]
- **Business Impact**: [Revenue loss, SLA breach, etc.]

## Timeline

[Chronological sequence of events]

## Root Cause

[Detailed explanation of fundamental cause]

### 5 Whys Analysis

[Step-by-step "Why?" chain]

## Contributing Factors

[List of factors that enabled or worsened the incident]

## Prevention

### Immediate Actions (Within 24h)
- [ ] Action 1
- [ ] Action 2

### Short-term Actions (Within 1 week)
- [ ] Action 1
- [ ] Action 2

### Long-term Actions (Within 1 month)
- [ ] Action 1
- [ ] Action 2

## Lessons Learned

[Key takeaways and process improvements]

Prevention Strategy Development

Fix Categories

Category	Description	Example
Technical	Code, config, infrastructure changes	Add database index, implement retry logic
Process	Changes to how work is done	Require code review for DB changes
Monitoring	Detect issues before they cause incidents	Alert on slow query thresholds
Testing	Catch issues before production	Add integration test for edge case
Documentation	Improve knowledge sharing	Document backup restoration procedure

Prevention Checklist

**Can we prevent the root cause?**
- [ ] Technical fix implemented
- [ ] Tests added to catch recurrence
- [ ] Monitoring added to detect early

**Can we detect it faster?**
- [ ] Alerts configured
- [ ] Logging improved
- [ ] Dashboards updated

**Can we mitigate impact?**
- [ ] Graceful degradation added
- [ ] Circuit breaker implemented
- [ ] Fallback logic added

**Can we recover faster?**
- [ ] Runbook created
- [ ] Automation added
- [ ] Team trained

**Can we prevent similar issues?**
- [ ] Pattern identified
- [ ] Linting rule added
- [ ] Architecture review scheduled

Common RCA Pitfalls

Pitfall 1: Stopping Too Early

Bad:

Why? → User got 500 error
Root Cause: Server returned error

Prevention: Fix server error

Good:

Why? → User got 500 error
Why? → Server threw unhandled exception
Why? → Null pointer accessing user.email
Why? → User object was null
Why? → Database returned no user
Why? → User ID didn't exist
Why? → Frontend sent deleted user's ID
Why? → Frontend cache not invalidated after deletion
Root Cause: Missing cache invalidation on user deletion

Prevention: Trigger cache clear on user deletion, add cache TTL as safety

Pitfall 2: Blaming People

Bad:

Root Cause: Developer forgot to add validation
Prevention: Tell developer to remember next time

Good:

Root Cause: No validation enforced at API boundary
Prevention:
- Use Pydantic for automatic validation
- Add linting rule to detect missing validation
- Update code review checklist

Pitfall 3: Accepting "Human Error" as Root Cause

Bad:

Root Cause: Admin accidentally deleted production database
Prevention: Be more careful

Good:

Root Cause: Production database lacks deletion protection
Prevention:
- Enable RDS deletion protection
- Require MFA for production access
- Implement soft-delete instead of hard-delete
- Add "Are you sure?" confirmation with typed confirmation

Pitfall 4: Multiple Root Causes Without Prioritization

Bad:

Root Causes:
1. Missing error handling
2. No monitoring
3. Bad documentation
4. Insufficient testing
5. Poor communication

Prevention: Fix all of them

Good:

Primary Root Cause: Missing error handling (caused immediate incident)

Contributing Factors:
- No monitoring (delayed detection)
- Insufficient testing (didn't catch before deployment)

Prevention Priority:
1. Add error handling (prevents recurrence) - P0
2. Add monitoring (faster detection) - P1
3. Add tests (catch in CI) - P1
4. Improve docs (better response) - P2

Pitfall 5: Technical Fix Without Process Improvement

Bad:

Root Cause: Missing database index
Prevention: Add index

Good:

Root Cause: Missing database index causing slow queries
Prevention:
- Technical: Add index on user_id column
- Process: Require query performance review in code review
- Monitoring: Alert on queries >200ms
- Testing: Add performance test asserting query count

RCA Review and Validation

Review Checklist

- [ ] Root cause clearly identified and evidence-based
- [ ] Timeline accurate and complete
- [ ] All contributing factors documented
- [ ] Prevention strategies are actionable
- [ ] Prevention strategies assigned owners and due dates
- [ ] Lessons learned documented
- [ ] Incident review meeting scheduled
- [ ] RCA shared with relevant teams

Validation Questions

Completeness: Does the RCA explain all observed symptoms?
Preventability: Will the proposed fixes prevent recurrence?
Testability: Can we verify the fixes work?
Generalizability: Are there similar issues we should address?
Sustainability: Will fixes remain effective long-term?

Best Practices

Do's

✅ Start RCA immediately after incident resolution ✅ Involve multiple people for diverse perspectives ✅ Use data to support each "Why" answer ✅ Focus on processes, not people ✅ Document everything even if it seems obvious ✅ Assign owners to all prevention actions ✅ Set deadlines for prevention implementation ✅ Follow up to ensure actions completed

Don'ts

❌ Don't rush - Thorough RCA takes time ❌ Don't blame - Focus on systemic issues ❌ Don't accept vague answers - "System was slow" → "Query took 4.5s" ❌ Don't stop at technical fixes - Address process and monitoring too ❌ Don't skip documentation - Future incidents benefit from past RCAs ❌ Don't forget to close the loop - Verify prevention actions worked

Quick Reference

Technique	Best For	Output
5 Whys	Sequential cause-effect chains	Linear cause chain → root cause
Fishbone	Multiple potential causes	Categorized causes diagram
Timeline	Cascading failures, timing issues	Chronological event sequence

Root Cause Type	Fix Strategy
Missing validation	Add validation at boundary + tests
Missing error handling	Add try/catch + logging + monitoring
Performance issue	Optimize + add performance test + alert
Configuration error	Fix config + add validation + documentation
Process gap	Update process + add checklist + training

Usage: When debugging is complete, perform RCA to understand why the bug existed and how to prevent similar issues. Use 5 Whys for most cases, Fishbone for complex multi-factor incidents, Timeline for cascading failures.

13 KiB Raw Blame History

Root Cause Analysis (RCA) Methodology

What is Root Cause Analysis?

The 5 Whys Technique

Overview

Example: Null Pointer Error

Rules for Effective 5 Whys

Template

Fishbone (Ishikawa) Diagram

Overview

Example Structure

When to Use

Timeline Analysis

Overview

Example

Components

Distinguishing Root Cause from Symptoms

Symptom vs. Root Cause

Test: The Prevention Question

Contributing Factors vs. Root Cause

Multiple Contributing Factors

Prioritization

RCA Documentation Format

Standard Structure

Prevention Strategy Development

Fix Categories

Prevention Checklist

Common RCA Pitfalls

Pitfall 1: Stopping Too Early

Pitfall 2: Blaming People

Pitfall 3: Accepting "Human Error" as Root Cause

Pitfall 4: Multiple Root Causes Without Prioritization

Pitfall 5: Technical Fix Without Process Improvement

RCA Review and Validation

Review Checklist

Validation Questions

Best Practices

Do's

Don'ts

Quick Reference

13 KiB

Raw Blame History