Initial commit

This commit is contained in:
Zhongwei Li
2025-11-29 18:29:18 +08:00
commit 46dfc30864
25 changed files with 4683 additions and 0 deletions

View File

@@ -0,0 +1,82 @@
# Incident Response Reference
Quick reference guides for incident severity classification, communication templates, root cause analysis techniques, and runbook structure.
## Available References
### Incident Severity Matrix
**File**: [incident-severity-matrix.md](incident-severity-matrix.md)
Complete severity classification guide with examples:
- **SEV1 (Critical)**: Complete outage, all customers affected, revenue stopped
- **SEV2 (Major)**: Partial degradation, significant customer impact
- **SEV3 (Minor)**: Isolated issues, workarounds available
- **SEV4 (Cosmetic)**: UI issues, no functional impact
**Use when**: Classifying incident severity, determining escalation path
---
### Communication Templates
**File**: [communication-templates.md](communication-templates.md)
Ready-to-use templates for all incident communications:
- Internal updates (Slack, email)
- External communications (status page, customer emails)
- Executive briefings
- Post-incident summaries
- Postmortem distribution
**Use when**: Communicating during or after incidents
---
### Root Cause Analysis Techniques
**File**: [rca-techniques.md](rca-techniques.md)
Comprehensive RCA methodology guide:
- **5 Whys**: Iterative questioning to find root cause
- **Fishbone Diagrams**: Category-based analysis
- **Timeline Reconstruction**: Event sequencing and correlation
- **Contributing Factors Analysis**: Immediate vs underlying vs latent causes
- **Hypothesis Testing**: Data-driven validation
**Use when**: Conducting root cause analysis, writing postmortems
---
### Runbook Structure Guide
**File**: [runbook-structure-guide.md](runbook-structure-guide.md)
Best practices for writing effective runbooks:
- Standard runbook template
- Diagnostic procedures
- Remediation steps
- Escalation paths
- Success criteria
- Runbook maintenance
**Use when**: Creating or updating runbooks, automating diagnostics
---
## Quick Links
**By Use Case**:
- Need to classify incident severity → [Severity Matrix](incident-severity-matrix.md)
- Need to communicate during incident → [Communication Templates](communication-templates.md)
- Need to find root cause → [RCA Techniques](rca-techniques.md)
- Need to write a runbook → [Runbook Structure Guide](runbook-structure-guide.md)
**Related Documentation**:
- **Examples**: [Examples Index](../examples/INDEX.md) - Real-world incident examples
- **Templates**: [Templates Index](../templates/INDEX.md) - Incident timeline, postmortem templates
- **Main Agent**: [incident-responder.md](../incident-responder.md) - Incident responder agent
---
Return to [main agent](../incident-responder.md)

View File

@@ -0,0 +1,213 @@
# Communication Templates
Copy-paste templates for incident communications across all channels and severity levels.
## Internal Communications
### SEV1 Incident Start
```
🚨 SEV1 INCIDENT DECLARED 🚨
Incident ID: INC-YYYY-MM-DD-XXX
Impact: [Brief description - e.g., "100% outage, database down"]
Affected Services: [List services]
Customer Impact: [All users / X% of users / Specific feature unavailable]
Roles:
- Incident Commander: @[name]
- Technical Lead: @[name]
- Communications Lead: @[name]
War Room:
- Slack: #incident-XXX
- Zoom: [link]
Status: Investigating
Next Update: [Time] (15 minutes or on status change)
Runbook: [Link if available]
```
### SEV2 Incident Start
```
⚠️ SEV2 INCIDENT
Incident ID: INC-YYYY-MM-DD-XXX
Impact: [Brief description - e.g., "API degraded, 30% users affected"]
Symptoms: [What users are experiencing]
IC: @[name]
Status: Investigating [suspected cause]
Next Update: 30 minutes
```
### Incident Update
```
📊 UPDATE #[N] (T+[X] minutes)
Root Cause: [What we found OR "Still investigating"]
Mitigation: [What we're doing]
Impact: [Current status - improving/stable/worsening]
ETA: [Expected resolution time OR "Unknown"]
Next Update: [Time]
```
### Incident Resolved
```
🎉 INCIDENT RESOLVED (T+[X] minutes/hours)
Final Status: All services operational
Root Cause: [Brief summary]
Fix Applied: [What was done]
Monitoring: Ongoing for [duration]
Postmortem: Scheduled for [date/time]
Timeline: [Link to detailed timeline]
```
## External Communications
### Status Page - Investigating
```
🔴 INVESTIGATING - [Brief Title]
We are investigating reports of [issue description].
Our team is actively working to identify the cause.
Affected: [Service names]
Started: [HH:MM UTC]
Next Update: [HH:MM UTC]
```
### Status Page - Identified
```
🟡 IDENTIFIED - [Issue Title]
We have identified the issue as [brief cause].
Our team is implementing a fix.
Affected: [Service names]
Started: [HH:MM UTC]
Identified: [HH:MM UTC]
Est. Resolution: [HH:MM UTC]
```
### Status Page - Monitoring
```
🟢 MONITORING - [Issue Title] Resolved
The issue has been resolved and services are operating normally.
We are monitoring to ensure stability.
Started: [HH:MM UTC]
Resolved: [HH:MM UTC]
Duration: [X] minutes
```
### Customer Email - SEV1 Postmortem
```
Subject: Service Disruption - [Date] Postmortem
Dear [Product Name] Customers,
On [Date] at [Time UTC], we experienced a service disruption that affected [all users / X% of users] for approximately [duration].
What Happened:
[2-3 sentence summary of the incident]
Impact:
- Duration: [X] minutes
- Affected Users: [percentage or description]
- Services Impacted: [list]
Root Cause:
[1-2 sentence explanation of root cause]
Resolution:
[1-2 sentences on how we fixed it]
Prevention:
We have implemented the following measures to prevent recurrence:
1. [Measure 1]
2. [Measure 2]
3. [Measure 3]
We sincerely apologize for the inconvenience and appreciate your patience.
[Team Name]
[Company Name]
```
## Executive Briefings
### Initial Notification (SEV1 only)
```
Subject: SEV1 Incident - [Brief Title]
Summary:
- Incident ID: INC-YYYY-MM-DD-XXX
- Started: [HH:MM UTC]
- Impact: [All users affected / X% affected / revenue stopped]
- Status: [Investigating / Mitigation in progress]
Current Situation:
[2-3 sentences explaining what's happening]
Response:
- IC: [Name]
- Team: [X] engineers actively working
- ETA: [Time if known, "Unknown" if not]
Business Impact:
- Revenue: [Estimated $ per hour OR "Minimal"]
- Customers: [Number affected]
- SLA: [Yes/No breach, details]
Next Update: [Time]
```
### Resolution Summary (Executive)
```
Subject: SEV1 Resolved - [Brief Title]
Incident INC-YYYY-MM-DD-XXX has been resolved after [duration].
Timeline:
- Started: [HH:MM UTC]
- Identified: [HH:MM UTC]
- Resolved: [HH:MM UTC]
- Total Duration: [X] minutes
Impact:
- Customers Affected: [Number / percentage]
- Revenue Loss: [$X estimated]
- SLA Breach: [Yes/No]
Root Cause:
[1-2 sentences]
Resolution:
[1-2 sentences on fix]
Prevention:
[2-3 key action items with owners and dates]
Postmortem: Scheduled for [date/time]
[IC Name]
```
## Related Documentation
- **Severity Matrix**: [incident-severity-matrix.md](incident-severity-matrix.md) - When to use each template
- **Examples**: [Examples Index](../examples/INDEX.md) - Real communications from incidents
- **Templates**: [Templates Index](../templates/INDEX.md) - Full incident timeline and postmortem templates
---
Return to [reference index](INDEX.md)

View File

@@ -0,0 +1,260 @@
# Root Cause Analysis Techniques
Comprehensive methods for identifying root causes through 5 Whys, Fishbone Diagrams, Timeline Reconstruction, and data-driven hypothesis testing.
## 5 Whys Technique
### Method
Ask "Why?" iteratively until you reach the root cause (typically 5 levels deep, but can be 3-7).
**Rules**:
1. Start with the problem statement
2. Ask "Why did this happen?"
3. Answer based on facts/data (not assumptions)
4. Repeat for each answer until root cause found
5. Root cause = **systemic issue** (not human error)
### Example
**Problem**: Database went down
```
Why 1: Why did the database go down?
→ Because the primary database ran out of disk space
Why 2: Why did it run out of disk space?
→ Because PostgreSQL logs filled the entire disk (450GB)
Why 3: Why did logs grow to 450GB?
→ Because log rotation was disabled
Why 4: Why was log rotation disabled?
→ Because the `log_truncate_on_rotation` config was set to `off` during a migration
Why 5: Why was this config change not caught?
→ Because configuration changes are not code-reviewed and there was no disk monitoring alert
ROOT CAUSE: Missing disk monitoring alerts + configuration change without code review
```
**Action Items**:
- Add disk usage monitoring (>90% alert)
- Require code review for all config changes
- Enable log rotation on all databases
---
## Fishbone Diagram (Ishikawa)
### Method
Categorize contributing factors into major categories to identify root cause systematically.
**Categories** (6M's):
1. **Method** (Process)
2. **Machine** (Technology)
3. **Material** (Inputs/Data)
4. **Measurement** (Monitoring)
5. **Mother Nature** (Environment)
6. **Manpower** (People/Skills)
### Example
**Problem**: API performance degraded (p95: 200ms → 2000ms)
```
API Performance Degraded
┌───────────────────┬───────────┴───────────┬───────────────────┐
│ │ │ │
METHOD MACHINE MATERIAL MEASUREMENT
(Process) (Technology) (Data) (Monitoring)
│ │ │ │
No memory EventEmitter Large dataset No heap
profiling in listeners leak processing snapshots
code review (not removed) (100K orders) in CI/CD
│ │ │ │
No long-running Node.js v14 High traffic No gradual
load tests (old GC) spike (2x) alerts
(only 5min) (1h → 2h)
```
**Root Causes Identified**:
- **Machine**: EventEmitter leak (technical)
- **Measurement**: No heap monitoring (monitoring gap)
- **Method**: No memory profiling in code review (process gap)
---
## Timeline Reconstruction
### Method
Build chronological timeline of events to identify causation and correlation.
**Steps**:
1. Gather logs from all systems (with timestamps)
2. Normalize to UTC
3. Plot events chronologically
4. Identify cause-and-effect relationships
5. Find the triggering event
### Example
```
12:00:00 - Normal operation (p95: 200ms, memory: 400MB)
12:15:00 - Code deployment (v2.15.4)
12:30:00 - Memory: 720MB (+80% in 15min) ⚠️
12:45:00 - Memory: 1.2GB (+67% in 15min) ⚠️
13:00:00 - Memory: 1.8GB (+50% in 15min) 🚨
13:00:00 - p95 latency: 800ms (4x slower)
13:15:00 - Memory: 2.3GB (limit reached)
13:15:00 - Workers start OOMing
13:20:00 - p95 latency: 2000ms (10x slower)
13:30:00 - Alert fired: High latency
14:00:00 - Alert fired: High memory
CORRELATION:
- Deployment at 12:15 → Memory growth starts at 12:30
- Memory growth → Latency increase (correlated)
- TRIGGER: Code deployment v2.15.4
ACTION: Review code changes in v2.15.4
```
---
## Contributing Factors Analysis
### Levels of Causation
**Immediate Cause** (What happened):
- Direct technical failure
- Example: EventEmitter listeners not removed
**Underlying Conditions** (Why it was possible):
- Missing safeguards
- Example: No memory profiling in code review
**Latent Failures** (Systemic weaknesses):
- Organizational/process gaps
- Example: No developer training on memory management
### Example
**Incident**: Memory leak in production
```
Immediate Cause:
└─ Code: EventEmitter .on() used without .removeListener()
Underlying Conditions:
├─ No code review caught the issue
├─ No memory profiling in CI/CD
└─ Short load tests (5min) didn't reveal gradual leak
Latent Failures:
├─ Team lacks memory management training
├─ No documentation on EventEmitter best practices
└─ Culture of "ship fast, fix later"
```
---
## Hypothesis Testing
### Method
Generate hypotheses, test with data, validate or reject.
**Process**:
1. Observe symptoms
2. Generate hypotheses (educated guesses)
3. Design experiments to test each hypothesis
4. Collect data
5. Accept or reject hypothesis
6. Repeat until root cause found
### Example
**Symptom**: Checkout API slow (p95: 3000ms)
**Hypothesis 1**: Database slow queries
```
Test: Check slow query log
Data: All queries < 50ms ✅
Result: REJECTED - database is fast
```
**Hypothesis 2**: External API slow
```
Test: Distributed tracing (Jaeger)
Data: Fraud check API: 2750ms (91% of total time) 🚨
Result: ACCEPTED - external API is bottleneck
```
**Hypothesis 3**: Network latency
```
Test: curl timing breakdown
Data: DNS: 50ms, Connect: 30ms, Transfer: 2750ms
Result: PARTIAL - transfer is slow (not DNS/connect)
```
**Root Cause**: External fraud check API slow (blocking checkout)
---
## Blameless RCA Principles
### Core Tenets
1. **Focus on Systems, Not People**
- ❌ "Engineer made a mistake"
- ✅ "Process didn't catch config error"
2. **Assume Good Intent**
- Everyone did the best they could with information available
- Blame discourages honesty and learning
3. **Multiple Contributing Factors**
- Never a single cause
- Usually 3-5 factors contribute
4. **Actionable Improvements**
- Fix the system, not the person
- Concrete action items with owners
### Example (Blameless vs Blame)
**Blamefu (BAD)**:
```
Root Cause: Engineer Jane deployed code without testing
Action Item: Remind Jane to test before deploying
```
**Blameless (GOOD)**:
```
Root Cause: Deployment process allowed untested code to reach production
Contributing Factors:
1. No automated tests in CI/CD
2. Manual deployment process (prone to human error)
3. No staging environment validation
Action Items:
1. Add automated tests to CI/CD (Owner: Mike, Due: Dec 20)
2. Require staging deployment + validation before production (Owner: Sarah, Due: Dec 22)
3. Implement deployment checklist (Owner: Alex, Due: Dec 18)
```
---
## Related Documentation
- **Examples**: [Examples Index](../examples/INDEX.md) - RCA examples from real incidents
- **Severity Matrix**: [incident-severity-matrix.md](incident-severity-matrix.md) - When to perform RCA
- **Templates**: [Postmortem Template](../templates/postmortem-template.md) - Structured RCA format
---
Return to [reference index](INDEX.md)