Initial commit

2025-11-29 18:29:18 +08:00
commit 46dfc30864
25 changed files with 4683 additions and 0 deletions
--- a/skills/incident-response/reference/INDEX.md
+++ b/skills/incident-response/reference/INDEX.md
@@ -0,0 +1,82 @@
+# Incident Response Reference
+
+Quick reference guides for incident severity classification, communication templates, root cause analysis techniques, and runbook structure.
+
+## Available References
+
+### Incident Severity Matrix
+
+**File**: [incident-severity-matrix.md](incident-severity-matrix.md)
+
+Complete severity classification guide with examples:
+- **SEV1 (Critical)**: Complete outage, all customers affected, revenue stopped
+- **SEV2 (Major)**: Partial degradation, significant customer impact
+- **SEV3 (Minor)**: Isolated issues, workarounds available
+- **SEV4 (Cosmetic)**: UI issues, no functional impact
+
+**Use when**: Classifying incident severity, determining escalation path
+
+---
+
+### Communication Templates
+
+**File**: [communication-templates.md](communication-templates.md)
+
+Ready-to-use templates for all incident communications:
+- Internal updates (Slack, email)
+- External communications (status page, customer emails)
+- Executive briefings
+- Post-incident summaries
+- Postmortem distribution
+
+**Use when**: Communicating during or after incidents
+
+---
+
+### Root Cause Analysis Techniques
+
+**File**: [rca-techniques.md](rca-techniques.md)
+
+Comprehensive RCA methodology guide:
+- **5 Whys**: Iterative questioning to find root cause
+- **Fishbone Diagrams**: Category-based analysis
+- **Timeline Reconstruction**: Event sequencing and correlation
+- **Contributing Factors Analysis**: Immediate vs underlying vs latent causes
+- **Hypothesis Testing**: Data-driven validation
+
+**Use when**: Conducting root cause analysis, writing postmortems
+
+---
+
+### Runbook Structure Guide
+
+**File**: [runbook-structure-guide.md](runbook-structure-guide.md)
+
+Best practices for writing effective runbooks:
+- Standard runbook template
+- Diagnostic procedures
+- Remediation steps
+- Escalation paths
+- Success criteria
+- Runbook maintenance
+
+**Use when**: Creating or updating runbooks, automating diagnostics
+
+---
+
+## Quick Links
+
+**By Use Case**:
+- Need to classify incident severity → [Severity Matrix](incident-severity-matrix.md)
+- Need to communicate during incident → [Communication Templates](communication-templates.md)
+- Need to find root cause → [RCA Techniques](rca-techniques.md)
+- Need to write a runbook → [Runbook Structure Guide](runbook-structure-guide.md)
+
+**Related Documentation**:
+- **Examples**: [Examples Index](../examples/INDEX.md) - Real-world incident examples
+- **Templates**: [Templates Index](../templates/INDEX.md) - Incident timeline, postmortem templates
+- **Main Agent**: [incident-responder.md](../incident-responder.md) - Incident responder agent
+
+---
+
+Return to [main agent](../incident-responder.md)
--- a/skills/incident-response/reference/communication-templates.md
+++ b/skills/incident-response/reference/communication-templates.md
@@ -0,0 +1,213 @@
+# Communication Templates
+
+Copy-paste templates for incident communications across all channels and severity levels.
+
+## Internal Communications
+
+### SEV1 Incident Start
+
+```
+🚨 SEV1 INCIDENT DECLARED 🚨
+Incident ID: INC-YYYY-MM-DD-XXX
+Impact: [Brief description - e.g., "100% outage, database down"]
+Affected Services: [List services]
+Customer Impact: [All users / X% of users / Specific feature unavailable]
+
+Roles:
+- Incident Commander: @[name]
+- Technical Lead: @[name]
+- Communications Lead: @[name]
+
+War Room:
+- Slack: #incident-XXX
+- Zoom: [link]
+
+Status: Investigating
+Next Update: [Time] (15 minutes or on status change)
+Runbook: [Link if available]
+```
+
+### SEV2 Incident Start
+
+```
+⚠️ SEV2 INCIDENT
+Incident ID: INC-YYYY-MM-DD-XXX
+Impact: [Brief description - e.g., "API degraded, 30% users affected"]
+Symptoms: [What users are experiencing]
+
+IC: @[name]
+Status: Investigating [suspected cause]
+Next Update: 30 minutes
+```
+
+### Incident Update
+
+```
+📊 UPDATE #[N] (T+[X] minutes)
+Root Cause: [What we found OR "Still investigating"]
+Mitigation: [What we're doing]
+Impact: [Current status - improving/stable/worsening]
+ETA: [Expected resolution time OR "Unknown"]
+Next Update: [Time]
+```
+
+### Incident Resolved
+
+```
+🎉 INCIDENT RESOLVED (T+[X] minutes/hours)
+Final Status: All services operational
+Root Cause: [Brief summary]
+Fix Applied: [What was done]
+Monitoring: Ongoing for [duration]
+
+Postmortem: Scheduled for [date/time]
+Timeline: [Link to detailed timeline]
+```
+
+## External Communications
+
+### Status Page - Investigating
+
+```
+🔴 INVESTIGATING - [Brief Title]
+
+We are investigating reports of [issue description].
+Our team is actively working to identify the cause.
+
+Affected: [Service names]
+Started: [HH:MM UTC]
+Next Update: [HH:MM UTC]
+```
+
+### Status Page - Identified
+
+```
+🟡 IDENTIFIED - [Issue Title]
+
+We have identified the issue as [brief cause].
+Our team is implementing a fix.
+
+Affected: [Service names]
+Started: [HH:MM UTC]
+Identified: [HH:MM UTC]
+Est. Resolution: [HH:MM UTC]
+```
+
+### Status Page - Monitoring
+
+```
+🟢 MONITORING - [Issue Title] Resolved
+
+The issue has been resolved and services are operating normally.
+We are monitoring to ensure stability.
+
+Started: [HH:MM UTC]
+Resolved: [HH:MM UTC]
+Duration: [X] minutes
+```
+
+### Customer Email - SEV1 Postmortem
+
+```
+Subject: Service Disruption - [Date] Postmortem
+
+Dear [Product Name] Customers,
+
+On [Date] at [Time UTC], we experienced a service disruption that affected [all users / X% of users] for approximately [duration].
+
+What Happened:
+[2-3 sentence summary of the incident]
+
+Impact:
+- Duration: [X] minutes
+- Affected Users: [percentage or description]
+- Services Impacted: [list]
+
+Root Cause:
+[1-2 sentence explanation of root cause]
+
+Resolution:
+[1-2 sentences on how we fixed it]
+
+Prevention:
+We have implemented the following measures to prevent recurrence:
+1. [Measure 1]
+2. [Measure 2]
+3. [Measure 3]
+
+We sincerely apologize for the inconvenience and appreciate your patience.
+
+[Team Name]
+[Company Name]
+```
+
+## Executive Briefings
+
+### Initial Notification (SEV1 only)
+
+```
+Subject: SEV1 Incident - [Brief Title]
+
+Summary:
+- Incident ID: INC-YYYY-MM-DD-XXX
+- Started: [HH:MM UTC]
+- Impact: [All users affected / X% affected / revenue stopped]
+- Status: [Investigating / Mitigation in progress]
+
+Current Situation:
+[2-3 sentences explaining what's happening]
+
+Response:
+- IC: [Name]
+- Team: [X] engineers actively working
+- ETA: [Time if known, "Unknown" if not]
+
+Business Impact:
+- Revenue: [Estimated $ per hour OR "Minimal"]
+- Customers: [Number affected]
+- SLA: [Yes/No breach, details]
+
+Next Update: [Time]
+```
+
+### Resolution Summary (Executive)
+
+```
+Subject: SEV1 Resolved - [Brief Title]
+
+Incident INC-YYYY-MM-DD-XXX has been resolved after [duration].
+
+Timeline:
+- Started: [HH:MM UTC]
+- Identified: [HH:MM UTC]
+- Resolved: [HH:MM UTC]
+- Total Duration: [X] minutes
+
+Impact:
+- Customers Affected: [Number / percentage]
+- Revenue Loss: [$X estimated]
+- SLA Breach: [Yes/No]
+
+Root Cause:
+[1-2 sentences]
+
+Resolution:
+[1-2 sentences on fix]
+
+Prevention:
+[2-3 key action items with owners and dates]
+
+Postmortem: Scheduled for [date/time]
+
+[IC Name]
+```
+
+## Related Documentation
+
+- **Severity Matrix**: [incident-severity-matrix.md](incident-severity-matrix.md) - When to use each template
+- **Examples**: [Examples Index](../examples/INDEX.md) - Real communications from incidents
+- **Templates**: [Templates Index](../templates/INDEX.md) - Full incident timeline and postmortem templates
+
+---
+
+Return to [reference index](INDEX.md)
--- a/skills/incident-response/reference/rca-techniques.md
+++ b/skills/incident-response/reference/rca-techniques.md
@@ -0,0 +1,260 @@
+# Root Cause Analysis Techniques
+
+Comprehensive methods for identifying root causes through 5 Whys, Fishbone Diagrams, Timeline Reconstruction, and data-driven hypothesis testing.
+
+## 5 Whys Technique
+
+### Method
+
+Ask "Why?" iteratively until you reach the root cause (typically 5 levels deep, but can be 3-7).
+
+**Rules**:
+1. Start with the problem statement
+2. Ask "Why did this happen?"
+3. Answer based on facts/data (not assumptions)
+4. Repeat for each answer until root cause found
+5. Root cause = **systemic issue** (not human error)
+
+### Example
+
+**Problem**: Database went down
+
+```
+Why 1: Why did the database go down?
+→ Because the primary database ran out of disk space
+
+Why 2: Why did it run out of disk space?
+→ Because PostgreSQL logs filled the entire disk (450GB)
+
+Why 3: Why did logs grow to 450GB?
+→ Because log rotation was disabled
+
+Why 4: Why was log rotation disabled?
+→ Because the `log_truncate_on_rotation` config was set to `off` during a migration
+
+Why 5: Why was this config change not caught?
+→ Because configuration changes are not code-reviewed and there was no disk monitoring alert
+
+ROOT CAUSE: Missing disk monitoring alerts + configuration change without code review
+```
+
+**Action Items**:
+- Add disk usage monitoring (>90% alert)
+- Require code review for all config changes
+- Enable log rotation on all databases
+
+---
+
+## Fishbone Diagram (Ishikawa)
+
+### Method
+
+Categorize contributing factors into major categories to identify root cause systematically.
+
+**Categories** (6M's):
+1. **Method** (Process)
+2. **Machine** (Technology)
+3. **Material** (Inputs/Data)
+4. **Measurement** (Monitoring)
+5. **Mother Nature** (Environment)
+6. **Manpower** (People/Skills)
+
+### Example
+
+**Problem**: API performance degraded (p95: 200ms → 2000ms)
+
+```
+                              API Performance Degraded
+                                        │
+        ┌───────────────────┬───────────┴───────────┬───────────────────┐
+        │                   │                       │                   │
+    METHOD              MACHINE                 MATERIAL          MEASUREMENT
+  (Process)          (Technology)               (Data)            (Monitoring)
+        │                   │                       │                   │
+   No memory          EventEmitter              Large dataset       No heap
+   profiling in       listeners leak            processing          snapshots
+   code review        (not removed)             (100K orders)       in CI/CD
+        │                   │                       │                   │
+   No long-running    Node.js v14               High traffic        No gradual
+   load tests         (old GC)                  spike (2x)          alerts
+   (only 5min)                                                      (1h → 2h)
+```
+
+**Root Causes Identified**:
+- **Machine**: EventEmitter leak (technical)
+- **Measurement**: No heap monitoring (monitoring gap)
+- **Method**: No memory profiling in code review (process gap)
+
+---
+
+## Timeline Reconstruction
+
+### Method
+
+Build chronological timeline of events to identify causation and correlation.
+
+**Steps**:
+1. Gather logs from all systems (with timestamps)
+2. Normalize to UTC
+3. Plot events chronologically
+4. Identify cause-and-effect relationships
+5. Find the triggering event
+
+### Example
+
+```
+12:00:00 - Normal operation (p95: 200ms, memory: 400MB)
+12:15:00 - Code deployment (v2.15.4)
+12:30:00 - Memory: 720MB (+80% in 15min) ⚠️
+12:45:00 - Memory: 1.2GB (+67% in 15min) ⚠️
+13:00:00 - Memory: 1.8GB (+50% in 15min) 🚨
+13:00:00 - p95 latency: 800ms (4x slower)
+13:15:00 - Memory: 2.3GB (limit reached)
+13:15:00 - Workers start OOMing
+13:20:00 - p95 latency: 2000ms (10x slower)
+13:30:00 - Alert fired: High latency
+14:00:00 - Alert fired: High memory
+
+CORRELATION:
+- Deployment at 12:15 → Memory growth starts at 12:30
+- Memory growth → Latency increase (correlated)
+- TRIGGER: Code deployment v2.15.4
+
+ACTION: Review code changes in v2.15.4
+```
+
+---
+
+## Contributing Factors Analysis
+
+### Levels of Causation
+
+**Immediate Cause** (What happened):
+- Direct technical failure
+- Example: EventEmitter listeners not removed
+
+**Underlying Conditions** (Why it was possible):
+- Missing safeguards
+- Example: No memory profiling in code review
+
+**Latent Failures** (Systemic weaknesses):
+- Organizational/process gaps
+- Example: No developer training on memory management
+
+### Example
+
+**Incident**: Memory leak in production
+
+```
+Immediate Cause:
+└─ Code: EventEmitter .on() used without .removeListener()
+
+Underlying Conditions:
+├─ No code review caught the issue
+├─ No memory profiling in CI/CD
+└─ Short load tests (5min) didn't reveal gradual leak
+
+Latent Failures:
+├─ Team lacks memory management training
+├─ No documentation on EventEmitter best practices
+└─ Culture of "ship fast, fix later"
+```
+
+---
+
+## Hypothesis Testing
+
+### Method
+
+Generate hypotheses, test with data, validate or reject.
+
+**Process**:
+1. Observe symptoms
+2. Generate hypotheses (educated guesses)
+3. Design experiments to test each hypothesis
+4. Collect data
+5. Accept or reject hypothesis
+6. Repeat until root cause found
+
+### Example
+
+**Symptom**: Checkout API slow (p95: 3000ms)
+
+**Hypothesis 1**: Database slow queries
+```
+Test: Check slow query log
+Data: All queries < 50ms ✅
+Result: REJECTED - database is fast
+```
+
+**Hypothesis 2**: External API slow
+```
+Test: Distributed tracing (Jaeger)
+Data: Fraud check API: 2750ms (91% of total time) 🚨
+Result: ACCEPTED - external API is bottleneck
+```
+
+**Hypothesis 3**: Network latency
+```
+Test: curl timing breakdown
+Data: DNS: 50ms, Connect: 30ms, Transfer: 2750ms
+Result: PARTIAL - transfer is slow (not DNS/connect)
+```
+
+**Root Cause**: External fraud check API slow (blocking checkout)
+
+---
+
+## Blameless RCA Principles
+
+### Core Tenets
+
+1. **Focus on Systems, Not People**
+   - ❌ "Engineer made a mistake"
+   - ✅ "Process didn't catch config error"
+
+2. **Assume Good Intent**
+   - Everyone did the best they could with information available
+   - Blame discourages honesty and learning
+
+3. **Multiple Contributing Factors**
+   - Never a single cause
+   - Usually 3-5 factors contribute
+
+4. **Actionable Improvements**
+   - Fix the system, not the person
+   - Concrete action items with owners
+
+### Example (Blameless vs Blame)
+
+**Blamefu (BAD)**:
+```
+Root Cause: Engineer Jane deployed code without testing
+Action Item: Remind Jane to test before deploying
+```
+
+**Blameless (GOOD)**:
+```
+Root Cause: Deployment process allowed untested code to reach production
+Contributing Factors:
+1. No automated tests in CI/CD
+2. Manual deployment process (prone to human error)
+3. No staging environment validation
+
+Action Items:
+1. Add automated tests to CI/CD (Owner: Mike, Due: Dec 20)
+2. Require staging deployment + validation before production (Owner: Sarah, Due: Dec 22)
+3. Implement deployment checklist (Owner: Alex, Due: Dec 18)
+```
+
+---
+
+## Related Documentation
+
+- **Examples**: [Examples Index](../examples/INDEX.md) - RCA examples from real incidents
+- **Severity Matrix**: [incident-severity-matrix.md](incident-severity-matrix.md) - When to perform RCA
+- **Templates**: [Postmortem Template](../templates/postmortem-template.md) - Structured RCA format
+
+---
+
+Return to [reference index](INDEX.md)