Initial commit

2025-11-29 18:29:18 +08:00
commit 46dfc30864
25 changed files with 4683 additions and 0 deletions
--- a/skills/incident-response/SKILL.md
+++ b/skills/incident-response/SKILL.md
@@ -0,0 +1,26 @@
+# Incident Response Skill
+
+Handle production incidents with SRE best practices including detection, investigation, mitigation, recovery, and postmortems.
+
+## Description
+
+Production incident response following SRE methodologies with incident timeline tracking, RCA documentation, and runbook updates.
+
+## What's Included
+
+- **Examples**: SEV1 incident handling, postmortem templates
+- **Reference**: SRE best practices, incident severity levels
+- **Templates**: Incident reports, RCA documents, runbook updates
+
+## Use When
+
+- Production outages
+- SEV1/SEV2 incidents
+- Postmortem creation
+- Runbook updates
+
+## Related Agents
+
+- `incident-responder`
+
+**Skill Version**: 1.0
--- a/skills/incident-response/examples/INDEX.md
+++ b/skills/incident-response/examples/INDEX.md
@@ -0,0 +1,122 @@
+# Incident Response Examples
+
+Real-world production incident examples demonstrating systematic incident response, root cause analysis, mitigation strategies, and blameless postmortems.
+
+## Available Examples
+
+### SEV1: Critical Database Outage
+
+**File**: [sev1-critical-database-outage.md](sev1-critical-database-outage.md)
+
+Complete database failure causing total service outage:
+- **Incident**: PostgreSQL primary failure, 100% error rate
+- **Impact**: All services down, $50K revenue loss/hour
+- **Root Cause**: Disk full on primary, replication lag spike
+- **Resolution**: Promoted replica, cleared disk space, restored service
+- **MTTR**: 45 minutes (detection → full recovery)
+- **Prevention**: Disk monitoring alerts, automatic disk cleanup, replica promotion automation
+
+**Key Learnings**:
+- Importance of replica promotion runbooks
+- Disk space monitoring thresholds
+- Automated failover procedures
+
+---
+
+### SEV2: API Performance Degradation
+
+**File**: [sev2-api-performance-degradation.md](sev2-api-performance-degradation.md)
+
+Gradual performance degradation due to memory leak:
+- **Incident**: API p95 latency 200ms → 5000ms over 2 hours
+- **Impact**: 30% of users affected, slow page loads
+- **Root Cause**: Memory leak in worker process, OOM killing workers
+- **Resolution**: Identified leak with heap snapshot, deployed fix, restarted workers
+- **MTTR**: 3 hours (detection → permanent fix)
+- **Prevention**: Memory profiling in CI/CD, heap snapshot automation, worker restart automation
+
+**Key Learnings**:
+- Early detection through gradual alerts
+- Heap snapshot analysis for memory leaks
+- Temporary mitigation (worker restarts) vs permanent fix
+
+---
+
+### SEV3: Feature Flag Misconfiguration
+
+**File**: [sev3-feature-flag-misconfiguration.md](sev3-feature-flag-misconfiguration.md)
+
+Feature flag enabled for wrong audience causing confusion:
+- **Incident**: Experimental feature shown to 20% of production users
+- **Impact**: 200 support tickets, user confusion, no revenue impact
+- **Root Cause**: Feature flag percentage set to 20% instead of 0%
+- **Resolution**: Disabled flag, sent customer communication, updated flag process
+- **MTTR**: 30 minutes (detection → resolution)
+- **Prevention**: Feature flag code review, staging validation, gradual rollout process
+
+**Key Learnings**:
+- Feature flag validation before production
+- Importance of clear documentation
+- Quick rollback procedures
+
+---
+
+### Distributed Tracing Investigation
+
+**File**: [distributed-tracing-investigation.md](distributed-tracing-investigation.md)
+
+Using Jaeger distributed tracing to find microservice bottleneck:
+- **Incident**: Checkout API slow (3s p95), unclear which service
+- **Investigation**: Used Jaeger to trace request flow across 7 microservices
+- **Root Cause**: Payment service calling external API synchronously (2.8s)
+- **Resolution**: Moved external API call to async background job
+- **Impact**: p95 latency 3000ms → 150ms (95% faster)
+
+**Key Learnings**:
+- Power of distributed tracing for microservices
+- Synchronous external dependencies are dangerous
+- Background jobs for non-critical operations
+
+---
+
+### Cascade Failure Prevention
+
+**File**: [cascade-failure-prevention.md](cascade-failure-prevention.md)
+
+Preventing cascade failure through circuit breakers and bulkheads:
+- **Incident**: Auth service down, caused all dependent services to fail
+- **Impact**: Complete outage instead of graceful degradation
+- **Root Cause**: No circuit breakers, all services retrying auth indefinitely
+- **Resolution**: Implemented circuit breakers, bulkhead isolation, fallback logic
+- **Prevention**: Circuit breaker pattern, timeout configuration, graceful degradation
+
+**Key Learnings**:
+- Circuit breakers prevent cascade failures
+- Bulkhead isolation limits blast radius
+- Fallback logic enables graceful degradation
+
+---
+
+## Learning Outcomes
+
+After studying these examples, you will understand:
+
+1. **Incident Classification**: How to assess severity (SEV1-SEV4) based on impact
+2. **Incident Command**: Role of IC, communication protocols, timeline management
+3. **Root Cause Analysis**: 5 Whys, timeline reconstruction, data-driven investigation
+4. **Mitigation Strategies**: Immediate actions, temporary fixes, permanent solutions
+5. **Blameless Postmortems**: Focus on systems not people, actionable items, continuous improvement
+6. **Communication**: Internal updates, external communications, executive briefings
+7. **Prevention**: Monitoring improvements, runbook automation, architectural changes
+
+---
+
+## Related Documentation
+
+- **Reference**: [Reference Index](../reference/INDEX.md) - Severity matrix, communication templates, RCA techniques
+- **Templates**: [Templates Index](../templates/INDEX.md) - Incident timeline, postmortem, runbook templates
+- **Main Agent**: [incident-responder.md](../incident-responder.md) - Incident responder agent
+
+---
+
+Return to [main agent](../incident-responder.md)
--- a/skills/incident-response/reference/INDEX.md
+++ b/skills/incident-response/reference/INDEX.md
@@ -0,0 +1,82 @@
+# Incident Response Reference
+
+Quick reference guides for incident severity classification, communication templates, root cause analysis techniques, and runbook structure.
+
+## Available References
+
+### Incident Severity Matrix
+
+**File**: [incident-severity-matrix.md](incident-severity-matrix.md)
+
+Complete severity classification guide with examples:
+- **SEV1 (Critical)**: Complete outage, all customers affected, revenue stopped
+- **SEV2 (Major)**: Partial degradation, significant customer impact
+- **SEV3 (Minor)**: Isolated issues, workarounds available
+- **SEV4 (Cosmetic)**: UI issues, no functional impact
+
+**Use when**: Classifying incident severity, determining escalation path
+
+---
+
+### Communication Templates
+
+**File**: [communication-templates.md](communication-templates.md)
+
+Ready-to-use templates for all incident communications:
+- Internal updates (Slack, email)
+- External communications (status page, customer emails)
+- Executive briefings
+- Post-incident summaries
+- Postmortem distribution
+
+**Use when**: Communicating during or after incidents
+
+---
+
+### Root Cause Analysis Techniques
+
+**File**: [rca-techniques.md](rca-techniques.md)
+
+Comprehensive RCA methodology guide:
+- **5 Whys**: Iterative questioning to find root cause
+- **Fishbone Diagrams**: Category-based analysis
+- **Timeline Reconstruction**: Event sequencing and correlation
+- **Contributing Factors Analysis**: Immediate vs underlying vs latent causes
+- **Hypothesis Testing**: Data-driven validation
+
+**Use when**: Conducting root cause analysis, writing postmortems
+
+---
+
+### Runbook Structure Guide
+
+**File**: [runbook-structure-guide.md](runbook-structure-guide.md)
+
+Best practices for writing effective runbooks:
+- Standard runbook template
+- Diagnostic procedures
+- Remediation steps
+- Escalation paths
+- Success criteria
+- Runbook maintenance
+
+**Use when**: Creating or updating runbooks, automating diagnostics
+
+---
+
+## Quick Links
+
+**By Use Case**:
+- Need to classify incident severity → [Severity Matrix](incident-severity-matrix.md)
+- Need to communicate during incident → [Communication Templates](communication-templates.md)
+- Need to find root cause → [RCA Techniques](rca-techniques.md)
+- Need to write a runbook → [Runbook Structure Guide](runbook-structure-guide.md)
+
+**Related Documentation**:
+- **Examples**: [Examples Index](../examples/INDEX.md) - Real-world incident examples
+- **Templates**: [Templates Index](../templates/INDEX.md) - Incident timeline, postmortem templates
+- **Main Agent**: [incident-responder.md](../incident-responder.md) - Incident responder agent
+
+---
+
+Return to [main agent](../incident-responder.md)
--- a/skills/incident-response/reference/communication-templates.md
+++ b/skills/incident-response/reference/communication-templates.md
@@ -0,0 +1,213 @@
+# Communication Templates
+
+Copy-paste templates for incident communications across all channels and severity levels.
+
+## Internal Communications
+
+### SEV1 Incident Start
+
+```
+🚨 SEV1 INCIDENT DECLARED 🚨
+Incident ID: INC-YYYY-MM-DD-XXX
+Impact: [Brief description - e.g., "100% outage, database down"]
+Affected Services: [List services]
+Customer Impact: [All users / X% of users / Specific feature unavailable]
+
+Roles:
+- Incident Commander: @[name]
+- Technical Lead: @[name]
+- Communications Lead: @[name]
+
+War Room:
+- Slack: #incident-XXX
+- Zoom: [link]
+
+Status: Investigating
+Next Update: [Time] (15 minutes or on status change)
+Runbook: [Link if available]
+```
+
+### SEV2 Incident Start
+
+```
+⚠️ SEV2 INCIDENT
+Incident ID: INC-YYYY-MM-DD-XXX
+Impact: [Brief description - e.g., "API degraded, 30% users affected"]
+Symptoms: [What users are experiencing]
+
+IC: @[name]
+Status: Investigating [suspected cause]
+Next Update: 30 minutes
+```
+
+### Incident Update
+
+```
+📊 UPDATE #[N] (T+[X] minutes)
+Root Cause: [What we found OR "Still investigating"]
+Mitigation: [What we're doing]
+Impact: [Current status - improving/stable/worsening]
+ETA: [Expected resolution time OR "Unknown"]
+Next Update: [Time]
+```
+
+### Incident Resolved
+
+```
+🎉 INCIDENT RESOLVED (T+[X] minutes/hours)
+Final Status: All services operational
+Root Cause: [Brief summary]
+Fix Applied: [What was done]
+Monitoring: Ongoing for [duration]
+
+Postmortem: Scheduled for [date/time]
+Timeline: [Link to detailed timeline]
+```
+
+## External Communications
+
+### Status Page - Investigating
+
+```
+🔴 INVESTIGATING - [Brief Title]
+
+We are investigating reports of [issue description].
+Our team is actively working to identify the cause.
+
+Affected: [Service names]
+Started: [HH:MM UTC]
+Next Update: [HH:MM UTC]
+```
+
+### Status Page - Identified
+
+```
+🟡 IDENTIFIED - [Issue Title]
+
+We have identified the issue as [brief cause].
+Our team is implementing a fix.
+
+Affected: [Service names]
+Started: [HH:MM UTC]
+Identified: [HH:MM UTC]
+Est. Resolution: [HH:MM UTC]
+```
+
+### Status Page - Monitoring
+
+```
+🟢 MONITORING - [Issue Title] Resolved
+
+The issue has been resolved and services are operating normally.
+We are monitoring to ensure stability.
+
+Started: [HH:MM UTC]
+Resolved: [HH:MM UTC]
+Duration: [X] minutes
+```
+
+### Customer Email - SEV1 Postmortem
+
+```
+Subject: Service Disruption - [Date] Postmortem
+
+Dear [Product Name] Customers,
+
+On [Date] at [Time UTC], we experienced a service disruption that affected [all users / X% of users] for approximately [duration].
+
+What Happened:
+[2-3 sentence summary of the incident]
+
+Impact:
+- Duration: [X] minutes
+- Affected Users: [percentage or description]
+- Services Impacted: [list]
+
+Root Cause:
+[1-2 sentence explanation of root cause]
+
+Resolution:
+[1-2 sentences on how we fixed it]
+
+Prevention:
+We have implemented the following measures to prevent recurrence:
+1. [Measure 1]
+2. [Measure 2]
+3. [Measure 3]
+
+We sincerely apologize for the inconvenience and appreciate your patience.
+
+[Team Name]
+[Company Name]
+```
+
+## Executive Briefings
+
+### Initial Notification (SEV1 only)
+
+```
+Subject: SEV1 Incident - [Brief Title]
+
+Summary:
+- Incident ID: INC-YYYY-MM-DD-XXX
+- Started: [HH:MM UTC]
+- Impact: [All users affected / X% affected / revenue stopped]
+- Status: [Investigating / Mitigation in progress]
+
+Current Situation:
+[2-3 sentences explaining what's happening]
+
+Response:
+- IC: [Name]
+- Team: [X] engineers actively working
+- ETA: [Time if known, "Unknown" if not]
+
+Business Impact:
+- Revenue: [Estimated $ per hour OR "Minimal"]
+- Customers: [Number affected]
+- SLA: [Yes/No breach, details]
+
+Next Update: [Time]
+```
+
+### Resolution Summary (Executive)
+
+```
+Subject: SEV1 Resolved - [Brief Title]
+
+Incident INC-YYYY-MM-DD-XXX has been resolved after [duration].
+
+Timeline:
+- Started: [HH:MM UTC]
+- Identified: [HH:MM UTC]
+- Resolved: [HH:MM UTC]
+- Total Duration: [X] minutes
+
+Impact:
+- Customers Affected: [Number / percentage]
+- Revenue Loss: [$X estimated]
+- SLA Breach: [Yes/No]
+
+Root Cause:
+[1-2 sentences]
+
+Resolution:
+[1-2 sentences on fix]
+
+Prevention:
+[2-3 key action items with owners and dates]
+
+Postmortem: Scheduled for [date/time]
+
+[IC Name]
+```
+
+## Related Documentation
+
+- **Severity Matrix**: [incident-severity-matrix.md](incident-severity-matrix.md) - When to use each template
+- **Examples**: [Examples Index](../examples/INDEX.md) - Real communications from incidents
+- **Templates**: [Templates Index](../templates/INDEX.md) - Full incident timeline and postmortem templates
+
+---
+
+Return to [reference index](INDEX.md)
--- a/skills/incident-response/reference/rca-techniques.md
+++ b/skills/incident-response/reference/rca-techniques.md
@@ -0,0 +1,260 @@
+# Root Cause Analysis Techniques
+
+Comprehensive methods for identifying root causes through 5 Whys, Fishbone Diagrams, Timeline Reconstruction, and data-driven hypothesis testing.
+
+## 5 Whys Technique
+
+### Method
+
+Ask "Why?" iteratively until you reach the root cause (typically 5 levels deep, but can be 3-7).
+
+**Rules**:
+1. Start with the problem statement
+2. Ask "Why did this happen?"
+3. Answer based on facts/data (not assumptions)
+4. Repeat for each answer until root cause found
+5. Root cause = **systemic issue** (not human error)
+
+### Example
+
+**Problem**: Database went down
+
+```
+Why 1: Why did the database go down?
+→ Because the primary database ran out of disk space
+
+Why 2: Why did it run out of disk space?
+→ Because PostgreSQL logs filled the entire disk (450GB)
+
+Why 3: Why did logs grow to 450GB?
+→ Because log rotation was disabled
+
+Why 4: Why was log rotation disabled?
+→ Because the `log_truncate_on_rotation` config was set to `off` during a migration
+
+Why 5: Why was this config change not caught?
+→ Because configuration changes are not code-reviewed and there was no disk monitoring alert
+
+ROOT CAUSE: Missing disk monitoring alerts + configuration change without code review
+```
+
+**Action Items**:
+- Add disk usage monitoring (>90% alert)
+- Require code review for all config changes
+- Enable log rotation on all databases
+
+---
+
+## Fishbone Diagram (Ishikawa)
+
+### Method
+
+Categorize contributing factors into major categories to identify root cause systematically.
+
+**Categories** (6M's):
+1. **Method** (Process)
+2. **Machine** (Technology)
+3. **Material** (Inputs/Data)
+4. **Measurement** (Monitoring)
+5. **Mother Nature** (Environment)
+6. **Manpower** (People/Skills)
+
+### Example
+
+**Problem**: API performance degraded (p95: 200ms → 2000ms)
+
+```
+                              API Performance Degraded
+                                        │
+        ┌───────────────────┬───────────┴───────────┬───────────────────┐
+        │                   │                       │                   │
+    METHOD              MACHINE                 MATERIAL          MEASUREMENT
+  (Process)          (Technology)               (Data)            (Monitoring)
+        │                   │                       │                   │
+   No memory          EventEmitter              Large dataset       No heap
+   profiling in       listeners leak            processing          snapshots
+   code review        (not removed)             (100K orders)       in CI/CD
+        │                   │                       │                   │
+   No long-running    Node.js v14               High traffic        No gradual
+   load tests         (old GC)                  spike (2x)          alerts
+   (only 5min)                                                      (1h → 2h)
+```
+
+**Root Causes Identified**:
+- **Machine**: EventEmitter leak (technical)
+- **Measurement**: No heap monitoring (monitoring gap)
+- **Method**: No memory profiling in code review (process gap)
+
+---
+
+## Timeline Reconstruction
+
+### Method
+
+Build chronological timeline of events to identify causation and correlation.
+
+**Steps**:
+1. Gather logs from all systems (with timestamps)
+2. Normalize to UTC
+3. Plot events chronologically
+4. Identify cause-and-effect relationships
+5. Find the triggering event
+
+### Example
+
+```
+12:00:00 - Normal operation (p95: 200ms, memory: 400MB)
+12:15:00 - Code deployment (v2.15.4)
+12:30:00 - Memory: 720MB (+80% in 15min) ⚠️
+12:45:00 - Memory: 1.2GB (+67% in 15min) ⚠️
+13:00:00 - Memory: 1.8GB (+50% in 15min) 🚨
+13:00:00 - p95 latency: 800ms (4x slower)
+13:15:00 - Memory: 2.3GB (limit reached)
+13:15:00 - Workers start OOMing
+13:20:00 - p95 latency: 2000ms (10x slower)
+13:30:00 - Alert fired: High latency
+14:00:00 - Alert fired: High memory
+
+CORRELATION:
+- Deployment at 12:15 → Memory growth starts at 12:30
+- Memory growth → Latency increase (correlated)
+- TRIGGER: Code deployment v2.15.4
+
+ACTION: Review code changes in v2.15.4
+```
+
+---
+
+## Contributing Factors Analysis
+
+### Levels of Causation
+
+**Immediate Cause** (What happened):
+- Direct technical failure
+- Example: EventEmitter listeners not removed
+
+**Underlying Conditions** (Why it was possible):
+- Missing safeguards
+- Example: No memory profiling in code review
+
+**Latent Failures** (Systemic weaknesses):
+- Organizational/process gaps
+- Example: No developer training on memory management
+
+### Example
+
+**Incident**: Memory leak in production
+
+```
+Immediate Cause:
+└─ Code: EventEmitter .on() used without .removeListener()
+
+Underlying Conditions:
+├─ No code review caught the issue
+├─ No memory profiling in CI/CD
+└─ Short load tests (5min) didn't reveal gradual leak
+
+Latent Failures:
+├─ Team lacks memory management training
+├─ No documentation on EventEmitter best practices
+└─ Culture of "ship fast, fix later"
+```
+
+---
+
+## Hypothesis Testing
+
+### Method
+
+Generate hypotheses, test with data, validate or reject.
+
+**Process**:
+1. Observe symptoms
+2. Generate hypotheses (educated guesses)
+3. Design experiments to test each hypothesis
+4. Collect data
+5. Accept or reject hypothesis
+6. Repeat until root cause found
+
+### Example
+
+**Symptom**: Checkout API slow (p95: 3000ms)
+
+**Hypothesis 1**: Database slow queries
+```
+Test: Check slow query log
+Data: All queries < 50ms ✅
+Result: REJECTED - database is fast
+```
+
+**Hypothesis 2**: External API slow
+```
+Test: Distributed tracing (Jaeger)
+Data: Fraud check API: 2750ms (91% of total time) 🚨
+Result: ACCEPTED - external API is bottleneck
+```
+
+**Hypothesis 3**: Network latency
+```
+Test: curl timing breakdown
+Data: DNS: 50ms, Connect: 30ms, Transfer: 2750ms
+Result: PARTIAL - transfer is slow (not DNS/connect)
+```
+
+**Root Cause**: External fraud check API slow (blocking checkout)
+
+---
+
+## Blameless RCA Principles
+
+### Core Tenets
+
+1. **Focus on Systems, Not People**
+   - ❌ "Engineer made a mistake"
+   - ✅ "Process didn't catch config error"
+
+2. **Assume Good Intent**
+   - Everyone did the best they could with information available
+   - Blame discourages honesty and learning
+
+3. **Multiple Contributing Factors**
+   - Never a single cause
+   - Usually 3-5 factors contribute
+
+4. **Actionable Improvements**
+   - Fix the system, not the person
+   - Concrete action items with owners
+
+### Example (Blameless vs Blame)
+
+**Blamefu (BAD)**:
+```
+Root Cause: Engineer Jane deployed code without testing
+Action Item: Remind Jane to test before deploying
+```
+
+**Blameless (GOOD)**:
+```
+Root Cause: Deployment process allowed untested code to reach production
+Contributing Factors:
+1. No automated tests in CI/CD
+2. Manual deployment process (prone to human error)
+3. No staging environment validation
+
+Action Items:
+1. Add automated tests to CI/CD (Owner: Mike, Due: Dec 20)
+2. Require staging deployment + validation before production (Owner: Sarah, Due: Dec 22)
+3. Implement deployment checklist (Owner: Alex, Due: Dec 18)
+```
+
+---
+
+## Related Documentation
+
+- **Examples**: [Examples Index](../examples/INDEX.md) - RCA examples from real incidents
+- **Severity Matrix**: [incident-severity-matrix.md](incident-severity-matrix.md) - When to perform RCA
+- **Templates**: [Postmortem Template](../templates/postmortem-template.md) - Structured RCA format
+
+---
+
+Return to [reference index](INDEX.md)
--- a/skills/incident-response/templates/INDEX.md
+++ b/skills/incident-response/templates/INDEX.md
@@ -0,0 +1,76 @@
+# Incident Response Templates
+
+Ready-to-use templates for incident timelines, blameless postmortems, and runbooks. Copy and fill in for your incidents.
+
+## Available Templates
+
+### Incident Timeline Template
+
+**File**: [incident-timeline-template.md](incident-timeline-template.md)
+
+Real-time incident tracking template:
+- Incident overview (ID, severity, impact)
+- Chronological timeline (minute-by-minute)
+- Role assignments (IC, Tech Lead, Comms)
+- Status updates
+- Resolution summary
+
+**Use when**: Tracking ongoing incident in real-time
+
+---
+
+### Postmortem Template
+
+**File**: [postmortem-template.md](postmortem-template.md)
+
+Blameless postmortem template:
+- Executive summary
+- Timeline reconstruction
+- Root cause analysis (5 Whys)
+- Contributing factors
+- Action items with owners
+- Lessons learned
+
+**Use when**: Documenting incident after resolution (within 24-48 hours)
+
+---
+
+### Runbook Template
+
+**File**: [runbook-template.md](runbook-template.md)
+
+Standard runbook structure:
+- Problem description
+- Diagnostic steps with commands
+- Mitigation procedures
+- Escalation paths
+- Success criteria
+
+**Use when**: Creating new runbook or updating existing one
+
+---
+
+## Template Usage
+
+**How to use**:
+1. Copy template to your documentation system
+2. Fill in all `[FILL IN]` sections
+3. Remove optional sections if not applicable
+4. Share with team for review
+
+**When to create**:
+- **Incident Timeline**: As soon as SEV1/SEV2 declared (real-time)
+- **Postmortem**: Within 24-48 hours of incident resolution
+- **Runbook**: After any new incident type or process improvement
+
+---
+
+## Related Documentation
+
+- **Examples**: [Examples Index](../examples/INDEX.md) - See completed examples
+- **Reference**: [Reference Index](../reference/INDEX.md) - RCA techniques, communication templates
+- **Main Agent**: [incident-responder.md](../incident-responder.md) - Incident responder agent
+
+---
+
+Return to [main agent](../incident-responder.md)
--- a/skills/incident-response/templates/incident-timeline-template.md
+++ b/skills/incident-response/templates/incident-timeline-template.md
@@ -0,0 +1,147 @@
+# Incident Timeline: [INCIDENT TITLE]
+
+**Incident ID**: INC-YYYY-MM-DD-XXX
+**Severity**: [SEV1 / SEV2 / SEV3]
+**Status**: [Investigating / Mitigating / Resolved / Monitoring]
+**Started**: [YYYY-MM-DD HH:MM UTC]
+
+---
+
+## Incident Overview
+
+**Impact**:
+- Customer Impact: [All users / X% of users / Specific feature]
+- Services Affected: [List affected services]
+- Error Rate: [X%]
+- Revenue Impact: [$X estimated]
+
+**Symptoms**:
+- [User-facing symptom 1]
+- [User-facing symptom 2]
+- [Metric: baseline → current]
+
+---
+
+## Team
+
+**Incident Commander**: @[name]
+**Technical Lead**: @[name]
+**Communications Lead**: @[name]
+**Scribe**: @[name]
+**SMEs**: @[name1], @[name2]
+
+**Channels**:
+- Slack: #incident-XXX
+- Zoom: [link]
+- Status Page: [link]
+
+---
+
+## Timeline
+
+| Time (UTC) | Event | Action Taken | Owner | Status |
+|------------|-------|--------------|-------|--------|
+| [HH:MM] | [Alert fired / Issue detected] | [What was done] | @[name] | 🔴 Started |
+| [HH:MM] | [IC joined] | [Declared severity, assigned roles] | @[IC] | 🔴 Investigating |
+| [HH:MM] | [Discovery] | [What was found] | @[name] | 🔴 Investigating |
+| [HH:MM] | [Root cause identified] | [What the root cause is] | @[name] | 🟡 Identified |
+| [HH:MM] | [Mitigation started] | [What fix is being applied] | @[name] | 🟡 Mitigating |
+| [HH:MM] | [Mitigation complete] | [Verification of fix] | @[name] | 🟢 Mitigated |
+| [HH:MM] | [Incident resolved] | [All checks passing] | @[IC] | 🟢 Resolved |
+
+**Total Duration**: [X] minutes/hours
+
+---
+
+## Status Updates
+
+### Update #1 ([HH:MM UTC] - T+[X] min)
+
+**Status**: [Investigating / Mitigating]
+**Root Cause**: [Known / Unknown - investigating X]
+**Current Actions**: [What team is doing]
+**Impact**: [Current impact status]
+**ETA**: [Estimated resolution time OR "Unknown"]
+**Next Update**: [Time]
+
+### Update #2 ([HH:MM UTC] - T+[X] min)
+
+[Same format as Update #1]
+
+### Final Update ([HH:MM UTC] - T+[X] min)
+
+**Status**: Resolved
+**Root Cause**: [Brief summary]
+**Fix Applied**: [What was done]
+**Impact**: Resolved
+**Monitoring**: [Ongoing monitoring period]
+
+---
+
+## Root Cause (Brief)
+
+**Immediate Cause**: [What directly caused the issue]
+
+**Contributing Factors**:
+1. [Factor 1]
+2. [Factor 2]
+3. [Factor 3]
+
+---
+
+## Resolution Summary
+
+**Temporary Fix** (if applicable):
+- [What was done to quickly mitigate]
+- [When it was applied]
+
+**Permanent Fix**:
+- [What was done for long-term solution]
+- [When it was applied]
+
+**Verification**:
+- [How we confirmed the fix worked]
+- [Metrics that returned to normal]
+
+---
+
+## Communications
+
+### Internal
+
+- [HH:MM] - SEV1 declared in #incidents
+- [HH:MM] - Update #1 posted
+- [HH:MM] - Update #2 posted
+- [HH:MM] - Resolution announced
+
+### External
+
+- [HH:MM] - Status page: "Investigating"
+- [HH:MM] - Status page: "Identified"
+- [HH:MM] - Status page: "Monitoring"
+- [HH:MM] - Status page: "Resolved"
+- [HH:MM] - Customer email sent (if applicable)
+
+### Executive
+
+- [HH:MM] - Initial notification to CTO/CEO (SEV1 only)
+- [HH:MM] - Resolution summary sent
+
+---
+
+## Next Steps
+
+- [ ] Full postmortem scheduled: [Date/Time]
+- [ ] Action items created in Linear
+- [ ] Runbook updated with new learnings
+- [ ] Monitoring improvements identified
+
+---
+
+## Notes
+
+[Any additional context, observations, or learnings captured during the incident]
+
+---
+
+Return to [templates index](INDEX.md)
--- a/skills/incident-response/templates/postmortem-template.md
+++ b/skills/incident-response/templates/postmortem-template.md
@@ -0,0 +1,187 @@
+# Postmortem: [INCIDENT TITLE]
+
+**Date**: [YYYY-MM-DD]
+**Incident ID**: INC-YYYY-MM-DD-XXX
+**Severity**: [SEV1 / SEV2 / SEV3]
+**Author**: [Name]
+**Reviewers**: [Names]
+**Status**: [Draft / Final]
+
+---
+
+## Executive Summary
+
+**What Happened**: [2-3 sentence summary of the incident]
+
+**Impact**:
+- **Duration**: [X] minutes/hours
+- **Users Affected**: [All / X% / specific group]
+- **Revenue Impact**: [$X estimated loss]
+- **SLA Breach**: [Yes/No - details]
+
+**Root Cause**: [1 sentence root cause]
+
+**Resolution**: [1 sentence how it was fixed]
+
+**Key Actions**: [3 most important action items]
+
+---
+
+## Timeline
+
+| Time (UTC) | Event | Notes |
+|------------|-------|-------|
+| [HH:MM] | [Event] | [Context] |
+| [HH:MM] | [Event] | [Context] |
+| [HH:MM] | [Event] | [Context] |
+
+**Duration Breakdown**:
+- Detection → Identification: [X] minutes
+- Identification → Mitigation: [X] minutes
+- Mitigation → Full Resolution: [X] minutes
+- **Total MTTR**: [X] minutes
+
+---
+
+## Root Cause Analysis (5 Whys)
+
+**Why 1**: Why did [problem] happen?
+→ [Answer based on facts]
+
+**Why 2**: Why did [previous answer] happen?
+→ [Answer based on facts]
+
+**Why 3**: Why did [previous answer] happen?
+→ [Answer based on facts]
+
+**Why 4**: Why did [previous answer] happen?
+→ [Answer based on facts]
+
+**Why 5**: Why did [previous answer] happen?
+→ [Answer based on facts]
+
+**ROOT CAUSE**: [Final systemic issue identified]
+
+---
+
+## Contributing Factors
+
+### Immediate Cause
+[Direct technical cause of the incident]
+
+### Underlying Conditions
+1. [Condition that enabled the immediate cause]
+2. [Condition that enabled the immediate cause]
+
+### Latent Failures
+1. [Organizational/process weakness]
+2. [Organizational/process weakness]
+
+---
+
+## What Went Well ✅
+
+1. [Something that worked well during response]
+2. [Something that worked well during response]
+3. [Something that worked well during response]
+
+---
+
+## What Went Wrong ❌
+
+1. [Something that didn't work or was missing]
+2. [Something that didn't work or was missing]
+3. [Something that didn't work or was missing]
+
+---
+
+## Action Items
+
+| Priority | Action | Owner | Due Date | Status | Link |
+|----------|--------|-------|----------|--------|------|
+| P0 | [Critical - do immediately] | @[name] | [Date] | [ ] | [Link] |
+| P1 | [Important - do within 1 week] | @[name] | [Date] | [ ] | [Link] |
+| P2 | [Nice to have - do within 1 month] | @[name] | [Date] | [ ] | [Link] |
+
+### P0 Actions (Immediate)
+- [ ] [Action 1] - @[owner] - [due date]
+- [ ] [Action 2] - @[owner] - [due date]
+
+### P1 Actions (Short-Term)
+- [ ] [Action 1] - @[owner] - [due date]
+- [ ] [Action 2] - @[owner] - [due date]
+
+### P2 Actions (Long-Term)
+- [ ] [Action 1] - @[owner] - [due date]
+- [ ] [Action 2] - @[owner] - [due date]
+
+---
+
+## Lessons Learned
+
+### Technical Learnings
+1. [Technical insight gained]
+2. [Technical insight gained]
+
+### Process Learnings
+1. [Process improvement identified]
+2. [Process improvement identified]
+
+### Communication Learnings
+1. [Communication improvement identified]
+2. [Communication improvement identified]
+
+---
+
+## Prevention Measures
+
+### Immediate (Completed)
+- [x] [What was done same day]
+- [x] [What was done same day]
+
+### Short-Term (1-2 weeks)
+- [ ] [What will be done soon]
+- [ ] [What will be done soon]
+
+### Long-Term (1-3 months)
+- [ ] [What will be done eventually]
+- [ ] [What will be done eventually]
+
+---
+
+## Related Incidents
+
+- [INC-YYYY-MM-DD-XXX] - [Brief description] - [Link]
+- [INC-YYYY-MM-DD-XXX] - [Brief description] - [Link]
+
+---
+
+## Appendix
+
+### Relevant Logs
+```
+[Paste key log entries]
+```
+
+### Metrics/Graphs
+[Links to Grafana dashboards, screenshots]
+
+### Commands Run
+```bash
+[Commands that were used during investigation/mitigation]
+```
+
+---
+
+## Sign-Off
+
+**Incident Commander**: [Name] - [Date]
+**Technical Lead**: [Name] - [Date]
+**Engineering Manager**: [Name] - [Date]
+
+**Postmortem Review**: [Date/Time]
+**Attendees**: [List of people who reviewed]
+
+---
+
+Return to [templates index](INDEX.md)
--- a/skills/incident-response/templates/runbook-template.md
+++ b/skills/incident-response/templates/runbook-template.md
@@ -0,0 +1,255 @@
+# Runbook: [PROBLEM TITLE]
+
+**Alert**: [Alert name that triggers this runbook]
+**Severity**: [SEV1 / SEV2 / SEV3]
+**Owner**: [Team name]
+**Last Updated**: [YYYY-MM-DD]
+**Last Tested**: [YYYY-MM-DD]
+
+---
+
+## Problem Description
+
+[2-3 sentence description of what this problem is]
+
+**Symptoms**:
+- [Observable symptom 1 - what users/operators see]
+- [Observable symptom 2]
+- [Observable symptom 3]
+
+**Impact**:
+- **Customer Impact**: [What users experience]
+- **Business Impact**: [Revenue, SLA, compliance]
+- **Affected Services**: [List of services]
+
+---
+
+## Prerequisites
+
+**Required Access**:
+- [ ] Kubernetes cluster access (`kubectl` configured)
+- [ ] Database access (PlanetScale/PostgreSQL)
+- [ ] Cloudflare Workers access (`wrangler` configured)
+- [ ] Monitoring access (Grafana, Datadog)
+
+**Required Tools**:
+- [ ] `kubectl` v1.28+
+- [ ] `wrangler` v3+
+- [ ] `pscale` CLI
+- [ ] `curl`, `jq`
+
+---
+
+## Diagnosis
+
+### Step 1: [Check Initial Symptom]
+
+**What to check**: [Describe what this step verifies]
+
+```bash
+# Command to run
+[command]
+
+# Expected output (healthy):
+[what you should see if everything is fine]
+
+# Problem indicator:
+[what you see if there's an issue]
+```
+
+**Interpretation**:
+- If [condition], then [conclusion]
+- If [condition], then go to Step 2
+
+---
+
+### Step 2: [Verify Root Cause]
+
+```bash
+# Command to run
+[command]
+
+# Look for:
+[what to look for in the output]
+```
+
+**Possible Causes**:
+1. **[Cause 1]**: [How to identify] → Go to [Mitigation Option A](#option-a-cause-1)
+2. **[Cause 2]**: [How to identify] → Go to [Mitigation Option B](#option-b-cause-2)
+3. **[Cause 3]**: [How to identify] → Escalate to [team]
+
+---
+
+### Step 3: [Additional Verification]
+
+[Only if needed for complex scenarios]
+
+```bash
+# Commands
+[commands]
+```
+
+---
+
+## Mitigation
+
+### Option A: [Cause 1]
+
+**When to use**: [Conditions when this mitigation applies]
+
+**Steps**:
+1. [Action 1]
+   ```bash
+   [command]
+   ```
+
+2. [Action 2]
+   ```bash
+   [command]
+   ```
+
+3. [Action 3]
+   ```bash
+   [command]
+   ```
+
+**Verification**:
+```bash
+# Check that mitigation worked
+[verification command]
+
+# Expected result:
+[what you should see]
+```
+
+**If mitigation fails**: [What to do next - usually escalate]
+
+---
+
+### Option B: [Cause 2]
+
+[Same format as Option A]
+
+---
+
+## Rollback
+
+**If mitigation makes things worse:**
+
+```bash
+# Rollback command 1
+[command to undo action 1]
+
+# Rollback command 2
+[command to undo action 2]
+```
+
+---
+
+## Verification & Monitoring
+
+### Health Checks
+
+After mitigation, verify these metrics return to normal:
+
+```bash
+# Check 1: Service health
+curl https://api.greyhaven.io/health
+# Expected: HTTP 200, {"status": "healthy"}
+
+# Check 2: Error rate
+# Grafana: Error Rate dashboard
+# Expected: <0.1%
+
+# Check 3: Latency
+# Grafana: API Latency dashboard
+# Expected: p95 <500ms
+```
+
+### Monitoring Period
+
+Monitor for **[time period]** after mitigation:
+- [ ] Error rate stable (<0.1%)
+- [ ] Latency normal (p95 <500ms)
+- [ ] No new alerts
+- [ ] User reports resolved
+
+---
+
+## Escalation
+
+**Escalate if**:
+- Mitigation doesn't work after [X] minutes
+- Root cause unclear after diagnosis
+- Issue is [severity] and unresolved after [X] minutes
+- Multiple services affected
+
+**Escalation Path**:
+```
+0-15 min:  @oncall-engineer
+15-30 min: @team-lead
+30-60 min: @engineering-manager
+60+ min:   @vp-engineering (SEV1 only)
+```
+
+**Escalation Contact**:
+- Team Slack: #[team-channel]
+- PagerDuty: [escalation policy]
+- Oncall: @[oncall-alias]
+
+---
+
+## Common Mistakes
+
+### Mistake 1: [Common Error]
+
+**Wrong**:
+```bash
+[incorrect command or approach]
+```
+
+**Correct**:
+```bash
+[correct command or approach]
+```
+
+### Mistake 2: [Common Error]
+
+[Description and correction]
+
+---
+
+## Related Documentation
+
+- **Alert Definition**: [Link to alert config]
+- **Monitoring Dashboard**: [Link to Grafana]
+- **Architecture Doc**: [Link to system architecture]
+- **Past Incidents**: [Links to similar incidents]
+- **Postmortems**: [Links to related postmortems]
+
+---
+
+## Changelog
+
+| Date | Author | Changes |
+|------|--------|---------|
+| [YYYY-MM-DD] | @[name] | Initial creation |
+| [YYYY-MM-DD] | @[name] | Updated [what changed] |
+
+---
+
+## Testing Notes
+
+**Last Test Date**: [YYYY-MM-DD]
+**Test Result**: [Pass / Fail]
+**Notes**: [What was learned from testing]
+
+**How to Test**:
+1. [Step to simulate failure in staging]
+2. [Follow runbook]
+3. [Verify recovery]
+4. [Document time taken and any issues]
+
+---
+
+Return to [templates index](INDEX.md)