Initial commit

2025-11-29 18:29:18 +08:00
commit 46dfc30864
25 changed files with 4683 additions and 0 deletions
--- a/skills/incident-response/templates/INDEX.md
+++ b/skills/incident-response/templates/INDEX.md
@@ -0,0 +1,76 @@
+# Incident Response Templates
+
+Ready-to-use templates for incident timelines, blameless postmortems, and runbooks. Copy and fill in for your incidents.
+
+## Available Templates
+
+### Incident Timeline Template
+
+**File**: [incident-timeline-template.md](incident-timeline-template.md)
+
+Real-time incident tracking template:
+- Incident overview (ID, severity, impact)
+- Chronological timeline (minute-by-minute)
+- Role assignments (IC, Tech Lead, Comms)
+- Status updates
+- Resolution summary
+
+**Use when**: Tracking ongoing incident in real-time
+
+---
+
+### Postmortem Template
+
+**File**: [postmortem-template.md](postmortem-template.md)
+
+Blameless postmortem template:
+- Executive summary
+- Timeline reconstruction
+- Root cause analysis (5 Whys)
+- Contributing factors
+- Action items with owners
+- Lessons learned
+
+**Use when**: Documenting incident after resolution (within 24-48 hours)
+
+---
+
+### Runbook Template
+
+**File**: [runbook-template.md](runbook-template.md)
+
+Standard runbook structure:
+- Problem description
+- Diagnostic steps with commands
+- Mitigation procedures
+- Escalation paths
+- Success criteria
+
+**Use when**: Creating new runbook or updating existing one
+
+---
+
+## Template Usage
+
+**How to use**:
+1. Copy template to your documentation system
+2. Fill in all `[FILL IN]` sections
+3. Remove optional sections if not applicable
+4. Share with team for review
+
+**When to create**:
+- **Incident Timeline**: As soon as SEV1/SEV2 declared (real-time)
+- **Postmortem**: Within 24-48 hours of incident resolution
+- **Runbook**: After any new incident type or process improvement
+
+---
+
+## Related Documentation
+
+- **Examples**: [Examples Index](../examples/INDEX.md) - See completed examples
+- **Reference**: [Reference Index](../reference/INDEX.md) - RCA techniques, communication templates
+- **Main Agent**: [incident-responder.md](../incident-responder.md) - Incident responder agent
+
+---
+
+Return to [main agent](../incident-responder.md)
--- a/skills/incident-response/templates/incident-timeline-template.md
+++ b/skills/incident-response/templates/incident-timeline-template.md
@@ -0,0 +1,147 @@
+# Incident Timeline: [INCIDENT TITLE]
+
+**Incident ID**: INC-YYYY-MM-DD-XXX
+**Severity**: [SEV1 / SEV2 / SEV3]
+**Status**: [Investigating / Mitigating / Resolved / Monitoring]
+**Started**: [YYYY-MM-DD HH:MM UTC]
+
+---
+
+## Incident Overview
+
+**Impact**:
+- Customer Impact: [All users / X% of users / Specific feature]
+- Services Affected: [List affected services]
+- Error Rate: [X%]
+- Revenue Impact: [$X estimated]
+
+**Symptoms**:
+- [User-facing symptom 1]
+- [User-facing symptom 2]
+- [Metric: baseline → current]
+
+---
+
+## Team
+
+**Incident Commander**: @[name]
+**Technical Lead**: @[name]
+**Communications Lead**: @[name]
+**Scribe**: @[name]
+**SMEs**: @[name1], @[name2]
+
+**Channels**:
+- Slack: #incident-XXX
+- Zoom: [link]
+- Status Page: [link]
+
+---
+
+## Timeline
+
+| Time (UTC) | Event | Action Taken | Owner | Status |
+|------------|-------|--------------|-------|--------|
+| [HH:MM] | [Alert fired / Issue detected] | [What was done] | @[name] | 🔴 Started |
+| [HH:MM] | [IC joined] | [Declared severity, assigned roles] | @[IC] | 🔴 Investigating |
+| [HH:MM] | [Discovery] | [What was found] | @[name] | 🔴 Investigating |
+| [HH:MM] | [Root cause identified] | [What the root cause is] | @[name] | 🟡 Identified |
+| [HH:MM] | [Mitigation started] | [What fix is being applied] | @[name] | 🟡 Mitigating |
+| [HH:MM] | [Mitigation complete] | [Verification of fix] | @[name] | 🟢 Mitigated |
+| [HH:MM] | [Incident resolved] | [All checks passing] | @[IC] | 🟢 Resolved |
+
+**Total Duration**: [X] minutes/hours
+
+---
+
+## Status Updates
+
+### Update #1 ([HH:MM UTC] - T+[X] min)
+
+**Status**: [Investigating / Mitigating]
+**Root Cause**: [Known / Unknown - investigating X]
+**Current Actions**: [What team is doing]
+**Impact**: [Current impact status]
+**ETA**: [Estimated resolution time OR "Unknown"]
+**Next Update**: [Time]
+
+### Update #2 ([HH:MM UTC] - T+[X] min)
+
+[Same format as Update #1]
+
+### Final Update ([HH:MM UTC] - T+[X] min)
+
+**Status**: Resolved
+**Root Cause**: [Brief summary]
+**Fix Applied**: [What was done]
+**Impact**: Resolved
+**Monitoring**: [Ongoing monitoring period]
+
+---
+
+## Root Cause (Brief)
+
+**Immediate Cause**: [What directly caused the issue]
+
+**Contributing Factors**:
+1. [Factor 1]
+2. [Factor 2]
+3. [Factor 3]
+
+---
+
+## Resolution Summary
+
+**Temporary Fix** (if applicable):
+- [What was done to quickly mitigate]
+- [When it was applied]
+
+**Permanent Fix**:
+- [What was done for long-term solution]
+- [When it was applied]
+
+**Verification**:
+- [How we confirmed the fix worked]
+- [Metrics that returned to normal]
+
+---
+
+## Communications
+
+### Internal
+
+- [HH:MM] - SEV1 declared in #incidents
+- [HH:MM] - Update #1 posted
+- [HH:MM] - Update #2 posted
+- [HH:MM] - Resolution announced
+
+### External
+
+- [HH:MM] - Status page: "Investigating"
+- [HH:MM] - Status page: "Identified"
+- [HH:MM] - Status page: "Monitoring"
+- [HH:MM] - Status page: "Resolved"
+- [HH:MM] - Customer email sent (if applicable)
+
+### Executive
+
+- [HH:MM] - Initial notification to CTO/CEO (SEV1 only)
+- [HH:MM] - Resolution summary sent
+
+---
+
+## Next Steps
+
+- [ ] Full postmortem scheduled: [Date/Time]
+- [ ] Action items created in Linear
+- [ ] Runbook updated with new learnings
+- [ ] Monitoring improvements identified
+
+---
+
+## Notes
+
+[Any additional context, observations, or learnings captured during the incident]
+
+---
+
+Return to [templates index](INDEX.md)
--- a/skills/incident-response/templates/postmortem-template.md
+++ b/skills/incident-response/templates/postmortem-template.md
@@ -0,0 +1,187 @@
+# Postmortem: [INCIDENT TITLE]
+
+**Date**: [YYYY-MM-DD]
+**Incident ID**: INC-YYYY-MM-DD-XXX
+**Severity**: [SEV1 / SEV2 / SEV3]
+**Author**: [Name]
+**Reviewers**: [Names]
+**Status**: [Draft / Final]
+
+---
+
+## Executive Summary
+
+**What Happened**: [2-3 sentence summary of the incident]
+
+**Impact**:
+- **Duration**: [X] minutes/hours
+- **Users Affected**: [All / X% / specific group]
+- **Revenue Impact**: [$X estimated loss]
+- **SLA Breach**: [Yes/No - details]
+
+**Root Cause**: [1 sentence root cause]
+
+**Resolution**: [1 sentence how it was fixed]
+
+**Key Actions**: [3 most important action items]
+
+---
+
+## Timeline
+
+| Time (UTC) | Event | Notes |
+|------------|-------|-------|
+| [HH:MM] | [Event] | [Context] |
+| [HH:MM] | [Event] | [Context] |
+| [HH:MM] | [Event] | [Context] |
+
+**Duration Breakdown**:
+- Detection → Identification: [X] minutes
+- Identification → Mitigation: [X] minutes
+- Mitigation → Full Resolution: [X] minutes
+- **Total MTTR**: [X] minutes
+
+---
+
+## Root Cause Analysis (5 Whys)
+
+**Why 1**: Why did [problem] happen?
+→ [Answer based on facts]
+
+**Why 2**: Why did [previous answer] happen?
+→ [Answer based on facts]
+
+**Why 3**: Why did [previous answer] happen?
+→ [Answer based on facts]
+
+**Why 4**: Why did [previous answer] happen?
+→ [Answer based on facts]
+
+**Why 5**: Why did [previous answer] happen?
+→ [Answer based on facts]
+
+**ROOT CAUSE**: [Final systemic issue identified]
+
+---
+
+## Contributing Factors
+
+### Immediate Cause
+[Direct technical cause of the incident]
+
+### Underlying Conditions
+1. [Condition that enabled the immediate cause]
+2. [Condition that enabled the immediate cause]
+
+### Latent Failures
+1. [Organizational/process weakness]
+2. [Organizational/process weakness]
+
+---
+
+## What Went Well ✅
+
+1. [Something that worked well during response]
+2. [Something that worked well during response]
+3. [Something that worked well during response]
+
+---
+
+## What Went Wrong ❌
+
+1. [Something that didn't work or was missing]
+2. [Something that didn't work or was missing]
+3. [Something that didn't work or was missing]
+
+---
+
+## Action Items
+
+| Priority | Action | Owner | Due Date | Status | Link |
+|----------|--------|-------|----------|--------|------|
+| P0 | [Critical - do immediately] | @[name] | [Date] | [ ] | [Link] |
+| P1 | [Important - do within 1 week] | @[name] | [Date] | [ ] | [Link] |
+| P2 | [Nice to have - do within 1 month] | @[name] | [Date] | [ ] | [Link] |
+
+### P0 Actions (Immediate)
+- [ ] [Action 1] - @[owner] - [due date]
+- [ ] [Action 2] - @[owner] - [due date]
+
+### P1 Actions (Short-Term)
+- [ ] [Action 1] - @[owner] - [due date]
+- [ ] [Action 2] - @[owner] - [due date]
+
+### P2 Actions (Long-Term)
+- [ ] [Action 1] - @[owner] - [due date]
+- [ ] [Action 2] - @[owner] - [due date]
+
+---
+
+## Lessons Learned
+
+### Technical Learnings
+1. [Technical insight gained]
+2. [Technical insight gained]
+
+### Process Learnings
+1. [Process improvement identified]
+2. [Process improvement identified]
+
+### Communication Learnings
+1. [Communication improvement identified]
+2. [Communication improvement identified]
+
+---
+
+## Prevention Measures
+
+### Immediate (Completed)
+- [x] [What was done same day]
+- [x] [What was done same day]
+
+### Short-Term (1-2 weeks)
+- [ ] [What will be done soon]
+- [ ] [What will be done soon]
+
+### Long-Term (1-3 months)
+- [ ] [What will be done eventually]
+- [ ] [What will be done eventually]
+
+---
+
+## Related Incidents
+
+- [INC-YYYY-MM-DD-XXX] - [Brief description] - [Link]
+- [INC-YYYY-MM-DD-XXX] - [Brief description] - [Link]
+
+---
+
+## Appendix
+
+### Relevant Logs
+```
+[Paste key log entries]
+```
+
+### Metrics/Graphs
+[Links to Grafana dashboards, screenshots]
+
+### Commands Run
+```bash
+[Commands that were used during investigation/mitigation]
+```
+
+---
+
+## Sign-Off
+
+**Incident Commander**: [Name] - [Date]
+**Technical Lead**: [Name] - [Date]
+**Engineering Manager**: [Name] - [Date]
+
+**Postmortem Review**: [Date/Time]
+**Attendees**: [List of people who reviewed]
+
+---
+
+Return to [templates index](INDEX.md)
--- a/skills/incident-response/templates/runbook-template.md
+++ b/skills/incident-response/templates/runbook-template.md
@@ -0,0 +1,255 @@
+# Runbook: [PROBLEM TITLE]
+
+**Alert**: [Alert name that triggers this runbook]
+**Severity**: [SEV1 / SEV2 / SEV3]
+**Owner**: [Team name]
+**Last Updated**: [YYYY-MM-DD]
+**Last Tested**: [YYYY-MM-DD]
+
+---
+
+## Problem Description
+
+[2-3 sentence description of what this problem is]
+
+**Symptoms**:
+- [Observable symptom 1 - what users/operators see]
+- [Observable symptom 2]
+- [Observable symptom 3]
+
+**Impact**:
+- **Customer Impact**: [What users experience]
+- **Business Impact**: [Revenue, SLA, compliance]
+- **Affected Services**: [List of services]
+
+---
+
+## Prerequisites
+
+**Required Access**:
+- [ ] Kubernetes cluster access (`kubectl` configured)
+- [ ] Database access (PlanetScale/PostgreSQL)
+- [ ] Cloudflare Workers access (`wrangler` configured)
+- [ ] Monitoring access (Grafana, Datadog)
+
+**Required Tools**:
+- [ ] `kubectl` v1.28+
+- [ ] `wrangler` v3+
+- [ ] `pscale` CLI
+- [ ] `curl`, `jq`
+
+---
+
+## Diagnosis
+
+### Step 1: [Check Initial Symptom]
+
+**What to check**: [Describe what this step verifies]
+
+```bash
+# Command to run
+[command]
+
+# Expected output (healthy):
+[what you should see if everything is fine]
+
+# Problem indicator:
+[what you see if there's an issue]
+```
+
+**Interpretation**:
+- If [condition], then [conclusion]
+- If [condition], then go to Step 2
+
+---
+
+### Step 2: [Verify Root Cause]
+
+```bash
+# Command to run
+[command]
+
+# Look for:
+[what to look for in the output]
+```
+
+**Possible Causes**:
+1. **[Cause 1]**: [How to identify] → Go to [Mitigation Option A](#option-a-cause-1)
+2. **[Cause 2]**: [How to identify] → Go to [Mitigation Option B](#option-b-cause-2)
+3. **[Cause 3]**: [How to identify] → Escalate to [team]
+
+---
+
+### Step 3: [Additional Verification]
+
+[Only if needed for complex scenarios]
+
+```bash
+# Commands
+[commands]
+```
+
+---
+
+## Mitigation
+
+### Option A: [Cause 1]
+
+**When to use**: [Conditions when this mitigation applies]
+
+**Steps**:
+1. [Action 1]
+   ```bash
+   [command]
+   ```
+
+2. [Action 2]
+   ```bash
+   [command]
+   ```
+
+3. [Action 3]
+   ```bash
+   [command]
+   ```
+
+**Verification**:
+```bash
+# Check that mitigation worked
+[verification command]
+
+# Expected result:
+[what you should see]
+```
+
+**If mitigation fails**: [What to do next - usually escalate]
+
+---
+
+### Option B: [Cause 2]
+
+[Same format as Option A]
+
+---
+
+## Rollback
+
+**If mitigation makes things worse:**
+
+```bash
+# Rollback command 1
+[command to undo action 1]
+
+# Rollback command 2
+[command to undo action 2]
+```
+
+---
+
+## Verification & Monitoring
+
+### Health Checks
+
+After mitigation, verify these metrics return to normal:
+
+```bash
+# Check 1: Service health
+curl https://api.greyhaven.io/health
+# Expected: HTTP 200, {"status": "healthy"}
+
+# Check 2: Error rate
+# Grafana: Error Rate dashboard
+# Expected: <0.1%
+
+# Check 3: Latency
+# Grafana: API Latency dashboard
+# Expected: p95 <500ms
+```
+
+### Monitoring Period
+
+Monitor for **[time period]** after mitigation:
+- [ ] Error rate stable (<0.1%)
+- [ ] Latency normal (p95 <500ms)
+- [ ] No new alerts
+- [ ] User reports resolved
+
+---
+
+## Escalation
+
+**Escalate if**:
+- Mitigation doesn't work after [X] minutes
+- Root cause unclear after diagnosis
+- Issue is [severity] and unresolved after [X] minutes
+- Multiple services affected
+
+**Escalation Path**:
+```
+0-15 min:  @oncall-engineer
+15-30 min: @team-lead
+30-60 min: @engineering-manager
+60+ min:   @vp-engineering (SEV1 only)
+```
+
+**Escalation Contact**:
+- Team Slack: #[team-channel]
+- PagerDuty: [escalation policy]
+- Oncall: @[oncall-alias]
+
+---
+
+## Common Mistakes
+
+### Mistake 1: [Common Error]
+
+**Wrong**:
+```bash
+[incorrect command or approach]
+```
+
+**Correct**:
+```bash
+[correct command or approach]
+```
+
+### Mistake 2: [Common Error]
+
+[Description and correction]
+
+---
+
+## Related Documentation
+
+- **Alert Definition**: [Link to alert config]
+- **Monitoring Dashboard**: [Link to Grafana]
+- **Architecture Doc**: [Link to system architecture]
+- **Past Incidents**: [Links to similar incidents]
+- **Postmortems**: [Links to related postmortems]
+
+---
+
+## Changelog
+
+| Date | Author | Changes |
+|------|--------|---------|
+| [YYYY-MM-DD] | @[name] | Initial creation |
+| [YYYY-MM-DD] | @[name] | Updated [what changed] |
+
+---
+
+## Testing Notes
+
+**Last Test Date**: [YYYY-MM-DD]
+**Test Result**: [Pass / Fail]
+**Notes**: [What was learned from testing]
+
+**How to Test**:
+1. [Step to simulate failure in staging]
+2. [Follow runbook]
+3. [Verify recovery]
+4. [Document time taken and any issues]
+
+---
+
+Return to [templates index](INDEX.md)