Initial commit

This commit is contained in:
Zhongwei Li
2025-11-29 18:29:18 +08:00
commit 46dfc30864
25 changed files with 4683 additions and 0 deletions

View File

@@ -0,0 +1,76 @@
# Incident Response Templates
Ready-to-use templates for incident timelines, blameless postmortems, and runbooks. Copy and fill in for your incidents.
## Available Templates
### Incident Timeline Template
**File**: [incident-timeline-template.md](incident-timeline-template.md)
Real-time incident tracking template:
- Incident overview (ID, severity, impact)
- Chronological timeline (minute-by-minute)
- Role assignments (IC, Tech Lead, Comms)
- Status updates
- Resolution summary
**Use when**: Tracking ongoing incident in real-time
---
### Postmortem Template
**File**: [postmortem-template.md](postmortem-template.md)
Blameless postmortem template:
- Executive summary
- Timeline reconstruction
- Root cause analysis (5 Whys)
- Contributing factors
- Action items with owners
- Lessons learned
**Use when**: Documenting incident after resolution (within 24-48 hours)
---
### Runbook Template
**File**: [runbook-template.md](runbook-template.md)
Standard runbook structure:
- Problem description
- Diagnostic steps with commands
- Mitigation procedures
- Escalation paths
- Success criteria
**Use when**: Creating new runbook or updating existing one
---
## Template Usage
**How to use**:
1. Copy template to your documentation system
2. Fill in all `[FILL IN]` sections
3. Remove optional sections if not applicable
4. Share with team for review
**When to create**:
- **Incident Timeline**: As soon as SEV1/SEV2 declared (real-time)
- **Postmortem**: Within 24-48 hours of incident resolution
- **Runbook**: After any new incident type or process improvement
---
## Related Documentation
- **Examples**: [Examples Index](../examples/INDEX.md) - See completed examples
- **Reference**: [Reference Index](../reference/INDEX.md) - RCA techniques, communication templates
- **Main Agent**: [incident-responder.md](../incident-responder.md) - Incident responder agent
---
Return to [main agent](../incident-responder.md)

View File

@@ -0,0 +1,147 @@
# Incident Timeline: [INCIDENT TITLE]
**Incident ID**: INC-YYYY-MM-DD-XXX
**Severity**: [SEV1 / SEV2 / SEV3]
**Status**: [Investigating / Mitigating / Resolved / Monitoring]
**Started**: [YYYY-MM-DD HH:MM UTC]
---
## Incident Overview
**Impact**:
- Customer Impact: [All users / X% of users / Specific feature]
- Services Affected: [List affected services]
- Error Rate: [X%]
- Revenue Impact: [$X estimated]
**Symptoms**:
- [User-facing symptom 1]
- [User-facing symptom 2]
- [Metric: baseline → current]
---
## Team
**Incident Commander**: @[name]
**Technical Lead**: @[name]
**Communications Lead**: @[name]
**Scribe**: @[name]
**SMEs**: @[name1], @[name2]
**Channels**:
- Slack: #incident-XXX
- Zoom: [link]
- Status Page: [link]
---
## Timeline
| Time (UTC) | Event | Action Taken | Owner | Status |
|------------|-------|--------------|-------|--------|
| [HH:MM] | [Alert fired / Issue detected] | [What was done] | @[name] | 🔴 Started |
| [HH:MM] | [IC joined] | [Declared severity, assigned roles] | @[IC] | 🔴 Investigating |
| [HH:MM] | [Discovery] | [What was found] | @[name] | 🔴 Investigating |
| [HH:MM] | [Root cause identified] | [What the root cause is] | @[name] | 🟡 Identified |
| [HH:MM] | [Mitigation started] | [What fix is being applied] | @[name] | 🟡 Mitigating |
| [HH:MM] | [Mitigation complete] | [Verification of fix] | @[name] | 🟢 Mitigated |
| [HH:MM] | [Incident resolved] | [All checks passing] | @[IC] | 🟢 Resolved |
**Total Duration**: [X] minutes/hours
---
## Status Updates
### Update #1 ([HH:MM UTC] - T+[X] min)
**Status**: [Investigating / Mitigating]
**Root Cause**: [Known / Unknown - investigating X]
**Current Actions**: [What team is doing]
**Impact**: [Current impact status]
**ETA**: [Estimated resolution time OR "Unknown"]
**Next Update**: [Time]
### Update #2 ([HH:MM UTC] - T+[X] min)
[Same format as Update #1]
### Final Update ([HH:MM UTC] - T+[X] min)
**Status**: Resolved
**Root Cause**: [Brief summary]
**Fix Applied**: [What was done]
**Impact**: Resolved
**Monitoring**: [Ongoing monitoring period]
---
## Root Cause (Brief)
**Immediate Cause**: [What directly caused the issue]
**Contributing Factors**:
1. [Factor 1]
2. [Factor 2]
3. [Factor 3]
---
## Resolution Summary
**Temporary Fix** (if applicable):
- [What was done to quickly mitigate]
- [When it was applied]
**Permanent Fix**:
- [What was done for long-term solution]
- [When it was applied]
**Verification**:
- [How we confirmed the fix worked]
- [Metrics that returned to normal]
---
## Communications
### Internal
- [HH:MM] - SEV1 declared in #incidents
- [HH:MM] - Update #1 posted
- [HH:MM] - Update #2 posted
- [HH:MM] - Resolution announced
### External
- [HH:MM] - Status page: "Investigating"
- [HH:MM] - Status page: "Identified"
- [HH:MM] - Status page: "Monitoring"
- [HH:MM] - Status page: "Resolved"
- [HH:MM] - Customer email sent (if applicable)
### Executive
- [HH:MM] - Initial notification to CTO/CEO (SEV1 only)
- [HH:MM] - Resolution summary sent
---
## Next Steps
- [ ] Full postmortem scheduled: [Date/Time]
- [ ] Action items created in Linear
- [ ] Runbook updated with new learnings
- [ ] Monitoring improvements identified
---
## Notes
[Any additional context, observations, or learnings captured during the incident]
---
Return to [templates index](INDEX.md)

View File

@@ -0,0 +1,187 @@
# Postmortem: [INCIDENT TITLE]
**Date**: [YYYY-MM-DD]
**Incident ID**: INC-YYYY-MM-DD-XXX
**Severity**: [SEV1 / SEV2 / SEV3]
**Author**: [Name]
**Reviewers**: [Names]
**Status**: [Draft / Final]
---
## Executive Summary
**What Happened**: [2-3 sentence summary of the incident]
**Impact**:
- **Duration**: [X] minutes/hours
- **Users Affected**: [All / X% / specific group]
- **Revenue Impact**: [$X estimated loss]
- **SLA Breach**: [Yes/No - details]
**Root Cause**: [1 sentence root cause]
**Resolution**: [1 sentence how it was fixed]
**Key Actions**: [3 most important action items]
---
## Timeline
| Time (UTC) | Event | Notes |
|------------|-------|-------|
| [HH:MM] | [Event] | [Context] |
| [HH:MM] | [Event] | [Context] |
| [HH:MM] | [Event] | [Context] |
**Duration Breakdown**:
- Detection → Identification: [X] minutes
- Identification → Mitigation: [X] minutes
- Mitigation → Full Resolution: [X] minutes
- **Total MTTR**: [X] minutes
---
## Root Cause Analysis (5 Whys)
**Why 1**: Why did [problem] happen?
→ [Answer based on facts]
**Why 2**: Why did [previous answer] happen?
→ [Answer based on facts]
**Why 3**: Why did [previous answer] happen?
→ [Answer based on facts]
**Why 4**: Why did [previous answer] happen?
→ [Answer based on facts]
**Why 5**: Why did [previous answer] happen?
→ [Answer based on facts]
**ROOT CAUSE**: [Final systemic issue identified]
---
## Contributing Factors
### Immediate Cause
[Direct technical cause of the incident]
### Underlying Conditions
1. [Condition that enabled the immediate cause]
2. [Condition that enabled the immediate cause]
### Latent Failures
1. [Organizational/process weakness]
2. [Organizational/process weakness]
---
## What Went Well ✅
1. [Something that worked well during response]
2. [Something that worked well during response]
3. [Something that worked well during response]
---
## What Went Wrong ❌
1. [Something that didn't work or was missing]
2. [Something that didn't work or was missing]
3. [Something that didn't work or was missing]
---
## Action Items
| Priority | Action | Owner | Due Date | Status | Link |
|----------|--------|-------|----------|--------|------|
| P0 | [Critical - do immediately] | @[name] | [Date] | [ ] | [Link] |
| P1 | [Important - do within 1 week] | @[name] | [Date] | [ ] | [Link] |
| P2 | [Nice to have - do within 1 month] | @[name] | [Date] | [ ] | [Link] |
### P0 Actions (Immediate)
- [ ] [Action 1] - @[owner] - [due date]
- [ ] [Action 2] - @[owner] - [due date]
### P1 Actions (Short-Term)
- [ ] [Action 1] - @[owner] - [due date]
- [ ] [Action 2] - @[owner] - [due date]
### P2 Actions (Long-Term)
- [ ] [Action 1] - @[owner] - [due date]
- [ ] [Action 2] - @[owner] - [due date]
---
## Lessons Learned
### Technical Learnings
1. [Technical insight gained]
2. [Technical insight gained]
### Process Learnings
1. [Process improvement identified]
2. [Process improvement identified]
### Communication Learnings
1. [Communication improvement identified]
2. [Communication improvement identified]
---
## Prevention Measures
### Immediate (Completed)
- [x] [What was done same day]
- [x] [What was done same day]
### Short-Term (1-2 weeks)
- [ ] [What will be done soon]
- [ ] [What will be done soon]
### Long-Term (1-3 months)
- [ ] [What will be done eventually]
- [ ] [What will be done eventually]
---
## Related Incidents
- [INC-YYYY-MM-DD-XXX] - [Brief description] - [Link]
- [INC-YYYY-MM-DD-XXX] - [Brief description] - [Link]
---
## Appendix
### Relevant Logs
```
[Paste key log entries]
```
### Metrics/Graphs
[Links to Grafana dashboards, screenshots]
### Commands Run
```bash
[Commands that were used during investigation/mitigation]
```
---
## Sign-Off
**Incident Commander**: [Name] - [Date]
**Technical Lead**: [Name] - [Date]
**Engineering Manager**: [Name] - [Date]
**Postmortem Review**: [Date/Time]
**Attendees**: [List of people who reviewed]
---
Return to [templates index](INDEX.md)

View File

@@ -0,0 +1,255 @@
# Runbook: [PROBLEM TITLE]
**Alert**: [Alert name that triggers this runbook]
**Severity**: [SEV1 / SEV2 / SEV3]
**Owner**: [Team name]
**Last Updated**: [YYYY-MM-DD]
**Last Tested**: [YYYY-MM-DD]
---
## Problem Description
[2-3 sentence description of what this problem is]
**Symptoms**:
- [Observable symptom 1 - what users/operators see]
- [Observable symptom 2]
- [Observable symptom 3]
**Impact**:
- **Customer Impact**: [What users experience]
- **Business Impact**: [Revenue, SLA, compliance]
- **Affected Services**: [List of services]
---
## Prerequisites
**Required Access**:
- [ ] Kubernetes cluster access (`kubectl` configured)
- [ ] Database access (PlanetScale/PostgreSQL)
- [ ] Cloudflare Workers access (`wrangler` configured)
- [ ] Monitoring access (Grafana, Datadog)
**Required Tools**:
- [ ] `kubectl` v1.28+
- [ ] `wrangler` v3+
- [ ] `pscale` CLI
- [ ] `curl`, `jq`
---
## Diagnosis
### Step 1: [Check Initial Symptom]
**What to check**: [Describe what this step verifies]
```bash
# Command to run
[command]
# Expected output (healthy):
[what you should see if everything is fine]
# Problem indicator:
[what you see if there's an issue]
```
**Interpretation**:
- If [condition], then [conclusion]
- If [condition], then go to Step 2
---
### Step 2: [Verify Root Cause]
```bash
# Command to run
[command]
# Look for:
[what to look for in the output]
```
**Possible Causes**:
1. **[Cause 1]**: [How to identify] → Go to [Mitigation Option A](#option-a-cause-1)
2. **[Cause 2]**: [How to identify] → Go to [Mitigation Option B](#option-b-cause-2)
3. **[Cause 3]**: [How to identify] → Escalate to [team]
---
### Step 3: [Additional Verification]
[Only if needed for complex scenarios]
```bash
# Commands
[commands]
```
---
## Mitigation
### Option A: [Cause 1]
**When to use**: [Conditions when this mitigation applies]
**Steps**:
1. [Action 1]
```bash
[command]
```
2. [Action 2]
```bash
[command]
```
3. [Action 3]
```bash
[command]
```
**Verification**:
```bash
# Check that mitigation worked
[verification command]
# Expected result:
[what you should see]
```
**If mitigation fails**: [What to do next - usually escalate]
---
### Option B: [Cause 2]
[Same format as Option A]
---
## Rollback
**If mitigation makes things worse:**
```bash
# Rollback command 1
[command to undo action 1]
# Rollback command 2
[command to undo action 2]
```
---
## Verification & Monitoring
### Health Checks
After mitigation, verify these metrics return to normal:
```bash
# Check 1: Service health
curl https://api.greyhaven.io/health
# Expected: HTTP 200, {"status": "healthy"}
# Check 2: Error rate
# Grafana: Error Rate dashboard
# Expected: <0.1%
# Check 3: Latency
# Grafana: API Latency dashboard
# Expected: p95 <500ms
```
### Monitoring Period
Monitor for **[time period]** after mitigation:
- [ ] Error rate stable (<0.1%)
- [ ] Latency normal (p95 <500ms)
- [ ] No new alerts
- [ ] User reports resolved
---
## Escalation
**Escalate if**:
- Mitigation doesn't work after [X] minutes
- Root cause unclear after diagnosis
- Issue is [severity] and unresolved after [X] minutes
- Multiple services affected
**Escalation Path**:
```
0-15 min: @oncall-engineer
15-30 min: @team-lead
30-60 min: @engineering-manager
60+ min: @vp-engineering (SEV1 only)
```
**Escalation Contact**:
- Team Slack: #[team-channel]
- PagerDuty: [escalation policy]
- Oncall: @[oncall-alias]
---
## Common Mistakes
### Mistake 1: [Common Error]
**Wrong**:
```bash
[incorrect command or approach]
```
**Correct**:
```bash
[correct command or approach]
```
### Mistake 2: [Common Error]
[Description and correction]
---
## Related Documentation
- **Alert Definition**: [Link to alert config]
- **Monitoring Dashboard**: [Link to Grafana]
- **Architecture Doc**: [Link to system architecture]
- **Past Incidents**: [Links to similar incidents]
- **Postmortems**: [Links to related postmortems]
---
## Changelog
| Date | Author | Changes |
|------|--------|---------|
| [YYYY-MM-DD] | @[name] | Initial creation |
| [YYYY-MM-DD] | @[name] | Updated [what changed] |
---
## Testing Notes
**Last Test Date**: [YYYY-MM-DD]
**Test Result**: [Pass / Fail]
**Notes**: [What was learned from testing]
**How to Test**:
1. [Step to simulate failure in staging]
2. [Follow runbook]
3. [Verify recovery]
4. [Document time taken and any issues]
---
Return to [templates index](INDEX.md)