Files
gh-greyhaven-ai-claude-code…/skills/incident-response/templates/postmortem-template.md
2025-11-29 18:29:18 +08:00

188 lines
3.9 KiB
Markdown

# Postmortem: [INCIDENT TITLE]
**Date**: [YYYY-MM-DD]
**Incident ID**: INC-YYYY-MM-DD-XXX
**Severity**: [SEV1 / SEV2 / SEV3]
**Author**: [Name]
**Reviewers**: [Names]
**Status**: [Draft / Final]
---
## Executive Summary
**What Happened**: [2-3 sentence summary of the incident]
**Impact**:
- **Duration**: [X] minutes/hours
- **Users Affected**: [All / X% / specific group]
- **Revenue Impact**: [$X estimated loss]
- **SLA Breach**: [Yes/No - details]
**Root Cause**: [1 sentence root cause]
**Resolution**: [1 sentence how it was fixed]
**Key Actions**: [3 most important action items]
---
## Timeline
| Time (UTC) | Event | Notes |
|------------|-------|-------|
| [HH:MM] | [Event] | [Context] |
| [HH:MM] | [Event] | [Context] |
| [HH:MM] | [Event] | [Context] |
**Duration Breakdown**:
- Detection → Identification: [X] minutes
- Identification → Mitigation: [X] minutes
- Mitigation → Full Resolution: [X] minutes
- **Total MTTR**: [X] minutes
---
## Root Cause Analysis (5 Whys)
**Why 1**: Why did [problem] happen?
→ [Answer based on facts]
**Why 2**: Why did [previous answer] happen?
→ [Answer based on facts]
**Why 3**: Why did [previous answer] happen?
→ [Answer based on facts]
**Why 4**: Why did [previous answer] happen?
→ [Answer based on facts]
**Why 5**: Why did [previous answer] happen?
→ [Answer based on facts]
**ROOT CAUSE**: [Final systemic issue identified]
---
## Contributing Factors
### Immediate Cause
[Direct technical cause of the incident]
### Underlying Conditions
1. [Condition that enabled the immediate cause]
2. [Condition that enabled the immediate cause]
### Latent Failures
1. [Organizational/process weakness]
2. [Organizational/process weakness]
---
## What Went Well ✅
1. [Something that worked well during response]
2. [Something that worked well during response]
3. [Something that worked well during response]
---
## What Went Wrong ❌
1. [Something that didn't work or was missing]
2. [Something that didn't work or was missing]
3. [Something that didn't work or was missing]
---
## Action Items
| Priority | Action | Owner | Due Date | Status | Link |
|----------|--------|-------|----------|--------|------|
| P0 | [Critical - do immediately] | @[name] | [Date] | [ ] | [Link] |
| P1 | [Important - do within 1 week] | @[name] | [Date] | [ ] | [Link] |
| P2 | [Nice to have - do within 1 month] | @[name] | [Date] | [ ] | [Link] |
### P0 Actions (Immediate)
- [ ] [Action 1] - @[owner] - [due date]
- [ ] [Action 2] - @[owner] - [due date]
### P1 Actions (Short-Term)
- [ ] [Action 1] - @[owner] - [due date]
- [ ] [Action 2] - @[owner] - [due date]
### P2 Actions (Long-Term)
- [ ] [Action 1] - @[owner] - [due date]
- [ ] [Action 2] - @[owner] - [due date]
---
## Lessons Learned
### Technical Learnings
1. [Technical insight gained]
2. [Technical insight gained]
### Process Learnings
1. [Process improvement identified]
2. [Process improvement identified]
### Communication Learnings
1. [Communication improvement identified]
2. [Communication improvement identified]
---
## Prevention Measures
### Immediate (Completed)
- [x] [What was done same day]
- [x] [What was done same day]
### Short-Term (1-2 weeks)
- [ ] [What will be done soon]
- [ ] [What will be done soon]
### Long-Term (1-3 months)
- [ ] [What will be done eventually]
- [ ] [What will be done eventually]
---
## Related Incidents
- [INC-YYYY-MM-DD-XXX] - [Brief description] - [Link]
- [INC-YYYY-MM-DD-XXX] - [Brief description] - [Link]
---
## Appendix
### Relevant Logs
```
[Paste key log entries]
```
### Metrics/Graphs
[Links to Grafana dashboards, screenshots]
### Commands Run
```bash
[Commands that were used during investigation/mitigation]
```
---
## Sign-Off
**Incident Commander**: [Name] - [Date]
**Technical Lead**: [Name] - [Date]
**Engineering Manager**: [Name] - [Date]
**Postmortem Review**: [Date/Time]
**Attendees**: [List of people who reviewed]
---
Return to [templates index](INDEX.md)