188 lines
3.9 KiB
Markdown
188 lines
3.9 KiB
Markdown
# Postmortem: [INCIDENT TITLE]
|
|
|
|
**Date**: [YYYY-MM-DD]
|
|
**Incident ID**: INC-YYYY-MM-DD-XXX
|
|
**Severity**: [SEV1 / SEV2 / SEV3]
|
|
**Author**: [Name]
|
|
**Reviewers**: [Names]
|
|
**Status**: [Draft / Final]
|
|
|
|
---
|
|
|
|
## Executive Summary
|
|
|
|
**What Happened**: [2-3 sentence summary of the incident]
|
|
|
|
**Impact**:
|
|
- **Duration**: [X] minutes/hours
|
|
- **Users Affected**: [All / X% / specific group]
|
|
- **Revenue Impact**: [$X estimated loss]
|
|
- **SLA Breach**: [Yes/No - details]
|
|
|
|
**Root Cause**: [1 sentence root cause]
|
|
|
|
**Resolution**: [1 sentence how it was fixed]
|
|
|
|
**Key Actions**: [3 most important action items]
|
|
|
|
---
|
|
|
|
## Timeline
|
|
|
|
| Time (UTC) | Event | Notes |
|
|
|------------|-------|-------|
|
|
| [HH:MM] | [Event] | [Context] |
|
|
| [HH:MM] | [Event] | [Context] |
|
|
| [HH:MM] | [Event] | [Context] |
|
|
|
|
**Duration Breakdown**:
|
|
- Detection → Identification: [X] minutes
|
|
- Identification → Mitigation: [X] minutes
|
|
- Mitigation → Full Resolution: [X] minutes
|
|
- **Total MTTR**: [X] minutes
|
|
|
|
---
|
|
|
|
## Root Cause Analysis (5 Whys)
|
|
|
|
**Why 1**: Why did [problem] happen?
|
|
→ [Answer based on facts]
|
|
|
|
**Why 2**: Why did [previous answer] happen?
|
|
→ [Answer based on facts]
|
|
|
|
**Why 3**: Why did [previous answer] happen?
|
|
→ [Answer based on facts]
|
|
|
|
**Why 4**: Why did [previous answer] happen?
|
|
→ [Answer based on facts]
|
|
|
|
**Why 5**: Why did [previous answer] happen?
|
|
→ [Answer based on facts]
|
|
|
|
**ROOT CAUSE**: [Final systemic issue identified]
|
|
|
|
---
|
|
|
|
## Contributing Factors
|
|
|
|
### Immediate Cause
|
|
[Direct technical cause of the incident]
|
|
|
|
### Underlying Conditions
|
|
1. [Condition that enabled the immediate cause]
|
|
2. [Condition that enabled the immediate cause]
|
|
|
|
### Latent Failures
|
|
1. [Organizational/process weakness]
|
|
2. [Organizational/process weakness]
|
|
|
|
---
|
|
|
|
## What Went Well ✅
|
|
|
|
1. [Something that worked well during response]
|
|
2. [Something that worked well during response]
|
|
3. [Something that worked well during response]
|
|
|
|
---
|
|
|
|
## What Went Wrong ❌
|
|
|
|
1. [Something that didn't work or was missing]
|
|
2. [Something that didn't work or was missing]
|
|
3. [Something that didn't work or was missing]
|
|
|
|
---
|
|
|
|
## Action Items
|
|
|
|
| Priority | Action | Owner | Due Date | Status | Link |
|
|
|----------|--------|-------|----------|--------|------|
|
|
| P0 | [Critical - do immediately] | @[name] | [Date] | [ ] | [Link] |
|
|
| P1 | [Important - do within 1 week] | @[name] | [Date] | [ ] | [Link] |
|
|
| P2 | [Nice to have - do within 1 month] | @[name] | [Date] | [ ] | [Link] |
|
|
|
|
### P0 Actions (Immediate)
|
|
- [ ] [Action 1] - @[owner] - [due date]
|
|
- [ ] [Action 2] - @[owner] - [due date]
|
|
|
|
### P1 Actions (Short-Term)
|
|
- [ ] [Action 1] - @[owner] - [due date]
|
|
- [ ] [Action 2] - @[owner] - [due date]
|
|
|
|
### P2 Actions (Long-Term)
|
|
- [ ] [Action 1] - @[owner] - [due date]
|
|
- [ ] [Action 2] - @[owner] - [due date]
|
|
|
|
---
|
|
|
|
## Lessons Learned
|
|
|
|
### Technical Learnings
|
|
1. [Technical insight gained]
|
|
2. [Technical insight gained]
|
|
|
|
### Process Learnings
|
|
1. [Process improvement identified]
|
|
2. [Process improvement identified]
|
|
|
|
### Communication Learnings
|
|
1. [Communication improvement identified]
|
|
2. [Communication improvement identified]
|
|
|
|
---
|
|
|
|
## Prevention Measures
|
|
|
|
### Immediate (Completed)
|
|
- [x] [What was done same day]
|
|
- [x] [What was done same day]
|
|
|
|
### Short-Term (1-2 weeks)
|
|
- [ ] [What will be done soon]
|
|
- [ ] [What will be done soon]
|
|
|
|
### Long-Term (1-3 months)
|
|
- [ ] [What will be done eventually]
|
|
- [ ] [What will be done eventually]
|
|
|
|
---
|
|
|
|
## Related Incidents
|
|
|
|
- [INC-YYYY-MM-DD-XXX] - [Brief description] - [Link]
|
|
- [INC-YYYY-MM-DD-XXX] - [Brief description] - [Link]
|
|
|
|
---
|
|
|
|
## Appendix
|
|
|
|
### Relevant Logs
|
|
```
|
|
[Paste key log entries]
|
|
```
|
|
|
|
### Metrics/Graphs
|
|
[Links to Grafana dashboards, screenshots]
|
|
|
|
### Commands Run
|
|
```bash
|
|
[Commands that were used during investigation/mitigation]
|
|
```
|
|
|
|
---
|
|
|
|
## Sign-Off
|
|
|
|
**Incident Commander**: [Name] - [Date]
|
|
**Technical Lead**: [Name] - [Date]
|
|
**Engineering Manager**: [Name] - [Date]
|
|
|
|
**Postmortem Review**: [Date/Time]
|
|
**Attendees**: [List of people who reviewed]
|
|
|
|
---
|
|
|
|
Return to [templates index](INDEX.md)
|