3.9 KiB
Postmortem: [INCIDENT TITLE]
Date: [YYYY-MM-DD] Incident ID: INC-YYYY-MM-DD-XXX Severity: [SEV1 / SEV2 / SEV3] Author: [Name] Reviewers: [Names] Status: [Draft / Final]
Executive Summary
What Happened: [2-3 sentence summary of the incident]
Impact:
- Duration: [X] minutes/hours
- Users Affected: [All / X% / specific group]
- Revenue Impact: [$X estimated loss]
- SLA Breach: [Yes/No - details]
Root Cause: [1 sentence root cause]
Resolution: [1 sentence how it was fixed]
Key Actions: [3 most important action items]
Timeline
| Time (UTC) | Event | Notes |
|---|---|---|
| [HH:MM] | [Event] | [Context] |
| [HH:MM] | [Event] | [Context] |
| [HH:MM] | [Event] | [Context] |
Duration Breakdown:
- Detection → Identification: [X] minutes
- Identification → Mitigation: [X] minutes
- Mitigation → Full Resolution: [X] minutes
- Total MTTR: [X] minutes
Root Cause Analysis (5 Whys)
Why 1: Why did [problem] happen? → [Answer based on facts]
Why 2: Why did [previous answer] happen? → [Answer based on facts]
Why 3: Why did [previous answer] happen? → [Answer based on facts]
Why 4: Why did [previous answer] happen? → [Answer based on facts]
Why 5: Why did [previous answer] happen? → [Answer based on facts]
ROOT CAUSE: [Final systemic issue identified]
Contributing Factors
Immediate Cause
[Direct technical cause of the incident]
Underlying Conditions
- [Condition that enabled the immediate cause]
- [Condition that enabled the immediate cause]
Latent Failures
- [Organizational/process weakness]
- [Organizational/process weakness]
What Went Well ✅
- [Something that worked well during response]
- [Something that worked well during response]
- [Something that worked well during response]
What Went Wrong ❌
- [Something that didn't work or was missing]
- [Something that didn't work or was missing]
- [Something that didn't work or was missing]
Action Items
| Priority | Action | Owner | Due Date | Status | Link |
|---|---|---|---|---|---|
| P0 | [Critical - do immediately] | @[name] | [Date] | [ ] | [Link] |
| P1 | [Important - do within 1 week] | @[name] | [Date] | [ ] | [Link] |
| P2 | [Nice to have - do within 1 month] | @[name] | [Date] | [ ] | [Link] |
P0 Actions (Immediate)
- [Action 1] - @[owner] - [due date]
- [Action 2] - @[owner] - [due date]
P1 Actions (Short-Term)
- [Action 1] - @[owner] - [due date]
- [Action 2] - @[owner] - [due date]
P2 Actions (Long-Term)
- [Action 1] - @[owner] - [due date]
- [Action 2] - @[owner] - [due date]
Lessons Learned
Technical Learnings
- [Technical insight gained]
- [Technical insight gained]
Process Learnings
- [Process improvement identified]
- [Process improvement identified]
Communication Learnings
- [Communication improvement identified]
- [Communication improvement identified]
Prevention Measures
Immediate (Completed)
- [What was done same day]
- [What was done same day]
Short-Term (1-2 weeks)
- [What will be done soon]
- [What will be done soon]
Long-Term (1-3 months)
- [What will be done eventually]
- [What will be done eventually]
Related Incidents
- [INC-YYYY-MM-DD-XXX] - [Brief description] - [Link]
- [INC-YYYY-MM-DD-XXX] - [Brief description] - [Link]
Appendix
Relevant Logs
[Paste key log entries]
Metrics/Graphs
[Links to Grafana dashboards, screenshots]
Commands Run
[Commands that were used during investigation/mitigation]
Sign-Off
Incident Commander: [Name] - [Date] Technical Lead: [Name] - [Date] Engineering Manager: [Name] - [Date]
Postmortem Review: [Date/Time] Attendees: [List of people who reviewed]
Return to templates index