397 lines
16 KiB
Markdown
397 lines
16 KiB
Markdown
# Postmortem Template
|
|
|
|
## Table of Contents
|
|
1. [Workflow](#workflow)
|
|
2. [Postmortem Document Template](#postmortem-document-template)
|
|
3. [Guidance for Each Section](#guidance-for-each-section)
|
|
4. [Quick Patterns](#quick-patterns)
|
|
5. [Quality Checklist](#quality-checklist)
|
|
|
|
## Workflow
|
|
|
|
Copy this checklist and track your progress:
|
|
|
|
```
|
|
Postmortem Progress:
|
|
- [ ] Step 1: Assemble timeline and quantify impact
|
|
- [ ] Step 2: Conduct root cause analysis
|
|
- [ ] Step 3: Define corrective and preventive actions
|
|
- [ ] Step 4: Document and share postmortem
|
|
- [ ] Step 5: Track action items to completion
|
|
```
|
|
|
|
**Step 1: Assemble timeline and quantify impact**
|
|
|
|
Fill out [Incident Summary](#incident-summary) and [Timeline](#timeline) sections. Quantify impact in [Impact](#impact) section. Gather logs, metrics, alerts, and witness accounts.
|
|
|
|
**Step 2: Conduct root cause analysis**
|
|
|
|
Use 5 Whys or fishbone diagram to identify root causes. Fill out [Root Cause Analysis](#root-cause-analysis) section. See [Guidance: Root Cause](#root-cause) for techniques.
|
|
|
|
**Step 3: Define corrective and preventive actions**
|
|
|
|
For each root cause, identify actions. Fill out [Corrective Actions](#corrective-actions) section with SMART actions (specific, measurable, assigned, realistic, time-bound).
|
|
|
|
**Step 4: Document and share postmortem**
|
|
|
|
Complete [What Went Well](#what-went-well) and [Lessons Learned](#lessons-learned) sections. Share with team, stakeholders, and broader organization. Present in meeting for discussion.
|
|
|
|
**Step 5: Track action items to completion**
|
|
|
|
Add actions to project tracker. Assign owners, set deadlines. Review progress weekly. Close postmortem when all actions complete. Use [Quality Checklist](#quality-checklist) to validate.
|
|
|
|
---
|
|
|
|
## Postmortem Document Template
|
|
|
|
### Incident Summary
|
|
|
|
**Incident ID**: [Unique identifier, e.g., INC-2024-001]
|
|
**Date**: [When incident occurred, e.g., 2024-03-05]
|
|
**Duration**: [How long, e.g., 2 hours 15 minutes]
|
|
**Severity**: [Critical / High / Medium / Low]
|
|
**Status**: [Resolved / Investigating / Monitoring]
|
|
|
|
**Summary**: [1-2 sentence description of what happened]
|
|
|
|
**Impact Summary**:
|
|
- **Users Affected**: [Number or percentage]
|
|
- **Duration**: [Downtime or degradation period]
|
|
- **Revenue Impact**: [Estimated $ loss, if applicable]
|
|
- **SLA Impact**: [SLA breach? Refunds owed?]
|
|
- **Reputation Impact**: [Customer complaints, social media, press]
|
|
|
|
**Owner**: [Postmortem author]
|
|
**Contributors**: [Who participated in postmortem]
|
|
**Date Written**: [When postmortem was created]
|
|
|
|
---
|
|
|
|
### Timeline
|
|
|
|
**Detection to Resolution**: [14:05 - 16:05 UTC (2 hours)]
|
|
|
|
| Time (UTC) | Event | Source | Action Taken |
|
|
|------------|-------|--------|--------------|
|
|
| 14:00 | [Normal operation] | - | - |
|
|
| 14:05 | [Deployment started] | CI/CD | [Config change deployed] |
|
|
| 14:07 | [Error spike detected] | Monitoring | [Errors jumped from 0 to 500/min] |
|
|
| 14:10 | [Alert fired] | PagerDuty | [On-call engineer paged] |
|
|
| 14:15 | [Engineer joins incident] | Slack | [Investigation begins] |
|
|
| 14:20 | [Root cause identified] | Logs | [Bad config found] |
|
|
| 14:25 | [Mitigation attempt #1] | Manual | [Tried config hotfix - failed] |
|
|
| 14:45 | [Mitigation attempt #2] | Manual | [Scaled up instances - no effect] |
|
|
| 15:30 | [Rollback decision] | Incident Commander | [Decided to rollback deployment] |
|
|
| 15:45 | [Rollback executed] | CI/CD | [Rolled back to previous version] |
|
|
| 16:05 | [Service restored] | Monitoring | [Errors returned to 0, traffic normal] |
|
|
| 16:30 | [All-clear declared] | Incident Commander | [Incident closed] |
|
|
|
|
**Key Observations**:
|
|
- Detection lag: [X minutes between occurrence and detection]
|
|
- Response time: [X minutes from alert to engineer joining]
|
|
- Diagnosis time: [X minutes to identify root cause]
|
|
- Mitigation time: [X minutes from mitigation start to resolution]
|
|
- Total incident duration: [X hours Y minutes]
|
|
|
|
---
|
|
|
|
### Impact
|
|
|
|
**User Impact**: [50K users (20%), complete unavailability, global]
|
|
**Business**: [$20K revenue loss, SLA breach (99.9% → 2hr down), $15K credits]
|
|
**Operational**: [150 support tickets, 10 person-hours incident response]
|
|
**Reputation**: [50 negative tweets, 3 Enterprise escalations]
|
|
|
|
| Metric | Baseline | During | Post |
|
|
|--------|----------|--------|------|
|
|
| Uptime | 99.99% | 0% | 99.99% |
|
|
| Errors | <0.1% | 100% | <0.1% |
|
|
| Users | 250K | 0 | 240K |
|
|
|
|
---
|
|
|
|
### Root Cause Analysis
|
|
|
|
#### 5 Whys
|
|
|
|
**Problem Statement**: Database connection pool exhausted, causing 2-hour service outage affecting 50K users.
|
|
|
|
1. **Why did the service go down?**
|
|
- Database connection pool exhausted (all connections in use, new requests rejected)
|
|
|
|
2. **Why did the connection pool exhaust?**
|
|
- Connection pool size was set to 10, but traffic required 100+ connections
|
|
|
|
3. **Why was the pool size set to 10?**
|
|
- Configuration template had wrong value (10 instead of 100)
|
|
|
|
4. **Why did the template have the wrong value?**
|
|
- New team member created template without referencing production values
|
|
|
|
5. **Why didn't anyone catch this before deployment?**
|
|
- No staging environment with production-like load to test configs
|
|
- No automated config validation in deployment pipeline
|
|
- No peer review required for configuration changes
|
|
|
|
**Root Causes Identified**:
|
|
1. **Primary**: Missing config validation in deployment pipeline
|
|
2. **Contributing**: No staging environment with prod-like load
|
|
3. **Contributing**: Unclear onboarding process for infrastructure changes
|
|
4. **Contributing**: No peer review requirement for config changes
|
|
|
|
#### Contributing Factors
|
|
|
|
**Technical**:
|
|
- Deployment pipeline lacks config validation
|
|
- No staging environment matching production scale
|
|
- Monitoring detected symptom (errors) but not root cause (pool exhaustion)
|
|
|
|
**Process**:
|
|
- Onboarding doesn't cover production config management
|
|
- No peer review process for infrastructure changes
|
|
- Runbook for rollback was outdated, delayed recovery by 45 minutes
|
|
|
|
**Organizational**:
|
|
- Pressure to deploy quickly (feature deadline) led to skipping validation steps
|
|
- Team understaffed (3 people covering on-call for 50-person engineering org)
|
|
|
|
---
|
|
|
|
### Corrective Actions
|
|
|
|
**Immediate Fixes** (Completed or in-flight):
|
|
- [x] Rollback to previous config (Completed 16:05 UTC)
|
|
- [x] Manually set connection pool to 100 on all instances (Completed 16:30 UTC)
|
|
- [ ] Update deployment runbook with rollback steps (Owner: Sam, Due: Mar 8)
|
|
|
|
**Short-Term Improvements** (Next 2-4 weeks):
|
|
- [ ] Add config validation to deployment pipeline (Owner: Alex, Due: Mar 15)
|
|
- Validate connection pool size is between 50-500
|
|
- Validate all required config keys present
|
|
- Fail deployment if validation fails
|
|
- [ ] Create staging environment with production-like load (Owner: Jordan, Due: Mar 30)
|
|
- 10% of prod traffic
|
|
- Same config as prod
|
|
- Automated smoke tests after each deployment
|
|
- [ ] Require peer review for all infrastructure changes (Owner: Morgan, Due: Mar 10)
|
|
- Update GitHub settings to require 1 approval
|
|
- Add CODEOWNERS for infra configs
|
|
- [ ] Improve monitoring to detect pool exhaustion (Owner: Casey, Due: Mar 20)
|
|
- Alert when pool utilization >80%
|
|
- Dashboard showing pool metrics
|
|
|
|
**Long-Term Investments** (Next 1-3 months):
|
|
- [ ] Implement canary deployments (Owner: Alex, Due: May 1)
|
|
- Deploy to 5% of traffic first
|
|
- Auto-rollback if error rate >0.5%
|
|
- [ ] Expand on-call rotation (Owner: Morgan, Due: Apr 15)
|
|
- Hire 2 more SREs
|
|
- Train 3 backend engineers for on-call
|
|
- Reduce on-call burden from 1:3 to 1:8
|
|
- [ ] Onboarding program for infrastructure changes (Owner: Jordan, Due: Apr 30)
|
|
- Checklist for new team members
|
|
- Buddy system for first 3 infra changes
|
|
- Production access granted after completion
|
|
|
|
**Tracking**:
|
|
- Actions added to JIRA project: INCIDENT-2024-001
|
|
- Reviewed in weekly engineering standup
|
|
- Postmortem closed when all actions complete (target: May 1)
|
|
|
|
---
|
|
|
|
### What Went Well
|
|
|
|
**Detection & Alerting**:
|
|
- ✓ Monitoring detected error spike within 3 minutes of occurrence
|
|
- ✓ PagerDuty alert fired correctly, paged on-call within 5 minutes
|
|
- ✓ Escalation path worked (on-call → senior eng → incident commander)
|
|
|
|
**Incident Response**:
|
|
- ✓ Team assembled quickly (5 engineers joined within 10 minutes)
|
|
- ✓ Clear incident commander (Morgan) coordinated response
|
|
- ✓ Regular status updates posted to #incidents Slack channel every 15 min
|
|
- ✓ Stakeholder communication: Execs notified within 20 min, status page updated
|
|
|
|
**Communication**:
|
|
- ✓ Blameless tone maintained throughout (no finger-pointing)
|
|
- ✓ External communication transparent (status page, Twitter updates)
|
|
- ✓ Customer support briefed on issue and talking points provided
|
|
|
|
**Technical**:
|
|
- ✓ Rollback process worked once initiated (though delayed by unclear runbook)
|
|
- ✓ No data loss or corruption during incident
|
|
- ✓ Automated alerts prevented longer outage (if manual detection, could be hours)
|
|
|
|
---
|
|
|
|
### Lessons Learned
|
|
|
|
**What We Learned**:
|
|
1. **Staging environments matter**: Could have caught this with prod-like staging
|
|
2. **Config is code**: Needs same rigor as code (validation, review, testing)
|
|
3. **Runbooks rot**: Outdated runbook delayed recovery by 45 min - need to keep fresh
|
|
4. **Pressure kills quality**: Rushing to hit deadline led to skipping validation
|
|
5. **Monitoring gaps**: Detected symptom (errors) but not cause (pool exhaustion) - need better metrics
|
|
|
|
**Knowledge Gaps Identified**:
|
|
- New team members don't know production config patterns
|
|
- Not everyone knows how to rollback deployments
|
|
- Connection pool sizing not well understood across team
|
|
|
|
**Process Improvements Needed**:
|
|
- Peer review for all infrastructure changes (not just code)
|
|
- Automated config validation before deployment
|
|
- Regular runbook review/update (quarterly)
|
|
- Staging environment for all changes
|
|
|
|
**Cultural Observations**:
|
|
- Team responded well under pressure (good collaboration, clear roles)
|
|
- Blameless culture held up (no one blamed new team member)
|
|
- But: Pressure to ship fast is real and leads to shortcuts
|
|
|
|
---
|
|
|
|
### Appendix
|
|
|
|
**Related Incidents**:
|
|
- [INC-2024-002]: Similar config error 2 weeks later (different service)
|
|
- [INC-2023-045]: Previous database connection issue (6 months ago)
|
|
|
|
**References**:
|
|
- Logs: [Link to log aggregator query]
|
|
- Metrics: [Link to Grafana dashboard]
|
|
- Incident Slack thread: [#incidents, Mar 5 14:10-16:30]
|
|
- Status page: [Link to status.company.com incident #123]
|
|
- Customer impact: [Link to support ticket analysis]
|
|
|
|
**Contacts**:
|
|
- Incident Commander: Morgan (morgan@company.com)
|
|
- Postmortem Author: Alex (alex@company.com)
|
|
- Escalation: VP Engineering (vpe@company.com)
|
|
|
|
---
|
|
|
|
## Guidance for Each Section
|
|
|
|
### Incident Summary
|
|
- **Severity Levels**:
|
|
- Critical: Total outage, data loss, security breach
|
|
- High: Partial outage, significant degradation, many users affected
|
|
- Medium: Minor degradation, limited users affected
|
|
- Low: Minimal impact, caught before customer impact
|
|
- **Summary**: 1-2 sentences max. Example: "Database connection pool exhaustion caused 2-hour outage affecting 50K users due to incorrect config deployment."
|
|
|
|
### Timeline
|
|
- **Timestamps**: Use UTC or clearly state timezone. Be specific (14:05, not "afternoon").
|
|
- **Key Events**: Detection, escalation, investigation milestones, mitigation attempts, resolution.
|
|
- **Sources**: Where info came from (monitoring, logs, person X, Slack, etc.).
|
|
- **Objectivity**: Facts only, no blame. "Engineer X deployed" not "Engineer X mistakenly deployed".
|
|
|
|
### Impact
|
|
- **Quantify**: Numbers > adjectives. "50K users" not "many users". "$20K" not "significant".
|
|
- **Multiple Dimensions**: Users, revenue, SLA, reputation, team time, opportunity cost.
|
|
- **Metrics Table**: Before/during/after for key metrics shows impact clearly.
|
|
|
|
### Root Cause
|
|
- **5 Whys**: Stop when you reach fixable systemic issue, not "person made mistake".
|
|
- **Multiple Causes**: Most incidents have >1 cause. List primary + contributing.
|
|
- **System Focus**: "Deployment pipeline allowed bad config" not "Person deployed bad config".
|
|
|
|
### Corrective Actions
|
|
- **SMART**: Specific (what exactly), Measurable (how verify), Assigned (who owns), Realistic (achievable), Time-bound (when done).
|
|
- **Categorize**: Immediate (days), Short-term (weeks), Long-term (months).
|
|
- **Prioritize**: High impact, low effort first. Don't create >10 actions (won't get done).
|
|
- **Track**: Add to project tracker, review weekly, don't let languish.
|
|
|
|
### What Went Well
|
|
- **Balance**: Postmortems aren't just about failures, recognize what worked.
|
|
- **Examples**: Good communication, fast response, effective teamwork, working alerting.
|
|
- **Build On**: These are strengths to maintain and amplify.
|
|
|
|
### Lessons Learned
|
|
- **Generalize**: Extract principles applicable beyond this incident.
|
|
- **Share**: These are org-wide learnings, share broadly.
|
|
- **Action**: Connect lessons to corrective actions.
|
|
|
|
---
|
|
|
|
## Quick Patterns
|
|
|
|
### Production Outage Postmortem
|
|
```
|
|
Summary: [Service] down for [X hours] affecting [Y users] due to [root cause]
|
|
Timeline: Detection → Investigation → Mitigation attempts → Resolution
|
|
Impact: Users, revenue, SLA, reputation
|
|
Root Cause: 5 Whys to system issue
|
|
Actions: Immediate fixes, monitoring improvements, process changes, long-term infrastructure
|
|
```
|
|
|
|
### Security Incident Postmortem
|
|
```
|
|
Summary: [Breach type] exposed [data/systems] due to [vulnerability]
|
|
Timeline: Breach → Detection (often delayed) → Containment → Remediation → Recovery
|
|
Impact: Data exposed, compliance risk, reputation, remediation cost
|
|
Root Cause: Missing security controls, access management, unpatched vulnerabilities
|
|
Actions: Security audit, access review, patch management, training, compliance reporting
|
|
```
|
|
|
|
### Product Launch Failure Postmortem
|
|
```
|
|
Summary: [Feature] launched to [<X% adoption] vs [Y% expected] due to [reasons]
|
|
Timeline: Planning → Development → Launch → Metrics review → Pivot decision
|
|
Impact: Revenue miss, wasted effort, opportunity cost, team morale
|
|
Root Cause: Poor problem validation, wrong solution, inadequate testing, misaligned expectations
|
|
Actions: Improve discovery, user research, validation, stakeholder alignment
|
|
```
|
|
|
|
### Project Deadline Miss Postmortem
|
|
```
|
|
Summary: [Project] delivered [X months late] due to [root causes]
|
|
Timeline: Kickoff → Milestones → Delays → Delivery
|
|
Impact: Customer impact, revenue delay, team burnout, opportunity cost
|
|
Root Cause: Poor estimation, scope creep, technical challenges, resource constraints
|
|
Actions: Improve estimation, scope control, technical spikes, resource planning
|
|
```
|
|
|
|
---
|
|
|
|
## Quality Checklist
|
|
|
|
Before finalizing postmortem, verify:
|
|
|
|
**Completeness**:
|
|
- [ ] Incident summary with severity, impact summary, dates
|
|
- [ ] Timeline with timestamps, events, actions taken
|
|
- [ ] Impact quantified (users, duration, revenue, metrics)
|
|
- [ ] Root cause analysis (5 Whys or equivalent)
|
|
- [ ] Corrective actions with SMART criteria (specific, measurable, assigned, realistic, time-bound)
|
|
- [ ] What went well section
|
|
- [ ] Lessons learned documented
|
|
|
|
**Quality**:
|
|
- [ ] Blameless tone (focus on systems, not individuals)
|
|
- [ ] Timeline uses specific timestamps (not vague times)
|
|
- [ ] Impact quantified with numbers (not adjectives like "many")
|
|
- [ ] Root cause goes beyond surface symptoms (asked "Why?" 5 times)
|
|
- [ ] Actions are specific, owned, and time-bound (not vague "improve X")
|
|
- [ ] Actions tracked in project management tool
|
|
|
|
**Timeliness**:
|
|
- [ ] Postmortem written within 48 hours of incident resolution
|
|
- [ ] Shared with team and stakeholders within 72 hours
|
|
- [ ] Presented in team meeting for discussion
|
|
- [ ] Action items in tracker with owners and deadlines
|
|
|
|
**Learning**:
|
|
- [ ] Lessons learned extracted and generalized
|
|
- [ ] Shared broadly for organizational learning
|
|
- [ ] Action items address root causes (not just symptoms)
|
|
- [ ] Follow-up review scheduled (2-4 weeks) to check progress
|
|
|
|
**Overall Assessment**:
|
|
- [ ] Would prevent similar incident if all actions completed?
|
|
- [ ] Are we attacking root causes or just symptoms?
|
|
- [ ] Is tone blameless and constructive?
|
|
- [ ] Will team actually complete these actions? (realistic, prioritized)
|