zhongwei/gh-lyndonkl-claude

Fork 0

Files

Zhongwei Li 41d9f6b189 Initial commit

2025-11-30 08:38:26 +08:00

16 KiB

Raw Blame History

Postmortem Template

Workflow
Postmortem Document Template
Guidance for Each Section
Quick Patterns
Quality Checklist

Workflow

Copy this checklist and track your progress:

Postmortem Progress:
- [ ] Step 1: Assemble timeline and quantify impact
- [ ] Step 2: Conduct root cause analysis
- [ ] Step 3: Define corrective and preventive actions
- [ ] Step 4: Document and share postmortem
- [ ] Step 5: Track action items to completion

Step 1: Assemble timeline and quantify impact

Fill out Incident Summary and Timeline sections. Quantify impact in Impact section. Gather logs, metrics, alerts, and witness accounts.

Step 2: Conduct root cause analysis

Use 5 Whys or fishbone diagram to identify root causes. Fill out Root Cause Analysis section. See Guidance: Root Cause for techniques.

Step 3: Define corrective and preventive actions

For each root cause, identify actions. Fill out Corrective Actions section with SMART actions (specific, measurable, assigned, realistic, time-bound).

Step 4: Document and share postmortem

Complete What Went Well and Lessons Learned sections. Share with team, stakeholders, and broader organization. Present in meeting for discussion.

Step 5: Track action items to completion

Add actions to project tracker. Assign owners, set deadlines. Review progress weekly. Close postmortem when all actions complete. Use Quality Checklist to validate.

Postmortem Document Template

Incident Summary

Incident ID: [Unique identifier, e.g., INC-2024-001] Date: [When incident occurred, e.g., 2024-03-05] Duration: [How long, e.g., 2 hours 15 minutes] Severity: [Critical / High / Medium / Low] Status: [Resolved / Investigating / Monitoring]

Summary: [1-2 sentence description of what happened]

Impact Summary:

Users Affected: [Number or percentage]
Duration: [Downtime or degradation period]
Revenue Impact: [Estimated $ loss, if applicable]
SLA Impact: [SLA breach? Refunds owed?]
Reputation Impact: [Customer complaints, social media, press]

Owner: [Postmortem author] Contributors: [Who participated in postmortem] Date Written: [When postmortem was created]

Timeline

Detection to Resolution: [14:05 - 16:05 UTC (2 hours)]

Time (UTC)	Event	Source	Action Taken
14:00	[Normal operation]	-	-
14:05	[Deployment started]	CI/CD	[Config change deployed]
14:07	[Error spike detected]	Monitoring	[Errors jumped from 0 to 500/min]
14:10	[Alert fired]	PagerDuty	[On-call engineer paged]
14:15	[Engineer joins incident]	Slack	[Investigation begins]
14:20	[Root cause identified]	Logs	[Bad config found]
14:25	[Mitigation attempt #1]	Manual	[Tried config hotfix - failed]
14:45	[Mitigation attempt #2]	Manual	[Scaled up instances - no effect]
15:30	[Rollback decision]	Incident Commander	[Decided to rollback deployment]
15:45	[Rollback executed]	CI/CD	[Rolled back to previous version]
16:05	[Service restored]	Monitoring	[Errors returned to 0, traffic normal]
16:30	[All-clear declared]	Incident Commander	[Incident closed]

Key Observations:

Detection lag: [X minutes between occurrence and detection]
Response time: [X minutes from alert to engineer joining]
Diagnosis time: [X minutes to identify root cause]
Mitigation time: [X minutes from mitigation start to resolution]
Total incident duration: [X hours Y minutes]

Impact

User Impact: [50K users (20%), complete unavailability, global] Business: [$20K revenue loss, SLA breach (99.9% → 2hr down), $15K credits] Operational: [150 support tickets, 10 person-hours incident response] Reputation: [50 negative tweets, 3 Enterprise escalations]

Metric	Baseline	During	Post
Uptime	99.99%	0%	99.99%
Errors	<0.1%	100%	<0.1%
Users	250K	0	240K

Root Cause Analysis

5 Whys

Problem Statement: Database connection pool exhausted, causing 2-hour service outage affecting 50K users.

Why did the service go down?
- Database connection pool exhausted (all connections in use, new requests rejected)
Why did the connection pool exhaust?
- Connection pool size was set to 10, but traffic required 100+ connections
Why was the pool size set to 10?
- Configuration template had wrong value (10 instead of 100)
Why did the template have the wrong value?
- New team member created template without referencing production values
Why didn't anyone catch this before deployment?
- No staging environment with production-like load to test configs
- No automated config validation in deployment pipeline
- No peer review required for configuration changes

Root Causes Identified:

Primary: Missing config validation in deployment pipeline
Contributing: No staging environment with prod-like load
Contributing: Unclear onboarding process for infrastructure changes
Contributing: No peer review requirement for config changes

Contributing Factors

Technical:

Deployment pipeline lacks config validation
No staging environment matching production scale
Monitoring detected symptom (errors) but not root cause (pool exhaustion)

Process:

Onboarding doesn't cover production config management
No peer review process for infrastructure changes
Runbook for rollback was outdated, delayed recovery by 45 minutes

Organizational:

Pressure to deploy quickly (feature deadline) led to skipping validation steps
Team understaffed (3 people covering on-call for 50-person engineering org)

Corrective Actions

Immediate Fixes (Completed or in-flight):

Rollback to previous config (Completed 16:05 UTC)
Manually set connection pool to 100 on all instances (Completed 16:30 UTC)
Update deployment runbook with rollback steps (Owner: Sam, Due: Mar 8)

Short-Term Improvements (Next 2-4 weeks):

Add config validation to deployment pipeline (Owner: Alex, Due: Mar 15)
- Validate connection pool size is between 50-500
- Validate all required config keys present
- Fail deployment if validation fails
Create staging environment with production-like load (Owner: Jordan, Due: Mar 30)
- 10% of prod traffic
- Same config as prod
- Automated smoke tests after each deployment
Require peer review for all infrastructure changes (Owner: Morgan, Due: Mar 10)
- Update GitHub settings to require 1 approval
- Add CODEOWNERS for infra configs
Improve monitoring to detect pool exhaustion (Owner: Casey, Due: Mar 20)
- Alert when pool utilization >80%
- Dashboard showing pool metrics

Long-Term Investments (Next 1-3 months):

Implement canary deployments (Owner: Alex, Due: May 1)
- Deploy to 5% of traffic first
- Auto-rollback if error rate >0.5%
Expand on-call rotation (Owner: Morgan, Due: Apr 15)
- Hire 2 more SREs
- Train 3 backend engineers for on-call
- Reduce on-call burden from 1:3 to 1:8
Onboarding program for infrastructure changes (Owner: Jordan, Due: Apr 30)
- Checklist for new team members
- Buddy system for first 3 infra changes
- Production access granted after completion

Tracking:

Actions added to JIRA project: INCIDENT-2024-001
Reviewed in weekly engineering standup
Postmortem closed when all actions complete (target: May 1)

What Went Well

Detection & Alerting:

✓ Monitoring detected error spike within 3 minutes of occurrence
✓ PagerDuty alert fired correctly, paged on-call within 5 minutes
✓ Escalation path worked (on-call → senior eng → incident commander)

Incident Response:

✓ Team assembled quickly (5 engineers joined within 10 minutes)
✓ Clear incident commander (Morgan) coordinated response
✓ Regular status updates posted to #incidents Slack channel every 15 min
✓ Stakeholder communication: Execs notified within 20 min, status page updated

Communication:

✓ Blameless tone maintained throughout (no finger-pointing)
✓ External communication transparent (status page, Twitter updates)
✓ Customer support briefed on issue and talking points provided

Technical:

✓ Rollback process worked once initiated (though delayed by unclear runbook)
✓ No data loss or corruption during incident
✓ Automated alerts prevented longer outage (if manual detection, could be hours)

Lessons Learned

What We Learned:

Staging environments matter: Could have caught this with prod-like staging
Config is code: Needs same rigor as code (validation, review, testing)
Runbooks rot: Outdated runbook delayed recovery by 45 min - need to keep fresh
Pressure kills quality: Rushing to hit deadline led to skipping validation
Monitoring gaps: Detected symptom (errors) but not cause (pool exhaustion) - need better metrics

Knowledge Gaps Identified:

New team members don't know production config patterns
Not everyone knows how to rollback deployments
Connection pool sizing not well understood across team

Process Improvements Needed:

Peer review for all infrastructure changes (not just code)
Automated config validation before deployment
Regular runbook review/update (quarterly)
Staging environment for all changes

Cultural Observations:

Team responded well under pressure (good collaboration, clear roles)
Blameless culture held up (no one blamed new team member)
But: Pressure to ship fast is real and leads to shortcuts

Appendix

Related Incidents:

[INC-2024-002]: Similar config error 2 weeks later (different service)
[INC-2023-045]: Previous database connection issue (6 months ago)

References:

Logs: [Link to log aggregator query]
Metrics: [Link to Grafana dashboard]
Incident Slack thread: [#incidents, Mar 5 14:10-16:30]
Status page: [Link to status.company.com incident #123]
Customer impact: [Link to support ticket analysis]

Contacts:

Incident Commander: Morgan (morgan@company.com)
Postmortem Author: Alex (alex@company.com)
Escalation: VP Engineering (vpe@company.com)

Guidance for Each Section

Incident Summary

Severity Levels:
- Critical: Total outage, data loss, security breach
- High: Partial outage, significant degradation, many users affected
- Medium: Minor degradation, limited users affected
- Low: Minimal impact, caught before customer impact
Summary: 1-2 sentences max. Example: "Database connection pool exhaustion caused 2-hour outage affecting 50K users due to incorrect config deployment."

Timeline

Timestamps: Use UTC or clearly state timezone. Be specific (14:05, not "afternoon").
Key Events: Detection, escalation, investigation milestones, mitigation attempts, resolution.
Sources: Where info came from (monitoring, logs, person X, Slack, etc.).
Objectivity: Facts only, no blame. "Engineer X deployed" not "Engineer X mistakenly deployed".

Impact

Quantify: Numbers > adjectives. "50K users" not "many users". "$20K" not "significant".
Multiple Dimensions: Users, revenue, SLA, reputation, team time, opportunity cost.
Metrics Table: Before/during/after for key metrics shows impact clearly.

Root Cause

5 Whys: Stop when you reach fixable systemic issue, not "person made mistake".
Multiple Causes: Most incidents have >1 cause. List primary + contributing.
System Focus: "Deployment pipeline allowed bad config" not "Person deployed bad config".

Corrective Actions

SMART: Specific (what exactly), Measurable (how verify), Assigned (who owns), Realistic (achievable), Time-bound (when done).
Categorize: Immediate (days), Short-term (weeks), Long-term (months).
Prioritize: High impact, low effort first. Don't create >10 actions (won't get done).
Track: Add to project tracker, review weekly, don't let languish.

What Went Well

Balance: Postmortems aren't just about failures, recognize what worked.
Examples: Good communication, fast response, effective teamwork, working alerting.
Build On: These are strengths to maintain and amplify.

Lessons Learned

Generalize: Extract principles applicable beyond this incident.
Share: These are org-wide learnings, share broadly.
Action: Connect lessons to corrective actions.

Quick Patterns

Production Outage Postmortem

Summary: [Service] down for [X hours] affecting [Y users] due to [root cause]
Timeline: Detection → Investigation → Mitigation attempts → Resolution
Impact: Users, revenue, SLA, reputation
Root Cause: 5 Whys to system issue
Actions: Immediate fixes, monitoring improvements, process changes, long-term infrastructure

Security Incident Postmortem

Summary: [Breach type] exposed [data/systems] due to [vulnerability]
Timeline: Breach → Detection (often delayed) → Containment → Remediation → Recovery
Impact: Data exposed, compliance risk, reputation, remediation cost
Root Cause: Missing security controls, access management, unpatched vulnerabilities
Actions: Security audit, access review, patch management, training, compliance reporting

Product Launch Failure Postmortem

Summary: [Feature] launched to [<X% adoption] vs [Y% expected] due to [reasons]
Timeline: Planning → Development → Launch → Metrics review → Pivot decision
Impact: Revenue miss, wasted effort, opportunity cost, team morale
Root Cause: Poor problem validation, wrong solution, inadequate testing, misaligned expectations
Actions: Improve discovery, user research, validation, stakeholder alignment

Project Deadline Miss Postmortem

Summary: [Project] delivered [X months late] due to [root causes]
Timeline: Kickoff → Milestones → Delays → Delivery
Impact: Customer impact, revenue delay, team burnout, opportunity cost
Root Cause: Poor estimation, scope creep, technical challenges, resource constraints
Actions: Improve estimation, scope control, technical spikes, resource planning

Quality Checklist

Before finalizing postmortem, verify:

Completeness:

Incident summary with severity, impact summary, dates
Timeline with timestamps, events, actions taken
Impact quantified (users, duration, revenue, metrics)
Root cause analysis (5 Whys or equivalent)
Corrective actions with SMART criteria (specific, measurable, assigned, realistic, time-bound)
What went well section
Lessons learned documented

Quality:

Blameless tone (focus on systems, not individuals)
Timeline uses specific timestamps (not vague times)
Impact quantified with numbers (not adjectives like "many")
Root cause goes beyond surface symptoms (asked "Why?" 5 times)
Actions are specific, owned, and time-bound (not vague "improve X")
Actions tracked in project management tool

Timeliness:

Postmortem written within 48 hours of incident resolution
Shared with team and stakeholders within 72 hours
Presented in team meeting for discussion
Action items in tracker with owners and deadlines

Learning:

Lessons learned extracted and generalized
Shared broadly for organizational learning
Action items address root causes (not just symptoms)
Follow-up review scheduled (2-4 weeks) to check progress

Overall Assessment:

Would prevent similar incident if all actions completed?
Are we attacking root causes or just symptoms?
Is tone blameless and constructive?
Will team actually complete these actions? (realistic, prioritized)

16 KiB Raw Blame History

Postmortem Template

Table of Contents

Workflow

Postmortem Document Template

Incident Summary

Timeline

Impact

Root Cause Analysis

5 Whys

Contributing Factors

Corrective Actions

What Went Well

Lessons Learned

Appendix

Guidance for Each Section

Incident Summary

Timeline

Impact

Root Cause

Corrective Actions

What Went Well

Lessons Learned

Quick Patterns

Production Outage Postmortem

Security Incident Postmortem

Product Launch Failure Postmortem

Project Deadline Miss Postmortem

Quality Checklist

16 KiB

Raw Blame History