15 KiB
Postmortem Methodology
Table of Contents
- Blameless Culture
- Root Cause Analysis Techniques
- Corrective Action Frameworks
- Incident Response Patterns
- Postmortem Facilitation
- Organizational Learning
1. Blameless Culture
Core Principles
Humans Err, Systems Should Be Resilient
- Assumption: People will make mistakes. Design systems to tolerate errors.
- Example: Deployment requires 2 approvals → reduces error likelihood
- Example: Canary deployments → limits error blast radius
Second Victim Phenomenon
- First victim: Customers affected by incident
- Second victim: Engineer who made triggering action (guilt, anxiety, fear)
- Blameful culture: Second victim punished → hides future issues, leaves company
- Blameless culture: Learn together → transparency, improvement
Just Culture vs Blameless
- Blameless: Focus on system improvements, not individual accountability
- Just Culture: Distinguish reckless behavior (punish) from honest mistakes (learn)
- Example Reckless: Intentionally bypassing safeguards, ignoring warnings
- Example Honest: Misunderstanding, reasonable decision with information at hand
Language Patterns
Blameful → Blameless Reframing:
- ❌ "Engineer X caused outage" → ✓ "Deployment process allowed bad config to reach production"
- ❌ "PM didn't validate" → ✓ "Requirements validation process was missing"
- ❌ "Designer made error" → ✓ "Design review didn't catch accessibility issue"
- ❌ "Careless mistake" → ✓ "System lacked error detection at this step"
System-Focused Questions:
- "What allowed this to happen?" (not "Who did this?")
- "What gaps in our process enabled this?" (not "Why didn't you check?")
- "How can we prevent recurrence?" (not "How do we prevent X from doing this again?")
Building Blameless Culture
Leadership Modeling:
- Leaders admit own mistakes publicly
- Leaders ask "How did our systems fail?" not "Who messed up?"
- Leaders reward transparency (sharing incidents, near-misses)
Psychological Safety:
- Safe to raise concerns, report issues, admit mistakes
- No punishment for honest errors (distinguish from recklessness)
- Celebrate learning, not just success
Metrics to Track:
- Near-miss reporting rate (high = good, means people feel safe reporting)
- Postmortem completion rate (all incidents get postmortem)
- Action item completion rate (learnings turn into improvements)
- Repeat incident rate (same root cause recurring = not learning)
2. Root Cause Analysis Techniques
5 Whys
Process:
- State problem clearly
- Ask "Why?" → Answer
- Ask "Why?" on answer → Next answer
- Repeat until root cause (fixable at system/org level)
- Typically 5 iterations, but can be 3-7
Example: Problem: Website down
- Why? Database connection failed
- Why? Connection pool exhausted
- Why? Pool size too small (10 vs 100 needed)
- Why? Config template had wrong value
- Why? No validation in deployment pipeline
Root Cause: Deployment pipeline lacked config validation (fixable!)
Tips:
- Avoid "human error" as answer - keep asking why error possible
- Stop when you reach actionable system change
- Multiple paths may emerge - explore all
Fishbone Diagram (Ishikawa)
Structure: Fish skeleton with problem at head, causes as bones
Categories (6M):
- Methods (Processes): Missing step, unclear procedure, no validation
- Machines (Technology): System limits, infrastructure failures, bugs
- Materials (Data/Inputs): Bad data, missing info, incorrect assumptions
- Measurements (Monitoring): Blind spots, delayed detection, wrong metrics
- Mother Nature (Environment): External dependencies, third-party failures, regulations
- Manpower (People): Training gaps, staffing, knowledge silos (careful - focus on systemic issues)
When to Use: Complex incidents with multiple contributing factors
Process:
- Draw fish skeleton with problem at head
- Brainstorm causes in each category
- Vote on most likely root causes
- Investigate top 3-5 causes
- Confirm with evidence (logs, metrics)
Fault Tree Analysis
Structure: Tree with failure at top, causes below, gates connecting
Gates:
- AND Gate: All inputs required for failure (redundancy protects)
- OR Gate: Any input sufficient for failure (single point of failure)
Example:
Service Down (OR gate)
├── Database Failure (AND gate)
│ ├── Primary DB down
│ └── Replica promotion failed
└── Application Crash (OR gate)
├── Memory leak
├── Uncaught exception
└── OOM killer triggered
Use: Identify single points of failure, prioritize redundancy
Swiss Cheese Model
Concept: Layers of defense (safeguards) with holes (gaps)
Incident occurs when: Holes align across all layers
Example Layers:
- Design: Architecture with failover
- Training: Team knows how to respond
- Process: Peer review required
- Technology: Automated tests
- Monitoring: Alerts fire when issue occurs
Analysis: Identify which layers had holes for this incident, plug holes
Contributing Factors Framework
Categorize causes:
Immediate Cause: Proximate trigger
- Example: Config with wrong value deployed
Enabling Causes: Allowed immediate cause
- Example: No config validation, no peer review
Systemic Causes: Organizational issues
- Example: Pressure to ship fast, understaffed team, no time for rigor
Action: Address all levels, not just immediate
3. Corrective Action Frameworks
Hierarchy of Controls
From Most to Least Effective:
-
Elimination: Remove hazard entirely
- Example: Deprecate risky feature, sunset legacy system
- Most effective but often not feasible
-
Substitution: Replace with safer alternative
- Example: Use managed service vs self-hosted database
- Reduces risk substantially
-
Engineering Controls: Add safeguards
- Example: Rate limits, circuit breakers, automated testing, canary deployments
- Effective and feasible
-
Administrative Controls: Improve processes
- Example: Runbooks, checklists, peer review, approval gates
- Relies on compliance
-
Training: Educate people
- Example: Onboarding, workshops, documentation
- Least effective alone, use with other controls
Apply: Start at top of hierarchy, work down until feasible solution found
SMART Actions
Criteria:
- Specific: "Add config validation" not "Improve deployments"
- Measurable: "Reduce MTTR from 2hr to 30min" not "Faster response"
- Assignable: Name person, not "team" or "we"
- Realistic: Given constraints (time, budget, skills)
- Time-bound: Explicit deadline
Bad: "Improve monitoring" Good: "Add alert for connection pool >80% utilization by Mar 15 (Owner: Alex)"
Prevention vs Detection vs Mitigation
Prevention: Stop incident from occurring
- Example: Config validation, automated testing, peer review
- Best ROI but can't prevent everything
Detection: Identify incident quickly
- Example: Monitoring, alerting, anomaly detection
- Reduces time to mitigation
Mitigation: Limit impact when incident occurs
- Example: Circuit breakers, graceful degradation, failover, rollback
- Reduces blast radius
Balanced Portfolio: Invest in all three
Action Prioritization
Impact vs Effort Matrix:
- High Impact, Low Effort: Do immediately (Quick wins)
- High Impact, High Effort: Schedule as project (Strategic)
- Low Impact, Low Effort: Do if capacity (Fill-ins)
- Low Impact, High Effort: Skip (Not worth it)
Risk-Based: Prioritize by: Likelihood × Impact of recurrence
- Critical incident (total outage) likely to recur → Top priority
- Minor issue (UI glitch) unlikely to recur → Lower priority
4. Incident Response Patterns
Incident Severity Levels
Sev 1 (Critical):
- Total outage, data loss, security breach, >50% users affected
- Response: Immediate, all-hands, exec notification
- SLA: Acknowledge <15min, resolve <4hr
Sev 2 (High):
- Partial outage, degraded performance, 10-50% users affected
- Response: On-call team, incident commander assigned
- SLA: Acknowledge <30min, resolve <24hr
Sev 3 (Medium):
- Minor degradation, <10% users affected, non-critical feature down
- Response: On-call investigates, may defer to business hours
- SLA: Acknowledge <2hr, resolve <48hr
Sev 4 (Low):
- Minimal impact, cosmetic issues, internal tools only
- Response: Create ticket, prioritize in backlog
- SLA: No SLA, best-effort
Incident Command Structure (ICS)
Roles:
- Incident Commander (IC): Coordinates response, makes decisions, external communication
- Technical Lead: Diagnoses issue, implements fix, coordinates engineers
- Communication Lead: Status updates, stakeholder briefing, customer communication
- Scribe: Documents timeline, decisions, actions in incident log
Why Structure: Prevents chaos, clear ownership, scales to large incidents
Rotation: IC role rotates across senior engineers (training, burnout prevention)
Communication Patterns
Internal Updates (Slack, incident channel):
- Frequency: Every 15-30 min
- Format: Status, progress, next steps, blockers
- Example: "Update 14:30: Root cause identified (bad config). Initiating rollback. ETA resolution: 15:00."
External Updates (Status page, social media):
- Frequency: At detection, every hour, at resolution
- Tone: Transparent, apologetic, no technical jargon
- Example: "We're currently experiencing issues with login. Our team is actively working on a fix. We'll update every hour."
Executive Updates:
- Trigger: Sev 1/2 within 30 min
- Format: Impact, root cause (if known), ETA, what we're doing
- Frequency: Every 30-60 min until resolved
Postmortem Timing
When to Conduct:
- Immediately (within 48 hours): Sev 1/2 incidents
- Weekly batching: Sev 3 incidents (batch review)
- Monthly: Sev 4 or pattern analysis (recurring issues)
- Pre-mortem: Before major launch (imagine failure, identify risks)
Who Attends:
- All incident responders
- Service owners
- Optional: Stakeholders, leadership (for major incidents)
5. Postmortem Facilitation
Facilitation Tips
Set Tone:
- Open: "We're here to learn, not blame. Everything shared is blameless."
- Emphasize: Focus on systems and processes, not individuals
- Encourage: Transparency, even uncomfortable truths
Structure Meeting (60-90 min):
- Recap timeline (10 min): Walk through what happened
- Impact review (5 min): Quantify damage
- Root cause brainstorm (20 min): Fishbone or 5 Whys as group
- Corrective actions (20 min): Brainstorm actions for each root cause
- Prioritization (10 min): Vote on top 5 actions
- Assign owners (5 min): Who owns what, by when
Facilitation Techniques:
- Round-robin: Everyone speaks, no dominance
- Silent brainstorm: Write ideas on sticky notes, then share (prevents groupthink)
- Dot voting: Each person gets 3 votes for top priorities
- Parking lot: Capture off-topic items for later
Red Flags (intervene if you see):
- Blame language ("X caused this") → Reframe to system focus
- Defensiveness ("I had to rush because...") → Acknowledge pressure, focus on fixing system
- Rabbit holes (long technical debates) → Park for offline discussion
Follow-Up
Document: Assign someone to write up postmortem (usually IC or Technical Lead) Share: Distribute to team, stakeholders, company-wide (transparency) Present: 10-min presentation in all-hands or team meeting (visibility) Track: Add action items to project tracker, review weekly in standup Close: Postmortem marked complete when all actions done
Postmortem Review Meetings
Monthly Postmortem Review (Team-level):
- Review all postmortems from last month
- Identify patterns (repeated root causes)
- Assess action item completion rate
- Celebrate improvements
Quarterly Postmortem Review (Org-level):
- Aggregate data: Incident frequency, severity, MTTR
- Trends: Are we getting better? (MTTR decreasing? Repeat incidents decreasing?)
- Investment decisions: Where to invest in reliability
6. Organizational Learning
Learning Metrics
Track:
- Incident frequency: Total incidents per month (decreasing over time?)
- MTTR (Mean Time To Resolve): Average time from detection to resolution (improving?)
- Repeat incidents: Same root cause recurring (learning gap if high)
- Action completion rate: % of postmortem actions completed (accountability)
- Near-miss reporting: # of near-misses reported (psychological safety indicator)
Goals:
- Reduce incident frequency: -10% per quarter
- Reduce MTTR: -20% per quarter
- Reduce repeat incidents: <5% of total
- Action completion: >90%
- Near-miss reporting: Increasing (good sign)
Knowledge Sharing
Postmortem Database:
- Centralized repository (Confluence, Notion, wiki)
- Searchable by: Date, service, root cause category, severity
- Template: Consistent format for easy scanning
Learning Sessions:
- Monthly "Failure Fridays": Review interesting postmortems
- Quarterly "Top 10 Incidents": Review worst incidents, learnings
- Blameless tone: Celebrate transparency, not success
Cross-Team Sharing:
- Share postmortems beyond immediate team
- Tag related teams (if payment incident, notify fintech team)
- Prevent: Team A repeats Team B's mistake
Continuous Improvement
Feedback Loops:
- Postmortem → Corrective actions → Completion → Measure impact → Repeat
- Quarterly review: Are actions working? (MTTR decreasing? Repeats decreasing?)
Runbook Updates:
- After every incident: Update runbook with learnings
- Quarterly: Review all runbooks, refresh stale content
- Metric: Runbook age (>6 months old = needs refresh)
Process Evolution:
- Learn from postmortem process itself
- Survey: "Was this postmortem useful? What would improve it?"
- Iterate: Improve template, facilitation, tracking
Quick Reference
Blameless Language Patterns
- ❌ "Person caused" → ✓ "Process allowed"
- ❌ "Didn't check" → ✓ "Validation missing"
- ❌ "Mistake" → ✓ "Gap in system"
Root Cause Techniques
- 5 Whys: Simple incidents, single cause
- Fishbone: Complex incidents, multiple causes
- Fault Tree: Identify single points of failure
- Swiss Cheese: Multiple safeguard failures
Corrective Action Hierarchy
- Eliminate (remove hazard)
- Substitute (safer alternative)
- Engineering controls (safeguards)
- Administrative (processes)
- Training (education)
Incident Response
- Sev 1: Total outage, <15min ack, <4hr resolve
- Sev 2: Partial, <30min ack, <24hr resolve
- Sev 3: Minor, <2hr ack, <48hr resolve
Success Metrics
- ✓ Incident frequency decreasing
- ✓ MTTR improving
- ✓ Repeat incidents <5%
- ✓ Action completion >90%
- ✓ Near-miss reporting increasing