gh-lyndonkl-claude/skills/postmortem/resources/methodology.md

# Postmortem Methodology

## Table of Contents
1. [Blameless Culture](#1-blameless-culture)
2. [Root Cause Analysis Techniques](#2-root-cause-analysis-techniques)
3. [Corrective Action Frameworks](#3-corrective-action-frameworks)
4. [Incident Response Patterns](#4-incident-response-patterns)
5. [Postmortem Facilitation](#5-postmortem-facilitation)
6. [Organizational Learning](#6-organizational-learning)

---

## 1. Blameless Culture

### Core Principles

**Humans Err, Systems Should Be Resilient**
- Assumption: People will make mistakes. Design systems to tolerate errors.
- Example: Deployment requires 2 approvals → reduces error likelihood
- Example: Canary deployments → limits error blast radius

**Second Victim Phenomenon**
- First victim: Customers affected by incident
- Second victim: Engineer who made triggering action (guilt, anxiety, fear)
- Blameful culture: Second victim punished → hides future issues, leaves company
- Blameless culture: Learn together → transparency, improvement

**Just Culture vs Blameless**
- **Blameless**: Focus on system improvements, not individual accountability
- **Just Culture**: Distinguish reckless behavior (punish) from honest mistakes (learn)
- Example Reckless: Intentionally bypassing safeguards, ignoring warnings
- Example Honest: Misunderstanding, reasonable decision with information at hand

### Language Patterns

**Blameful → Blameless Reframing**:
- ❌ "Engineer X caused outage" → ✓ "Deployment process allowed bad config to reach production"
- ❌ "PM didn't validate" → ✓ "Requirements validation process was missing"
- ❌ "Designer made error" → ✓ "Design review didn't catch accessibility issue"
- ❌ "Careless mistake" → ✓ "System lacked error detection at this step"

**System-Focused Questions**:
- "What allowed this to happen?" (not "Who did this?")
- "What gaps in our process enabled this?" (not "Why didn't you check?")
- "How can we prevent recurrence?" (not "How do we prevent X from doing this again?")

### Building Blameless Culture

**Leadership Modeling**:
- Leaders admit own mistakes publicly
- Leaders ask "How did our systems fail?" not "Who messed up?"
- Leaders reward transparency (sharing incidents, near-misses)

**Psychological Safety**:
- Safe to raise concerns, report issues, admit mistakes
- No punishment for honest errors (distinguish from recklessness)
- Celebrate learning, not just success

**Metrics to Track**:
- Near-miss reporting rate (high = good, means people feel safe reporting)
- Postmortem completion rate (all incidents get postmortem)
- Action item completion rate (learnings turn into improvements)
- Repeat incident rate (same root cause recurring = not learning)

---

## 2. Root Cause Analysis Techniques

### 5 Whys

**Process**:
1. State problem clearly
2. Ask "Why?" → Answer
3. Ask "Why?" on answer → Next answer
4. Repeat until root cause (fixable at system/org level)
5. Typically 5 iterations, but can be 3-7

**Example**:
Problem: Website down
1. Why? Database connection failed
2. Why? Connection pool exhausted
3. Why? Pool size too small (10 vs 100 needed)
4. Why? Config template had wrong value
5. Why? No validation in deployment pipeline

**Root Cause**: Deployment pipeline lacked config validation (fixable!)

**Tips**:
- Avoid "human error" as answer - keep asking why error possible
- Stop when you reach actionable system change
- Multiple paths may emerge - explore all

### Fishbone Diagram (Ishikawa)

**Structure**: Fish skeleton with problem at head, causes as bones

**Categories** (6M):
- **Methods** (Processes): Missing step, unclear procedure, no validation
- **Machines** (Technology): System limits, infrastructure failures, bugs
- **Materials** (Data/Inputs): Bad data, missing info, incorrect assumptions
- **Measurements** (Monitoring): Blind spots, delayed detection, wrong metrics
- **Mother Nature** (Environment): External dependencies, third-party failures, regulations
- **Manpower** (People): Training gaps, staffing, knowledge silos (careful - focus on systemic issues)

**When to Use**: Complex incidents with multiple contributing factors

**Process**:
1. Draw fish skeleton with problem at head
2. Brainstorm causes in each category
3. Vote on most likely root causes
4. Investigate top 3-5 causes
5. Confirm with evidence (logs, metrics)

### Fault Tree Analysis

**Structure**: Tree with failure at top, causes below, gates connecting

**Gates**:
- **AND Gate**: All inputs required for failure (redundancy protects)
- **OR Gate**: Any input sufficient for failure (single point of failure)

**Example**:
```
Service Down (OR gate)
├── Database Failure (AND gate)
│   ├── Primary DB down
│   └── Replica promotion failed
└── Application Crash (OR gate)
    ├── Memory leak
    ├── Uncaught exception
    └── OOM killer triggered
```

**Use**: Identify single points of failure, prioritize redundancy

### Swiss Cheese Model

**Concept**: Layers of defense (safeguards) with holes (gaps)

**Incident occurs when**: Holes align across all layers

**Example Layers**:
1. Design: Architecture with failover
2. Training: Team knows how to respond
3. Process: Peer review required
4. Technology: Automated tests
5. Monitoring: Alerts fire when issue occurs

**Analysis**: Identify which layers had holes for this incident, plug holes

### Contributing Factors Framework

**Categorize causes**:

**Immediate Cause**: Proximate trigger
- Example: Config with wrong value deployed

**Enabling Causes**: Allowed immediate cause
- Example: No config validation, no peer review

**Systemic Causes**: Organizational issues
- Example: Pressure to ship fast, understaffed team, no time for rigor

**Action**: Address all levels, not just immediate

---

## 3. Corrective Action Frameworks

### Hierarchy of Controls

**From Most to Least Effective**:

1. **Elimination**: Remove hazard entirely
   - Example: Deprecate risky feature, sunset legacy system
   - Most effective but often not feasible

2. **Substitution**: Replace with safer alternative
   - Example: Use managed service vs self-hosted database
   - Reduces risk substantially

3. **Engineering Controls**: Add safeguards
   - Example: Rate limits, circuit breakers, automated testing, canary deployments
   - Effective and feasible

4. **Administrative Controls**: Improve processes
   - Example: Runbooks, checklists, peer review, approval gates
   - Relies on compliance

5. **Training**: Educate people
   - Example: Onboarding, workshops, documentation
   - Least effective alone, use with other controls

**Apply**: Start at top of hierarchy, work down until feasible solution found

### SMART Actions

**Criteria**:
- **Specific**: "Add config validation" not "Improve deployments"
- **Measurable**: "Reduce MTTR from 2hr to 30min" not "Faster response"
- **Assignable**: Name person, not "team" or "we"
- **Realistic**: Given constraints (time, budget, skills)
- **Time-bound**: Explicit deadline

**Bad**: "Improve monitoring"
**Good**: "Add alert for connection pool >80% utilization by Mar 15 (Owner: Alex)"

### Prevention vs Detection vs Mitigation

**Prevention**: Stop incident from occurring
- Example: Config validation, automated testing, peer review
- Best ROI but can't prevent everything

**Detection**: Identify incident quickly
- Example: Monitoring, alerting, anomaly detection
- Reduces time to mitigation

**Mitigation**: Limit impact when incident occurs
- Example: Circuit breakers, graceful degradation, failover, rollback
- Reduces blast radius

**Balanced Portfolio**: Invest in all three

### Action Prioritization

**Impact vs Effort Matrix**:
- **High Impact, Low Effort**: Do immediately (Quick wins)
- **High Impact, High Effort**: Schedule as project (Strategic)
- **Low Impact, Low Effort**: Do if capacity (Fill-ins)
- **Low Impact, High Effort**: Skip (Not worth it)

**Risk-Based**: Prioritize by: Likelihood × Impact of recurrence
- Critical incident (total outage) likely to recur → Top priority
- Minor issue (UI glitch) unlikely to recur → Lower priority

---

## 4. Incident Response Patterns

### Incident Severity Levels

**Sev 1 (Critical)**:
- Total outage, data loss, security breach, >50% users affected
- Response: Immediate, all-hands, exec notification
- SLA: Acknowledge <15min, resolve <4hr

**Sev 2 (High)**:
- Partial outage, degraded performance, 10-50% users affected
- Response: On-call team, incident commander assigned
- SLA: Acknowledge <30min, resolve <24hr

**Sev 3 (Medium)**:
- Minor degradation, <10% users affected, non-critical feature down
- Response: On-call investigates, may defer to business hours
- SLA: Acknowledge <2hr, resolve <48hr

**Sev 4 (Low)**:
- Minimal impact, cosmetic issues, internal tools only
- Response: Create ticket, prioritize in backlog
- SLA: No SLA, best-effort

### Incident Command Structure (ICS)

**Roles**:
- **Incident Commander (IC)**: Coordinates response, makes decisions, external communication
- **Technical Lead**: Diagnoses issue, implements fix, coordinates engineers
- **Communication Lead**: Status updates, stakeholder briefing, customer communication
- **Scribe**: Documents timeline, decisions, actions in incident log

**Why Structure**: Prevents chaos, clear ownership, scales to large incidents

**Rotation**: IC role rotates across senior engineers (training, burnout prevention)

### Communication Patterns

**Internal Updates** (Slack, incident channel):
- Frequency: Every 15-30 min
- Format: Status, progress, next steps, blockers
- Example: "Update 14:30: Root cause identified (bad config). Initiating rollback. ETA resolution: 15:00."

**External Updates** (Status page, social media):
- Frequency: At detection, every hour, at resolution
- Tone: Transparent, apologetic, no technical jargon
- Example: "We're currently experiencing issues with login. Our team is actively working on a fix. We'll update every hour."

**Executive Updates**:
- Trigger: Sev 1/2 within 30 min
- Format: Impact, root cause (if known), ETA, what we're doing
- Frequency: Every 30-60 min until resolved

### Postmortem Timing

**When to Conduct**:
- **Immediately** (within 48 hours): Sev 1/2 incidents
- **Weekly batching**: Sev 3 incidents (batch review)
- **Monthly**: Sev 4 or pattern analysis (recurring issues)
- **Pre-mortem**: Before major launch (imagine failure, identify risks)

**Who Attends**:
- All incident responders
- Service owners
- Optional: Stakeholders, leadership (for major incidents)

---

## 5. Postmortem Facilitation

### Facilitation Tips

**Set Tone**:
- Open: "We're here to learn, not blame. Everything shared is blameless."
- Emphasize: Focus on systems and processes, not individuals
- Encourage: Transparency, even uncomfortable truths

**Structure Meeting** (60-90 min):
1. **Recap timeline** (10 min): Walk through what happened
2. **Impact review** (5 min): Quantify damage
3. **Root cause brainstorm** (20 min): Fishbone or 5 Whys as group
4. **Corrective actions** (20 min): Brainstorm actions for each root cause
5. **Prioritization** (10 min): Vote on top 5 actions
6. **Assign owners** (5 min): Who owns what, by when

**Facilitation Techniques**:
- **Round-robin**: Everyone speaks, no dominance
- **Silent brainstorm**: Write ideas on sticky notes, then share (prevents groupthink)
- **Dot voting**: Each person gets 3 votes for top priorities
- **Parking lot**: Capture off-topic items for later

**Red Flags** (intervene if you see):
- Blame language ("X caused this") → Reframe to system focus
- Defensiveness ("I had to rush because...") → Acknowledge pressure, focus on fixing system
- Rabbit holes (long technical debates) → Park for offline discussion

### Follow-Up

**Document**: Assign someone to write up postmortem (usually IC or Technical Lead)
**Share**: Distribute to team, stakeholders, company-wide (transparency)
**Present**: 10-min presentation in all-hands or team meeting (visibility)
**Track**: Add action items to project tracker, review weekly in standup
**Close**: Postmortem marked complete when all actions done

### Postmortem Review Meetings

**Monthly Postmortem Review** (Team-level):
- Review all postmortems from last month
- Identify patterns (repeated root causes)
- Assess action item completion rate
- Celebrate improvements

**Quarterly Postmortem Review** (Org-level):
- Aggregate data: Incident frequency, severity, MTTR
- Trends: Are we getting better? (MTTR decreasing? Repeat incidents decreasing?)
- Investment decisions: Where to invest in reliability

---

## 6. Organizational Learning

### Learning Metrics

**Track**:
- **Incident frequency**: Total incidents per month (decreasing over time?)
- **MTTR (Mean Time To Resolve)**: Average time from detection to resolution (improving?)
- **Repeat incidents**: Same root cause recurring (learning gap if high)
- **Action completion rate**: % of postmortem actions completed (accountability)
- **Near-miss reporting**: # of near-misses reported (psychological safety indicator)

**Goals**:
- Reduce incident frequency: -10% per quarter
- Reduce MTTR: -20% per quarter
- Reduce repeat incidents: <5% of total
- Action completion: >90%
- Near-miss reporting: Increasing (good sign)

### Knowledge Sharing

**Postmortem Database**:
- Centralized repository (Confluence, Notion, wiki)
- Searchable by: Date, service, root cause category, severity
- Template: Consistent format for easy scanning

**Learning Sessions**:
- Monthly "Failure Fridays": Review interesting postmortems
- Quarterly "Top 10 Incidents": Review worst incidents, learnings
- Blameless tone: Celebrate transparency, not success

**Cross-Team Sharing**:
- Share postmortems beyond immediate team
- Tag related teams (if payment incident, notify fintech team)
- Prevent: Team A repeats Team B's mistake

### Continuous Improvement

**Feedback Loops**:
- Postmortem → Corrective actions → Completion → Measure impact → Repeat
- Quarterly review: Are actions working? (MTTR decreasing? Repeats decreasing?)

**Runbook Updates**:
- After every incident: Update runbook with learnings
- Quarterly: Review all runbooks, refresh stale content
- Metric: Runbook age (>6 months old = needs refresh)

**Process Evolution**:
- Learn from postmortem process itself
- Survey: "Was this postmortem useful? What would improve it?"
- Iterate: Improve template, facilitation, tracking

---

## Quick Reference

### Blameless Language Patterns
- ❌ "Person caused" → ✓ "Process allowed"
- ❌ "Didn't check" → ✓ "Validation missing"
- ❌ "Mistake" → ✓ "Gap in system"

### Root Cause Techniques
- **5 Whys**: Simple incidents, single cause
- **Fishbone**: Complex incidents, multiple causes
- **Fault Tree**: Identify single points of failure
- **Swiss Cheese**: Multiple safeguard failures

### Corrective Action Hierarchy
1. Eliminate (remove hazard)
2. Substitute (safer alternative)
3. Engineering controls (safeguards)
4. Administrative (processes)
5. Training (education)

### Incident Response
- **Sev 1**: Total outage, <15min ack, <4hr resolve
- **Sev 2**: Partial, <30min ack, <24hr resolve
- **Sev 3**: Minor, <2hr ack, <48hr resolve

### Success Metrics
- ✓ Incident frequency decreasing
- ✓ MTTR improving
- ✓ Repeat incidents <5%
- ✓ Action completion >90%
- ✓ Near-miss reporting increasing