Files
gh-lyndonkl-claude/skills/postmortem/resources/methodology.md
2025-11-30 08:38:26 +08:00

15 KiB
Raw Blame History

Postmortem Methodology

Table of Contents

  1. Blameless Culture
  2. Root Cause Analysis Techniques
  3. Corrective Action Frameworks
  4. Incident Response Patterns
  5. Postmortem Facilitation
  6. Organizational Learning

1. Blameless Culture

Core Principles

Humans Err, Systems Should Be Resilient

  • Assumption: People will make mistakes. Design systems to tolerate errors.
  • Example: Deployment requires 2 approvals → reduces error likelihood
  • Example: Canary deployments → limits error blast radius

Second Victim Phenomenon

  • First victim: Customers affected by incident
  • Second victim: Engineer who made triggering action (guilt, anxiety, fear)
  • Blameful culture: Second victim punished → hides future issues, leaves company
  • Blameless culture: Learn together → transparency, improvement

Just Culture vs Blameless

  • Blameless: Focus on system improvements, not individual accountability
  • Just Culture: Distinguish reckless behavior (punish) from honest mistakes (learn)
  • Example Reckless: Intentionally bypassing safeguards, ignoring warnings
  • Example Honest: Misunderstanding, reasonable decision with information at hand

Language Patterns

Blameful → Blameless Reframing:

  • "Engineer X caused outage" → ✓ "Deployment process allowed bad config to reach production"
  • "PM didn't validate" → ✓ "Requirements validation process was missing"
  • "Designer made error" → ✓ "Design review didn't catch accessibility issue"
  • "Careless mistake" → ✓ "System lacked error detection at this step"

System-Focused Questions:

  • "What allowed this to happen?" (not "Who did this?")
  • "What gaps in our process enabled this?" (not "Why didn't you check?")
  • "How can we prevent recurrence?" (not "How do we prevent X from doing this again?")

Building Blameless Culture

Leadership Modeling:

  • Leaders admit own mistakes publicly
  • Leaders ask "How did our systems fail?" not "Who messed up?"
  • Leaders reward transparency (sharing incidents, near-misses)

Psychological Safety:

  • Safe to raise concerns, report issues, admit mistakes
  • No punishment for honest errors (distinguish from recklessness)
  • Celebrate learning, not just success

Metrics to Track:

  • Near-miss reporting rate (high = good, means people feel safe reporting)
  • Postmortem completion rate (all incidents get postmortem)
  • Action item completion rate (learnings turn into improvements)
  • Repeat incident rate (same root cause recurring = not learning)

2. Root Cause Analysis Techniques

5 Whys

Process:

  1. State problem clearly
  2. Ask "Why?" → Answer
  3. Ask "Why?" on answer → Next answer
  4. Repeat until root cause (fixable at system/org level)
  5. Typically 5 iterations, but can be 3-7

Example: Problem: Website down

  1. Why? Database connection failed
  2. Why? Connection pool exhausted
  3. Why? Pool size too small (10 vs 100 needed)
  4. Why? Config template had wrong value
  5. Why? No validation in deployment pipeline

Root Cause: Deployment pipeline lacked config validation (fixable!)

Tips:

  • Avoid "human error" as answer - keep asking why error possible
  • Stop when you reach actionable system change
  • Multiple paths may emerge - explore all

Fishbone Diagram (Ishikawa)

Structure: Fish skeleton with problem at head, causes as bones

Categories (6M):

  • Methods (Processes): Missing step, unclear procedure, no validation
  • Machines (Technology): System limits, infrastructure failures, bugs
  • Materials (Data/Inputs): Bad data, missing info, incorrect assumptions
  • Measurements (Monitoring): Blind spots, delayed detection, wrong metrics
  • Mother Nature (Environment): External dependencies, third-party failures, regulations
  • Manpower (People): Training gaps, staffing, knowledge silos (careful - focus on systemic issues)

When to Use: Complex incidents with multiple contributing factors

Process:

  1. Draw fish skeleton with problem at head
  2. Brainstorm causes in each category
  3. Vote on most likely root causes
  4. Investigate top 3-5 causes
  5. Confirm with evidence (logs, metrics)

Fault Tree Analysis

Structure: Tree with failure at top, causes below, gates connecting

Gates:

  • AND Gate: All inputs required for failure (redundancy protects)
  • OR Gate: Any input sufficient for failure (single point of failure)

Example:

Service Down (OR gate)
├── Database Failure (AND gate)
│   ├── Primary DB down
│   └── Replica promotion failed
└── Application Crash (OR gate)
    ├── Memory leak
    ├── Uncaught exception
    └── OOM killer triggered

Use: Identify single points of failure, prioritize redundancy

Swiss Cheese Model

Concept: Layers of defense (safeguards) with holes (gaps)

Incident occurs when: Holes align across all layers

Example Layers:

  1. Design: Architecture with failover
  2. Training: Team knows how to respond
  3. Process: Peer review required
  4. Technology: Automated tests
  5. Monitoring: Alerts fire when issue occurs

Analysis: Identify which layers had holes for this incident, plug holes

Contributing Factors Framework

Categorize causes:

Immediate Cause: Proximate trigger

  • Example: Config with wrong value deployed

Enabling Causes: Allowed immediate cause

  • Example: No config validation, no peer review

Systemic Causes: Organizational issues

  • Example: Pressure to ship fast, understaffed team, no time for rigor

Action: Address all levels, not just immediate


3. Corrective Action Frameworks

Hierarchy of Controls

From Most to Least Effective:

  1. Elimination: Remove hazard entirely

    • Example: Deprecate risky feature, sunset legacy system
    • Most effective but often not feasible
  2. Substitution: Replace with safer alternative

    • Example: Use managed service vs self-hosted database
    • Reduces risk substantially
  3. Engineering Controls: Add safeguards

    • Example: Rate limits, circuit breakers, automated testing, canary deployments
    • Effective and feasible
  4. Administrative Controls: Improve processes

    • Example: Runbooks, checklists, peer review, approval gates
    • Relies on compliance
  5. Training: Educate people

    • Example: Onboarding, workshops, documentation
    • Least effective alone, use with other controls

Apply: Start at top of hierarchy, work down until feasible solution found

SMART Actions

Criteria:

  • Specific: "Add config validation" not "Improve deployments"
  • Measurable: "Reduce MTTR from 2hr to 30min" not "Faster response"
  • Assignable: Name person, not "team" or "we"
  • Realistic: Given constraints (time, budget, skills)
  • Time-bound: Explicit deadline

Bad: "Improve monitoring" Good: "Add alert for connection pool >80% utilization by Mar 15 (Owner: Alex)"

Prevention vs Detection vs Mitigation

Prevention: Stop incident from occurring

  • Example: Config validation, automated testing, peer review
  • Best ROI but can't prevent everything

Detection: Identify incident quickly

  • Example: Monitoring, alerting, anomaly detection
  • Reduces time to mitigation

Mitigation: Limit impact when incident occurs

  • Example: Circuit breakers, graceful degradation, failover, rollback
  • Reduces blast radius

Balanced Portfolio: Invest in all three

Action Prioritization

Impact vs Effort Matrix:

  • High Impact, Low Effort: Do immediately (Quick wins)
  • High Impact, High Effort: Schedule as project (Strategic)
  • Low Impact, Low Effort: Do if capacity (Fill-ins)
  • Low Impact, High Effort: Skip (Not worth it)

Risk-Based: Prioritize by: Likelihood × Impact of recurrence

  • Critical incident (total outage) likely to recur → Top priority
  • Minor issue (UI glitch) unlikely to recur → Lower priority

4. Incident Response Patterns

Incident Severity Levels

Sev 1 (Critical):

  • Total outage, data loss, security breach, >50% users affected
  • Response: Immediate, all-hands, exec notification
  • SLA: Acknowledge <15min, resolve <4hr

Sev 2 (High):

  • Partial outage, degraded performance, 10-50% users affected
  • Response: On-call team, incident commander assigned
  • SLA: Acknowledge <30min, resolve <24hr

Sev 3 (Medium):

  • Minor degradation, <10% users affected, non-critical feature down
  • Response: On-call investigates, may defer to business hours
  • SLA: Acknowledge <2hr, resolve <48hr

Sev 4 (Low):

  • Minimal impact, cosmetic issues, internal tools only
  • Response: Create ticket, prioritize in backlog
  • SLA: No SLA, best-effort

Incident Command Structure (ICS)

Roles:

  • Incident Commander (IC): Coordinates response, makes decisions, external communication
  • Technical Lead: Diagnoses issue, implements fix, coordinates engineers
  • Communication Lead: Status updates, stakeholder briefing, customer communication
  • Scribe: Documents timeline, decisions, actions in incident log

Why Structure: Prevents chaos, clear ownership, scales to large incidents

Rotation: IC role rotates across senior engineers (training, burnout prevention)

Communication Patterns

Internal Updates (Slack, incident channel):

  • Frequency: Every 15-30 min
  • Format: Status, progress, next steps, blockers
  • Example: "Update 14:30: Root cause identified (bad config). Initiating rollback. ETA resolution: 15:00."

External Updates (Status page, social media):

  • Frequency: At detection, every hour, at resolution
  • Tone: Transparent, apologetic, no technical jargon
  • Example: "We're currently experiencing issues with login. Our team is actively working on a fix. We'll update every hour."

Executive Updates:

  • Trigger: Sev 1/2 within 30 min
  • Format: Impact, root cause (if known), ETA, what we're doing
  • Frequency: Every 30-60 min until resolved

Postmortem Timing

When to Conduct:

  • Immediately (within 48 hours): Sev 1/2 incidents
  • Weekly batching: Sev 3 incidents (batch review)
  • Monthly: Sev 4 or pattern analysis (recurring issues)
  • Pre-mortem: Before major launch (imagine failure, identify risks)

Who Attends:

  • All incident responders
  • Service owners
  • Optional: Stakeholders, leadership (for major incidents)

5. Postmortem Facilitation

Facilitation Tips

Set Tone:

  • Open: "We're here to learn, not blame. Everything shared is blameless."
  • Emphasize: Focus on systems and processes, not individuals
  • Encourage: Transparency, even uncomfortable truths

Structure Meeting (60-90 min):

  1. Recap timeline (10 min): Walk through what happened
  2. Impact review (5 min): Quantify damage
  3. Root cause brainstorm (20 min): Fishbone or 5 Whys as group
  4. Corrective actions (20 min): Brainstorm actions for each root cause
  5. Prioritization (10 min): Vote on top 5 actions
  6. Assign owners (5 min): Who owns what, by when

Facilitation Techniques:

  • Round-robin: Everyone speaks, no dominance
  • Silent brainstorm: Write ideas on sticky notes, then share (prevents groupthink)
  • Dot voting: Each person gets 3 votes for top priorities
  • Parking lot: Capture off-topic items for later

Red Flags (intervene if you see):

  • Blame language ("X caused this") → Reframe to system focus
  • Defensiveness ("I had to rush because...") → Acknowledge pressure, focus on fixing system
  • Rabbit holes (long technical debates) → Park for offline discussion

Follow-Up

Document: Assign someone to write up postmortem (usually IC or Technical Lead) Share: Distribute to team, stakeholders, company-wide (transparency) Present: 10-min presentation in all-hands or team meeting (visibility) Track: Add action items to project tracker, review weekly in standup Close: Postmortem marked complete when all actions done

Postmortem Review Meetings

Monthly Postmortem Review (Team-level):

  • Review all postmortems from last month
  • Identify patterns (repeated root causes)
  • Assess action item completion rate
  • Celebrate improvements

Quarterly Postmortem Review (Org-level):

  • Aggregate data: Incident frequency, severity, MTTR
  • Trends: Are we getting better? (MTTR decreasing? Repeat incidents decreasing?)
  • Investment decisions: Where to invest in reliability

6. Organizational Learning

Learning Metrics

Track:

  • Incident frequency: Total incidents per month (decreasing over time?)
  • MTTR (Mean Time To Resolve): Average time from detection to resolution (improving?)
  • Repeat incidents: Same root cause recurring (learning gap if high)
  • Action completion rate: % of postmortem actions completed (accountability)
  • Near-miss reporting: # of near-misses reported (psychological safety indicator)

Goals:

  • Reduce incident frequency: -10% per quarter
  • Reduce MTTR: -20% per quarter
  • Reduce repeat incidents: <5% of total
  • Action completion: >90%
  • Near-miss reporting: Increasing (good sign)

Knowledge Sharing

Postmortem Database:

  • Centralized repository (Confluence, Notion, wiki)
  • Searchable by: Date, service, root cause category, severity
  • Template: Consistent format for easy scanning

Learning Sessions:

  • Monthly "Failure Fridays": Review interesting postmortems
  • Quarterly "Top 10 Incidents": Review worst incidents, learnings
  • Blameless tone: Celebrate transparency, not success

Cross-Team Sharing:

  • Share postmortems beyond immediate team
  • Tag related teams (if payment incident, notify fintech team)
  • Prevent: Team A repeats Team B's mistake

Continuous Improvement

Feedback Loops:

  • Postmortem → Corrective actions → Completion → Measure impact → Repeat
  • Quarterly review: Are actions working? (MTTR decreasing? Repeats decreasing?)

Runbook Updates:

  • After every incident: Update runbook with learnings
  • Quarterly: Review all runbooks, refresh stale content
  • Metric: Runbook age (>6 months old = needs refresh)

Process Evolution:

  • Learn from postmortem process itself
  • Survey: "Was this postmortem useful? What would improve it?"
  • Iterate: Improve template, facilitation, tracking

Quick Reference

Blameless Language Patterns

  • "Person caused" → ✓ "Process allowed"
  • "Didn't check" → ✓ "Validation missing"
  • "Mistake" → ✓ "Gap in system"

Root Cause Techniques

  • 5 Whys: Simple incidents, single cause
  • Fishbone: Complex incidents, multiple causes
  • Fault Tree: Identify single points of failure
  • Swiss Cheese: Multiple safeguard failures

Corrective Action Hierarchy

  1. Eliminate (remove hazard)
  2. Substitute (safer alternative)
  3. Engineering controls (safeguards)
  4. Administrative (processes)
  5. Training (education)

Incident Response

  • Sev 1: Total outage, <15min ack, <4hr resolve
  • Sev 2: Partial, <30min ack, <24hr resolve
  • Sev 3: Minor, <2hr ack, <48hr resolve

Success Metrics

  • ✓ Incident frequency decreasing
  • ✓ MTTR improving
  • ✓ Repeat incidents <5%
  • ✓ Action completion >90%
  • ✓ Near-miss reporting increasing