zhongwei/gh-lyndonkl-claude

Fork 0

Files

Zhongwei Li 41d9f6b189 Initial commit

2025-11-30 08:38:26 +08:00

15 KiB

Raw Blame History

Postmortem Methodology

Blameless Culture
Root Cause Analysis Techniques
Corrective Action Frameworks
Incident Response Patterns
Postmortem Facilitation
Organizational Learning

1. Blameless Culture

Core Principles

Humans Err, Systems Should Be Resilient

Assumption: People will make mistakes. Design systems to tolerate errors.
Example: Deployment requires 2 approvals → reduces error likelihood
Example: Canary deployments → limits error blast radius

Second Victim Phenomenon

First victim: Customers affected by incident
Second victim: Engineer who made triggering action (guilt, anxiety, fear)
Blameful culture: Second victim punished → hides future issues, leaves company
Blameless culture: Learn together → transparency, improvement

Just Culture vs Blameless

Blameless: Focus on system improvements, not individual accountability
Just Culture: Distinguish reckless behavior (punish) from honest mistakes (learn)
Example Reckless: Intentionally bypassing safeguards, ignoring warnings
Example Honest: Misunderstanding, reasonable decision with information at hand

Language Patterns

Blameful → Blameless Reframing:

❌ "Engineer X caused outage" → ✓ "Deployment process allowed bad config to reach production"
❌ "PM didn't validate" → ✓ "Requirements validation process was missing"
❌ "Designer made error" → ✓ "Design review didn't catch accessibility issue"
❌ "Careless mistake" → ✓ "System lacked error detection at this step"

System-Focused Questions:

"What allowed this to happen?" (not "Who did this?")
"What gaps in our process enabled this?" (not "Why didn't you check?")
"How can we prevent recurrence?" (not "How do we prevent X from doing this again?")

Building Blameless Culture

Leadership Modeling:

Leaders admit own mistakes publicly
Leaders ask "How did our systems fail?" not "Who messed up?"
Leaders reward transparency (sharing incidents, near-misses)

Psychological Safety:

Safe to raise concerns, report issues, admit mistakes
No punishment for honest errors (distinguish from recklessness)
Celebrate learning, not just success

Metrics to Track:

Near-miss reporting rate (high = good, means people feel safe reporting)
Postmortem completion rate (all incidents get postmortem)
Action item completion rate (learnings turn into improvements)
Repeat incident rate (same root cause recurring = not learning)

2. Root Cause Analysis Techniques

5 Whys

Process:

State problem clearly
Ask "Why?" → Answer
Ask "Why?" on answer → Next answer
Repeat until root cause (fixable at system/org level)
Typically 5 iterations, but can be 3-7

Example: Problem: Website down

Why? Database connection failed
Why? Connection pool exhausted
Why? Pool size too small (10 vs 100 needed)
Why? Config template had wrong value
Why? No validation in deployment pipeline

Root Cause: Deployment pipeline lacked config validation (fixable!)

Tips:

Avoid "human error" as answer - keep asking why error possible
Stop when you reach actionable system change
Multiple paths may emerge - explore all

Fishbone Diagram (Ishikawa)

Structure: Fish skeleton with problem at head, causes as bones

Categories (6M):

Methods (Processes): Missing step, unclear procedure, no validation
Machines (Technology): System limits, infrastructure failures, bugs
Materials (Data/Inputs): Bad data, missing info, incorrect assumptions
Measurements (Monitoring): Blind spots, delayed detection, wrong metrics
Mother Nature (Environment): External dependencies, third-party failures, regulations
Manpower (People): Training gaps, staffing, knowledge silos (careful - focus on systemic issues)

When to Use: Complex incidents with multiple contributing factors

Process:

Draw fish skeleton with problem at head
Brainstorm causes in each category
Vote on most likely root causes
Investigate top 3-5 causes
Confirm with evidence (logs, metrics)

Fault Tree Analysis

Structure: Tree with failure at top, causes below, gates connecting

Gates:

AND Gate: All inputs required for failure (redundancy protects)
OR Gate: Any input sufficient for failure (single point of failure)

Example:

Service Down (OR gate)
├── Database Failure (AND gate)
│   ├── Primary DB down
│   └── Replica promotion failed
└── Application Crash (OR gate)
    ├── Memory leak
    ├── Uncaught exception
    └── OOM killer triggered

Use: Identify single points of failure, prioritize redundancy

Swiss Cheese Model

Concept: Layers of defense (safeguards) with holes (gaps)

Incident occurs when: Holes align across all layers

Example Layers:

Design: Architecture with failover
Training: Team knows how to respond
Process: Peer review required
Technology: Automated tests
Monitoring: Alerts fire when issue occurs

Analysis: Identify which layers had holes for this incident, plug holes

Contributing Factors Framework

Categorize causes:

Immediate Cause: Proximate trigger

Example: Config with wrong value deployed

Enabling Causes: Allowed immediate cause

Example: No config validation, no peer review

Systemic Causes: Organizational issues

Example: Pressure to ship fast, understaffed team, no time for rigor

Action: Address all levels, not just immediate

3. Corrective Action Frameworks

Hierarchy of Controls

From Most to Least Effective:

Elimination: Remove hazard entirely
- Example: Deprecate risky feature, sunset legacy system
- Most effective but often not feasible
Substitution: Replace with safer alternative
- Example: Use managed service vs self-hosted database
- Reduces risk substantially
Engineering Controls: Add safeguards
- Example: Rate limits, circuit breakers, automated testing, canary deployments
- Effective and feasible
Administrative Controls: Improve processes
- Example: Runbooks, checklists, peer review, approval gates
- Relies on compliance
Training: Educate people
- Example: Onboarding, workshops, documentation
- Least effective alone, use with other controls

Apply: Start at top of hierarchy, work down until feasible solution found

SMART Actions

Criteria:

Specific: "Add config validation" not "Improve deployments"
Measurable: "Reduce MTTR from 2hr to 30min" not "Faster response"
Assignable: Name person, not "team" or "we"
Realistic: Given constraints (time, budget, skills)
Time-bound: Explicit deadline

Bad: "Improve monitoring" Good: "Add alert for connection pool >80% utilization by Mar 15 (Owner: Alex)"

Prevention vs Detection vs Mitigation

Prevention: Stop incident from occurring

Example: Config validation, automated testing, peer review
Best ROI but can't prevent everything

Detection: Identify incident quickly

Example: Monitoring, alerting, anomaly detection
Reduces time to mitigation

Mitigation: Limit impact when incident occurs

Example: Circuit breakers, graceful degradation, failover, rollback
Reduces blast radius

Balanced Portfolio: Invest in all three

Action Prioritization

Impact vs Effort Matrix:

High Impact, Low Effort: Do immediately (Quick wins)
High Impact, High Effort: Schedule as project (Strategic)
Low Impact, Low Effort: Do if capacity (Fill-ins)
Low Impact, High Effort: Skip (Not worth it)

Risk-Based: Prioritize by: Likelihood × Impact of recurrence

Critical incident (total outage) likely to recur → Top priority
Minor issue (UI glitch) unlikely to recur → Lower priority

4. Incident Response Patterns

Incident Severity Levels

Sev 1 (Critical):

Total outage, data loss, security breach, >50% users affected
Response: Immediate, all-hands, exec notification
SLA: Acknowledge <15min, resolve <4hr

Sev 2 (High):

Partial outage, degraded performance, 10-50% users affected
Response: On-call team, incident commander assigned
SLA: Acknowledge <30min, resolve <24hr

Sev 3 (Medium):

Minor degradation, <10% users affected, non-critical feature down
Response: On-call investigates, may defer to business hours
SLA: Acknowledge <2hr, resolve <48hr

Sev 4 (Low):

Minimal impact, cosmetic issues, internal tools only
Response: Create ticket, prioritize in backlog
SLA: No SLA, best-effort

Incident Command Structure (ICS)

Roles:

Incident Commander (IC): Coordinates response, makes decisions, external communication
Technical Lead: Diagnoses issue, implements fix, coordinates engineers
Communication Lead: Status updates, stakeholder briefing, customer communication
Scribe: Documents timeline, decisions, actions in incident log

Why Structure: Prevents chaos, clear ownership, scales to large incidents

Rotation: IC role rotates across senior engineers (training, burnout prevention)

Communication Patterns

Internal Updates (Slack, incident channel):

Frequency: Every 15-30 min
Format: Status, progress, next steps, blockers
Example: "Update 14:30: Root cause identified (bad config). Initiating rollback. ETA resolution: 15:00."

External Updates (Status page, social media):

Frequency: At detection, every hour, at resolution
Tone: Transparent, apologetic, no technical jargon
Example: "We're currently experiencing issues with login. Our team is actively working on a fix. We'll update every hour."

Executive Updates:

Trigger: Sev 1/2 within 30 min
Format: Impact, root cause (if known), ETA, what we're doing
Frequency: Every 30-60 min until resolved

Postmortem Timing

When to Conduct:

Immediately (within 48 hours): Sev 1/2 incidents
Weekly batching: Sev 3 incidents (batch review)
Monthly: Sev 4 or pattern analysis (recurring issues)
Pre-mortem: Before major launch (imagine failure, identify risks)

Who Attends:

All incident responders
Service owners
Optional: Stakeholders, leadership (for major incidents)

5. Postmortem Facilitation

Facilitation Tips

Set Tone:

Open: "We're here to learn, not blame. Everything shared is blameless."
Emphasize: Focus on systems and processes, not individuals
Encourage: Transparency, even uncomfortable truths

Structure Meeting (60-90 min):

Recap timeline (10 min): Walk through what happened
Impact review (5 min): Quantify damage
Root cause brainstorm (20 min): Fishbone or 5 Whys as group
Corrective actions (20 min): Brainstorm actions for each root cause
Prioritization (10 min): Vote on top 5 actions
Assign owners (5 min): Who owns what, by when

Facilitation Techniques:

Round-robin: Everyone speaks, no dominance
Silent brainstorm: Write ideas on sticky notes, then share (prevents groupthink)
Dot voting: Each person gets 3 votes for top priorities
Parking lot: Capture off-topic items for later

Red Flags (intervene if you see):

Blame language ("X caused this") → Reframe to system focus
Defensiveness ("I had to rush because...") → Acknowledge pressure, focus on fixing system
Rabbit holes (long technical debates) → Park for offline discussion

Follow-Up

Document: Assign someone to write up postmortem (usually IC or Technical Lead) Share: Distribute to team, stakeholders, company-wide (transparency) Present: 10-min presentation in all-hands or team meeting (visibility) Track: Add action items to project tracker, review weekly in standup Close: Postmortem marked complete when all actions done

Postmortem Review Meetings

Monthly Postmortem Review (Team-level):

Review all postmortems from last month
Identify patterns (repeated root causes)
Assess action item completion rate
Celebrate improvements

Quarterly Postmortem Review (Org-level):

Aggregate data: Incident frequency, severity, MTTR
Trends: Are we getting better? (MTTR decreasing? Repeat incidents decreasing?)
Investment decisions: Where to invest in reliability

6. Organizational Learning

Learning Metrics

Track:

Incident frequency: Total incidents per month (decreasing over time?)
MTTR (Mean Time To Resolve): Average time from detection to resolution (improving?)
Repeat incidents: Same root cause recurring (learning gap if high)
Action completion rate: % of postmortem actions completed (accountability)
Near-miss reporting: # of near-misses reported (psychological safety indicator)

Goals:

Reduce incident frequency: -10% per quarter
Reduce MTTR: -20% per quarter
Reduce repeat incidents: <5% of total
Action completion: >90%
Near-miss reporting: Increasing (good sign)

Postmortem Database:

Centralized repository (Confluence, Notion, wiki)
Searchable by: Date, service, root cause category, severity
Template: Consistent format for easy scanning

Learning Sessions:

Monthly "Failure Fridays": Review interesting postmortems
Quarterly "Top 10 Incidents": Review worst incidents, learnings
Blameless tone: Celebrate transparency, not success

Cross-Team Sharing:

Share postmortems beyond immediate team
Tag related teams (if payment incident, notify fintech team)
Prevent: Team A repeats Team B's mistake

Continuous Improvement

Feedback Loops:

Postmortem → Corrective actions → Completion → Measure impact → Repeat
Quarterly review: Are actions working? (MTTR decreasing? Repeats decreasing?)

Runbook Updates:

After every incident: Update runbook with learnings
Quarterly: Review all runbooks, refresh stale content
Metric: Runbook age (>6 months old = needs refresh)

Process Evolution:

Learn from postmortem process itself
Survey: "Was this postmortem useful? What would improve it?"
Iterate: Improve template, facilitation, tracking

Quick Reference

Blameless Language Patterns

❌ "Person caused" → ✓ "Process allowed"
❌ "Didn't check" → ✓ "Validation missing"
❌ "Mistake" → ✓ "Gap in system"

Root Cause Techniques

5 Whys: Simple incidents, single cause
Fishbone: Complex incidents, multiple causes
Fault Tree: Identify single points of failure
Swiss Cheese: Multiple safeguard failures

Corrective Action Hierarchy

Eliminate (remove hazard)
Substitute (safer alternative)
Engineering controls (safeguards)
Administrative (processes)
Training (education)

Incident Response

Sev 1: Total outage, <15min ack, <4hr resolve
Sev 2: Partial, <30min ack, <24hr resolve
Sev 3: Minor, <2hr ack, <48hr resolve

Success Metrics

✓ Incident frequency decreasing
✓ MTTR improving
✓ Repeat incidents <5%
✓ Action completion >90%
✓ Near-miss reporting increasing

15 KiB Raw Blame History Unescape Escape