Initial commit

2025-11-30 08:38:26 +08:00
commit 41d9f6b189
304 changed files with 98322 additions and 0 deletions
--- a/skills/postmortem/resources/methodology.md
+++ b/skills/postmortem/resources/methodology.md
@@ -0,0 +1,440 @@
+# Postmortem Methodology
+
+## Table of Contents
+1. [Blameless Culture](#1-blameless-culture)
+2. [Root Cause Analysis Techniques](#2-root-cause-analysis-techniques)
+3. [Corrective Action Frameworks](#3-corrective-action-frameworks)
+4. [Incident Response Patterns](#4-incident-response-patterns)
+5. [Postmortem Facilitation](#5-postmortem-facilitation)
+6. [Organizational Learning](#6-organizational-learning)
+
+---
+
+## 1. Blameless Culture
+
+### Core Principles
+
+**Humans Err, Systems Should Be Resilient**
+- Assumption: People will make mistakes. Design systems to tolerate errors.
+- Example: Deployment requires 2 approvals → reduces error likelihood
+- Example: Canary deployments → limits error blast radius
+
+**Second Victim Phenomenon**
+- First victim: Customers affected by incident
+- Second victim: Engineer who made triggering action (guilt, anxiety, fear)
+- Blameful culture: Second victim punished → hides future issues, leaves company
+- Blameless culture: Learn together → transparency, improvement
+
+**Just Culture vs Blameless**
+- **Blameless**: Focus on system improvements, not individual accountability
+- **Just Culture**: Distinguish reckless behavior (punish) from honest mistakes (learn)
+- Example Reckless: Intentionally bypassing safeguards, ignoring warnings
+- Example Honest: Misunderstanding, reasonable decision with information at hand
+
+### Language Patterns
+
+**Blameful → Blameless Reframing**:
+- ❌ "Engineer X caused outage" → ✓ "Deployment process allowed bad config to reach production"
+- ❌ "PM didn't validate" → ✓ "Requirements validation process was missing"
+- ❌ "Designer made error" → ✓ "Design review didn't catch accessibility issue"
+- ❌ "Careless mistake" → ✓ "System lacked error detection at this step"
+
+**System-Focused Questions**:
+- "What allowed this to happen?" (not "Who did this?")
+- "What gaps in our process enabled this?" (not "Why didn't you check?")
+- "How can we prevent recurrence?" (not "How do we prevent X from doing this again?")
+
+### Building Blameless Culture
+
+**Leadership Modeling**:
+- Leaders admit own mistakes publicly
+- Leaders ask "How did our systems fail?" not "Who messed up?"
+- Leaders reward transparency (sharing incidents, near-misses)
+
+**Psychological Safety**:
+- Safe to raise concerns, report issues, admit mistakes
+- No punishment for honest errors (distinguish from recklessness)
+- Celebrate learning, not just success
+
+**Metrics to Track**:
+- Near-miss reporting rate (high = good, means people feel safe reporting)
+- Postmortem completion rate (all incidents get postmortem)
+- Action item completion rate (learnings turn into improvements)
+- Repeat incident rate (same root cause recurring = not learning)
+
+---
+
+## 2. Root Cause Analysis Techniques
+
+### 5 Whys
+
+**Process**:
+1. State problem clearly
+2. Ask "Why?" → Answer
+3. Ask "Why?" on answer → Next answer
+4. Repeat until root cause (fixable at system/org level)
+5. Typically 5 iterations, but can be 3-7
+
+**Example**:
+Problem: Website down
+1. Why? Database connection failed
+2. Why? Connection pool exhausted
+3. Why? Pool size too small (10 vs 100 needed)
+4. Why? Config template had wrong value
+5. Why? No validation in deployment pipeline
+
+**Root Cause**: Deployment pipeline lacked config validation (fixable!)
+
+**Tips**:
+- Avoid "human error" as answer - keep asking why error possible
+- Stop when you reach actionable system change
+- Multiple paths may emerge - explore all
+
+### Fishbone Diagram (Ishikawa)
+
+**Structure**: Fish skeleton with problem at head, causes as bones
+
+**Categories** (6M):
+- **Methods** (Processes): Missing step, unclear procedure, no validation
+- **Machines** (Technology): System limits, infrastructure failures, bugs
+- **Materials** (Data/Inputs): Bad data, missing info, incorrect assumptions
+- **Measurements** (Monitoring): Blind spots, delayed detection, wrong metrics
+- **Mother Nature** (Environment): External dependencies, third-party failures, regulations
+- **Manpower** (People): Training gaps, staffing, knowledge silos (careful - focus on systemic issues)
+
+**When to Use**: Complex incidents with multiple contributing factors
+
+**Process**:
+1. Draw fish skeleton with problem at head
+2. Brainstorm causes in each category
+3. Vote on most likely root causes
+4. Investigate top 3-5 causes
+5. Confirm with evidence (logs, metrics)
+
+### Fault Tree Analysis
+
+**Structure**: Tree with failure at top, causes below, gates connecting
+
+**Gates**:
+- **AND Gate**: All inputs required for failure (redundancy protects)
+- **OR Gate**: Any input sufficient for failure (single point of failure)
+
+**Example**:
+```
+Service Down (OR gate)
+├── Database Failure (AND gate)
+│   ├── Primary DB down
+│   └── Replica promotion failed
+└── Application Crash (OR gate)
+    ├── Memory leak
+    ├── Uncaught exception
+    └── OOM killer triggered
+```
+
+**Use**: Identify single points of failure, prioritize redundancy
+
+### Swiss Cheese Model
+
+**Concept**: Layers of defense (safeguards) with holes (gaps)
+
+**Incident occurs when**: Holes align across all layers
+
+**Example Layers**:
+1. Design: Architecture with failover
+2. Training: Team knows how to respond
+3. Process: Peer review required
+4. Technology: Automated tests
+5. Monitoring: Alerts fire when issue occurs
+
+**Analysis**: Identify which layers had holes for this incident, plug holes
+
+### Contributing Factors Framework
+
+**Categorize causes**:
+
+**Immediate Cause**: Proximate trigger
+- Example: Config with wrong value deployed
+
+**Enabling Causes**: Allowed immediate cause
+- Example: No config validation, no peer review
+
+**Systemic Causes**: Organizational issues
+- Example: Pressure to ship fast, understaffed team, no time for rigor
+
+**Action**: Address all levels, not just immediate
+
+---
+
+## 3. Corrective Action Frameworks
+
+### Hierarchy of Controls
+
+**From Most to Least Effective**:
+
+1. **Elimination**: Remove hazard entirely
+   - Example: Deprecate risky feature, sunset legacy system
+   - Most effective but often not feasible
+
+2. **Substitution**: Replace with safer alternative
+   - Example: Use managed service vs self-hosted database
+   - Reduces risk substantially
+
+3. **Engineering Controls**: Add safeguards
+   - Example: Rate limits, circuit breakers, automated testing, canary deployments
+   - Effective and feasible
+
+4. **Administrative Controls**: Improve processes
+   - Example: Runbooks, checklists, peer review, approval gates
+   - Relies on compliance
+
+5. **Training**: Educate people
+   - Example: Onboarding, workshops, documentation
+   - Least effective alone, use with other controls
+
+**Apply**: Start at top of hierarchy, work down until feasible solution found
+
+### SMART Actions
+
+**Criteria**:
+- **Specific**: "Add config validation" not "Improve deployments"
+- **Measurable**: "Reduce MTTR from 2hr to 30min" not "Faster response"
+- **Assignable**: Name person, not "team" or "we"
+- **Realistic**: Given constraints (time, budget, skills)
+- **Time-bound**: Explicit deadline
+
+**Bad**: "Improve monitoring"
+**Good**: "Add alert for connection pool >80% utilization by Mar 15 (Owner: Alex)"
+
+### Prevention vs Detection vs Mitigation
+
+**Prevention**: Stop incident from occurring
+- Example: Config validation, automated testing, peer review
+- Best ROI but can't prevent everything
+
+**Detection**: Identify incident quickly
+- Example: Monitoring, alerting, anomaly detection
+- Reduces time to mitigation
+
+**Mitigation**: Limit impact when incident occurs
+- Example: Circuit breakers, graceful degradation, failover, rollback
+- Reduces blast radius
+
+**Balanced Portfolio**: Invest in all three
+
+### Action Prioritization
+
+**Impact vs Effort Matrix**:
+- **High Impact, Low Effort**: Do immediately (Quick wins)
+- **High Impact, High Effort**: Schedule as project (Strategic)
+- **Low Impact, Low Effort**: Do if capacity (Fill-ins)
+- **Low Impact, High Effort**: Skip (Not worth it)
+
+**Risk-Based**: Prioritize by: Likelihood × Impact of recurrence
+- Critical incident (total outage) likely to recur → Top priority
+- Minor issue (UI glitch) unlikely to recur → Lower priority
+
+---
+
+## 4. Incident Response Patterns
+
+### Incident Severity Levels
+
+**Sev 1 (Critical)**:
+- Total outage, data loss, security breach, >50% users affected
+- Response: Immediate, all-hands, exec notification
+- SLA: Acknowledge <15min, resolve <4hr
+
+**Sev 2 (High)**:
+- Partial outage, degraded performance, 10-50% users affected
+- Response: On-call team, incident commander assigned
+- SLA: Acknowledge <30min, resolve <24hr
+
+**Sev 3 (Medium)**:
+- Minor degradation, <10% users affected, non-critical feature down
+- Response: On-call investigates, may defer to business hours
+- SLA: Acknowledge <2hr, resolve <48hr
+
+**Sev 4 (Low)**:
+- Minimal impact, cosmetic issues, internal tools only
+- Response: Create ticket, prioritize in backlog
+- SLA: No SLA, best-effort
+
+### Incident Command Structure (ICS)
+
+**Roles**:
+- **Incident Commander (IC)**: Coordinates response, makes decisions, external communication
+- **Technical Lead**: Diagnoses issue, implements fix, coordinates engineers
+- **Communication Lead**: Status updates, stakeholder briefing, customer communication
+- **Scribe**: Documents timeline, decisions, actions in incident log
+
+**Why Structure**: Prevents chaos, clear ownership, scales to large incidents
+
+**Rotation**: IC role rotates across senior engineers (training, burnout prevention)
+
+### Communication Patterns
+
+**Internal Updates** (Slack, incident channel):
+- Frequency: Every 15-30 min
+- Format: Status, progress, next steps, blockers
+- Example: "Update 14:30: Root cause identified (bad config). Initiating rollback. ETA resolution: 15:00."
+
+**External Updates** (Status page, social media):
+- Frequency: At detection, every hour, at resolution
+- Tone: Transparent, apologetic, no technical jargon
+- Example: "We're currently experiencing issues with login. Our team is actively working on a fix. We'll update every hour."
+
+**Executive Updates**:
+- Trigger: Sev 1/2 within 30 min
+- Format: Impact, root cause (if known), ETA, what we're doing
+- Frequency: Every 30-60 min until resolved
+
+### Postmortem Timing
+
+**When to Conduct**:
+- **Immediately** (within 48 hours): Sev 1/2 incidents
+- **Weekly batching**: Sev 3 incidents (batch review)
+- **Monthly**: Sev 4 or pattern analysis (recurring issues)
+- **Pre-mortem**: Before major launch (imagine failure, identify risks)
+
+**Who Attends**:
+- All incident responders
+- Service owners
+- Optional: Stakeholders, leadership (for major incidents)
+
+---
+
+## 5. Postmortem Facilitation
+
+### Facilitation Tips
+
+**Set Tone**:
+- Open: "We're here to learn, not blame. Everything shared is blameless."
+- Emphasize: Focus on systems and processes, not individuals
+- Encourage: Transparency, even uncomfortable truths
+
+**Structure Meeting** (60-90 min):
+1. **Recap timeline** (10 min): Walk through what happened
+2. **Impact review** (5 min): Quantify damage
+3. **Root cause brainstorm** (20 min): Fishbone or 5 Whys as group
+4. **Corrective actions** (20 min): Brainstorm actions for each root cause
+5. **Prioritization** (10 min): Vote on top 5 actions
+6. **Assign owners** (5 min): Who owns what, by when
+
+**Facilitation Techniques**:
+- **Round-robin**: Everyone speaks, no dominance
+- **Silent brainstorm**: Write ideas on sticky notes, then share (prevents groupthink)
+- **Dot voting**: Each person gets 3 votes for top priorities
+- **Parking lot**: Capture off-topic items for later
+
+**Red Flags** (intervene if you see):
+- Blame language ("X caused this") → Reframe to system focus
+- Defensiveness ("I had to rush because...") → Acknowledge pressure, focus on fixing system
+- Rabbit holes (long technical debates) → Park for offline discussion
+
+### Follow-Up
+
+**Document**: Assign someone to write up postmortem (usually IC or Technical Lead)
+**Share**: Distribute to team, stakeholders, company-wide (transparency)
+**Present**: 10-min presentation in all-hands or team meeting (visibility)
+**Track**: Add action items to project tracker, review weekly in standup
+**Close**: Postmortem marked complete when all actions done
+
+### Postmortem Review Meetings
+
+**Monthly Postmortem Review** (Team-level):
+- Review all postmortems from last month
+- Identify patterns (repeated root causes)
+- Assess action item completion rate
+- Celebrate improvements
+
+**Quarterly Postmortem Review** (Org-level):
+- Aggregate data: Incident frequency, severity, MTTR
+- Trends: Are we getting better? (MTTR decreasing? Repeat incidents decreasing?)
+- Investment decisions: Where to invest in reliability
+
+---
+
+## 6. Organizational Learning
+
+### Learning Metrics
+
+**Track**:
+- **Incident frequency**: Total incidents per month (decreasing over time?)
+- **MTTR (Mean Time To Resolve)**: Average time from detection to resolution (improving?)
+- **Repeat incidents**: Same root cause recurring (learning gap if high)
+- **Action completion rate**: % of postmortem actions completed (accountability)
+- **Near-miss reporting**: # of near-misses reported (psychological safety indicator)
+
+**Goals**:
+- Reduce incident frequency: -10% per quarter
+- Reduce MTTR: -20% per quarter
+- Reduce repeat incidents: <5% of total
+- Action completion: >90%
+- Near-miss reporting: Increasing (good sign)
+
+### Knowledge Sharing
+
+**Postmortem Database**:
+- Centralized repository (Confluence, Notion, wiki)
+- Searchable by: Date, service, root cause category, severity
+- Template: Consistent format for easy scanning
+
+**Learning Sessions**:
+- Monthly "Failure Fridays": Review interesting postmortems
+- Quarterly "Top 10 Incidents": Review worst incidents, learnings
+- Blameless tone: Celebrate transparency, not success
+
+**Cross-Team Sharing**:
+- Share postmortems beyond immediate team
+- Tag related teams (if payment incident, notify fintech team)
+- Prevent: Team A repeats Team B's mistake
+
+### Continuous Improvement
+
+**Feedback Loops**:
+- Postmortem → Corrective actions → Completion → Measure impact → Repeat
+- Quarterly review: Are actions working? (MTTR decreasing? Repeats decreasing?)
+
+**Runbook Updates**:
+- After every incident: Update runbook with learnings
+- Quarterly: Review all runbooks, refresh stale content
+- Metric: Runbook age (>6 months old = needs refresh)
+
+**Process Evolution**:
+- Learn from postmortem process itself
+- Survey: "Was this postmortem useful? What would improve it?"
+- Iterate: Improve template, facilitation, tracking
+
+---
+
+## Quick Reference
+
+### Blameless Language Patterns
+- ❌ "Person caused" → ✓ "Process allowed"
+- ❌ "Didn't check" → ✓ "Validation missing"
+- ❌ "Mistake" → ✓ "Gap in system"
+
+### Root Cause Techniques
+- **5 Whys**: Simple incidents, single cause
+- **Fishbone**: Complex incidents, multiple causes
+- **Fault Tree**: Identify single points of failure
+- **Swiss Cheese**: Multiple safeguard failures
+
+### Corrective Action Hierarchy
+1. Eliminate (remove hazard)
+2. Substitute (safer alternative)
+3. Engineering controls (safeguards)
+4. Administrative (processes)
+5. Training (education)
+
+### Incident Response
+- **Sev 1**: Total outage, <15min ack, <4hr resolve
+- **Sev 2**: Partial, <30min ack, <24hr resolve
+- **Sev 3**: Minor, <2hr ack, <48hr resolve
+
+### Success Metrics
+- ✓ Incident frequency decreasing
+- ✓ MTTR improving
+- ✓ Repeat incidents <5%
+- ✓ Action completion >90%
+- ✓ Near-miss reporting increasing