Initial commit

2025-11-30 08:38:26 +08:00
commit 41d9f6b189
304 changed files with 98322 additions and 0 deletions
--- a/skills/postmortem/SKILL.md
+++ b/skills/postmortem/SKILL.md
@@ -0,0 +1,276 @@
+---
+name: postmortem
+description: Use when analyzing failures, outages, incidents, or negative outcomes, conducting blameless postmortems, documenting root causes with 5 Whys or fishbone diagrams, identifying corrective actions with owners and timelines, learning from near-misses, establishing prevention strategies, or when user mentions postmortem, incident review, failure analysis, RCA, lessons learned, or after-action review.
+---
+# Postmortem
+
+## Table of Contents
+1. [Purpose](#purpose)
+2. [When to Use](#when-to-use)
+3. [What Is It?](#what-is-it)
+4. [Workflow](#workflow)
+5. [Common Patterns](#common-patterns)
+6. [Guardrails](#guardrails)
+7. [Quick Reference](#quick-reference)
+
+## Purpose
+
+Conduct blameless postmortems that transform failures into learning opportunities by documenting what happened, why it happened, impact quantification, root cause analysis, and actionable preventions with clear ownership.
+
+## When to Use
+
+**Use this skill when:**
+
+### Incident Context
+- Production outage, system failure, or service degradation occurred
+- Security breach, data loss, or compliance violation happened
+- Product launch failed, project missed deadline, or initiative underperformed
+- Customer-impacting bug, quality issue, or support crisis arose
+- Near-miss incident that could have caused serious harm (proactive postmortem)
+
+### Learning Goals
+- Need to understand root cause (not just symptoms) to prevent recurrence
+- Want to identify systemic issues vs. individual mistakes
+- Must document timeline and impact for stakeholders or auditors
+- Aim to improve processes, systems, or practices based on failure insights
+- Building organizational learning culture (celebrate transparency, not blame)
+
+### Timing
+- **Immediately after** incident resolution (while memory fresh, within 48 hours)
+- **Scheduled reviews** for recurring issues or chronic problems
+- **Quarterly reviews** of all incidents to identify patterns
+- **Pre-mortem** style: Before major launch, imagine it failed and write postmortem
+
+**Do NOT use when:**
+- Incident still ongoing (focus on resolution first, postmortem second)
+- Looking to assign blame or punish individuals (antithesis of blameless culture)
+- Issue is trivial with no learning value (reserved for significant incidents)
+
+## What Is It?
+
+**Postmortem** is a structured, blameless analysis of failures that answers:
+- **What happened?** Timeline of events from detection to resolution
+- **What was the impact?** Quantified harm (users affected, revenue lost, duration)
+- **Why did it happen?** Root cause analysis using 5 Whys, fishbone, or fault trees
+- **How do we prevent recurrence?** Actionable items with owners and deadlines
+- **What went well?** Positive aspects of incident response
+
+**Key Principles**:
+- **Blameless**: Focus on systems/processes, not individuals. Humans err; systems should be resilient.
+- **Actionable**: Corrective actions must be specific, owned, and tracked
+- **Transparent**: Share widely to enable organizational learning
+- **Timely**: Conduct while memory fresh (within 48 hours of resolution)
+
+**Quick Example:**
+
+**Incident**: Database outage, 2-hour downtime, 50K users affected
+
+**Timeline**:
+- 14:05 - Automated deployment started (config change)
+- 14:07 - Database connection pool exhausted, errors spike
+- 14:10 - Alerts fired, on-call paged
+- 14:15 - Engineer investigates, identifies bad config
+- 15:30 - Rollback initiated (delayed by unclear runbook)
+- 16:05 - Service restored
+
+**Impact**: 2-hour outage, 50K users unable to access, estimated $20K revenue loss
+
+**Root Cause** (5 Whys):
+1. Why outage? Bad config deployed
+2. Why bad config? Connection pool size set to 10 (should be 100)
+3. Why wrong value? Config templated incorrectly
+4. Why template wrong? New team member unfamiliar with prod values
+5. Why no catch? No staging environment testing of configs
+
+**Corrective Actions**:
+- [ ] Add config validation to deployment pipeline (Owner: Alex, Due: Mar 15)
+- [ ] Create staging env with prod-like load (Owner: Jordan, Due: Mar 30)
+- [ ] Update runbook with rollback steps (Owner: Sam, Due: Mar 10)
+- [ ] Onboarding checklist: Review prod configs (Owner: Morgan, Due: Mar 5)
+
+**What Went Well**: Alerts fired quickly, team responded within 5 minutes, good communication
+
+## Workflow
+
+Copy this checklist and track your progress:
+
+```
+Postmortem Progress:
+- [ ] Step 1: Assemble timeline and quantify impact
+- [ ] Step 2: Conduct root cause analysis
+- [ ] Step 3: Define corrective and preventive actions
+- [ ] Step 4: Document and share postmortem
+- [ ] Step 5: Track action items to completion
+```
+
+**Step 1: Assemble timeline and quantify impact**
+
+Gather facts: when detected, when started, key events, when resolved. Quantify impact: users affected, duration, revenue/SLA impact, customer complaints. For straightforward incidents use [resources/template.md](resources/template.md). For complex incidents with multiple causes or cascading failures, study [resources/methodology.md](resources/methodology.md) for advanced timeline reconstruction techniques.
+
+**Step 2: Conduct root cause analysis**
+
+Ask "Why?" 5 times to get from symptom to root cause, or use fishbone diagram for complex incidents with multiple contributing factors. See [Root Cause Analysis Techniques](#root-cause-analysis-techniques) for guidance. Focus on system failures (process gaps, missing safeguards) not human errors.
+
+**Step 3: Define corrective and preventive actions**
+
+For each root cause, identify actions to prevent recurrence. Must be specific (not "improve testing"), owned (named person), and time-bound (deadline). Categorize as immediate fixes vs. long-term improvements. See [Corrective Actions](#corrective-actions-framework) for framework.
+
+**Step 4: Document and share postmortem**
+
+Create postmortem document using template. Include timeline, impact, root cause, actions, what went well. Share widely (engineering, product, leadership) to enable learning. Present in team meeting for discussion. Archive in knowledge base.
+
+**Step 5: Track action items to completion**
+
+Assign owners, set deadlines, add to project tracker. Review progress in standups or weekly meetings. Close postmortem only when all actions complete. Self-assess quality using [resources/evaluators/rubric_postmortem.json](resources/evaluators/rubric_postmortem.json). Minimum standard: ≥3.5 average score.
+
+## Common Patterns
+
+### By Incident Type
+
+**Production Outages** (system failures, downtime):
+- Timeline: Detection → Investigation → Mitigation → Resolution
+- Impact: Users affected, duration, SLA breach, revenue loss
+- Root cause: Often config errors, deployment issues, infrastructure limits
+- Actions: Improve monitoring, runbooks, rollback procedures, capacity planning
+
+**Security Incidents** (breaches, vulnerabilities):
+- Timeline: Breach occurrence → Detection (often delayed) → Containment → Remediation
+- Impact: Data exposed, compliance risk, reputation damage
+- Root cause: Missing security controls, access management gaps, unpatched vulnerabilities
+- Actions: Security audits, access reviews, patch management, training
+
+**Product/Project Failures** (launches, deadlines):
+- Timeline: Planning → Execution → Launch/Deadline → Outcome vs. Expectations
+- Impact: Revenue miss, user churn, wasted effort, opportunity cost
+- Root cause: Poor requirements, unrealistic estimates, misalignment, inadequate testing
+- Actions: Improve discovery, estimation, stakeholder alignment, validation processes
+
+**Process Failures** (operational, procedural):
+- Timeline: Process initiation → Breakdown point → Impact realization
+- Impact: Delays, quality issues, rework, team frustration
+- Root cause: Unclear process, missing steps, handoff failures, tooling gaps
+- Actions: Document processes, automate workflows, improve communication, training
+
+### By Root Cause Category
+
+**Human Error** (surface cause, dig deeper):
+- Don't stop at "person made mistake"
+- Ask: Why was mistake possible? Why not caught? Why no safeguard?
+- Actions: Reduce error likelihood (checklists, automation), increase error detection (testing, reviews), mitigate error impact (rollback, redundancy)
+
+**Process Gap** (missing or unclear procedures):
+- Symptoms: "Didn't know to do X", "Not in runbook", "First time"
+- Actions: Document process, create checklist, formalize approval gates, onboarding
+
+**Technical Debt** (deferred maintenance):
+- Symptoms: "Known issue", "Fragile system", "Workaround failed"
+- Actions: Prioritize tech debt, allocate 20% capacity, refactor, replace legacy systems
+
+**External Dependencies** (third-party failures):
+- Symptoms: "Vendor down", "API failed", "Partner issue"
+- Actions: Add redundancy, circuit breakers, graceful degradation, SLA monitoring, vendor diversification
+
+**Systemic Issues** (organizational, cultural):
+- Symptoms: "Always rushed", "No time to test", "Pressure to ship"
+- Actions: Address root organizational issues (unrealistic deadlines, resource constraints, incentive misalignment)
+
+## Root Cause Analysis Techniques
+
+**5 Whys**:
+1. Start with problem statement
+2. Ask "Why did this happen?" → Answer
+3. Ask "Why did that happen?" → Answer
+4. Repeat 5 times (or until root cause found)
+5. Root cause: Fixable at organizational/system level
+
+**Example**: Database outage → Why? Bad config → Why? Wrong value → Why? Template error → Why? New team member unfamiliar → Why? No config review in onboarding
+
+**Fishbone Diagram** (Ishikawa):
+- Categories: People, Process, Technology, Environment
+- Brainstorm causes in each category
+- Identify most likely root causes for investigation
+- Useful for complex incidents with multiple contributing factors
+
+**Fault Tree Analysis**:
+- Top: Failure event (e.g., "System down")
+- Gates: AND (all required) vs OR (any sufficient)
+- Leaves: Base causes (e.g., "Config error" OR "Network failure")
+- Trace path from failure to root causes
+
+## Corrective Actions Framework
+
+**Types of Actions**:
+- **Immediate Fixes**: Deployed within days (hotfix, manual process, workaround)
+- **Short-term Improvements**: Completed within weeks (better monitoring, updated runbook, process change)
+- **Long-term Investments**: Completed within months (architecture changes, new systems, cultural shifts)
+
+**SMART Actions**:
+- **Specific**: "Add config validation" not "Improve deploys"
+- **Measurable**: "Reduce MTTR from 2hr to 30min" not "Faster response"
+- **Assignable**: Named owner, not "team"
+- **Realistic**: Given capacity and constraints
+- **Time-bound**: Explicit deadline
+
+**Prioritization**:
+1. **High impact, low effort**: Do immediately
+2. **High impact, high effort**: Schedule as strategic project
+3. **Low impact, low effort**: Do if spare capacity
+4. **Low impact, high effort**: Consider skipping (cost > benefit)
+
+**Prevention Hierarchy** (from most to least effective):
+1. **Eliminate**: Remove hazard entirely (e.g., deprecate risky feature)
+2. **Substitute**: Replace with safer alternative (e.g., use managed service vs self-host)
+3. **Engineering controls**: Add safeguards (e.g., rate limits, circuit breakers, automated testing)
+4. **Administrative controls**: Improve processes (e.g., runbooks, checklists, reviews)
+5. **Training**: Educate people (least effective alone, combine with others)
+
+## Guardrails
+
+**Blameless Culture**:
+- ❌ "Engineer caused outage by deploying bad config" → ✓ "Deployment pipeline allowed bad config to reach production"
+- ❌ "PM didn't validate requirements" → ✓ "Requirements validation process missing"
+- ❌ "Designer made mistake" → ✓ "Design review process didn't catch issue"
+- Focus: What system/process failed? Not who made error.
+
+**Root Cause Depth**:
+- ❌ Stopping at surface: "Bug caused outage" → ✓ Deep analysis: "Bug deployed because testing gap, no staging env, rushed release pressure"
+- ❌ Single cause: "Database failure" → ✓ Multiple causes: "Database + no failover + alerting delay + unclear runbook"
+- Rule: Keep asking "Why?" until you reach actionable systemic improvements
+
+**Actionability**:
+- ❌ Vague: "Improve testing", "Better communication", "More careful" → ✓ Specific: "Add E2E test suite covering top 10 user flows by Apr 1 (Owner: Alex)"
+- ❌ No owner: "Team should document" → ✓ Owned: "Sam documents incident response runbook by Mar 15"
+- ❌ No deadline: "Eventually migrate" → ✓ Time-bound: "Complete migration by Q2 end"
+
+**Impact Quantification**:
+- ❌ Qualitative: "Many users affected", "Significant downtime" → ✓ Quantitative: "50K users (20% of base), 2-hour outage, $20K revenue loss"
+- ❌ No metrics: "Bad customer experience" → ✓ Metrics: "NPS dropped from 50 to 30, 100 support tickets, 5 churned customers ($50K ARR)"
+
+**Timeliness**:
+- ❌ Wait 2 weeks → Memory fades, urgency lost → ✓ Conduct within 48 hours while fresh
+- ❌ Never follow up → Actions forgotten → ✓ Track actions, review weekly, close when complete
+
+## Quick Reference
+
+**Resources**:
+- [resources/template.md](resources/template.md) - Postmortem document structure and sections
+- [resources/methodology.md](resources/methodology.md) - Blameless culture, root cause analysis techniques, corrective action frameworks
+- [resources/evaluators/rubric_postmortem.json](resources/evaluators/rubric_postmortem.json) - Quality criteria for postmortems
+
+**Success Criteria**:
+- ✓ Timeline clear with timestamps and key events
+- ✓ Impact quantified (users, duration, revenue, metrics)
+- ✓ Root cause identified (systemic, not individual blame)
+- ✓ Corrective actions SMART (specific, measurable, assigned, realistic, time-bound)
+- ✓ Blameless tone (focus on systems/processes)
+- ✓ Documented and shared within 48 hours
+- ✓ Action items tracked to completion
+
+**Common Mistakes**:
+- ❌ Blame individuals → culture of fear, hide future issues
+- ❌ Superficial root cause → doesn't prevent recurrence
+- ❌ Vague actions → nothing actually improves
+- ❌ No follow-through → actions never completed, same incident repeats
+- ❌ Delayed postmortem → details forgotten, less useful
+- ❌ Not sharing → no organizational learning
+- ❌ Defensive tone → misses opportunity to improve
--- a/skills/postmortem/resources/evaluators/rubric_postmortem.json
+++ b/skills/postmortem/resources/evaluators/rubric_postmortem.json
@@ -0,0 +1,288 @@
+{
+  "name": "Postmortem Evaluator",
+  "description": "Evaluate quality of postmortems—assessing timeline clarity, impact quantification, root cause depth, corrective action quality, blameless tone, timeliness, and organizational learning.",
+  "version": "1.0.0",
+  "criteria": [
+    {
+      "name": "Timeline Clarity & Completeness",
+      "description": "Evaluates whether timeline has specific timestamps, key events, and clear sequence from detection to resolution",
+      "weight": 1.2,
+      "scale": {
+        "1": {
+          "label": "No timeline or vague",
+          "description": "Timeline missing, or times vague ('afternoon', 'later'). Events out of order. Can't reconstruct what happened."
+        },
+        "2": {
+          "label": "Incomplete timeline",
+          "description": "Some timestamps but many missing. Key events absent (e.g., when detected, when mitigated). Hard to follow sequence."
+        },
+        "3": {
+          "label": "Clear timeline with timestamps",
+          "description": "Timeline with specific times (14:05 UTC). Key events: detection, investigation, mitigation, resolution. Sequence clear."
+        },
+        "4": {
+          "label": "Detailed timeline",
+          "description": "Comprehensive timeline with timestamps, events, who took action, data sources (logs, monitoring). Table format for scannability. Detection lag, response time, mitigation time quantified."
+        },
+        "5": {
+          "label": "Exemplary timeline reconstruction",
+          "description": "Timeline reconstructed from multiple sources (logs, metrics, Slack, interviews). Timestamps precise (minute-level). Key observations noted (detection lag X min, response time Y min, why delays occurred). Visual aids (timeline diagram). Shows thoroughness in fact-gathering."
+        }
+      }
+    },
+    {
+      "name": "Impact Quantification",
+      "description": "Evaluates whether impact is quantified across multiple dimensions (users, revenue, SLA, reputation)",
+      "weight": 1.3,
+      "scale": {
+        "1": {
+          "label": "No impact or qualitative only",
+          "description": "Impact not quantified, or only qualitative ('many users', 'significant'). Can't assess severity."
+        },
+        "2": {
+          "label": "Partial quantification",
+          "description": "Some numbers (e.g., '2-hour outage') but incomplete. Missing key dimensions (users affected? revenue impact? SLA breach?)."
+        },
+        "3": {
+          "label": "Multi-dimensional impact",
+          "description": "Impact quantified: Users affected (50K), duration (2hr), revenue loss ($20K), SLA breach (yes/no). Multiple dimensions covered."
+        },
+        "4": {
+          "label": "Comprehensive impact analysis",
+          "description": "Impact across all dimensions: Users (50K, 20% of base), duration (2hr), revenue ($20K estimated), SLA breach (99.9% → 2hr down, $15K credits), reputation (social media, customer escalations), operational cost (person-hours, support tickets). Before/during/after metrics table."
+        },
+        "5": {
+          "label": "Exceptional impact depth",
+          "description": "Comprehensive impact with: User segmentation (which tiers affected most), geographic distribution, customer quotes/feedback, long-tail effects (churn risk, brand damage), opportunity cost (lost signups, abandoned carts). Metrics baseline/during/post in table. Impact tied to business context (e.g., '$20K = 2% of weekly revenue'). Shows deep understanding of business impact."
+        }
+      }
+    },
+    {
+      "name": "Root Cause Depth",
+      "description": "Evaluates whether root cause analysis goes beyond symptoms to systemic issues using 5 Whys or equivalent",
+      "weight": 1.4,
+      "scale": {
+        "1": {
+          "label": "No root cause or surface only",
+          "description": "Root cause missing, or stops at surface symptom ('bug caused outage', 'person made mistake'). No depth."
+        },
+        "2": {
+          "label": "Shallow root cause",
+          "description": "Identified proximate cause ('bad config deployed') but didn't ask why. Stopped at individual action, not system issue."
+        },
+        "3": {
+          "label": "System-level root cause",
+          "description": "Used 5 Whys or equivalent to reach systemic root cause. Example: 'Deployment pipeline lacked config validation' (fixable). Identified 1-2 root causes."
+        },
+        "4": {
+          "label": "Multi-factor root cause analysis",
+          "description": "Identified primary root cause + 2-3 contributing factors using 5 Whys or fishbone diagram. Categorized: Technical, process, organizational. Example: Primary: no config validation. Contributing: no staging env, rushed timeline, unclear runbook."
+        },
+        "5": {
+          "label": "Rigorous causal analysis",
+          "description": "Comprehensive root cause using multiple techniques (5 Whys for primary, fishbone for contributing factors, Swiss cheese model for safeguard failures). Identified immediate, enabling, and systemic causes. Contributing factors categorized (technical, process, organizational). Evidence provided (logs, metrics confirming each cause). Shows deep thinking beyond obvious. Addresses 'why this, why now' systemically."
+        }
+      }
+    },
+    {
+      "name": "Corrective Actions Quality",
+      "description": "Evaluates whether corrective actions are SMART (specific, measurable, assigned, realistic, time-bound) and address root causes",
+      "weight": 1.4,
+      "scale": {
+        "1": {
+          "label": "No actions or vague",
+          "description": "No corrective actions, or vague ('improve testing', 'be more careful'). Not actionable."
+        },
+        "2": {
+          "label": "Generic actions",
+          "description": "Actions listed but vague. Missing owner or deadline. Example: 'Better monitoring', 'Improve process' (what specifically?)."
+        },
+        "3": {
+          "label": "SMART actions",
+          "description": "Actions are specific ('Add config validation'), owned (Alex), and time-bound (Mar 15). Measurable. Realistic. 3-5 actions listed."
+        },
+        "4": {
+          "label": "Comprehensive action plan",
+          "description": "Actions categorized (immediate/short-term/long-term). Each action SMART. Address root causes (not just symptoms). Prioritized (high impact actions first). 5-8 actions total. Tracked in project management tool."
+        },
+        "5": {
+          "label": "Strategic corrective action framework",
+          "description": "Comprehensive action plan using hierarchy of controls (eliminate, substitute, engineering controls, administrative, training). Actions at all levels: immediate fixes (deployed), short-term (weeks), long-term (months). Each action: SMART + addresses specific root cause + rationale (why this action prevents recurrence). Prioritized by impact/effort. Tracked with clear accountability. Prevention/detection/mitigation balance. Shows strategic thinking about systemic improvement."
+        }
+      }
+    },
+    {
+      "name": "Blameless Tone",
+      "description": "Evaluates whether postmortem focuses on systems/processes (not individuals) and maintains constructive tone",
+      "weight": 1.3,
+      "scale": {
+        "1": {
+          "label": "Blameful or accusatory",
+          "description": "Blames individuals ('Engineer X caused outage by...'). Names names. Accusatory tone. Creates fear."
+        },
+        "2": {
+          "label": "Somewhat blameful",
+          "description": "Implies blame ('X should have checked', 'Y made mistake'). Focus on individual actions, not systems."
+        },
+        "3": {
+          "label": "Mostly blameless",
+          "description": "Focus on systems ('Process allowed bad config'). Avoids blaming individuals. Constructive tone. Minor lapses."
+        },
+        "4": {
+          "label": "Blameless and constructive",
+          "description": "Consistently blameless throughout. System-focused language ('Process gap', 'Missing safeguard'). Acknowledges human factors without blame. Constructive framing ('opportunity to improve'). Celebrates what went well."
+        },
+        "5": {
+          "label": "Exemplary blameless culture",
+          "description": "Blameless tone modeled throughout. Explicitly acknowledges pressure, constraints, trade-offs that led to decisions. Reframes 'mistakes' as learning opportunities. Celebrates transparency (thank you for sharing). Uses second victim language (acknowledges engineer's stress). Shows deep understanding of blameless culture principles. Creates psychological safety."
+        }
+      }
+    },
+    {
+      "name": "Timeliness & Follow-Through",
+      "description": "Evaluates whether postmortem conducted promptly (within 48hr) and action items tracked to completion",
+      "weight": 1.0,
+      "scale": {
+        "1": {
+          "label": "Delayed or no follow-up",
+          "description": "Postmortem written >1 week after incident (memory faded). No action tracking. Actions forgotten."
+        },
+        "2": {
+          "label": "Late postmortem",
+          "description": "Written 3-7 days after incident. Some actions tracked but many fall through cracks. Incomplete follow-up."
+        },
+        "3": {
+          "label": "Timely postmortem",
+          "description": "Written within 48-72 hours (memory fresh). Actions tracked in project tool. Some follow-up in standups."
+        },
+        "4": {
+          "label": "Timely with tracking",
+          "description": "Written within 48 hours. Actions added to tracker with owners and deadlines. Reviewed weekly in standups. Progress tracked. Most actions completed."
+        },
+        "5": {
+          "label": "Rigorous timeliness and accountability",
+          "description": "Postmortem written within 24-48 hours while memory fresh. Actions immediately added to tracker. Clear ownership (named individuals, not 'team'). Deadlines aligned to urgency. Weekly progress review in standups. Escalation if actions stalled. Postmortem closed only when all actions complete. Completion rate >90%. Shows commitment to follow-through and accountability."
+        }
+      }
+    },
+    {
+      "name": "Learning & Sharing",
+      "description": "Evaluates whether lessons extracted, documented, and shared for organizational learning",
+      "weight": 1.1,
+      "scale": {
+        "1": {
+          "label": "No learning or sharing",
+          "description": "Lessons not documented. Postmortem not shared beyond immediate team. No organizational learning."
+        },
+        "2": {
+          "label": "Minimal sharing",
+          "description": "Lessons vaguely stated. Shared with immediate team only. Limited learning value."
+        },
+        "3": {
+          "label": "Lessons documented and shared",
+          "description": "Lessons learned section with 3-5 insights. Shared with team and stakeholders. Archived in knowledge base."
+        },
+        "4": {
+          "label": "Comprehensive learning",
+          "description": "Lessons extracted and generalized (applicable beyond this incident). Shared broadly (team, stakeholders, company-wide). Presented in team meeting for discussion. Archived in searchable postmortem database. Tagged by category (root cause type, service, severity)."
+        },
+        "5": {
+          "label": "Exceptional organizational learning",
+          "description": "Lessons deeply analyzed and generalized. 'What went well' celebrated (not just failures). Shared company-wide with context (all-hands presentation, newsletter, postmortem database). Cross-team learnings identified (tag related teams to prevent repeat). Patterns identified from multiple postmortems. Metrics tracked (incident frequency, MTTR, repeat rate). Shows commitment to learning culture."
+        }
+      }
+    },
+    {
+      "name": "Completeness & Structure",
+      "description": "Evaluates whether postmortem includes all key sections (summary, timeline, impact, root cause, actions, lessons)",
+      "weight": 1.0,
+      "scale": {
+        "1": {
+          "label": "Incomplete or unstructured",
+          "description": "Missing major sections (no timeline or no root cause or no actions). Unstructured narrative."
+        },
+        "2": {
+          "label": "Partial completeness",
+          "description": "Has some sections but missing 1-2 key elements. Loose structure."
+        },
+        "3": {
+          "label": "Complete with standard sections",
+          "description": "Includes: Summary, timeline, impact, root cause, corrective actions. Follows template. Structured."
+        },
+        "4": {
+          "label": "Comprehensive structure",
+          "description": "All standard sections plus: What went well, lessons learned, appendix (links to logs/metrics/Slack). Well-structured, scannable. Easy to navigate."
+        },
+        "5": {
+          "label": "Exemplary completeness",
+          "description": "All sections comprehensive: Summary (clear), timeline (detailed), impact (multi-dimensional), root cause (deep), actions (SMART), what went well (celebrated), lessons (generalized), appendix (thorough references). Uses tables, formatting for scannability. Passes 'grandmother test' (non-expert can understand what happened). Shows attention to detail and communication quality."
+        }
+      }
+    }
+  ],
+  "guidance": {
+    "by_incident_type": {
+      "production_outage": {
+        "focus": "Prioritize timeline clarity (1.3x), impact quantification (1.4x), and root cause depth (1.4x). Outages need technical rigor.",
+        "typical_scores": "Timeline 4+, impact 4+, root cause 4+, actions 4+. Blameless 3+ (pressure high during outages).",
+        "red_flags": "Timeline vague, impact not quantified, root cause stops at 'bug' or 'config error', actions vague ('improve monitoring')"
+      },
+      "security_incident": {
+        "focus": "Prioritize root cause depth (1.5x), corrective actions (1.5x), and completeness (1.3x). Security needs thoroughness.",
+        "typical_scores": "Root cause 4+, actions 4+, completeness 4+. Timeline can be 3+ (detection often delayed in security).",
+        "red_flags": "Root cause shallow (stops at 'vulnerability'), no security audit action, missing compliance considerations"
+      },
+      "product_failure": {
+        "focus": "Prioritize impact quantification (1.3x), root cause (1.3x), and learning (1.3x). Product failures need business context.",
+        "typical_scores": "Impact 4+ (business metrics), root cause 4+, learning 4+. Timeline can be 3+ (less time-critical).",
+        "red_flags": "Impact qualitative only ('failed to meet expectations'), root cause blames individuals ('PM didn't validate'), no process improvements"
+      },
+      "project_deadline_miss": {
+        "focus": "Prioritize root cause depth (1.3x), blameless tone (1.5x), and learning (1.3x). Project failures prone to blame.",
+        "typical_scores": "Root cause 4+, blameless 4+ (critical), learning 4+. Timeline can be 3+ (long duration).",
+        "red_flags": "Blames individuals ('X underestimated'), superficial root cause ('poor planning'), no systemic improvements"
+      }
+    },
+    "by_severity": {
+      "sev_1_critical": {
+        "expectations": "All criteria 4+. High-severity incidents require rigorous postmortems. Timeline detailed, impact comprehensive, root cause deep, actions strategic.",
+        "next_steps": "Share company-wide, present to leadership, track metrics (MTTR improvement), quarterly review of all Sev 1 incidents"
+      },
+      "sev_2_high": {
+        "expectations": "All criteria 3.5+. Important incidents need thorough postmortems. Timeline clear, impact quantified, root cause systemic, actions SMART.",
+        "next_steps": "Share with team and stakeholders, track actions, monthly review of Sev 2 trends"
+      },
+      "sev_3_medium": {
+        "expectations": "All criteria 3+. Lower-severity incidents can have lighter postmortems. Focus: root cause and actions (learn and prevent).",
+        "next_steps": "Share with immediate team, batch review monthly for patterns"
+      }
+    }
+  },
+  "common_failure_modes": {
+    "superficial_root_cause": "Stops at symptom ('bug', 'config error') instead of asking why 5 times. Fix: Use 5 Whys until reach fixable system issue.",
+    "vague_actions": "Actions like 'improve monitoring', 'better communication'. Fix: Make SMART (specific metric, owner, deadline).",
+    "blame_individuals": "Postmortem blames person ('X caused outage'). Fix: Reframe to system focus ('Process allowed bad config').",
+    "no_impact_quantification": "Impact qualitative ('many users', 'significant'). Fix: Quantify (50K users, 2hr, $20K revenue loss).",
+    "delayed_postmortem": "Written >1 week after, memory faded. Fix: Conduct within 48 hours while fresh.",
+    "no_follow_through": "Actions never completed, languish in backlog. Fix: Track in project tool, review weekly, escalate if stalled.",
+    "not_shared": "Postmortem stays with immediate team, no org learning. Fix: Share company-wide, present in all-hands, archive in searchable database.",
+    "missing_timeline": "No timeline or vague times. Fix: Reconstruct from logs/metrics, use specific timestamps (14:05 UTC)."
+  },
+  "excellence_indicators": [
+    "Timeline reconstructed from multiple sources with minute-level precision and key observations (detection lag, response time)",
+    "Impact quantified across all dimensions (users, revenue, SLA, reputation, operational) with before/during/after metrics",
+    "Root cause uses multiple techniques (5 Whys, fishbone) to identify immediate, enabling, and systemic causes with evidence",
+    "Corrective actions use hierarchy of controls (eliminate/substitute/engineering/admin/training) with SMART criteria and rationale",
+    "Blameless tone maintained throughout with acknowledgment of pressure/constraints and celebration of transparency",
+    "Postmortem written within 24-48 hours, actions tracked to >90% completion, closed only when all complete",
+    "Lessons generalized and shared company-wide with cross-team tagging and pattern analysis across multiple incidents",
+    "Complete structure with all sections (summary, timeline, impact, root cause, actions, what went well, lessons, appendix)",
+    "Uses tables and formatting for scannability, passes 'grandmother test' (non-expert can understand)",
+    "Celebrates what went well, not just failures (detection worked, team responded fast, good communication)"
+  ],
+  "evaluation_notes": {
+    "scoring": "Calculate weighted average across all criteria. Minimum passing score: 3.0 (basic quality). Production-ready target: 3.5+. Excellence threshold: 4.2+. For production outages, weight timeline, impact, and root cause higher. For security incidents, weight root cause and actions higher. For product failures, weight impact and learning higher.",
+    "context": "Adjust expectations by severity. Sev 1 incidents need 4+ across all criteria (high rigor). Sev 2 incidents need 3.5+ (thorough). Sev 3 incidents need 3+ (lighter but still valuable). Different incident types need different emphasis: outages need technical rigor (timeline, root cause), security needs thoroughness (root cause, actions, completeness), product failures need business context (impact, learning).",
+    "iteration": "Low scores indicate specific improvement areas. Priority order: 1) Fix blameful tone (critical for culture), 2) Deepen root cause (5 Whys to system level), 3) Make actions SMART (specific, owned, time-bound), 4) Quantify impact (numbers not adjectives), 5) Improve timeliness (within 48hr), 6) Complete timeline (specific timestamps), 7) Extract and share lessons (org learning), 8) Ensure completeness (all sections). Re-score after each improvement cycle."
+  }
+}
--- a/skills/postmortem/resources/methodology.md
+++ b/skills/postmortem/resources/methodology.md
@@ -0,0 +1,440 @@
+# Postmortem Methodology
+
+## Table of Contents
+1. [Blameless Culture](#1-blameless-culture)
+2. [Root Cause Analysis Techniques](#2-root-cause-analysis-techniques)
+3. [Corrective Action Frameworks](#3-corrective-action-frameworks)
+4. [Incident Response Patterns](#4-incident-response-patterns)
+5. [Postmortem Facilitation](#5-postmortem-facilitation)
+6. [Organizational Learning](#6-organizational-learning)
+
+---
+
+## 1. Blameless Culture
+
+### Core Principles
+
+**Humans Err, Systems Should Be Resilient**
+- Assumption: People will make mistakes. Design systems to tolerate errors.
+- Example: Deployment requires 2 approvals → reduces error likelihood
+- Example: Canary deployments → limits error blast radius
+
+**Second Victim Phenomenon**
+- First victim: Customers affected by incident
+- Second victim: Engineer who made triggering action (guilt, anxiety, fear)
+- Blameful culture: Second victim punished → hides future issues, leaves company
+- Blameless culture: Learn together → transparency, improvement
+
+**Just Culture vs Blameless**
+- **Blameless**: Focus on system improvements, not individual accountability
+- **Just Culture**: Distinguish reckless behavior (punish) from honest mistakes (learn)
+- Example Reckless: Intentionally bypassing safeguards, ignoring warnings
+- Example Honest: Misunderstanding, reasonable decision with information at hand
+
+### Language Patterns
+
+**Blameful → Blameless Reframing**:
+- ❌ "Engineer X caused outage" → ✓ "Deployment process allowed bad config to reach production"
+- ❌ "PM didn't validate" → ✓ "Requirements validation process was missing"
+- ❌ "Designer made error" → ✓ "Design review didn't catch accessibility issue"
+- ❌ "Careless mistake" → ✓ "System lacked error detection at this step"
+
+**System-Focused Questions**:
+- "What allowed this to happen?" (not "Who did this?")
+- "What gaps in our process enabled this?" (not "Why didn't you check?")
+- "How can we prevent recurrence?" (not "How do we prevent X from doing this again?")
+
+### Building Blameless Culture
+
+**Leadership Modeling**:
+- Leaders admit own mistakes publicly
+- Leaders ask "How did our systems fail?" not "Who messed up?"
+- Leaders reward transparency (sharing incidents, near-misses)
+
+**Psychological Safety**:
+- Safe to raise concerns, report issues, admit mistakes
+- No punishment for honest errors (distinguish from recklessness)
+- Celebrate learning, not just success
+
+**Metrics to Track**:
+- Near-miss reporting rate (high = good, means people feel safe reporting)
+- Postmortem completion rate (all incidents get postmortem)
+- Action item completion rate (learnings turn into improvements)
+- Repeat incident rate (same root cause recurring = not learning)
+
+---
+
+## 2. Root Cause Analysis Techniques
+
+### 5 Whys
+
+**Process**:
+1. State problem clearly
+2. Ask "Why?" → Answer
+3. Ask "Why?" on answer → Next answer
+4. Repeat until root cause (fixable at system/org level)
+5. Typically 5 iterations, but can be 3-7
+
+**Example**:
+Problem: Website down
+1. Why? Database connection failed
+2. Why? Connection pool exhausted
+3. Why? Pool size too small (10 vs 100 needed)
+4. Why? Config template had wrong value
+5. Why? No validation in deployment pipeline
+
+**Root Cause**: Deployment pipeline lacked config validation (fixable!)
+
+**Tips**:
+- Avoid "human error" as answer - keep asking why error possible
+- Stop when you reach actionable system change
+- Multiple paths may emerge - explore all
+
+### Fishbone Diagram (Ishikawa)
+
+**Structure**: Fish skeleton with problem at head, causes as bones
+
+**Categories** (6M):
+- **Methods** (Processes): Missing step, unclear procedure, no validation
+- **Machines** (Technology): System limits, infrastructure failures, bugs
+- **Materials** (Data/Inputs): Bad data, missing info, incorrect assumptions
+- **Measurements** (Monitoring): Blind spots, delayed detection, wrong metrics
+- **Mother Nature** (Environment): External dependencies, third-party failures, regulations
+- **Manpower** (People): Training gaps, staffing, knowledge silos (careful - focus on systemic issues)
+
+**When to Use**: Complex incidents with multiple contributing factors
+
+**Process**:
+1. Draw fish skeleton with problem at head
+2. Brainstorm causes in each category
+3. Vote on most likely root causes
+4. Investigate top 3-5 causes
+5. Confirm with evidence (logs, metrics)
+
+### Fault Tree Analysis
+
+**Structure**: Tree with failure at top, causes below, gates connecting
+
+**Gates**:
+- **AND Gate**: All inputs required for failure (redundancy protects)
+- **OR Gate**: Any input sufficient for failure (single point of failure)
+
+**Example**:
+```
+Service Down (OR gate)
+├── Database Failure (AND gate)
+│   ├── Primary DB down
+│   └── Replica promotion failed
+└── Application Crash (OR gate)
+    ├── Memory leak
+    ├── Uncaught exception
+    └── OOM killer triggered
+```
+
+**Use**: Identify single points of failure, prioritize redundancy
+
+### Swiss Cheese Model
+
+**Concept**: Layers of defense (safeguards) with holes (gaps)
+
+**Incident occurs when**: Holes align across all layers
+
+**Example Layers**:
+1. Design: Architecture with failover
+2. Training: Team knows how to respond
+3. Process: Peer review required
+4. Technology: Automated tests
+5. Monitoring: Alerts fire when issue occurs
+
+**Analysis**: Identify which layers had holes for this incident, plug holes
+
+### Contributing Factors Framework
+
+**Categorize causes**:
+
+**Immediate Cause**: Proximate trigger
+- Example: Config with wrong value deployed
+
+**Enabling Causes**: Allowed immediate cause
+- Example: No config validation, no peer review
+
+**Systemic Causes**: Organizational issues
+- Example: Pressure to ship fast, understaffed team, no time for rigor
+
+**Action**: Address all levels, not just immediate
+
+---
+
+## 3. Corrective Action Frameworks
+
+### Hierarchy of Controls
+
+**From Most to Least Effective**:
+
+1. **Elimination**: Remove hazard entirely
+   - Example: Deprecate risky feature, sunset legacy system
+   - Most effective but often not feasible
+
+2. **Substitution**: Replace with safer alternative
+   - Example: Use managed service vs self-hosted database
+   - Reduces risk substantially
+
+3. **Engineering Controls**: Add safeguards
+   - Example: Rate limits, circuit breakers, automated testing, canary deployments
+   - Effective and feasible
+
+4. **Administrative Controls**: Improve processes
+   - Example: Runbooks, checklists, peer review, approval gates
+   - Relies on compliance
+
+5. **Training**: Educate people
+   - Example: Onboarding, workshops, documentation
+   - Least effective alone, use with other controls
+
+**Apply**: Start at top of hierarchy, work down until feasible solution found
+
+### SMART Actions
+
+**Criteria**:
+- **Specific**: "Add config validation" not "Improve deployments"
+- **Measurable**: "Reduce MTTR from 2hr to 30min" not "Faster response"
+- **Assignable**: Name person, not "team" or "we"
+- **Realistic**: Given constraints (time, budget, skills)
+- **Time-bound**: Explicit deadline
+
+**Bad**: "Improve monitoring"
+**Good**: "Add alert for connection pool >80% utilization by Mar 15 (Owner: Alex)"
+
+### Prevention vs Detection vs Mitigation
+
+**Prevention**: Stop incident from occurring
+- Example: Config validation, automated testing, peer review
+- Best ROI but can't prevent everything
+
+**Detection**: Identify incident quickly
+- Example: Monitoring, alerting, anomaly detection
+- Reduces time to mitigation
+
+**Mitigation**: Limit impact when incident occurs
+- Example: Circuit breakers, graceful degradation, failover, rollback
+- Reduces blast radius
+
+**Balanced Portfolio**: Invest in all three
+
+### Action Prioritization
+
+**Impact vs Effort Matrix**:
+- **High Impact, Low Effort**: Do immediately (Quick wins)
+- **High Impact, High Effort**: Schedule as project (Strategic)
+- **Low Impact, Low Effort**: Do if capacity (Fill-ins)
+- **Low Impact, High Effort**: Skip (Not worth it)
+
+**Risk-Based**: Prioritize by: Likelihood × Impact of recurrence
+- Critical incident (total outage) likely to recur → Top priority
+- Minor issue (UI glitch) unlikely to recur → Lower priority
+
+---
+
+## 4. Incident Response Patterns
+
+### Incident Severity Levels
+
+**Sev 1 (Critical)**:
+- Total outage, data loss, security breach, >50% users affected
+- Response: Immediate, all-hands, exec notification
+- SLA: Acknowledge <15min, resolve <4hr
+
+**Sev 2 (High)**:
+- Partial outage, degraded performance, 10-50% users affected
+- Response: On-call team, incident commander assigned
+- SLA: Acknowledge <30min, resolve <24hr
+
+**Sev 3 (Medium)**:
+- Minor degradation, <10% users affected, non-critical feature down
+- Response: On-call investigates, may defer to business hours
+- SLA: Acknowledge <2hr, resolve <48hr
+
+**Sev 4 (Low)**:
+- Minimal impact, cosmetic issues, internal tools only
+- Response: Create ticket, prioritize in backlog
+- SLA: No SLA, best-effort
+
+### Incident Command Structure (ICS)
+
+**Roles**:
+- **Incident Commander (IC)**: Coordinates response, makes decisions, external communication
+- **Technical Lead**: Diagnoses issue, implements fix, coordinates engineers
+- **Communication Lead**: Status updates, stakeholder briefing, customer communication
+- **Scribe**: Documents timeline, decisions, actions in incident log
+
+**Why Structure**: Prevents chaos, clear ownership, scales to large incidents
+
+**Rotation**: IC role rotates across senior engineers (training, burnout prevention)
+
+### Communication Patterns
+
+**Internal Updates** (Slack, incident channel):
+- Frequency: Every 15-30 min
+- Format: Status, progress, next steps, blockers
+- Example: "Update 14:30: Root cause identified (bad config). Initiating rollback. ETA resolution: 15:00."
+
+**External Updates** (Status page, social media):
+- Frequency: At detection, every hour, at resolution
+- Tone: Transparent, apologetic, no technical jargon
+- Example: "We're currently experiencing issues with login. Our team is actively working on a fix. We'll update every hour."
+
+**Executive Updates**:
+- Trigger: Sev 1/2 within 30 min
+- Format: Impact, root cause (if known), ETA, what we're doing
+- Frequency: Every 30-60 min until resolved
+
+### Postmortem Timing
+
+**When to Conduct**:
+- **Immediately** (within 48 hours): Sev 1/2 incidents
+- **Weekly batching**: Sev 3 incidents (batch review)
+- **Monthly**: Sev 4 or pattern analysis (recurring issues)
+- **Pre-mortem**: Before major launch (imagine failure, identify risks)
+
+**Who Attends**:
+- All incident responders
+- Service owners
+- Optional: Stakeholders, leadership (for major incidents)
+
+---
+
+## 5. Postmortem Facilitation
+
+### Facilitation Tips
+
+**Set Tone**:
+- Open: "We're here to learn, not blame. Everything shared is blameless."
+- Emphasize: Focus on systems and processes, not individuals
+- Encourage: Transparency, even uncomfortable truths
+
+**Structure Meeting** (60-90 min):
+1. **Recap timeline** (10 min): Walk through what happened
+2. **Impact review** (5 min): Quantify damage
+3. **Root cause brainstorm** (20 min): Fishbone or 5 Whys as group
+4. **Corrective actions** (20 min): Brainstorm actions for each root cause
+5. **Prioritization** (10 min): Vote on top 5 actions
+6. **Assign owners** (5 min): Who owns what, by when
+
+**Facilitation Techniques**:
+- **Round-robin**: Everyone speaks, no dominance
+- **Silent brainstorm**: Write ideas on sticky notes, then share (prevents groupthink)
+- **Dot voting**: Each person gets 3 votes for top priorities
+- **Parking lot**: Capture off-topic items for later
+
+**Red Flags** (intervene if you see):
+- Blame language ("X caused this") → Reframe to system focus
+- Defensiveness ("I had to rush because...") → Acknowledge pressure, focus on fixing system
+- Rabbit holes (long technical debates) → Park for offline discussion
+
+### Follow-Up
+
+**Document**: Assign someone to write up postmortem (usually IC or Technical Lead)
+**Share**: Distribute to team, stakeholders, company-wide (transparency)
+**Present**: 10-min presentation in all-hands or team meeting (visibility)
+**Track**: Add action items to project tracker, review weekly in standup
+**Close**: Postmortem marked complete when all actions done
+
+### Postmortem Review Meetings
+
+**Monthly Postmortem Review** (Team-level):
+- Review all postmortems from last month
+- Identify patterns (repeated root causes)
+- Assess action item completion rate
+- Celebrate improvements
+
+**Quarterly Postmortem Review** (Org-level):
+- Aggregate data: Incident frequency, severity, MTTR
+- Trends: Are we getting better? (MTTR decreasing? Repeat incidents decreasing?)
+- Investment decisions: Where to invest in reliability
+
+---
+
+## 6. Organizational Learning
+
+### Learning Metrics
+
+**Track**:
+- **Incident frequency**: Total incidents per month (decreasing over time?)
+- **MTTR (Mean Time To Resolve)**: Average time from detection to resolution (improving?)
+- **Repeat incidents**: Same root cause recurring (learning gap if high)
+- **Action completion rate**: % of postmortem actions completed (accountability)
+- **Near-miss reporting**: # of near-misses reported (psychological safety indicator)
+
+**Goals**:
+- Reduce incident frequency: -10% per quarter
+- Reduce MTTR: -20% per quarter
+- Reduce repeat incidents: <5% of total
+- Action completion: >90%
+- Near-miss reporting: Increasing (good sign)
+
+### Knowledge Sharing
+
+**Postmortem Database**:
+- Centralized repository (Confluence, Notion, wiki)
+- Searchable by: Date, service, root cause category, severity
+- Template: Consistent format for easy scanning
+
+**Learning Sessions**:
+- Monthly "Failure Fridays": Review interesting postmortems
+- Quarterly "Top 10 Incidents": Review worst incidents, learnings
+- Blameless tone: Celebrate transparency, not success
+
+**Cross-Team Sharing**:
+- Share postmortems beyond immediate team
+- Tag related teams (if payment incident, notify fintech team)
+- Prevent: Team A repeats Team B's mistake
+
+### Continuous Improvement
+
+**Feedback Loops**:
+- Postmortem → Corrective actions → Completion → Measure impact → Repeat
+- Quarterly review: Are actions working? (MTTR decreasing? Repeats decreasing?)
+
+**Runbook Updates**:
+- After every incident: Update runbook with learnings
+- Quarterly: Review all runbooks, refresh stale content
+- Metric: Runbook age (>6 months old = needs refresh)
+
+**Process Evolution**:
+- Learn from postmortem process itself
+- Survey: "Was this postmortem useful? What would improve it?"
+- Iterate: Improve template, facilitation, tracking
+
+---
+
+## Quick Reference
+
+### Blameless Language Patterns
+- ❌ "Person caused" → ✓ "Process allowed"
+- ❌ "Didn't check" → ✓ "Validation missing"
+- ❌ "Mistake" → ✓ "Gap in system"
+
+### Root Cause Techniques
+- **5 Whys**: Simple incidents, single cause
+- **Fishbone**: Complex incidents, multiple causes
+- **Fault Tree**: Identify single points of failure
+- **Swiss Cheese**: Multiple safeguard failures
+
+### Corrective Action Hierarchy
+1. Eliminate (remove hazard)
+2. Substitute (safer alternative)
+3. Engineering controls (safeguards)
+4. Administrative (processes)
+5. Training (education)
+
+### Incident Response
+- **Sev 1**: Total outage, <15min ack, <4hr resolve
+- **Sev 2**: Partial, <30min ack, <24hr resolve
+- **Sev 3**: Minor, <2hr ack, <48hr resolve
+
+### Success Metrics
+- ✓ Incident frequency decreasing
+- ✓ MTTR improving
+- ✓ Repeat incidents <5%
+- ✓ Action completion >90%
+- ✓ Near-miss reporting increasing
--- a/skills/postmortem/resources/template.md
+++ b/skills/postmortem/resources/template.md
@@ -0,0 +1,396 @@
+# Postmortem Template
+
+## Table of Contents
+1. [Workflow](#workflow)
+2. [Postmortem Document Template](#postmortem-document-template)
+3. [Guidance for Each Section](#guidance-for-each-section)
+4. [Quick Patterns](#quick-patterns)
+5. [Quality Checklist](#quality-checklist)
+
+## Workflow
+
+Copy this checklist and track your progress:
+
+```
+Postmortem Progress:
+- [ ] Step 1: Assemble timeline and quantify impact
+- [ ] Step 2: Conduct root cause analysis
+- [ ] Step 3: Define corrective and preventive actions
+- [ ] Step 4: Document and share postmortem
+- [ ] Step 5: Track action items to completion
+```
+
+**Step 1: Assemble timeline and quantify impact**
+
+Fill out [Incident Summary](#incident-summary) and [Timeline](#timeline) sections. Quantify impact in [Impact](#impact) section. Gather logs, metrics, alerts, and witness accounts.
+
+**Step 2: Conduct root cause analysis**
+
+Use 5 Whys or fishbone diagram to identify root causes. Fill out [Root Cause Analysis](#root-cause-analysis) section. See [Guidance: Root Cause](#root-cause) for techniques.
+
+**Step 3: Define corrective and preventive actions**
+
+For each root cause, identify actions. Fill out [Corrective Actions](#corrective-actions) section with SMART actions (specific, measurable, assigned, realistic, time-bound).
+
+**Step 4: Document and share postmortem**
+
+Complete [What Went Well](#what-went-well) and [Lessons Learned](#lessons-learned) sections. Share with team, stakeholders, and broader organization. Present in meeting for discussion.
+
+**Step 5: Track action items to completion**
+
+Add actions to project tracker. Assign owners, set deadlines. Review progress weekly. Close postmortem when all actions complete. Use [Quality Checklist](#quality-checklist) to validate.
+
+---
+
+## Postmortem Document Template
+
+### Incident Summary
+
+**Incident ID**: [Unique identifier, e.g., INC-2024-001]
+**Date**: [When incident occurred, e.g., 2024-03-05]
+**Duration**: [How long, e.g., 2 hours 15 minutes]
+**Severity**: [Critical / High / Medium / Low]
+**Status**: [Resolved / Investigating / Monitoring]
+
+**Summary**: [1-2 sentence description of what happened]
+
+**Impact Summary**:
+- **Users Affected**: [Number or percentage]
+- **Duration**: [Downtime or degradation period]
+- **Revenue Impact**: [Estimated $ loss, if applicable]
+- **SLA Impact**: [SLA breach? Refunds owed?]
+- **Reputation Impact**: [Customer complaints, social media, press]
+
+**Owner**: [Postmortem author]
+**Contributors**: [Who participated in postmortem]
+**Date Written**: [When postmortem was created]
+
+---
+
+### Timeline
+
+**Detection to Resolution**: [14:05 - 16:05 UTC (2 hours)]
+
+| Time (UTC) | Event | Source | Action Taken |
+|------------|-------|--------|--------------|
+| 14:00 | [Normal operation] | - | - |
+| 14:05 | [Deployment started] | CI/CD | [Config change deployed] |
+| 14:07 | [Error spike detected] | Monitoring | [Errors jumped from 0 to 500/min] |
+| 14:10 | [Alert fired] | PagerDuty | [On-call engineer paged] |
+| 14:15 | [Engineer joins incident] | Slack | [Investigation begins] |
+| 14:20 | [Root cause identified] | Logs | [Bad config found] |
+| 14:25 | [Mitigation attempt #1] | Manual | [Tried config hotfix - failed] |
+| 14:45 | [Mitigation attempt #2] | Manual | [Scaled up instances - no effect] |
+| 15:30 | [Rollback decision] | Incident Commander | [Decided to rollback deployment] |
+| 15:45 | [Rollback executed] | CI/CD | [Rolled back to previous version] |
+| 16:05 | [Service restored] | Monitoring | [Errors returned to 0, traffic normal] |
+| 16:30 | [All-clear declared] | Incident Commander | [Incident closed] |
+
+**Key Observations**:
+- Detection lag: [X minutes between occurrence and detection]
+- Response time: [X minutes from alert to engineer joining]
+- Diagnosis time: [X minutes to identify root cause]
+- Mitigation time: [X minutes from mitigation start to resolution]
+- Total incident duration: [X hours Y minutes]
+
+---
+
+### Impact
+
+**User Impact**: [50K users (20%), complete unavailability, global]
+**Business**: [$20K revenue loss, SLA breach (99.9% → 2hr down), $15K credits]
+**Operational**: [150 support tickets, 10 person-hours incident response]
+**Reputation**: [50 negative tweets, 3 Enterprise escalations]
+
+| Metric | Baseline | During | Post |
+|--------|----------|--------|------|
+| Uptime | 99.99% | 0% | 99.99% |
+| Errors | <0.1% | 100% | <0.1% |
+| Users | 250K | 0 | 240K |
+
+---
+
+### Root Cause Analysis
+
+#### 5 Whys
+
+**Problem Statement**: Database connection pool exhausted, causing 2-hour service outage affecting 50K users.
+
+1. **Why did the service go down?**
+   - Database connection pool exhausted (all connections in use, new requests rejected)
+
+2. **Why did the connection pool exhaust?**
+   - Connection pool size was set to 10, but traffic required 100+ connections
+
+3. **Why was the pool size set to 10?**
+   - Configuration template had wrong value (10 instead of 100)
+
+4. **Why did the template have the wrong value?**
+   - New team member created template without referencing production values
+
+5. **Why didn't anyone catch this before deployment?**
+   - No staging environment with production-like load to test configs
+   - No automated config validation in deployment pipeline
+   - No peer review required for configuration changes
+
+**Root Causes Identified**:
+1. **Primary**: Missing config validation in deployment pipeline
+2. **Contributing**: No staging environment with prod-like load
+3. **Contributing**: Unclear onboarding process for infrastructure changes
+4. **Contributing**: No peer review requirement for config changes
+
+#### Contributing Factors
+
+**Technical**:
+- Deployment pipeline lacks config validation
+- No staging environment matching production scale
+- Monitoring detected symptom (errors) but not root cause (pool exhaustion)
+
+**Process**:
+- Onboarding doesn't cover production config management
+- No peer review process for infrastructure changes
+- Runbook for rollback was outdated, delayed recovery by 45 minutes
+
+**Organizational**:
+- Pressure to deploy quickly (feature deadline) led to skipping validation steps
+- Team understaffed (3 people covering on-call for 50-person engineering org)
+
+---
+
+### Corrective Actions
+
+**Immediate Fixes** (Completed or in-flight):
+- [x] Rollback to previous config (Completed 16:05 UTC)
+- [x] Manually set connection pool to 100 on all instances (Completed 16:30 UTC)
+- [ ] Update deployment runbook with rollback steps (Owner: Sam, Due: Mar 8)
+
+**Short-Term Improvements** (Next 2-4 weeks):
+- [ ] Add config validation to deployment pipeline (Owner: Alex, Due: Mar 15)
+  - Validate connection pool size is between 50-500
+  - Validate all required config keys present
+  - Fail deployment if validation fails
+- [ ] Create staging environment with production-like load (Owner: Jordan, Due: Mar 30)
+  - 10% of prod traffic
+  - Same config as prod
+  - Automated smoke tests after each deployment
+- [ ] Require peer review for all infrastructure changes (Owner: Morgan, Due: Mar 10)
+  - Update GitHub settings to require 1 approval
+  - Add CODEOWNERS for infra configs
+- [ ] Improve monitoring to detect pool exhaustion (Owner: Casey, Due: Mar 20)
+  - Alert when pool utilization >80%
+  - Dashboard showing pool metrics
+
+**Long-Term Investments** (Next 1-3 months):
+- [ ] Implement canary deployments (Owner: Alex, Due: May 1)
+  - Deploy to 5% of traffic first
+  - Auto-rollback if error rate >0.5%
+- [ ] Expand on-call rotation (Owner: Morgan, Due: Apr 15)
+  - Hire 2 more SREs
+  - Train 3 backend engineers for on-call
+  - Reduce on-call burden from 1:3 to 1:8
+- [ ] Onboarding program for infrastructure changes (Owner: Jordan, Due: Apr 30)
+  - Checklist for new team members
+  - Buddy system for first 3 infra changes
+  - Production access granted after completion
+
+**Tracking**:
+- Actions added to JIRA project: INCIDENT-2024-001
+- Reviewed in weekly engineering standup
+- Postmortem closed when all actions complete (target: May 1)
+
+---
+
+### What Went Well
+
+**Detection & Alerting**:
+- ✓ Monitoring detected error spike within 3 minutes of occurrence
+- ✓ PagerDuty alert fired correctly, paged on-call within 5 minutes
+- ✓ Escalation path worked (on-call → senior eng → incident commander)
+
+**Incident Response**:
+- ✓ Team assembled quickly (5 engineers joined within 10 minutes)
+- ✓ Clear incident commander (Morgan) coordinated response
+- ✓ Regular status updates posted to #incidents Slack channel every 15 min
+- ✓ Stakeholder communication: Execs notified within 20 min, status page updated
+
+**Communication**:
+- ✓ Blameless tone maintained throughout (no finger-pointing)
+- ✓ External communication transparent (status page, Twitter updates)
+- ✓ Customer support briefed on issue and talking points provided
+
+**Technical**:
+- ✓ Rollback process worked once initiated (though delayed by unclear runbook)
+- ✓ No data loss or corruption during incident
+- ✓ Automated alerts prevented longer outage (if manual detection, could be hours)
+
+---
+
+### Lessons Learned
+
+**What We Learned**:
+1. **Staging environments matter**: Could have caught this with prod-like staging
+2. **Config is code**: Needs same rigor as code (validation, review, testing)
+3. **Runbooks rot**: Outdated runbook delayed recovery by 45 min - need to keep fresh
+4. **Pressure kills quality**: Rushing to hit deadline led to skipping validation
+5. **Monitoring gaps**: Detected symptom (errors) but not cause (pool exhaustion) - need better metrics
+
+**Knowledge Gaps Identified**:
+- New team members don't know production config patterns
+- Not everyone knows how to rollback deployments
+- Connection pool sizing not well understood across team
+
+**Process Improvements Needed**:
+- Peer review for all infrastructure changes (not just code)
+- Automated config validation before deployment
+- Regular runbook review/update (quarterly)
+- Staging environment for all changes
+
+**Cultural Observations**:
+- Team responded well under pressure (good collaboration, clear roles)
+- Blameless culture held up (no one blamed new team member)
+- But: Pressure to ship fast is real and leads to shortcuts
+
+---
+
+### Appendix
+
+**Related Incidents**:
+- [INC-2024-002]: Similar config error 2 weeks later (different service)
+- [INC-2023-045]: Previous database connection issue (6 months ago)
+
+**References**:
+- Logs: [Link to log aggregator query]
+- Metrics: [Link to Grafana dashboard]
+- Incident Slack thread: [#incidents, Mar 5 14:10-16:30]
+- Status page: [Link to status.company.com incident #123]
+- Customer impact: [Link to support ticket analysis]
+
+**Contacts**:
+- Incident Commander: Morgan (morgan@company.com)
+- Postmortem Author: Alex (alex@company.com)
+- Escalation: VP Engineering (vpe@company.com)
+
+---
+
+## Guidance for Each Section
+
+### Incident Summary
+- **Severity Levels**:
+  - Critical: Total outage, data loss, security breach
+  - High: Partial outage, significant degradation, many users affected
+  - Medium: Minor degradation, limited users affected
+  - Low: Minimal impact, caught before customer impact
+- **Summary**: 1-2 sentences max. Example: "Database connection pool exhaustion caused 2-hour outage affecting 50K users due to incorrect config deployment."
+
+### Timeline
+- **Timestamps**: Use UTC or clearly state timezone. Be specific (14:05, not "afternoon").
+- **Key Events**: Detection, escalation, investigation milestones, mitigation attempts, resolution.
+- **Sources**: Where info came from (monitoring, logs, person X, Slack, etc.).
+- **Objectivity**: Facts only, no blame. "Engineer X deployed" not "Engineer X mistakenly deployed".
+
+### Impact
+- **Quantify**: Numbers > adjectives. "50K users" not "many users". "$20K" not "significant".
+- **Multiple Dimensions**: Users, revenue, SLA, reputation, team time, opportunity cost.
+- **Metrics Table**: Before/during/after for key metrics shows impact clearly.
+
+### Root Cause
+- **5 Whys**: Stop when you reach fixable systemic issue, not "person made mistake".
+- **Multiple Causes**: Most incidents have >1 cause. List primary + contributing.
+- **System Focus**: "Deployment pipeline allowed bad config" not "Person deployed bad config".
+
+### Corrective Actions
+- **SMART**: Specific (what exactly), Measurable (how verify), Assigned (who owns), Realistic (achievable), Time-bound (when done).
+- **Categorize**: Immediate (days), Short-term (weeks), Long-term (months).
+- **Prioritize**: High impact, low effort first. Don't create >10 actions (won't get done).
+- **Track**: Add to project tracker, review weekly, don't let languish.
+
+### What Went Well
+- **Balance**: Postmortems aren't just about failures, recognize what worked.
+- **Examples**: Good communication, fast response, effective teamwork, working alerting.
+- **Build On**: These are strengths to maintain and amplify.
+
+### Lessons Learned
+- **Generalize**: Extract principles applicable beyond this incident.
+- **Share**: These are org-wide learnings, share broadly.
+- **Action**: Connect lessons to corrective actions.
+
+---
+
+## Quick Patterns
+
+### Production Outage Postmortem
+```
+Summary: [Service] down for [X hours] affecting [Y users] due to [root cause]
+Timeline: Detection → Investigation → Mitigation attempts → Resolution
+Impact: Users, revenue, SLA, reputation
+Root Cause: 5 Whys to system issue
+Actions: Immediate fixes, monitoring improvements, process changes, long-term infrastructure
+```
+
+### Security Incident Postmortem
+```
+Summary: [Breach type] exposed [data/systems] due to [vulnerability]
+Timeline: Breach → Detection (often delayed) → Containment → Remediation → Recovery
+Impact: Data exposed, compliance risk, reputation, remediation cost
+Root Cause: Missing security controls, access management, unpatched vulnerabilities
+Actions: Security audit, access review, patch management, training, compliance reporting
+```
+
+### Product Launch Failure Postmortem
+```
+Summary: [Feature] launched to [<X% adoption] vs [Y% expected] due to [reasons]
+Timeline: Planning → Development → Launch → Metrics review → Pivot decision
+Impact: Revenue miss, wasted effort, opportunity cost, team morale
+Root Cause: Poor problem validation, wrong solution, inadequate testing, misaligned expectations
+Actions: Improve discovery, user research, validation, stakeholder alignment
+```
+
+### Project Deadline Miss Postmortem
+```
+Summary: [Project] delivered [X months late] due to [root causes]
+Timeline: Kickoff → Milestones → Delays → Delivery
+Impact: Customer impact, revenue delay, team burnout, opportunity cost
+Root Cause: Poor estimation, scope creep, technical challenges, resource constraints
+Actions: Improve estimation, scope control, technical spikes, resource planning
+```
+
+---
+
+## Quality Checklist
+
+Before finalizing postmortem, verify:
+
+**Completeness**:
+- [ ] Incident summary with severity, impact summary, dates
+- [ ] Timeline with timestamps, events, actions taken
+- [ ] Impact quantified (users, duration, revenue, metrics)
+- [ ] Root cause analysis (5 Whys or equivalent)
+- [ ] Corrective actions with SMART criteria (specific, measurable, assigned, realistic, time-bound)
+- [ ] What went well section
+- [ ] Lessons learned documented
+
+**Quality**:
+- [ ] Blameless tone (focus on systems, not individuals)
+- [ ] Timeline uses specific timestamps (not vague times)
+- [ ] Impact quantified with numbers (not adjectives like "many")
+- [ ] Root cause goes beyond surface symptoms (asked "Why?" 5 times)
+- [ ] Actions are specific, owned, and time-bound (not vague "improve X")
+- [ ] Actions tracked in project management tool
+
+**Timeliness**:
+- [ ] Postmortem written within 48 hours of incident resolution
+- [ ] Shared with team and stakeholders within 72 hours
+- [ ] Presented in team meeting for discussion
+- [ ] Action items in tracker with owners and deadlines
+
+**Learning**:
+- [ ] Lessons learned extracted and generalized
+- [ ] Shared broadly for organizational learning
+- [ ] Action items address root causes (not just symptoms)
+- [ ] Follow-up review scheduled (2-4 weeks) to check progress
+
+**Overall Assessment**:
+- [ ] Would prevent similar incident if all actions completed?
+- [ ] Are we attacking root causes or just symptoms?
+- [ ] Is tone blameless and constructive?
+- [ ] Will team actually complete these actions? (realistic, prioritized)