Files
gh-greyhaven-ai-claude-code…/skills/incident-response/templates/runbook-template.md
2025-11-29 18:29:18 +08:00

4.4 KiB

Runbook: [PROBLEM TITLE]

Alert: [Alert name that triggers this runbook] Severity: [SEV1 / SEV2 / SEV3] Owner: [Team name] Last Updated: [YYYY-MM-DD] Last Tested: [YYYY-MM-DD]


Problem Description

[2-3 sentence description of what this problem is]

Symptoms:

  • [Observable symptom 1 - what users/operators see]
  • [Observable symptom 2]
  • [Observable symptom 3]

Impact:

  • Customer Impact: [What users experience]
  • Business Impact: [Revenue, SLA, compliance]
  • Affected Services: [List of services]

Prerequisites

Required Access:

  • Kubernetes cluster access (kubectl configured)
  • Database access (PlanetScale/PostgreSQL)
  • Cloudflare Workers access (wrangler configured)
  • Monitoring access (Grafana, Datadog)

Required Tools:

  • kubectl v1.28+
  • wrangler v3+
  • pscale CLI
  • curl, jq

Diagnosis

Step 1: [Check Initial Symptom]

What to check: [Describe what this step verifies]

# Command to run
[command]

# Expected output (healthy):
[what you should see if everything is fine]

# Problem indicator:
[what you see if there's an issue]

Interpretation:

  • If [condition], then [conclusion]
  • If [condition], then go to Step 2

Step 2: [Verify Root Cause]

# Command to run
[command]

# Look for:
[what to look for in the output]

Possible Causes:

  1. [Cause 1]: [How to identify] → Go to Mitigation Option A
  2. [Cause 2]: [How to identify] → Go to Mitigation Option B
  3. [Cause 3]: [How to identify] → Escalate to [team]

Step 3: [Additional Verification]

[Only if needed for complex scenarios]

# Commands
[commands]

Mitigation

Option A: [Cause 1]

When to use: [Conditions when this mitigation applies]

Steps:

  1. [Action 1]

    [command]
    
  2. [Action 2]

    [command]
    
  3. [Action 3]

    [command]
    

Verification:

# Check that mitigation worked
[verification command]

# Expected result:
[what you should see]

If mitigation fails: [What to do next - usually escalate]


Option B: [Cause 2]

[Same format as Option A]


Rollback

If mitigation makes things worse:

# Rollback command 1
[command to undo action 1]

# Rollback command 2
[command to undo action 2]

Verification & Monitoring

Health Checks

After mitigation, verify these metrics return to normal:

# Check 1: Service health
curl https://api.greyhaven.io/health
# Expected: HTTP 200, {"status": "healthy"}

# Check 2: Error rate
# Grafana: Error Rate dashboard
# Expected: <0.1%

# Check 3: Latency
# Grafana: API Latency dashboard
# Expected: p95 <500ms

Monitoring Period

Monitor for [time period] after mitigation:

  • Error rate stable (<0.1%)
  • Latency normal (p95 <500ms)
  • No new alerts
  • User reports resolved

Escalation

Escalate if:

  • Mitigation doesn't work after [X] minutes
  • Root cause unclear after diagnosis
  • Issue is [severity] and unresolved after [X] minutes
  • Multiple services affected

Escalation Path:

0-15 min:  @oncall-engineer
15-30 min: @team-lead
30-60 min: @engineering-manager
60+ min:   @vp-engineering (SEV1 only)

Escalation Contact:

  • Team Slack: #[team-channel]
  • PagerDuty: [escalation policy]
  • Oncall: @[oncall-alias]

Common Mistakes

Mistake 1: [Common Error]

Wrong:

[incorrect command or approach]

Correct:

[correct command or approach]

Mistake 2: [Common Error]

[Description and correction]


  • Alert Definition: [Link to alert config]
  • Monitoring Dashboard: [Link to Grafana]
  • Architecture Doc: [Link to system architecture]
  • Past Incidents: [Links to similar incidents]
  • Postmortems: [Links to related postmortems]

Changelog

Date Author Changes
[YYYY-MM-DD] @[name] Initial creation
[YYYY-MM-DD] @[name] Updated [what changed]

Testing Notes

Last Test Date: [YYYY-MM-DD] Test Result: [Pass / Fail] Notes: [What was learned from testing]

How to Test:

  1. [Step to simulate failure in staging]
  2. [Follow runbook]
  3. [Verify recovery]
  4. [Document time taken and any issues]

Return to templates index