4.4 KiB
Runbook: [PROBLEM TITLE]
Alert: [Alert name that triggers this runbook] Severity: [SEV1 / SEV2 / SEV3] Owner: [Team name] Last Updated: [YYYY-MM-DD] Last Tested: [YYYY-MM-DD]
Problem Description
[2-3 sentence description of what this problem is]
Symptoms:
- [Observable symptom 1 - what users/operators see]
- [Observable symptom 2]
- [Observable symptom 3]
Impact:
- Customer Impact: [What users experience]
- Business Impact: [Revenue, SLA, compliance]
- Affected Services: [List of services]
Prerequisites
Required Access:
- Kubernetes cluster access (
kubectlconfigured) - Database access (PlanetScale/PostgreSQL)
- Cloudflare Workers access (
wranglerconfigured) - Monitoring access (Grafana, Datadog)
Required Tools:
kubectlv1.28+wranglerv3+pscaleCLIcurl,jq
Diagnosis
Step 1: [Check Initial Symptom]
What to check: [Describe what this step verifies]
# Command to run
[command]
# Expected output (healthy):
[what you should see if everything is fine]
# Problem indicator:
[what you see if there's an issue]
Interpretation:
- If [condition], then [conclusion]
- If [condition], then go to Step 2
Step 2: [Verify Root Cause]
# Command to run
[command]
# Look for:
[what to look for in the output]
Possible Causes:
- [Cause 1]: [How to identify] → Go to Mitigation Option A
- [Cause 2]: [How to identify] → Go to Mitigation Option B
- [Cause 3]: [How to identify] → Escalate to [team]
Step 3: [Additional Verification]
[Only if needed for complex scenarios]
# Commands
[commands]
Mitigation
Option A: [Cause 1]
When to use: [Conditions when this mitigation applies]
Steps:
-
[Action 1]
[command] -
[Action 2]
[command] -
[Action 3]
[command]
Verification:
# Check that mitigation worked
[verification command]
# Expected result:
[what you should see]
If mitigation fails: [What to do next - usually escalate]
Option B: [Cause 2]
[Same format as Option A]
Rollback
If mitigation makes things worse:
# Rollback command 1
[command to undo action 1]
# Rollback command 2
[command to undo action 2]
Verification & Monitoring
Health Checks
After mitigation, verify these metrics return to normal:
# Check 1: Service health
curl https://api.greyhaven.io/health
# Expected: HTTP 200, {"status": "healthy"}
# Check 2: Error rate
# Grafana: Error Rate dashboard
# Expected: <0.1%
# Check 3: Latency
# Grafana: API Latency dashboard
# Expected: p95 <500ms
Monitoring Period
Monitor for [time period] after mitigation:
- Error rate stable (<0.1%)
- Latency normal (p95 <500ms)
- No new alerts
- User reports resolved
Escalation
Escalate if:
- Mitigation doesn't work after [X] minutes
- Root cause unclear after diagnosis
- Issue is [severity] and unresolved after [X] minutes
- Multiple services affected
Escalation Path:
0-15 min: @oncall-engineer
15-30 min: @team-lead
30-60 min: @engineering-manager
60+ min: @vp-engineering (SEV1 only)
Escalation Contact:
- Team Slack: #[team-channel]
- PagerDuty: [escalation policy]
- Oncall: @[oncall-alias]
Common Mistakes
Mistake 1: [Common Error]
Wrong:
[incorrect command or approach]
Correct:
[correct command or approach]
Mistake 2: [Common Error]
[Description and correction]
Related Documentation
- Alert Definition: [Link to alert config]
- Monitoring Dashboard: [Link to Grafana]
- Architecture Doc: [Link to system architecture]
- Past Incidents: [Links to similar incidents]
- Postmortems: [Links to related postmortems]
Changelog
| Date | Author | Changes |
|---|---|---|
| [YYYY-MM-DD] | @[name] | Initial creation |
| [YYYY-MM-DD] | @[name] | Updated [what changed] |
Testing Notes
Last Test Date: [YYYY-MM-DD] Test Result: [Pass / Fail] Notes: [What was learned from testing]
How to Test:
- [Step to simulate failure in staging]
- [Follow runbook]
- [Verify recovery]
- [Document time taken and any issues]
Return to templates index