256 lines
4.4 KiB
Markdown
256 lines
4.4 KiB
Markdown
# Runbook: [PROBLEM TITLE]
|
|
|
|
**Alert**: [Alert name that triggers this runbook]
|
|
**Severity**: [SEV1 / SEV2 / SEV3]
|
|
**Owner**: [Team name]
|
|
**Last Updated**: [YYYY-MM-DD]
|
|
**Last Tested**: [YYYY-MM-DD]
|
|
|
|
---
|
|
|
|
## Problem Description
|
|
|
|
[2-3 sentence description of what this problem is]
|
|
|
|
**Symptoms**:
|
|
- [Observable symptom 1 - what users/operators see]
|
|
- [Observable symptom 2]
|
|
- [Observable symptom 3]
|
|
|
|
**Impact**:
|
|
- **Customer Impact**: [What users experience]
|
|
- **Business Impact**: [Revenue, SLA, compliance]
|
|
- **Affected Services**: [List of services]
|
|
|
|
---
|
|
|
|
## Prerequisites
|
|
|
|
**Required Access**:
|
|
- [ ] Kubernetes cluster access (`kubectl` configured)
|
|
- [ ] Database access (PlanetScale/PostgreSQL)
|
|
- [ ] Cloudflare Workers access (`wrangler` configured)
|
|
- [ ] Monitoring access (Grafana, Datadog)
|
|
|
|
**Required Tools**:
|
|
- [ ] `kubectl` v1.28+
|
|
- [ ] `wrangler` v3+
|
|
- [ ] `pscale` CLI
|
|
- [ ] `curl`, `jq`
|
|
|
|
---
|
|
|
|
## Diagnosis
|
|
|
|
### Step 1: [Check Initial Symptom]
|
|
|
|
**What to check**: [Describe what this step verifies]
|
|
|
|
```bash
|
|
# Command to run
|
|
[command]
|
|
|
|
# Expected output (healthy):
|
|
[what you should see if everything is fine]
|
|
|
|
# Problem indicator:
|
|
[what you see if there's an issue]
|
|
```
|
|
|
|
**Interpretation**:
|
|
- If [condition], then [conclusion]
|
|
- If [condition], then go to Step 2
|
|
|
|
---
|
|
|
|
### Step 2: [Verify Root Cause]
|
|
|
|
```bash
|
|
# Command to run
|
|
[command]
|
|
|
|
# Look for:
|
|
[what to look for in the output]
|
|
```
|
|
|
|
**Possible Causes**:
|
|
1. **[Cause 1]**: [How to identify] → Go to [Mitigation Option A](#option-a-cause-1)
|
|
2. **[Cause 2]**: [How to identify] → Go to [Mitigation Option B](#option-b-cause-2)
|
|
3. **[Cause 3]**: [How to identify] → Escalate to [team]
|
|
|
|
---
|
|
|
|
### Step 3: [Additional Verification]
|
|
|
|
[Only if needed for complex scenarios]
|
|
|
|
```bash
|
|
# Commands
|
|
[commands]
|
|
```
|
|
|
|
---
|
|
|
|
## Mitigation
|
|
|
|
### Option A: [Cause 1]
|
|
|
|
**When to use**: [Conditions when this mitigation applies]
|
|
|
|
**Steps**:
|
|
1. [Action 1]
|
|
```bash
|
|
[command]
|
|
```
|
|
|
|
2. [Action 2]
|
|
```bash
|
|
[command]
|
|
```
|
|
|
|
3. [Action 3]
|
|
```bash
|
|
[command]
|
|
```
|
|
|
|
**Verification**:
|
|
```bash
|
|
# Check that mitigation worked
|
|
[verification command]
|
|
|
|
# Expected result:
|
|
[what you should see]
|
|
```
|
|
|
|
**If mitigation fails**: [What to do next - usually escalate]
|
|
|
|
---
|
|
|
|
### Option B: [Cause 2]
|
|
|
|
[Same format as Option A]
|
|
|
|
---
|
|
|
|
## Rollback
|
|
|
|
**If mitigation makes things worse:**
|
|
|
|
```bash
|
|
# Rollback command 1
|
|
[command to undo action 1]
|
|
|
|
# Rollback command 2
|
|
[command to undo action 2]
|
|
```
|
|
|
|
---
|
|
|
|
## Verification & Monitoring
|
|
|
|
### Health Checks
|
|
|
|
After mitigation, verify these metrics return to normal:
|
|
|
|
```bash
|
|
# Check 1: Service health
|
|
curl https://api.greyhaven.io/health
|
|
# Expected: HTTP 200, {"status": "healthy"}
|
|
|
|
# Check 2: Error rate
|
|
# Grafana: Error Rate dashboard
|
|
# Expected: <0.1%
|
|
|
|
# Check 3: Latency
|
|
# Grafana: API Latency dashboard
|
|
# Expected: p95 <500ms
|
|
```
|
|
|
|
### Monitoring Period
|
|
|
|
Monitor for **[time period]** after mitigation:
|
|
- [ ] Error rate stable (<0.1%)
|
|
- [ ] Latency normal (p95 <500ms)
|
|
- [ ] No new alerts
|
|
- [ ] User reports resolved
|
|
|
|
---
|
|
|
|
## Escalation
|
|
|
|
**Escalate if**:
|
|
- Mitigation doesn't work after [X] minutes
|
|
- Root cause unclear after diagnosis
|
|
- Issue is [severity] and unresolved after [X] minutes
|
|
- Multiple services affected
|
|
|
|
**Escalation Path**:
|
|
```
|
|
0-15 min: @oncall-engineer
|
|
15-30 min: @team-lead
|
|
30-60 min: @engineering-manager
|
|
60+ min: @vp-engineering (SEV1 only)
|
|
```
|
|
|
|
**Escalation Contact**:
|
|
- Team Slack: #[team-channel]
|
|
- PagerDuty: [escalation policy]
|
|
- Oncall: @[oncall-alias]
|
|
|
|
---
|
|
|
|
## Common Mistakes
|
|
|
|
### Mistake 1: [Common Error]
|
|
|
|
**Wrong**:
|
|
```bash
|
|
[incorrect command or approach]
|
|
```
|
|
|
|
**Correct**:
|
|
```bash
|
|
[correct command or approach]
|
|
```
|
|
|
|
### Mistake 2: [Common Error]
|
|
|
|
[Description and correction]
|
|
|
|
---
|
|
|
|
## Related Documentation
|
|
|
|
- **Alert Definition**: [Link to alert config]
|
|
- **Monitoring Dashboard**: [Link to Grafana]
|
|
- **Architecture Doc**: [Link to system architecture]
|
|
- **Past Incidents**: [Links to similar incidents]
|
|
- **Postmortems**: [Links to related postmortems]
|
|
|
|
---
|
|
|
|
## Changelog
|
|
|
|
| Date | Author | Changes |
|
|
|------|--------|---------|
|
|
| [YYYY-MM-DD] | @[name] | Initial creation |
|
|
| [YYYY-MM-DD] | @[name] | Updated [what changed] |
|
|
|
|
---
|
|
|
|
## Testing Notes
|
|
|
|
**Last Test Date**: [YYYY-MM-DD]
|
|
**Test Result**: [Pass / Fail]
|
|
**Notes**: [What was learned from testing]
|
|
|
|
**How to Test**:
|
|
1. [Step to simulate failure in staging]
|
|
2. [Follow runbook]
|
|
3. [Verify recovery]
|
|
4. [Document time taken and any issues]
|
|
|
|
---
|
|
|
|
Return to [templates index](INDEX.md)
|