gh-greyhaven-ai-claude-code…/skills/incident-response/templates/runbook-template.md

# Runbook: [PROBLEM TITLE]

**Alert**: [Alert name that triggers this runbook]
**Severity**: [SEV1 / SEV2 / SEV3]
**Owner**: [Team name]
**Last Updated**: [YYYY-MM-DD]
**Last Tested**: [YYYY-MM-DD]

---

## Problem Description

[2-3 sentence description of what this problem is]

**Symptoms**:
- [Observable symptom 1 - what users/operators see]
- [Observable symptom 2]
- [Observable symptom 3]

**Impact**:
- **Customer Impact**: [What users experience]
- **Business Impact**: [Revenue, SLA, compliance]
- **Affected Services**: [List of services]

---

## Prerequisites

**Required Access**:
- [ ] Kubernetes cluster access (`kubectl` configured)
- [ ] Database access (PlanetScale/PostgreSQL)
- [ ] Cloudflare Workers access (`wrangler` configured)
- [ ] Monitoring access (Grafana, Datadog)

**Required Tools**:
- [ ] `kubectl` v1.28+
- [ ] `wrangler` v3+
- [ ] `pscale` CLI
- [ ] `curl`, `jq`

---

## Diagnosis

### Step 1: [Check Initial Symptom]

**What to check**: [Describe what this step verifies]

```bash
# Command to run
[command]

# Expected output (healthy):
[what you should see if everything is fine]

# Problem indicator:
[what you see if there's an issue]
```

**Interpretation**:
- If [condition], then [conclusion]
- If [condition], then go to Step 2

---

### Step 2: [Verify Root Cause]

```bash
# Command to run
[command]

# Look for:
[what to look for in the output]
```

**Possible Causes**:
1. **[Cause 1]**: [How to identify] → Go to [Mitigation Option A](#option-a-cause-1)
2. **[Cause 2]**: [How to identify] → Go to [Mitigation Option B](#option-b-cause-2)
3. **[Cause 3]**: [How to identify] → Escalate to [team]

---

### Step 3: [Additional Verification]

[Only if needed for complex scenarios]

```bash
# Commands
[commands]
```

---

## Mitigation

### Option A: [Cause 1]

**When to use**: [Conditions when this mitigation applies]

**Steps**:
1. [Action 1]
   ```bash
   [command]
   ```

2. [Action 2]
   ```bash
   [command]
   ```

3. [Action 3]
   ```bash
   [command]
   ```

**Verification**:
```bash
# Check that mitigation worked
[verification command]

# Expected result:
[what you should see]
```

**If mitigation fails**: [What to do next - usually escalate]

---

### Option B: [Cause 2]

[Same format as Option A]

---

## Rollback

**If mitigation makes things worse:**

```bash
# Rollback command 1
[command to undo action 1]

# Rollback command 2
[command to undo action 2]
```

---

## Verification & Monitoring

### Health Checks

After mitigation, verify these metrics return to normal:

```bash
# Check 1: Service health
curl https://api.greyhaven.io/health
# Expected: HTTP 200, {"status": "healthy"}

# Check 2: Error rate
# Grafana: Error Rate dashboard
# Expected: <0.1%

# Check 3: Latency
# Grafana: API Latency dashboard
# Expected: p95 <500ms
```

### Monitoring Period

Monitor for **[time period]** after mitigation:
- [ ] Error rate stable (<0.1%)
- [ ] Latency normal (p95 <500ms)
- [ ] No new alerts
- [ ] User reports resolved

---

## Escalation

**Escalate if**:
- Mitigation doesn't work after [X] minutes
- Root cause unclear after diagnosis
- Issue is [severity] and unresolved after [X] minutes
- Multiple services affected

**Escalation Path**:
```
0-15 min:  @oncall-engineer
15-30 min: @team-lead
30-60 min: @engineering-manager
60+ min:   @vp-engineering (SEV1 only)
```

**Escalation Contact**:
- Team Slack: #[team-channel]
- PagerDuty: [escalation policy]
- Oncall: @[oncall-alias]

---

## Common Mistakes

### Mistake 1: [Common Error]

**Wrong**:
```bash
[incorrect command or approach]
```

**Correct**:
```bash
[correct command or approach]
```

### Mistake 2: [Common Error]

[Description and correction]

---

## Related Documentation

- **Alert Definition**: [Link to alert config]
- **Monitoring Dashboard**: [Link to Grafana]
- **Architecture Doc**: [Link to system architecture]
- **Past Incidents**: [Links to similar incidents]
- **Postmortems**: [Links to related postmortems]

---

## Changelog

| Date | Author | Changes |
|------|--------|---------|
| [YYYY-MM-DD] | @[name] | Initial creation |
| [YYYY-MM-DD] | @[name] | Updated [what changed] |

---

## Testing Notes

**Last Test Date**: [YYYY-MM-DD]
**Test Result**: [Pass / Fail]
**Notes**: [What was learned from testing]

**How to Test**:
1. [Step to simulate failure in staging]
2. [Follow runbook]
3. [Verify recovery]
4. [Document time taken and any issues]

---

Return to [templates index](INDEX.md)