zhongwei/gh-greyhaven-ai-claude-code-config-grey-haven-plugins-incident-response

Files

Zhongwei Li 46dfc30864 Initial commit

2025-11-29 18:29:18 +08:00

4.4 KiB

Raw Blame History

Runbook: [PROBLEM TITLE]

Alert: [Alert name that triggers this runbook] Severity: [SEV1 / SEV2 / SEV3] Owner: [Team name] Last Updated: [YYYY-MM-DD] Last Tested: [YYYY-MM-DD]

Problem Description

[2-3 sentence description of what this problem is]

Symptoms:

[Observable symptom 1 - what users/operators see]
[Observable symptom 2]
[Observable symptom 3]

Impact:

Customer Impact: [What users experience]
Business Impact: [Revenue, SLA, compliance]
Affected Services: [List of services]

Prerequisites

Required Access:

Kubernetes cluster access (kubectl configured)
Database access (PlanetScale/PostgreSQL)
Cloudflare Workers access (wrangler configured)
Monitoring access (Grafana, Datadog)

Required Tools:

kubectl v1.28+
wrangler v3+
pscale CLI
curl, jq

Diagnosis

Step 1: [Check Initial Symptom]

What to check: [Describe what this step verifies]

# Command to run
[command]

# Expected output (healthy):
[what you should see if everything is fine]

# Problem indicator:
[what you see if there's an issue]

Interpretation:

If [condition], then [conclusion]
If [condition], then go to Step 2

Step 2: [Verify Root Cause]

# Command to run
[command]

# Look for:
[what to look for in the output]

Possible Causes:

[Cause 1]: [How to identify] → Go to Mitigation Option A
[Cause 2]: [How to identify] → Go to Mitigation Option B
[Cause 3]: [How to identify] → Escalate to [team]

Step 3: [Additional Verification]

[Only if needed for complex scenarios]

# Commands
[commands]

Mitigation

Option A: [Cause 1]

When to use: [Conditions when this mitigation applies]

Steps:

[Action 1]
```
[command]
```
[Action 2]
```
[command]
```
[Action 3]
```
[command]
```

Verification:

# Check that mitigation worked
[verification command]

# Expected result:
[what you should see]

If mitigation fails: [What to do next - usually escalate]

Option B: [Cause 2]

[Same format as Option A]

Rollback

If mitigation makes things worse:

# Rollback command 1
[command to undo action 1]

# Rollback command 2
[command to undo action 2]

Verification & Monitoring

Health Checks

After mitigation, verify these metrics return to normal:

# Check 1: Service health
curl https://api.greyhaven.io/health
# Expected: HTTP 200, {"status": "healthy"}

# Check 2: Error rate
# Grafana: Error Rate dashboard
# Expected: <0.1%

# Check 3: Latency
# Grafana: API Latency dashboard
# Expected: p95 <500ms

Monitoring Period

Monitor for [time period] after mitigation:

Error rate stable (<0.1%)
Latency normal (p95 <500ms)
No new alerts
User reports resolved

Escalation

Escalate if:

Mitigation doesn't work after [X] minutes
Root cause unclear after diagnosis
Issue is [severity] and unresolved after [X] minutes
Multiple services affected

Escalation Path:

0-15 min:  @oncall-engineer
15-30 min: @team-lead
30-60 min: @engineering-manager
60+ min:   @vp-engineering (SEV1 only)

Escalation Contact:

Team Slack: #[team-channel]
PagerDuty: [escalation policy]
Oncall: @[oncall-alias]

Common Mistakes

Mistake 1: [Common Error]

Wrong:

[incorrect command or approach]

Correct:

[correct command or approach]

Mistake 2: [Common Error]

[Description and correction]

Alert Definition: [Link to alert config]
Monitoring Dashboard: [Link to Grafana]
Architecture Doc: [Link to system architecture]
Past Incidents: [Links to similar incidents]
Postmortems: [Links to related postmortems]

Changelog

Date	Author	Changes
[YYYY-MM-DD]	@[name]	Initial creation
[YYYY-MM-DD]	@[name]	Updated [what changed]

Testing Notes

Last Test Date: [YYYY-MM-DD] Test Result: [Pass / Fail] Notes: [What was learned from testing]

How to Test:

[Step to simulate failure in staging]
[Follow runbook]
[Verify recovery]
[Document time taken and any issues]

Return to templates index

4.4 KiB Raw Blame History