gh-ahmedasmar-devops-claude…/assets/templates/runbooks/incident-runbook-template.md

# Runbook: [Alert Name]

## Overview

**Alert Name**: [e.g., HighLatency, ServiceDown, ErrorBudgetBurn]

**Severity**: [Critical | Warning | Info]

**Team**: [e.g., Backend, Platform, Database]

**Component**: [e.g., API Gateway, User Service, PostgreSQL]

**What it means**: [One-line description of what this alert indicates]

**User impact**: [How does this affect users? High/Medium/Low]

**Urgency**: [How quickly must this be addressed? Immediate/Hours/Days]

---

## Alert Details

### When This Alert Fires

This alert fires when:
- [Specific condition, e.g., "P95 latency exceeds 500ms for 10 minutes"]
- [Any additional conditions]

### Symptoms

Users will experience:
- [ ] Slow response times
- [ ] Errors or failures
- [ ] Service unavailable
- [ ] [Other symptoms]

### Probable Causes

Common causes include:
1. **[Cause 1]**: [Description]
   - Example: Database overload due to slow queries
2. **[Cause 2]**: [Description]
   - Example: Memory leak causing OOM errors
3. **[Cause 3]**: [Description]
   - Example: Upstream service degradation

---

## Investigation Steps

### 1. Check Service Health

**Dashboard**: [Link to primary dashboard]

**Key metrics to check**:
```bash
# Request rate
sum(rate(http_requests_total[5m]))

# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

# Latency (p95, p99)
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
```

**What to look for**:
- [ ] Has traffic spiked recently?
- [ ] Is error rate elevated?
- [ ] Are any endpoints particularly slow?

### 2. Check Recent Changes

**Deployments**:
```bash
# Kubernetes
kubectl rollout history deployment/[service-name] -n [namespace]

# Check when last deployed
kubectl get pods -n [namespace] -o wide | grep [service-name]
```

**What to look for**:
- [ ] Was there a recent deployment?
- [ ] Did alert start after deployment?
- [ ] Any configuration changes?

### 3. Check Logs

**Log query** (adjust for your log system):
```bash
# Kubernetes
kubectl logs deployment/[service-name] -n [namespace] --tail=100 | grep ERROR

# Elasticsearch/Kibana
GET /logs-*/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "service": "[service-name]" } },
        { "match": { "level": "error" } },
        { "range": { "@timestamp": { "gte": "now-30m" } } }
      ]
    }
  }
}

# Loki/LogQL
{job="[service-name]"} |= "error" | json | level="error"
```

**What to look for**:
- [ ] Repeated error messages
- [ ] Stack traces
- [ ] Connection errors
- [ ] Timeout errors

### 4. Check Dependencies

**Database**:
```bash
# Check active connections
SELECT count(*) FROM pg_stat_activity WHERE state = 'active';

# Check slow queries
SELECT pid, now() - pg_stat_activity.query_start AS duration, query
FROM pg_stat_activity
WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '5 seconds';
```

**External APIs**:
- [ ] Check status pages: [Link to status pages]
- [ ] Check API error rates in dashboard
- [ ] Test API endpoints manually

**Cache** (Redis/Memcached):
```bash
# Redis info
redis-cli -h [host] INFO stats

# Check memory usage
redis-cli -h [host] INFO memory
```

### 5. Check Resource Usage

**CPU and Memory**:
```bash
# Kubernetes
kubectl top pods -n [namespace] | grep [service-name]

# Node metrics
kubectl top nodes
```

**Prometheus queries**:
```promql
# CPU usage by pod
sum(rate(container_cpu_usage_seconds_total{pod=~"[service-name].*"}[5m])) by (pod)

# Memory usage by pod
sum(container_memory_usage_bytes{pod=~"[service-name].*"}) by (pod)
```

**What to look for**:
- [ ] CPU throttling
- [ ] Memory approaching limits
- [ ] Disk space issues

### 6. Check Traces (if available)

**Trace query**:
```bash
# Jaeger
# Search for slow traces (> 1s) in last 30 minutes

# Tempo/TraceQL
{ duration > 1s && resource.service.name = "[service-name]" }
```

**What to look for**:
- [ ] Which operation is slow?
- [ ] Where is time spent? (DB, external API, service logic)
- [ ] Any N+1 query patterns?

---

## Common Scenarios and Solutions

### Scenario 1: Recent Deployment Caused Issue

**Symptoms**:
- Alert started immediately after deployment
- Error logs correlate with new code

**Solution**:
```bash
# Rollback deployment
kubectl rollout undo deployment/[service-name] -n [namespace]

# Verify rollback succeeded
kubectl rollout status deployment/[service-name] -n [namespace]

# Monitor for alert resolution
```

**Follow-up**:
- [ ] Create incident report
- [ ] Review deployment process
- [ ] Add pre-deployment checks

### Scenario 2: Database Performance Issue

**Symptoms**:
- Slow query logs show problematic queries
- Database CPU or connection pool exhausted

**Solution**:
```bash
# Identify slow query
# Kill long-running query (use with caution)
SELECT pg_cancel_backend([pid]);

# Or terminate if cancel doesn't work
SELECT pg_terminate_backend([pid]);

# Add index if missing (in maintenance window)
CREATE INDEX CONCURRENTLY idx_name ON table_name (column_name);
```

**Follow-up**:
- [ ] Add query performance test
- [ ] Review and optimize query
- [ ] Consider read replicas

### Scenario 3: Memory Leak

**Symptoms**:
- Memory usage gradually increasing
- Eventually OOMKilled
- Restarts temporarily fix issue

**Solution**:
```bash
# Immediate: Restart pods
kubectl rollout restart deployment/[service-name] -n [namespace]

# Increase memory limits (temporary)
kubectl set resources deployment/[service-name] -n [namespace] \
  --limits=memory=2Gi
```

**Follow-up**:
- [ ] Profile application for memory leaks
- [ ] Add memory usage alerts
- [ ] Fix root cause

### Scenario 4: Traffic Spike / DDoS

**Symptoms**:
- Sudden traffic increase
- Traffic from unusual sources
- High CPU/memory across all instances

**Solution**:
```bash
# Scale up immediately
kubectl scale deployment/[service-name] -n [namespace] --replicas=10

# Enable rate limiting at load balancer level
# (Specific steps depend on LB)

# Block suspicious IPs if confirmed DDoS
# (Use WAF or network policies)
```

**Follow-up**:
- [ ] Implement rate limiting
- [ ] Add DDoS protection (CloudFlare, WAF)
- [ ] Set up auto-scaling

### Scenario 5: Upstream Service Degradation

**Symptoms**:
- Errors calling external API
- Timeouts to upstream service
- Upstream status page shows issues

**Solution**:
```bash
# Enable circuit breaker (if available)
# Adjust timeout configuration
# Switch to backup service/cached data

# Monitor external service
# Check status page: [Link]
```

**Follow-up**:
- [ ] Implement circuit breaker pattern
- [ ] Add fallback mechanisms
- [ ] Set up external service monitoring

---

## Immediate Actions (< 5 minutes)

These should be done first to mitigate impact:

1. **[Action 1]**: [e.g., "Scale up service"]
   ```bash
   kubectl scale deployment/[service] --replicas=10
   ```

2. **[Action 2]**: [e.g., "Rollback deployment"]
   ```bash
   kubectl rollout undo deployment/[service]
   ```

3. **[Action 3]**: [e.g., "Enable circuit breaker"]

---

## Short-term Actions (< 30 minutes)

After immediate mitigation:

1. **[Action 1]**: [e.g., "Investigate root cause"]
2. **[Action 2]**: [e.g., "Optimize slow query"]
3. **[Action 3]**: [e.g., "Clear cache if stale"]

---

## Long-term Actions (Post-Incident)

Preventive measures:

1. **[Action 1]**: [e.g., "Add circuit breaker"]
2. **[Action 2]**: [e.g., "Implement auto-scaling"]
3. **[Action 3]**: [e.g., "Add query performance tests"]
4. **[Action 4]**: [e.g., "Update alert thresholds"]

---

## Escalation

If issue persists after 30 minutes:

**Escalation Path**:
1. **Primary oncall**: @[username] ([slack/email])
2. **Team lead**: @[username] ([slack/email])
3. **Engineering manager**: @[username] ([slack/email])
4. **Incident commander**: @[username] ([slack/email])

**Communication**:
- **Slack channel**: #[incidents-channel]
- **Status page**: [Link]
- **Incident tracking**: [Link to incident management tool]

---

## Related Runbooks

- [Related Runbook 1]
- [Related Runbook 2]
- [Related Runbook 3]

## Related Dashboards

- [Main Service Dashboard]
- [Resource Usage Dashboard]
- [Dependency Dashboard]

## Related Documentation

- [Architecture Diagram]
- [Service Documentation]
- [API Documentation]

---

## Recent Incidents

| Date | Duration | Root Cause | Resolution | Ticket |
|------|----------|------------|------------|--------|
| 2024-10-15 | 23 min | Database pool exhausted | Increased pool size | INC-123 |
| 2024-09-30 | 45 min | Memory leak | Fixed code, restarted | INC-120 |

---

## Runbook Metadata

**Last Updated**: [Date]

**Owner**: [Team name]

**Reviewers**: [Names]

**Next Review**: [Date]

---

## Notes

- This runbook should be reviewed quarterly
- Update after each incident to capture new learnings
- Keep investigation steps concise and actionable
- Include actual commands that can be copy-pasted