8.7 KiB
Runbook: [Alert Name]
Overview
Alert Name: [e.g., HighLatency, ServiceDown, ErrorBudgetBurn]
Severity: [Critical | Warning | Info]
Team: [e.g., Backend, Platform, Database]
Component: [e.g., API Gateway, User Service, PostgreSQL]
What it means: [One-line description of what this alert indicates]
User impact: [How does this affect users? High/Medium/Low]
Urgency: [How quickly must this be addressed? Immediate/Hours/Days]
Alert Details
When This Alert Fires
This alert fires when:
- [Specific condition, e.g., "P95 latency exceeds 500ms for 10 minutes"]
- [Any additional conditions]
Symptoms
Users will experience:
- Slow response times
- Errors or failures
- Service unavailable
- [Other symptoms]
Probable Causes
Common causes include:
- [Cause 1]: [Description]
- Example: Database overload due to slow queries
- [Cause 2]: [Description]
- Example: Memory leak causing OOM errors
- [Cause 3]: [Description]
- Example: Upstream service degradation
Investigation Steps
1. Check Service Health
Dashboard: [Link to primary dashboard]
Key metrics to check:
# Request rate
sum(rate(http_requests_total[5m]))
# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
# Latency (p95, p99)
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
What to look for:
- Has traffic spiked recently?
- Is error rate elevated?
- Are any endpoints particularly slow?
2. Check Recent Changes
Deployments:
# Kubernetes
kubectl rollout history deployment/[service-name] -n [namespace]
# Check when last deployed
kubectl get pods -n [namespace] -o wide | grep [service-name]
What to look for:
- Was there a recent deployment?
- Did alert start after deployment?
- Any configuration changes?
3. Check Logs
Log query (adjust for your log system):
# Kubernetes
kubectl logs deployment/[service-name] -n [namespace] --tail=100 | grep ERROR
# Elasticsearch/Kibana
GET /logs-*/_search
{
"query": {
"bool": {
"must": [
{ "match": { "service": "[service-name]" } },
{ "match": { "level": "error" } },
{ "range": { "@timestamp": { "gte": "now-30m" } } }
]
}
}
}
# Loki/LogQL
{job="[service-name]"} |= "error" | json | level="error"
What to look for:
- Repeated error messages
- Stack traces
- Connection errors
- Timeout errors
4. Check Dependencies
Database:
# Check active connections
SELECT count(*) FROM pg_stat_activity WHERE state = 'active';
# Check slow queries
SELECT pid, now() - pg_stat_activity.query_start AS duration, query
FROM pg_stat_activity
WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '5 seconds';
External APIs:
- Check status pages: [Link to status pages]
- Check API error rates in dashboard
- Test API endpoints manually
Cache (Redis/Memcached):
# Redis info
redis-cli -h [host] INFO stats
# Check memory usage
redis-cli -h [host] INFO memory
5. Check Resource Usage
CPU and Memory:
# Kubernetes
kubectl top pods -n [namespace] | grep [service-name]
# Node metrics
kubectl top nodes
Prometheus queries:
# CPU usage by pod
sum(rate(container_cpu_usage_seconds_total{pod=~"[service-name].*"}[5m])) by (pod)
# Memory usage by pod
sum(container_memory_usage_bytes{pod=~"[service-name].*"}) by (pod)
What to look for:
- CPU throttling
- Memory approaching limits
- Disk space issues
6. Check Traces (if available)
Trace query:
# Jaeger
# Search for slow traces (> 1s) in last 30 minutes
# Tempo/TraceQL
{ duration > 1s && resource.service.name = "[service-name]" }
What to look for:
- Which operation is slow?
- Where is time spent? (DB, external API, service logic)
- Any N+1 query patterns?
Common Scenarios and Solutions
Scenario 1: Recent Deployment Caused Issue
Symptoms:
- Alert started immediately after deployment
- Error logs correlate with new code
Solution:
# Rollback deployment
kubectl rollout undo deployment/[service-name] -n [namespace]
# Verify rollback succeeded
kubectl rollout status deployment/[service-name] -n [namespace]
# Monitor for alert resolution
Follow-up:
- Create incident report
- Review deployment process
- Add pre-deployment checks
Scenario 2: Database Performance Issue
Symptoms:
- Slow query logs show problematic queries
- Database CPU or connection pool exhausted
Solution:
# Identify slow query
# Kill long-running query (use with caution)
SELECT pg_cancel_backend([pid]);
# Or terminate if cancel doesn't work
SELECT pg_terminate_backend([pid]);
# Add index if missing (in maintenance window)
CREATE INDEX CONCURRENTLY idx_name ON table_name (column_name);
Follow-up:
- Add query performance test
- Review and optimize query
- Consider read replicas
Scenario 3: Memory Leak
Symptoms:
- Memory usage gradually increasing
- Eventually OOMKilled
- Restarts temporarily fix issue
Solution:
# Immediate: Restart pods
kubectl rollout restart deployment/[service-name] -n [namespace]
# Increase memory limits (temporary)
kubectl set resources deployment/[service-name] -n [namespace] \
--limits=memory=2Gi
Follow-up:
- Profile application for memory leaks
- Add memory usage alerts
- Fix root cause
Scenario 4: Traffic Spike / DDoS
Symptoms:
- Sudden traffic increase
- Traffic from unusual sources
- High CPU/memory across all instances
Solution:
# Scale up immediately
kubectl scale deployment/[service-name] -n [namespace] --replicas=10
# Enable rate limiting at load balancer level
# (Specific steps depend on LB)
# Block suspicious IPs if confirmed DDoS
# (Use WAF or network policies)
Follow-up:
- Implement rate limiting
- Add DDoS protection (CloudFlare, WAF)
- Set up auto-scaling
Scenario 5: Upstream Service Degradation
Symptoms:
- Errors calling external API
- Timeouts to upstream service
- Upstream status page shows issues
Solution:
# Enable circuit breaker (if available)
# Adjust timeout configuration
# Switch to backup service/cached data
# Monitor external service
# Check status page: [Link]
Follow-up:
- Implement circuit breaker pattern
- Add fallback mechanisms
- Set up external service monitoring
Immediate Actions (< 5 minutes)
These should be done first to mitigate impact:
-
[Action 1]: [e.g., "Scale up service"]
kubectl scale deployment/[service] --replicas=10 -
[Action 2]: [e.g., "Rollback deployment"]
kubectl rollout undo deployment/[service] -
[Action 3]: [e.g., "Enable circuit breaker"]
Short-term Actions (< 30 minutes)
After immediate mitigation:
- [Action 1]: [e.g., "Investigate root cause"]
- [Action 2]: [e.g., "Optimize slow query"]
- [Action 3]: [e.g., "Clear cache if stale"]
Long-term Actions (Post-Incident)
Preventive measures:
- [Action 1]: [e.g., "Add circuit breaker"]
- [Action 2]: [e.g., "Implement auto-scaling"]
- [Action 3]: [e.g., "Add query performance tests"]
- [Action 4]: [e.g., "Update alert thresholds"]
Escalation
If issue persists after 30 minutes:
Escalation Path:
- Primary oncall: @[username] ([slack/email])
- Team lead: @[username] ([slack/email])
- Engineering manager: @[username] ([slack/email])
- Incident commander: @[username] ([slack/email])
Communication:
- Slack channel: #[incidents-channel]
- Status page: [Link]
- Incident tracking: [Link to incident management tool]
Related Runbooks
- [Related Runbook 1]
- [Related Runbook 2]
- [Related Runbook 3]
Related Dashboards
- [Main Service Dashboard]
- [Resource Usage Dashboard]
- [Dependency Dashboard]
Related Documentation
- [Architecture Diagram]
- [Service Documentation]
- [API Documentation]
Recent Incidents
| Date | Duration | Root Cause | Resolution | Ticket |
|---|---|---|---|---|
| 2024-10-15 | 23 min | Database pool exhausted | Increased pool size | INC-123 |
| 2024-09-30 | 45 min | Memory leak | Fixed code, restarted | INC-120 |
Runbook Metadata
Last Updated: [Date]
Owner: [Team name]
Reviewers: [Names]
Next Review: [Date]
Notes
- This runbook should be reviewed quarterly
- Update after each incident to capture new learnings
- Keep investigation steps concise and actionable
- Include actual commands that can be copy-pasted