Files
gh-ahmedasmar-devops-claude…/assets/templates/runbooks/incident-runbook-template.md
2025-11-29 17:51:22 +08:00

8.7 KiB

Runbook: [Alert Name]

Overview

Alert Name: [e.g., HighLatency, ServiceDown, ErrorBudgetBurn]

Severity: [Critical | Warning | Info]

Team: [e.g., Backend, Platform, Database]

Component: [e.g., API Gateway, User Service, PostgreSQL]

What it means: [One-line description of what this alert indicates]

User impact: [How does this affect users? High/Medium/Low]

Urgency: [How quickly must this be addressed? Immediate/Hours/Days]


Alert Details

When This Alert Fires

This alert fires when:

  • [Specific condition, e.g., "P95 latency exceeds 500ms for 10 minutes"]
  • [Any additional conditions]

Symptoms

Users will experience:

  • Slow response times
  • Errors or failures
  • Service unavailable
  • [Other symptoms]

Probable Causes

Common causes include:

  1. [Cause 1]: [Description]
    • Example: Database overload due to slow queries
  2. [Cause 2]: [Description]
    • Example: Memory leak causing OOM errors
  3. [Cause 3]: [Description]
    • Example: Upstream service degradation

Investigation Steps

1. Check Service Health

Dashboard: [Link to primary dashboard]

Key metrics to check:

# Request rate
sum(rate(http_requests_total[5m]))

# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

# Latency (p95, p99)
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

What to look for:

  • Has traffic spiked recently?
  • Is error rate elevated?
  • Are any endpoints particularly slow?

2. Check Recent Changes

Deployments:

# Kubernetes
kubectl rollout history deployment/[service-name] -n [namespace]

# Check when last deployed
kubectl get pods -n [namespace] -o wide | grep [service-name]

What to look for:

  • Was there a recent deployment?
  • Did alert start after deployment?
  • Any configuration changes?

3. Check Logs

Log query (adjust for your log system):

# Kubernetes
kubectl logs deployment/[service-name] -n [namespace] --tail=100 | grep ERROR

# Elasticsearch/Kibana
GET /logs-*/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "service": "[service-name]" } },
        { "match": { "level": "error" } },
        { "range": { "@timestamp": { "gte": "now-30m" } } }
      ]
    }
  }
}

# Loki/LogQL
{job="[service-name]"} |= "error" | json | level="error"

What to look for:

  • Repeated error messages
  • Stack traces
  • Connection errors
  • Timeout errors

4. Check Dependencies

Database:

# Check active connections
SELECT count(*) FROM pg_stat_activity WHERE state = 'active';

# Check slow queries
SELECT pid, now() - pg_stat_activity.query_start AS duration, query
FROM pg_stat_activity
WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '5 seconds';

External APIs:

  • Check status pages: [Link to status pages]
  • Check API error rates in dashboard
  • Test API endpoints manually

Cache (Redis/Memcached):

# Redis info
redis-cli -h [host] INFO stats

# Check memory usage
redis-cli -h [host] INFO memory

5. Check Resource Usage

CPU and Memory:

# Kubernetes
kubectl top pods -n [namespace] | grep [service-name]

# Node metrics
kubectl top nodes

Prometheus queries:

# CPU usage by pod
sum(rate(container_cpu_usage_seconds_total{pod=~"[service-name].*"}[5m])) by (pod)

# Memory usage by pod
sum(container_memory_usage_bytes{pod=~"[service-name].*"}) by (pod)

What to look for:

  • CPU throttling
  • Memory approaching limits
  • Disk space issues

6. Check Traces (if available)

Trace query:

# Jaeger
# Search for slow traces (> 1s) in last 30 minutes

# Tempo/TraceQL
{ duration > 1s && resource.service.name = "[service-name]" }

What to look for:

  • Which operation is slow?
  • Where is time spent? (DB, external API, service logic)
  • Any N+1 query patterns?

Common Scenarios and Solutions

Scenario 1: Recent Deployment Caused Issue

Symptoms:

  • Alert started immediately after deployment
  • Error logs correlate with new code

Solution:

# Rollback deployment
kubectl rollout undo deployment/[service-name] -n [namespace]

# Verify rollback succeeded
kubectl rollout status deployment/[service-name] -n [namespace]

# Monitor for alert resolution

Follow-up:

  • Create incident report
  • Review deployment process
  • Add pre-deployment checks

Scenario 2: Database Performance Issue

Symptoms:

  • Slow query logs show problematic queries
  • Database CPU or connection pool exhausted

Solution:

# Identify slow query
# Kill long-running query (use with caution)
SELECT pg_cancel_backend([pid]);

# Or terminate if cancel doesn't work
SELECT pg_terminate_backend([pid]);

# Add index if missing (in maintenance window)
CREATE INDEX CONCURRENTLY idx_name ON table_name (column_name);

Follow-up:

  • Add query performance test
  • Review and optimize query
  • Consider read replicas

Scenario 3: Memory Leak

Symptoms:

  • Memory usage gradually increasing
  • Eventually OOMKilled
  • Restarts temporarily fix issue

Solution:

# Immediate: Restart pods
kubectl rollout restart deployment/[service-name] -n [namespace]

# Increase memory limits (temporary)
kubectl set resources deployment/[service-name] -n [namespace] \
  --limits=memory=2Gi

Follow-up:

  • Profile application for memory leaks
  • Add memory usage alerts
  • Fix root cause

Scenario 4: Traffic Spike / DDoS

Symptoms:

  • Sudden traffic increase
  • Traffic from unusual sources
  • High CPU/memory across all instances

Solution:

# Scale up immediately
kubectl scale deployment/[service-name] -n [namespace] --replicas=10

# Enable rate limiting at load balancer level
# (Specific steps depend on LB)

# Block suspicious IPs if confirmed DDoS
# (Use WAF or network policies)

Follow-up:

  • Implement rate limiting
  • Add DDoS protection (CloudFlare, WAF)
  • Set up auto-scaling

Scenario 5: Upstream Service Degradation

Symptoms:

  • Errors calling external API
  • Timeouts to upstream service
  • Upstream status page shows issues

Solution:

# Enable circuit breaker (if available)
# Adjust timeout configuration
# Switch to backup service/cached data

# Monitor external service
# Check status page: [Link]

Follow-up:

  • Implement circuit breaker pattern
  • Add fallback mechanisms
  • Set up external service monitoring

Immediate Actions (< 5 minutes)

These should be done first to mitigate impact:

  1. [Action 1]: [e.g., "Scale up service"]

    kubectl scale deployment/[service] --replicas=10
    
  2. [Action 2]: [e.g., "Rollback deployment"]

    kubectl rollout undo deployment/[service]
    
  3. [Action 3]: [e.g., "Enable circuit breaker"]


Short-term Actions (< 30 minutes)

After immediate mitigation:

  1. [Action 1]: [e.g., "Investigate root cause"]
  2. [Action 2]: [e.g., "Optimize slow query"]
  3. [Action 3]: [e.g., "Clear cache if stale"]

Long-term Actions (Post-Incident)

Preventive measures:

  1. [Action 1]: [e.g., "Add circuit breaker"]
  2. [Action 2]: [e.g., "Implement auto-scaling"]
  3. [Action 3]: [e.g., "Add query performance tests"]
  4. [Action 4]: [e.g., "Update alert thresholds"]

Escalation

If issue persists after 30 minutes:

Escalation Path:

  1. Primary oncall: @[username] ([slack/email])
  2. Team lead: @[username] ([slack/email])
  3. Engineering manager: @[username] ([slack/email])
  4. Incident commander: @[username] ([slack/email])

Communication:

  • Slack channel: #[incidents-channel]
  • Status page: [Link]
  • Incident tracking: [Link to incident management tool]

  • [Related Runbook 1]
  • [Related Runbook 2]
  • [Related Runbook 3]
  • [Main Service Dashboard]
  • [Resource Usage Dashboard]
  • [Dependency Dashboard]
  • [Architecture Diagram]
  • [Service Documentation]
  • [API Documentation]

Recent Incidents

Date Duration Root Cause Resolution Ticket
2024-10-15 23 min Database pool exhausted Increased pool size INC-123
2024-09-30 45 min Memory leak Fixed code, restarted INC-120

Runbook Metadata

Last Updated: [Date]

Owner: [Team name]

Reviewers: [Names]

Next Review: [Date]


Notes

  • This runbook should be reviewed quarterly
  • Update after each incident to capture new learnings
  • Keep investigation steps concise and actionable
  • Include actual commands that can be copy-pasted