# Alerting Best Practices ## Core Principles ### 1. Every Alert Should Be Actionable If you can't do something about it, don't alert on it. ❌ Bad: `Alert: CPU > 50%` (What action should be taken?) ✅ Good: `Alert: API latency p95 > 2s for 10m` (Investigate/scale up) ### 2. Alert on Symptoms, Not Causes Alert on what users experience, not underlying components. ❌ Bad: `Database connection pool 80% full` ✅ Good: `Request latency p95 > 1s` (which might be caused by DB pool) ### 3. Alert on SLO Violations Tie alerts to Service Level Objectives. ✅ `Error rate exceeds 0.1% (SLO: 99.9% availability)` ### 4. Reduce Noise Alert fatigue is real. Only page for critical issues. **Alert Severity Levels**: - **Critical**: Page on-call immediately (user-facing issue) - **Warning**: Create ticket, review during business hours - **Info**: Log for awareness, no action needed --- ## Alert Design Patterns ### Pattern 1: Multi-Window Multi-Burn-Rate Google's recommended SLO alerting approach. **Concept**: Alert when error budget burn rate is high enough to exhaust the budget too quickly. ```yaml # Fast burn (6% of budget in 1 hour) - alert: FastBurnRate expr: | sum(rate(http_requests_total{status=~"5.."}[1h])) / sum(rate(http_requests_total[1h])) > (14.4 * 0.001) # 14.4x burn rate for 99.9% SLO for: 2m labels: severity: critical # Slow burn (6% of budget in 6 hours) - alert: SlowBurnRate expr: | sum(rate(http_requests_total{status=~"5.."}[6h])) / sum(rate(http_requests_total[6h])) > (6 * 0.001) # 6x burn rate for 99.9% SLO for: 30m labels: severity: warning ``` **Burn Rate Multipliers for 99.9% SLO (0.1% error budget)**: - 1 hour window, 2m grace: 14.4x burn rate - 6 hour window, 30m grace: 6x burn rate - 3 day window, 6h grace: 1x burn rate ### Pattern 2: Rate of Change Alert when metrics change rapidly. ```yaml - alert: TrafficSpike expr: | sum(rate(http_requests_total[5m])) > 1.5 * sum(rate(http_requests_total[5m] offset 1h)) for: 10m annotations: summary: "Traffic increased by 50% compared to 1 hour ago" ``` ### Pattern 3: Threshold with Hysteresis Prevent flapping with different thresholds for firing and resolving. ```yaml # Fire at 90%, resolve at 70% - alert: HighCPU expr: cpu_usage > 90 for: 5m - alert: HighCPU_Resolved expr: cpu_usage < 70 for: 5m ``` ### Pattern 4: Absent Metrics Alert when expected metrics stop being reported (service down). ```yaml - alert: ServiceDown expr: absent(up{job="my-service"}) for: 5m labels: severity: critical annotations: summary: "Service {{ $labels.job }} is not reporting metrics" ``` ### Pattern 5: Aggregate Alerts Alert on aggregate performance across multiple instances. ```yaml - alert: HighOverallErrorRate expr: | sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05 for: 10m annotations: summary: "Overall error rate is {{ $value | humanizePercentage }}" ``` --- ## Alert Annotation Best Practices ### Required Fields **summary**: One-line description of the issue ```yaml summary: "High error rate on {{ $labels.service }}: {{ $value | humanizePercentage }}" ``` **description**: Detailed explanation with context ```yaml description: | Error rate on {{ $labels.service }} is {{ $value | humanizePercentage }}, which exceeds the threshold of 1% for more than 10 minutes. Current value: {{ $value }} Runbook: https://runbooks.example.com/high-error-rate ``` **runbook_url**: Link to investigation steps ```yaml runbook_url: "https://runbooks.example.com/alerts/{{ $labels.alertname }}" ``` ### Optional but Recommended **dashboard**: Link to relevant dashboard ```yaml dashboard: "https://grafana.example.com/d/service-dashboard?var-service={{ $labels.service }}" ``` **logs**: Link to logs ```yaml logs: "https://kibana.example.com/app/discover#/?_a=(query:(query_string:(query:'service:{{ $labels.service }}')))" ``` --- ## Alert Label Best Practices ### Required Labels **severity**: Critical, warning, or info ```yaml labels: severity: critical ``` **team**: Who should handle this alert ```yaml labels: team: platform severity: critical ``` **component**: What part of the system ```yaml labels: component: api-gateway severity: warning ``` ### Example Complete Alert ```yaml - alert: HighLatency expr: | histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service) ) > 1 for: 10m labels: severity: warning team: backend component: api environment: "{{ $labels.environment }}" annotations: summary: "High latency on {{ $labels.service }}" description: | P95 latency on {{ $labels.service }} is {{ $value }}s, exceeding 1s threshold. This may impact user experience. Check recent deployments and database performance. Current p95: {{ $value }}s Threshold: 1s Duration: 10m+ runbook_url: "https://runbooks.example.com/high-latency" dashboard: "https://grafana.example.com/d/api-dashboard" logs: "https://kibana.example.com/app/discover#/?_a=(query:(query_string:(query:'service:{{ $labels.service }} AND level:error')))" ``` --- ## Alert Thresholds ### General Guidelines **Response Time / Latency**: - Warning: p95 > 500ms or p99 > 1s - Critical: p95 > 2s or p99 > 5s **Error Rate**: - Warning: > 1% - Critical: > 5% **Availability**: - Warning: < 99.9% - Critical: < 99.5% **CPU Utilization**: - Warning: > 70% for 15m - Critical: > 90% for 5m **Memory Utilization**: - Warning: > 80% for 15m - Critical: > 95% for 5m **Disk Space**: - Warning: > 80% full - Critical: > 90% full **Queue Depth**: - Warning: > 70% of max capacity - Critical: > 90% of max capacity ### Application-Specific Thresholds Set thresholds based on: 1. **Historical performance**: Use p95 of last 30 days + 20% 2. **SLO requirements**: If SLO is 99.9%, alert at 99.5% 3. **Business impact**: What error rate causes user complaints? --- ## The "for" Clause Prevent alert flapping by requiring the condition to be true for a duration. ### Guidelines **Critical alerts**: Short duration (2-5m) ```yaml - alert: ServiceDown expr: up == 0 for: 2m # Quick detection for critical issues ``` **Warning alerts**: Longer duration (10-30m) ```yaml - alert: HighMemoryUsage expr: memory_usage > 80 for: 15m # Avoid noise from temporary spikes ``` **Resource saturation**: Medium duration (5-10m) ```yaml - alert: HighCPU expr: cpu_usage > 90 for: 5m ``` --- ## Alert Routing ### Severity-Based Routing ```yaml # alertmanager.yml route: group_by: ['alertname', 'cluster'] group_wait: 10s group_interval: 5m repeat_interval: 4h receiver: 'default' routes: # Critical alerts → PagerDuty - match: severity: critical receiver: pagerduty group_wait: 10s repeat_interval: 5m # Warning alerts → Slack - match: severity: warning receiver: slack group_wait: 30s repeat_interval: 12h # Info alerts → Email - match: severity: info receiver: email repeat_interval: 24h ``` ### Team-Based Routing ```yaml routes: # Platform team - match: team: platform receiver: platform-pagerduty # Backend team - match: team: backend receiver: backend-slack # Database team - match: component: database receiver: dba-pagerduty ``` ### Time-Based Routing ```yaml # Only page during business hours for non-critical routes: - match: severity: warning receiver: slack active_time_intervals: - business_hours time_intervals: - name: business_hours time_intervals: - weekdays: ['monday:friday'] times: - start_time: '09:00' end_time: '17:00' location: 'America/New_York' ``` --- ## Alert Grouping ### Intelligent Grouping **Group by service and environment**: ```yaml route: group_by: ['alertname', 'service', 'environment'] group_wait: 30s group_interval: 5m ``` This prevents: - 50 alerts for "HighCPU" on different pods → 1 grouped alert - Mixing production and staging alerts ### Inhibition Rules Suppress related alerts when a parent alert fires. ```yaml inhibit_rules: # If service is down, suppress latency alerts - source_match: alertname: ServiceDown target_match: alertname: HighLatency equal: ['service'] # If node is down, suppress all pod alerts on that node - source_match: alertname: NodeDown target_match_re: alertname: '(PodCrashLoop|HighCPU|HighMemory)' equal: ['node'] ``` --- ## Runbook Structure Every alert should link to a runbook with: ### 1. Context - What does this alert mean? - What is the user impact? - What is the urgency? ### 2. Investigation Steps ```markdown ## Investigation 1. Check service health dashboard https://grafana.example.com/d/service-dashboard 2. Check recent deployments kubectl rollout history deployment/myapp -n production 3. Check error logs kubectl logs deployment/myapp -n production --tail=100 | grep ERROR 4. Check dependencies - Database: Check slow query log - Redis: Check memory usage - External APIs: Check status pages ``` ### 3. Common Causes ```markdown ## Common Causes - **Recent deployment**: Check if alert started after deployment - **Traffic spike**: Check request rate, might need to scale - **Database issues**: Check query performance and connection pool - **External API degradation**: Check third-party status pages ``` ### 4. Resolution Steps ```markdown ## Resolution ### Immediate Actions (< 5 minutes) 1. Scale up if traffic spike: `kubectl scale deployment myapp --replicas=10` 2. Rollback if recent deployment: `kubectl rollout undo deployment/myapp` ### Short-term Actions (< 30 minutes) 1. Restart pods if memory leak: `kubectl rollout restart deployment/myapp` 2. Clear cache if stale data: `redis-cli -h cache.example.com FLUSHDB` ### Long-term Actions (post-incident) 1. Review and optimize slow queries 2. Implement circuit breakers 3. Add more capacity 4. Update alert thresholds if false positive ``` ### 5. Escalation ```markdown ## Escalation If issue persists after 30 minutes: - Slack: #backend-oncall - PagerDuty: Escalate to senior engineer - Incident Commander: Jane Doe (jane@example.com) ``` --- ## Anti-Patterns to Avoid ### 1. Alert on Everything ❌ Don't: Alert on every warning log ✅ Do: Alert on error rate exceeding threshold ### 2. Alert Without Context ❌ Don't: "Error rate high" ✅ Do: "Error rate 5.2% exceeds 1% threshold for 10m, impacting checkout flow" ### 3. Static Thresholds for Dynamic Systems ❌ Don't: `cpu_usage > 70` (fails during scale-up) ✅ Do: Alert on SLO violations or rate of change ### 4. No "for" Clause ❌ Don't: Alert immediately on threshold breach ✅ Do: Use `for: 5m` to avoid flapping ### 5. Too Many Recipients ❌ Don't: Page 10 people for every alert ✅ Do: Route to specific on-call rotation ### 6. Duplicate Alerts ❌ Don't: Alert on both cause and symptom ✅ Do: Alert on symptom, use inhibition for causes ### 7. No Runbook ❌ Don't: Alert without guidance ✅ Do: Include runbook_url in every alert --- ## Alert Testing ### Test Alert Firing ```bash # Trigger test alert in Prometheus amtool alert add alertname="TestAlert" \ severity="warning" \ summary="Test alert" # Or use Alertmanager API curl -X POST http://alertmanager:9093/api/v1/alerts \ -d '[{ "labels": {"alertname": "TestAlert", "severity": "critical"}, "annotations": {"summary": "Test critical alert"} }]' ``` ### Verify Alert Rules ```bash # Check syntax promtool check rules alerts.yml # Test expression promtool query instant http://prometheus:9090 \ 'sum(rate(http_requests_total{status=~"5.."}[5m]))' # Unit test alerts promtool test rules test.yml ``` ### Test Alertmanager Routing ```bash # Test which receiver an alert would go to amtool config routes test \ --config.file=alertmanager.yml \ alertname="HighLatency" \ severity="critical" \ team="backend" ``` --- ## On-Call Best Practices ### Rotation Schedule - **Primary on-call**: First responder - **Secondary on-call**: Escalation backup - **Rotation length**: 1 week (balance load vs context) - **Handoff**: Monday morning (not Friday evening) ### On-Call Checklist ```markdown ## Pre-shift - [ ] Test pager/phone - [ ] Review recent incidents - [ ] Check upcoming deployments - [ ] Update contact info ## During shift - [ ] Respond to pages within 5 minutes - [ ] Document all incidents - [ ] Update runbooks if gaps found - [ ] Communicate in #incidents channel ## Post-shift - [ ] Hand off open incidents - [ ] Complete incident reports - [ ] Suggest improvements - [ ] Update team documentation ``` ### Escalation Policy 1. **Primary**: Responds within 5 minutes 2. **Secondary**: Auto-escalate after 15 minutes 3. **Manager**: Auto-escalate after 30 minutes 4. **Incident Commander**: Critical incidents only --- ## Metrics About Alerts Monitor your monitoring system! ### Key Metrics ```promql # Alert firing frequency sum(ALERTS{alertstate="firing"}) by (alertname) # Alert duration ALERTS_FOR_STATE{alertstate="firing"} # Alerts per severity sum(ALERTS{alertstate="firing"}) by (severity) # Time to acknowledge (from PagerDuty/etc) pagerduty_incident_ack_duration_seconds ``` ### Alert Quality Metrics - **Mean Time to Acknowledge (MTTA)**: < 5 minutes - **Mean Time to Resolve (MTTR)**: < 30 minutes - **False Positive Rate**: < 10% - **Alert Coverage**: % of incidents with preceding alert > 80%