Initial commit
This commit is contained in:
609
references/alerting_best_practices.md
Normal file
609
references/alerting_best_practices.md
Normal file
@@ -0,0 +1,609 @@
|
||||
# Alerting Best Practices
|
||||
|
||||
## Core Principles
|
||||
|
||||
### 1. Every Alert Should Be Actionable
|
||||
If you can't do something about it, don't alert on it.
|
||||
|
||||
❌ Bad: `Alert: CPU > 50%` (What action should be taken?)
|
||||
✅ Good: `Alert: API latency p95 > 2s for 10m` (Investigate/scale up)
|
||||
|
||||
### 2. Alert on Symptoms, Not Causes
|
||||
Alert on what users experience, not underlying components.
|
||||
|
||||
❌ Bad: `Database connection pool 80% full`
|
||||
✅ Good: `Request latency p95 > 1s` (which might be caused by DB pool)
|
||||
|
||||
### 3. Alert on SLO Violations
|
||||
Tie alerts to Service Level Objectives.
|
||||
|
||||
✅ `Error rate exceeds 0.1% (SLO: 99.9% availability)`
|
||||
|
||||
### 4. Reduce Noise
|
||||
Alert fatigue is real. Only page for critical issues.
|
||||
|
||||
**Alert Severity Levels**:
|
||||
- **Critical**: Page on-call immediately (user-facing issue)
|
||||
- **Warning**: Create ticket, review during business hours
|
||||
- **Info**: Log for awareness, no action needed
|
||||
|
||||
---
|
||||
|
||||
## Alert Design Patterns
|
||||
|
||||
### Pattern 1: Multi-Window Multi-Burn-Rate
|
||||
|
||||
Google's recommended SLO alerting approach.
|
||||
|
||||
**Concept**: Alert when error budget burn rate is high enough to exhaust the budget too quickly.
|
||||
|
||||
```yaml
|
||||
# Fast burn (6% of budget in 1 hour)
|
||||
- alert: FastBurnRate
|
||||
expr: |
|
||||
sum(rate(http_requests_total{status=~"5.."}[1h]))
|
||||
/
|
||||
sum(rate(http_requests_total[1h]))
|
||||
> (14.4 * 0.001) # 14.4x burn rate for 99.9% SLO
|
||||
for: 2m
|
||||
labels:
|
||||
severity: critical
|
||||
|
||||
# Slow burn (6% of budget in 6 hours)
|
||||
- alert: SlowBurnRate
|
||||
expr: |
|
||||
sum(rate(http_requests_total{status=~"5.."}[6h]))
|
||||
/
|
||||
sum(rate(http_requests_total[6h]))
|
||||
> (6 * 0.001) # 6x burn rate for 99.9% SLO
|
||||
for: 30m
|
||||
labels:
|
||||
severity: warning
|
||||
```
|
||||
|
||||
**Burn Rate Multipliers for 99.9% SLO (0.1% error budget)**:
|
||||
- 1 hour window, 2m grace: 14.4x burn rate
|
||||
- 6 hour window, 30m grace: 6x burn rate
|
||||
- 3 day window, 6h grace: 1x burn rate
|
||||
|
||||
### Pattern 2: Rate of Change
|
||||
Alert when metrics change rapidly.
|
||||
|
||||
```yaml
|
||||
- alert: TrafficSpike
|
||||
expr: |
|
||||
sum(rate(http_requests_total[5m]))
|
||||
>
|
||||
1.5 * sum(rate(http_requests_total[5m] offset 1h))
|
||||
for: 10m
|
||||
annotations:
|
||||
summary: "Traffic increased by 50% compared to 1 hour ago"
|
||||
```
|
||||
|
||||
### Pattern 3: Threshold with Hysteresis
|
||||
Prevent flapping with different thresholds for firing and resolving.
|
||||
|
||||
```yaml
|
||||
# Fire at 90%, resolve at 70%
|
||||
- alert: HighCPU
|
||||
expr: cpu_usage > 90
|
||||
for: 5m
|
||||
|
||||
- alert: HighCPU_Resolved
|
||||
expr: cpu_usage < 70
|
||||
for: 5m
|
||||
```
|
||||
|
||||
### Pattern 4: Absent Metrics
|
||||
Alert when expected metrics stop being reported (service down).
|
||||
|
||||
```yaml
|
||||
- alert: ServiceDown
|
||||
expr: absent(up{job="my-service"})
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "Service {{ $labels.job }} is not reporting metrics"
|
||||
```
|
||||
|
||||
### Pattern 5: Aggregate Alerts
|
||||
Alert on aggregate performance across multiple instances.
|
||||
|
||||
```yaml
|
||||
- alert: HighOverallErrorRate
|
||||
expr: |
|
||||
sum(rate(http_requests_total{status=~"5.."}[5m]))
|
||||
/
|
||||
sum(rate(http_requests_total[5m]))
|
||||
> 0.05
|
||||
for: 10m
|
||||
annotations:
|
||||
summary: "Overall error rate is {{ $value | humanizePercentage }}"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Alert Annotation Best Practices
|
||||
|
||||
### Required Fields
|
||||
|
||||
**summary**: One-line description of the issue
|
||||
```yaml
|
||||
summary: "High error rate on {{ $labels.service }}: {{ $value | humanizePercentage }}"
|
||||
```
|
||||
|
||||
**description**: Detailed explanation with context
|
||||
```yaml
|
||||
description: |
|
||||
Error rate on {{ $labels.service }} is {{ $value | humanizePercentage }},
|
||||
which exceeds the threshold of 1% for more than 10 minutes.
|
||||
|
||||
Current value: {{ $value }}
|
||||
Runbook: https://runbooks.example.com/high-error-rate
|
||||
```
|
||||
|
||||
**runbook_url**: Link to investigation steps
|
||||
```yaml
|
||||
runbook_url: "https://runbooks.example.com/alerts/{{ $labels.alertname }}"
|
||||
```
|
||||
|
||||
### Optional but Recommended
|
||||
|
||||
**dashboard**: Link to relevant dashboard
|
||||
```yaml
|
||||
dashboard: "https://grafana.example.com/d/service-dashboard?var-service={{ $labels.service }}"
|
||||
```
|
||||
|
||||
**logs**: Link to logs
|
||||
```yaml
|
||||
logs: "https://kibana.example.com/app/discover#/?_a=(query:(query_string:(query:'service:{{ $labels.service }}')))"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Alert Label Best Practices
|
||||
|
||||
### Required Labels
|
||||
|
||||
**severity**: Critical, warning, or info
|
||||
```yaml
|
||||
labels:
|
||||
severity: critical
|
||||
```
|
||||
|
||||
**team**: Who should handle this alert
|
||||
```yaml
|
||||
labels:
|
||||
team: platform
|
||||
severity: critical
|
||||
```
|
||||
|
||||
**component**: What part of the system
|
||||
```yaml
|
||||
labels:
|
||||
component: api-gateway
|
||||
severity: warning
|
||||
```
|
||||
|
||||
### Example Complete Alert
|
||||
```yaml
|
||||
- alert: HighLatency
|
||||
expr: |
|
||||
histogram_quantile(0.95,
|
||||
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
|
||||
) > 1
|
||||
for: 10m
|
||||
labels:
|
||||
severity: warning
|
||||
team: backend
|
||||
component: api
|
||||
environment: "{{ $labels.environment }}"
|
||||
annotations:
|
||||
summary: "High latency on {{ $labels.service }}"
|
||||
description: |
|
||||
P95 latency on {{ $labels.service }} is {{ $value }}s, exceeding 1s threshold.
|
||||
|
||||
This may impact user experience. Check recent deployments and database performance.
|
||||
|
||||
Current p95: {{ $value }}s
|
||||
Threshold: 1s
|
||||
Duration: 10m+
|
||||
runbook_url: "https://runbooks.example.com/high-latency"
|
||||
dashboard: "https://grafana.example.com/d/api-dashboard"
|
||||
logs: "https://kibana.example.com/app/discover#/?_a=(query:(query_string:(query:'service:{{ $labels.service }} AND level:error')))"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Alert Thresholds
|
||||
|
||||
### General Guidelines
|
||||
|
||||
**Response Time / Latency**:
|
||||
- Warning: p95 > 500ms or p99 > 1s
|
||||
- Critical: p95 > 2s or p99 > 5s
|
||||
|
||||
**Error Rate**:
|
||||
- Warning: > 1%
|
||||
- Critical: > 5%
|
||||
|
||||
**Availability**:
|
||||
- Warning: < 99.9%
|
||||
- Critical: < 99.5%
|
||||
|
||||
**CPU Utilization**:
|
||||
- Warning: > 70% for 15m
|
||||
- Critical: > 90% for 5m
|
||||
|
||||
**Memory Utilization**:
|
||||
- Warning: > 80% for 15m
|
||||
- Critical: > 95% for 5m
|
||||
|
||||
**Disk Space**:
|
||||
- Warning: > 80% full
|
||||
- Critical: > 90% full
|
||||
|
||||
**Queue Depth**:
|
||||
- Warning: > 70% of max capacity
|
||||
- Critical: > 90% of max capacity
|
||||
|
||||
### Application-Specific Thresholds
|
||||
|
||||
Set thresholds based on:
|
||||
1. **Historical performance**: Use p95 of last 30 days + 20%
|
||||
2. **SLO requirements**: If SLO is 99.9%, alert at 99.5%
|
||||
3. **Business impact**: What error rate causes user complaints?
|
||||
|
||||
---
|
||||
|
||||
## The "for" Clause
|
||||
|
||||
Prevent alert flapping by requiring the condition to be true for a duration.
|
||||
|
||||
### Guidelines
|
||||
|
||||
**Critical alerts**: Short duration (2-5m)
|
||||
```yaml
|
||||
- alert: ServiceDown
|
||||
expr: up == 0
|
||||
for: 2m # Quick detection for critical issues
|
||||
```
|
||||
|
||||
**Warning alerts**: Longer duration (10-30m)
|
||||
```yaml
|
||||
- alert: HighMemoryUsage
|
||||
expr: memory_usage > 80
|
||||
for: 15m # Avoid noise from temporary spikes
|
||||
```
|
||||
|
||||
**Resource saturation**: Medium duration (5-10m)
|
||||
```yaml
|
||||
- alert: HighCPU
|
||||
expr: cpu_usage > 90
|
||||
for: 5m
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Alert Routing
|
||||
|
||||
### Severity-Based Routing
|
||||
|
||||
```yaml
|
||||
# alertmanager.yml
|
||||
route:
|
||||
group_by: ['alertname', 'cluster']
|
||||
group_wait: 10s
|
||||
group_interval: 5m
|
||||
repeat_interval: 4h
|
||||
receiver: 'default'
|
||||
|
||||
routes:
|
||||
# Critical alerts → PagerDuty
|
||||
- match:
|
||||
severity: critical
|
||||
receiver: pagerduty
|
||||
group_wait: 10s
|
||||
repeat_interval: 5m
|
||||
|
||||
# Warning alerts → Slack
|
||||
- match:
|
||||
severity: warning
|
||||
receiver: slack
|
||||
group_wait: 30s
|
||||
repeat_interval: 12h
|
||||
|
||||
# Info alerts → Email
|
||||
- match:
|
||||
severity: info
|
||||
receiver: email
|
||||
repeat_interval: 24h
|
||||
```
|
||||
|
||||
### Team-Based Routing
|
||||
|
||||
```yaml
|
||||
routes:
|
||||
# Platform team
|
||||
- match:
|
||||
team: platform
|
||||
receiver: platform-pagerduty
|
||||
|
||||
# Backend team
|
||||
- match:
|
||||
team: backend
|
||||
receiver: backend-slack
|
||||
|
||||
# Database team
|
||||
- match:
|
||||
component: database
|
||||
receiver: dba-pagerduty
|
||||
```
|
||||
|
||||
### Time-Based Routing
|
||||
|
||||
```yaml
|
||||
# Only page during business hours for non-critical
|
||||
routes:
|
||||
- match:
|
||||
severity: warning
|
||||
receiver: slack
|
||||
active_time_intervals:
|
||||
- business_hours
|
||||
|
||||
time_intervals:
|
||||
- name: business_hours
|
||||
time_intervals:
|
||||
- weekdays: ['monday:friday']
|
||||
times:
|
||||
- start_time: '09:00'
|
||||
end_time: '17:00'
|
||||
location: 'America/New_York'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Alert Grouping
|
||||
|
||||
### Intelligent Grouping
|
||||
|
||||
**Group by service and environment**:
|
||||
```yaml
|
||||
route:
|
||||
group_by: ['alertname', 'service', 'environment']
|
||||
group_wait: 30s
|
||||
group_interval: 5m
|
||||
```
|
||||
|
||||
This prevents:
|
||||
- 50 alerts for "HighCPU" on different pods → 1 grouped alert
|
||||
- Mixing production and staging alerts
|
||||
|
||||
### Inhibition Rules
|
||||
|
||||
Suppress related alerts when a parent alert fires.
|
||||
|
||||
```yaml
|
||||
inhibit_rules:
|
||||
# If service is down, suppress latency alerts
|
||||
- source_match:
|
||||
alertname: ServiceDown
|
||||
target_match:
|
||||
alertname: HighLatency
|
||||
equal: ['service']
|
||||
|
||||
# If node is down, suppress all pod alerts on that node
|
||||
- source_match:
|
||||
alertname: NodeDown
|
||||
target_match_re:
|
||||
alertname: '(PodCrashLoop|HighCPU|HighMemory)'
|
||||
equal: ['node']
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Runbook Structure
|
||||
|
||||
Every alert should link to a runbook with:
|
||||
|
||||
### 1. Context
|
||||
- What does this alert mean?
|
||||
- What is the user impact?
|
||||
- What is the urgency?
|
||||
|
||||
### 2. Investigation Steps
|
||||
```markdown
|
||||
## Investigation
|
||||
|
||||
1. Check service health dashboard
|
||||
https://grafana.example.com/d/service-dashboard
|
||||
|
||||
2. Check recent deployments
|
||||
kubectl rollout history deployment/myapp -n production
|
||||
|
||||
3. Check error logs
|
||||
kubectl logs deployment/myapp -n production --tail=100 | grep ERROR
|
||||
|
||||
4. Check dependencies
|
||||
- Database: Check slow query log
|
||||
- Redis: Check memory usage
|
||||
- External APIs: Check status pages
|
||||
```
|
||||
|
||||
### 3. Common Causes
|
||||
```markdown
|
||||
## Common Causes
|
||||
|
||||
- **Recent deployment**: Check if alert started after deployment
|
||||
- **Traffic spike**: Check request rate, might need to scale
|
||||
- **Database issues**: Check query performance and connection pool
|
||||
- **External API degradation**: Check third-party status pages
|
||||
```
|
||||
|
||||
### 4. Resolution Steps
|
||||
```markdown
|
||||
## Resolution
|
||||
|
||||
### Immediate Actions (< 5 minutes)
|
||||
1. Scale up if traffic spike: `kubectl scale deployment myapp --replicas=10`
|
||||
2. Rollback if recent deployment: `kubectl rollout undo deployment/myapp`
|
||||
|
||||
### Short-term Actions (< 30 minutes)
|
||||
1. Restart pods if memory leak: `kubectl rollout restart deployment/myapp`
|
||||
2. Clear cache if stale data: `redis-cli -h cache.example.com FLUSHDB`
|
||||
|
||||
### Long-term Actions (post-incident)
|
||||
1. Review and optimize slow queries
|
||||
2. Implement circuit breakers
|
||||
3. Add more capacity
|
||||
4. Update alert thresholds if false positive
|
||||
```
|
||||
|
||||
### 5. Escalation
|
||||
```markdown
|
||||
## Escalation
|
||||
|
||||
If issue persists after 30 minutes:
|
||||
- Slack: #backend-oncall
|
||||
- PagerDuty: Escalate to senior engineer
|
||||
- Incident Commander: Jane Doe (jane@example.com)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Anti-Patterns to Avoid
|
||||
|
||||
### 1. Alert on Everything
|
||||
❌ Don't: Alert on every warning log
|
||||
✅ Do: Alert on error rate exceeding threshold
|
||||
|
||||
### 2. Alert Without Context
|
||||
❌ Don't: "Error rate high"
|
||||
✅ Do: "Error rate 5.2% exceeds 1% threshold for 10m, impacting checkout flow"
|
||||
|
||||
### 3. Static Thresholds for Dynamic Systems
|
||||
❌ Don't: `cpu_usage > 70` (fails during scale-up)
|
||||
✅ Do: Alert on SLO violations or rate of change
|
||||
|
||||
### 4. No "for" Clause
|
||||
❌ Don't: Alert immediately on threshold breach
|
||||
✅ Do: Use `for: 5m` to avoid flapping
|
||||
|
||||
### 5. Too Many Recipients
|
||||
❌ Don't: Page 10 people for every alert
|
||||
✅ Do: Route to specific on-call rotation
|
||||
|
||||
### 6. Duplicate Alerts
|
||||
❌ Don't: Alert on both cause and symptom
|
||||
✅ Do: Alert on symptom, use inhibition for causes
|
||||
|
||||
### 7. No Runbook
|
||||
❌ Don't: Alert without guidance
|
||||
✅ Do: Include runbook_url in every alert
|
||||
|
||||
---
|
||||
|
||||
## Alert Testing
|
||||
|
||||
### Test Alert Firing
|
||||
```bash
|
||||
# Trigger test alert in Prometheus
|
||||
amtool alert add alertname="TestAlert" \
|
||||
severity="warning" \
|
||||
summary="Test alert"
|
||||
|
||||
# Or use Alertmanager API
|
||||
curl -X POST http://alertmanager:9093/api/v1/alerts \
|
||||
-d '[{
|
||||
"labels": {"alertname": "TestAlert", "severity": "critical"},
|
||||
"annotations": {"summary": "Test critical alert"}
|
||||
}]'
|
||||
```
|
||||
|
||||
### Verify Alert Rules
|
||||
```bash
|
||||
# Check syntax
|
||||
promtool check rules alerts.yml
|
||||
|
||||
# Test expression
|
||||
promtool query instant http://prometheus:9090 \
|
||||
'sum(rate(http_requests_total{status=~"5.."}[5m]))'
|
||||
|
||||
# Unit test alerts
|
||||
promtool test rules test.yml
|
||||
```
|
||||
|
||||
### Test Alertmanager Routing
|
||||
```bash
|
||||
# Test which receiver an alert would go to
|
||||
amtool config routes test \
|
||||
--config.file=alertmanager.yml \
|
||||
alertname="HighLatency" \
|
||||
severity="critical" \
|
||||
team="backend"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## On-Call Best Practices
|
||||
|
||||
### Rotation Schedule
|
||||
- **Primary on-call**: First responder
|
||||
- **Secondary on-call**: Escalation backup
|
||||
- **Rotation length**: 1 week (balance load vs context)
|
||||
- **Handoff**: Monday morning (not Friday evening)
|
||||
|
||||
### On-Call Checklist
|
||||
```markdown
|
||||
## Pre-shift
|
||||
- [ ] Test pager/phone
|
||||
- [ ] Review recent incidents
|
||||
- [ ] Check upcoming deployments
|
||||
- [ ] Update contact info
|
||||
|
||||
## During shift
|
||||
- [ ] Respond to pages within 5 minutes
|
||||
- [ ] Document all incidents
|
||||
- [ ] Update runbooks if gaps found
|
||||
- [ ] Communicate in #incidents channel
|
||||
|
||||
## Post-shift
|
||||
- [ ] Hand off open incidents
|
||||
- [ ] Complete incident reports
|
||||
- [ ] Suggest improvements
|
||||
- [ ] Update team documentation
|
||||
```
|
||||
|
||||
### Escalation Policy
|
||||
1. **Primary**: Responds within 5 minutes
|
||||
2. **Secondary**: Auto-escalate after 15 minutes
|
||||
3. **Manager**: Auto-escalate after 30 minutes
|
||||
4. **Incident Commander**: Critical incidents only
|
||||
|
||||
---
|
||||
|
||||
## Metrics About Alerts
|
||||
|
||||
Monitor your monitoring system!
|
||||
|
||||
### Key Metrics
|
||||
```promql
|
||||
# Alert firing frequency
|
||||
sum(ALERTS{alertstate="firing"}) by (alertname)
|
||||
|
||||
# Alert duration
|
||||
ALERTS_FOR_STATE{alertstate="firing"}
|
||||
|
||||
# Alerts per severity
|
||||
sum(ALERTS{alertstate="firing"}) by (severity)
|
||||
|
||||
# Time to acknowledge (from PagerDuty/etc)
|
||||
pagerduty_incident_ack_duration_seconds
|
||||
```
|
||||
|
||||
### Alert Quality Metrics
|
||||
- **Mean Time to Acknowledge (MTTA)**: < 5 minutes
|
||||
- **Mean Time to Resolve (MTTR)**: < 30 minutes
|
||||
- **False Positive Rate**: < 10%
|
||||
- **Alert Coverage**: % of incidents with preceding alert > 80%
|
||||
Reference in New Issue
Block a user