Initial commit

2025-11-29 17:51:22 +08:00
commit 23753b435e
24 changed files with 9837 additions and 0 deletions
--- a/references/alerting_best_practices.md
+++ b/references/alerting_best_practices.md
@@ -0,0 +1,609 @@
+# Alerting Best Practices
+
+## Core Principles
+
+### 1. Every Alert Should Be Actionable
+If you can't do something about it, don't alert on it.
+
+❌ Bad: `Alert: CPU > 50%` (What action should be taken?)
+✅ Good: `Alert: API latency p95 > 2s for 10m` (Investigate/scale up)
+
+### 2. Alert on Symptoms, Not Causes
+Alert on what users experience, not underlying components.
+
+❌ Bad: `Database connection pool 80% full`
+✅ Good: `Request latency p95 > 1s` (which might be caused by DB pool)
+
+### 3. Alert on SLO Violations
+Tie alerts to Service Level Objectives.
+
+✅ `Error rate exceeds 0.1% (SLO: 99.9% availability)`
+
+### 4. Reduce Noise
+Alert fatigue is real. Only page for critical issues.
+
+**Alert Severity Levels**:
+- **Critical**: Page on-call immediately (user-facing issue)
+- **Warning**: Create ticket, review during business hours
+- **Info**: Log for awareness, no action needed
+
+---
+
+## Alert Design Patterns
+
+### Pattern 1: Multi-Window Multi-Burn-Rate
+
+Google's recommended SLO alerting approach.
+
+**Concept**: Alert when error budget burn rate is high enough to exhaust the budget too quickly.
+
+```yaml
+# Fast burn (6% of budget in 1 hour)
+- alert: FastBurnRate
+  expr: |
+    sum(rate(http_requests_total{status=~"5.."}[1h]))
+      /
+    sum(rate(http_requests_total[1h]))
+    > (14.4 * 0.001)  # 14.4x burn rate for 99.9% SLO
+  for: 2m
+  labels:
+    severity: critical
+
+# Slow burn (6% of budget in 6 hours)
+- alert: SlowBurnRate
+  expr: |
+    sum(rate(http_requests_total{status=~"5.."}[6h]))
+      /
+    sum(rate(http_requests_total[6h]))
+    > (6 * 0.001)  # 6x burn rate for 99.9% SLO
+  for: 30m
+  labels:
+    severity: warning
+```
+
+**Burn Rate Multipliers for 99.9% SLO (0.1% error budget)**:
+- 1 hour window, 2m grace: 14.4x burn rate
+- 6 hour window, 30m grace: 6x burn rate
+- 3 day window, 6h grace: 1x burn rate
+
+### Pattern 2: Rate of Change
+Alert when metrics change rapidly.
+
+```yaml
+- alert: TrafficSpike
+  expr: |
+    sum(rate(http_requests_total[5m]))
+      >
+    1.5 * sum(rate(http_requests_total[5m] offset 1h))
+  for: 10m
+  annotations:
+    summary: "Traffic increased by 50% compared to 1 hour ago"
+```
+
+### Pattern 3: Threshold with Hysteresis
+Prevent flapping with different thresholds for firing and resolving.
+
+```yaml
+# Fire at 90%, resolve at 70%
+- alert: HighCPU
+  expr: cpu_usage > 90
+  for: 5m
+
+- alert: HighCPU_Resolved
+  expr: cpu_usage < 70
+  for: 5m
+```
+
+### Pattern 4: Absent Metrics
+Alert when expected metrics stop being reported (service down).
+
+```yaml
+- alert: ServiceDown
+  expr: absent(up{job="my-service"})
+  for: 5m
+  labels:
+    severity: critical
+  annotations:
+    summary: "Service {{ $labels.job }} is not reporting metrics"
+```
+
+### Pattern 5: Aggregate Alerts
+Alert on aggregate performance across multiple instances.
+
+```yaml
+- alert: HighOverallErrorRate
+  expr: |
+    sum(rate(http_requests_total{status=~"5.."}[5m]))
+      /
+    sum(rate(http_requests_total[5m]))
+    > 0.05
+  for: 10m
+  annotations:
+    summary: "Overall error rate is {{ $value | humanizePercentage }}"
+```
+
+---
+
+## Alert Annotation Best Practices
+
+### Required Fields
+
+**summary**: One-line description of the issue
+```yaml
+summary: "High error rate on {{ $labels.service }}: {{ $value | humanizePercentage }}"
+```
+
+**description**: Detailed explanation with context
+```yaml
+description: |
+  Error rate on {{ $labels.service }} is {{ $value | humanizePercentage }},
+  which exceeds the threshold of 1% for more than 10 minutes.
+
+  Current value: {{ $value }}
+  Runbook: https://runbooks.example.com/high-error-rate
+```
+
+**runbook_url**: Link to investigation steps
+```yaml
+runbook_url: "https://runbooks.example.com/alerts/{{ $labels.alertname }}"
+```
+
+### Optional but Recommended
+
+**dashboard**: Link to relevant dashboard
+```yaml
+dashboard: "https://grafana.example.com/d/service-dashboard?var-service={{ $labels.service }}"
+```
+
+**logs**: Link to logs
+```yaml
+logs: "https://kibana.example.com/app/discover#/?_a=(query:(query_string:(query:'service:{{ $labels.service }}')))"
+```
+
+---
+
+## Alert Label Best Practices
+
+### Required Labels
+
+**severity**: Critical, warning, or info
+```yaml
+labels:
+  severity: critical
+```
+
+**team**: Who should handle this alert
+```yaml
+labels:
+  team: platform
+  severity: critical
+```
+
+**component**: What part of the system
+```yaml
+labels:
+  component: api-gateway
+  severity: warning
+```
+
+### Example Complete Alert
+```yaml
+- alert: HighLatency
+  expr: |
+    histogram_quantile(0.95,
+      sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
+    ) > 1
+  for: 10m
+  labels:
+    severity: warning
+    team: backend
+    component: api
+    environment: "{{ $labels.environment }}"
+  annotations:
+    summary: "High latency on {{ $labels.service }}"
+    description: |
+      P95 latency on {{ $labels.service }} is {{ $value }}s, exceeding 1s threshold.
+
+      This may impact user experience. Check recent deployments and database performance.
+
+      Current p95: {{ $value }}s
+      Threshold: 1s
+      Duration: 10m+
+    runbook_url: "https://runbooks.example.com/high-latency"
+    dashboard: "https://grafana.example.com/d/api-dashboard"
+    logs: "https://kibana.example.com/app/discover#/?_a=(query:(query_string:(query:'service:{{ $labels.service }} AND level:error')))"
+```
+
+---
+
+## Alert Thresholds
+
+### General Guidelines
+
+**Response Time / Latency**:
+- Warning: p95 > 500ms or p99 > 1s
+- Critical: p95 > 2s or p99 > 5s
+
+**Error Rate**:
+- Warning: > 1%
+- Critical: > 5%
+
+**Availability**:
+- Warning: < 99.9%
+- Critical: < 99.5%
+
+**CPU Utilization**:
+- Warning: > 70% for 15m
+- Critical: > 90% for 5m
+
+**Memory Utilization**:
+- Warning: > 80% for 15m
+- Critical: > 95% for 5m
+
+**Disk Space**:
+- Warning: > 80% full
+- Critical: > 90% full
+
+**Queue Depth**:
+- Warning: > 70% of max capacity
+- Critical: > 90% of max capacity
+
+### Application-Specific Thresholds
+
+Set thresholds based on:
+1. **Historical performance**: Use p95 of last 30 days + 20%
+2. **SLO requirements**: If SLO is 99.9%, alert at 99.5%
+3. **Business impact**: What error rate causes user complaints?
+
+---
+
+## The "for" Clause
+
+Prevent alert flapping by requiring the condition to be true for a duration.
+
+### Guidelines
+
+**Critical alerts**: Short duration (2-5m)
+```yaml
+- alert: ServiceDown
+  expr: up == 0
+  for: 2m  # Quick detection for critical issues
+```
+
+**Warning alerts**: Longer duration (10-30m)
+```yaml
+- alert: HighMemoryUsage
+  expr: memory_usage > 80
+  for: 15m  # Avoid noise from temporary spikes
+```
+
+**Resource saturation**: Medium duration (5-10m)
+```yaml
+- alert: HighCPU
+  expr: cpu_usage > 90
+  for: 5m
+```
+
+---
+
+## Alert Routing
+
+### Severity-Based Routing
+
+```yaml
+# alertmanager.yml
+route:
+  group_by: ['alertname', 'cluster']
+  group_wait: 10s
+  group_interval: 5m
+  repeat_interval: 4h
+  receiver: 'default'
+
+  routes:
+  # Critical alerts → PagerDuty
+  - match:
+      severity: critical
+    receiver: pagerduty
+    group_wait: 10s
+    repeat_interval: 5m
+
+  # Warning alerts → Slack
+  - match:
+      severity: warning
+    receiver: slack
+    group_wait: 30s
+    repeat_interval: 12h
+
+  # Info alerts → Email
+  - match:
+      severity: info
+    receiver: email
+    repeat_interval: 24h
+```
+
+### Team-Based Routing
+
+```yaml
+routes:
+  # Platform team
+  - match:
+      team: platform
+    receiver: platform-pagerduty
+
+  # Backend team
+  - match:
+      team: backend
+    receiver: backend-slack
+
+  # Database team
+  - match:
+      component: database
+    receiver: dba-pagerduty
+```
+
+### Time-Based Routing
+
+```yaml
+# Only page during business hours for non-critical
+routes:
+  - match:
+      severity: warning
+    receiver: slack
+    active_time_intervals:
+      - business_hours
+
+time_intervals:
+  - name: business_hours
+    time_intervals:
+      - weekdays: ['monday:friday']
+        times:
+          - start_time: '09:00'
+            end_time: '17:00'
+        location: 'America/New_York'
+```
+
+---
+
+## Alert Grouping
+
+### Intelligent Grouping
+
+**Group by service and environment**:
+```yaml
+route:
+  group_by: ['alertname', 'service', 'environment']
+  group_wait: 30s
+  group_interval: 5m
+```
+
+This prevents:
+- 50 alerts for "HighCPU" on different pods → 1 grouped alert
+- Mixing production and staging alerts
+
+### Inhibition Rules
+
+Suppress related alerts when a parent alert fires.
+
+```yaml
+inhibit_rules:
+  # If service is down, suppress latency alerts
+  - source_match:
+      alertname: ServiceDown
+    target_match:
+      alertname: HighLatency
+    equal: ['service']
+
+  # If node is down, suppress all pod alerts on that node
+  - source_match:
+      alertname: NodeDown
+    target_match_re:
+      alertname: '(PodCrashLoop|HighCPU|HighMemory)'
+    equal: ['node']
+```
+
+---
+
+## Runbook Structure
+
+Every alert should link to a runbook with:
+
+### 1. Context
+- What does this alert mean?
+- What is the user impact?
+- What is the urgency?
+
+### 2. Investigation Steps
+```markdown
+## Investigation
+
+1. Check service health dashboard
+   https://grafana.example.com/d/service-dashboard
+
+2. Check recent deployments
+   kubectl rollout history deployment/myapp -n production
+
+3. Check error logs
+   kubectl logs deployment/myapp -n production --tail=100 | grep ERROR
+
+4. Check dependencies
+   - Database: Check slow query log
+   - Redis: Check memory usage
+   - External APIs: Check status pages
+```
+
+### 3. Common Causes
+```markdown
+## Common Causes
+
+- **Recent deployment**: Check if alert started after deployment
+- **Traffic spike**: Check request rate, might need to scale
+- **Database issues**: Check query performance and connection pool
+- **External API degradation**: Check third-party status pages
+```
+
+### 4. Resolution Steps
+```markdown
+## Resolution
+
+### Immediate Actions (< 5 minutes)
+1. Scale up if traffic spike: `kubectl scale deployment myapp --replicas=10`
+2. Rollback if recent deployment: `kubectl rollout undo deployment/myapp`
+
+### Short-term Actions (< 30 minutes)
+1. Restart pods if memory leak: `kubectl rollout restart deployment/myapp`
+2. Clear cache if stale data: `redis-cli -h cache.example.com FLUSHDB`
+
+### Long-term Actions (post-incident)
+1. Review and optimize slow queries
+2. Implement circuit breakers
+3. Add more capacity
+4. Update alert thresholds if false positive
+```
+
+### 5. Escalation
+```markdown
+## Escalation
+
+If issue persists after 30 minutes:
+- Slack: #backend-oncall
+- PagerDuty: Escalate to senior engineer
+- Incident Commander: Jane Doe (jane@example.com)
+```
+
+---
+
+## Anti-Patterns to Avoid
+
+### 1. Alert on Everything
+❌ Don't: Alert on every warning log
+✅ Do: Alert on error rate exceeding threshold
+
+### 2. Alert Without Context
+❌ Don't: "Error rate high"
+✅ Do: "Error rate 5.2% exceeds 1% threshold for 10m, impacting checkout flow"
+
+### 3. Static Thresholds for Dynamic Systems
+❌ Don't: `cpu_usage > 70` (fails during scale-up)
+✅ Do: Alert on SLO violations or rate of change
+
+### 4. No "for" Clause
+❌ Don't: Alert immediately on threshold breach
+✅ Do: Use `for: 5m` to avoid flapping
+
+### 5. Too Many Recipients
+❌ Don't: Page 10 people for every alert
+✅ Do: Route to specific on-call rotation
+
+### 6. Duplicate Alerts
+❌ Don't: Alert on both cause and symptom
+✅ Do: Alert on symptom, use inhibition for causes
+
+### 7. No Runbook
+❌ Don't: Alert without guidance
+✅ Do: Include runbook_url in every alert
+
+---
+
+## Alert Testing
+
+### Test Alert Firing
+```bash
+# Trigger test alert in Prometheus
+amtool alert add alertname="TestAlert" \
+  severity="warning" \
+  summary="Test alert"
+
+# Or use Alertmanager API
+curl -X POST http://alertmanager:9093/api/v1/alerts \
+  -d '[{
+    "labels": {"alertname": "TestAlert", "severity": "critical"},
+    "annotations": {"summary": "Test critical alert"}
+  }]'
+```
+
+### Verify Alert Rules
+```bash
+# Check syntax
+promtool check rules alerts.yml
+
+# Test expression
+promtool query instant http://prometheus:9090 \
+  'sum(rate(http_requests_total{status=~"5.."}[5m]))'
+
+# Unit test alerts
+promtool test rules test.yml
+```
+
+### Test Alertmanager Routing
+```bash
+# Test which receiver an alert would go to
+amtool config routes test \
+  --config.file=alertmanager.yml \
+  alertname="HighLatency" \
+  severity="critical" \
+  team="backend"
+```
+
+---
+
+## On-Call Best Practices
+
+### Rotation Schedule
+- **Primary on-call**: First responder
+- **Secondary on-call**: Escalation backup
+- **Rotation length**: 1 week (balance load vs context)
+- **Handoff**: Monday morning (not Friday evening)
+
+### On-Call Checklist
+```markdown
+## Pre-shift
+- [ ] Test pager/phone
+- [ ] Review recent incidents
+- [ ] Check upcoming deployments
+- [ ] Update contact info
+
+## During shift
+- [ ] Respond to pages within 5 minutes
+- [ ] Document all incidents
+- [ ] Update runbooks if gaps found
+- [ ] Communicate in #incidents channel
+
+## Post-shift
+- [ ] Hand off open incidents
+- [ ] Complete incident reports
+- [ ] Suggest improvements
+- [ ] Update team documentation
+```
+
+### Escalation Policy
+1. **Primary**: Responds within 5 minutes
+2. **Secondary**: Auto-escalate after 15 minutes
+3. **Manager**: Auto-escalate after 30 minutes
+4. **Incident Commander**: Critical incidents only
+
+---
+
+## Metrics About Alerts
+
+Monitor your monitoring system!
+
+### Key Metrics
+```promql
+# Alert firing frequency
+sum(ALERTS{alertstate="firing"}) by (alertname)
+
+# Alert duration
+ALERTS_FOR_STATE{alertstate="firing"}
+
+# Alerts per severity
+sum(ALERTS{alertstate="firing"}) by (severity)
+
+# Time to acknowledge (from PagerDuty/etc)
+pagerduty_incident_ack_duration_seconds
+```
+
+### Alert Quality Metrics
+- **Mean Time to Acknowledge (MTTA)**: < 5 minutes
+- **Mean Time to Resolve (MTTR)**: < 30 minutes
+- **False Positive Rate**: < 10%
+- **Alert Coverage**: % of incidents with preceding alert > 80%