zhongwei/gh-ahmedasmar-devops-claude-skills-monitoring-observability

Fork 0

Files

Zhongwei Li 23753b435e Initial commit

2025-11-29 17:51:22 +08:00

13 KiB

Raw Blame History

Alerting Best Practices

Core Principles

1. Every Alert Should Be Actionable

If you can't do something about it, don't alert on it.

❌ Bad: Alert: CPU > 50% (What action should be taken?) ✅ Good: Alert: API latency p95 > 2s for 10m (Investigate/scale up)

2. Alert on Symptoms, Not Causes

Alert on what users experience, not underlying components.

❌ Bad: Database connection pool 80% full ✅ Good: Request latency p95 > 1s (which might be caused by DB pool)

3. Alert on SLO Violations

Tie alerts to Service Level Objectives.

✅ Error rate exceeds 0.1% (SLO: 99.9% availability)

4. Reduce Noise

Alert fatigue is real. Only page for critical issues.

Alert Severity Levels:

Critical: Page on-call immediately (user-facing issue)
Warning: Create ticket, review during business hours
Info: Log for awareness, no action needed

Alert Design Patterns

Pattern 1: Multi-Window Multi-Burn-Rate

Google's recommended SLO alerting approach.

Concept: Alert when error budget burn rate is high enough to exhaust the budget too quickly.

# Fast burn (6% of budget in 1 hour)
- alert: FastBurnRate
  expr: |
    sum(rate(http_requests_total{status=~"5.."}[1h]))
      /
    sum(rate(http_requests_total[1h]))
    > (14.4 * 0.001)  # 14.4x burn rate for 99.9% SLO
  for: 2m
  labels:
    severity: critical

# Slow burn (6% of budget in 6 hours)
- alert: SlowBurnRate
  expr: |
    sum(rate(http_requests_total{status=~"5.."}[6h]))
      /
    sum(rate(http_requests_total[6h]))
    > (6 * 0.001)  # 6x burn rate for 99.9% SLO
  for: 30m
  labels:
    severity: warning

Burn Rate Multipliers for 99.9% SLO (0.1% error budget):

1 hour window, 2m grace: 14.4x burn rate
6 hour window, 30m grace: 6x burn rate
3 day window, 6h grace: 1x burn rate

Pattern 2: Rate of Change

Alert when metrics change rapidly.

- alert: TrafficSpike
  expr: |
    sum(rate(http_requests_total[5m]))
      >
    1.5 * sum(rate(http_requests_total[5m] offset 1h))
  for: 10m
  annotations:
    summary: "Traffic increased by 50% compared to 1 hour ago"

Pattern 3: Threshold with Hysteresis

Prevent flapping with different thresholds for firing and resolving.

# Fire at 90%, resolve at 70%
- alert: HighCPU
  expr: cpu_usage > 90
  for: 5m

- alert: HighCPU_Resolved
  expr: cpu_usage < 70
  for: 5m

Pattern 4: Absent Metrics

Alert when expected metrics stop being reported (service down).

- alert: ServiceDown
  expr: absent(up{job="my-service"})
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Service {{ $labels.job }} is not reporting metrics"

Pattern 5: Aggregate Alerts

Alert on aggregate performance across multiple instances.

- alert: HighOverallErrorRate
  expr: |
    sum(rate(http_requests_total{status=~"5.."}[5m]))
      /
    sum(rate(http_requests_total[5m]))
    > 0.05
  for: 10m
  annotations:
    summary: "Overall error rate is {{ $value | humanizePercentage }}"

Alert Annotation Best Practices

Required Fields

summary: One-line description of the issue

summary: "High error rate on {{ $labels.service }}: {{ $value | humanizePercentage }}"

description: Detailed explanation with context

description: |
  Error rate on {{ $labels.service }} is {{ $value | humanizePercentage }},
  which exceeds the threshold of 1% for more than 10 minutes.

  Current value: {{ $value }}
  Runbook: https://runbooks.example.com/high-error-rate

runbook_url: Link to investigation steps

runbook_url: "https://runbooks.example.com/alerts/{{ $labels.alertname }}"

Optional but Recommended

dashboard: Link to relevant dashboard

dashboard: "https://grafana.example.com/d/service-dashboard?var-service={{ $labels.service }}"

logs: Link to logs

logs: "https://kibana.example.com/app/discover#/?_a=(query:(query_string:(query:'service:{{ $labels.service }}')))"

Alert Label Best Practices

Required Labels

severity: Critical, warning, or info

labels:
  severity: critical

team: Who should handle this alert

labels:
  team: platform
  severity: critical

component: What part of the system

labels:
  component: api-gateway
  severity: warning

Example Complete Alert

- alert: HighLatency
  expr: |
    histogram_quantile(0.95,
      sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
    ) > 1
  for: 10m
  labels:
    severity: warning
    team: backend
    component: api
    environment: "{{ $labels.environment }}"
  annotations:
    summary: "High latency on {{ $labels.service }}"
    description: |
      P95 latency on {{ $labels.service }} is {{ $value }}s, exceeding 1s threshold.

      This may impact user experience. Check recent deployments and database performance.

      Current p95: {{ $value }}s
      Threshold: 1s
      Duration: 10m+
    runbook_url: "https://runbooks.example.com/high-latency"
    dashboard: "https://grafana.example.com/d/api-dashboard"
    logs: "https://kibana.example.com/app/discover#/?_a=(query:(query_string:(query:'service:{{ $labels.service }} AND level:error')))"

Alert Thresholds

General Guidelines

Response Time / Latency:

Warning: p95 > 500ms or p99 > 1s
Critical: p95 > 2s or p99 > 5s

Error Rate:

Warning: > 1%
Critical: > 5%

Availability:

Warning: < 99.9%
Critical: < 99.5%

CPU Utilization:

Warning: > 70% for 15m
Critical: > 90% for 5m

Memory Utilization:

Warning: > 80% for 15m
Critical: > 95% for 5m

Disk Space:

Warning: > 80% full
Critical: > 90% full

Queue Depth:

Warning: > 70% of max capacity
Critical: > 90% of max capacity

Application-Specific Thresholds

Set thresholds based on:

Historical performance: Use p95 of last 30 days + 20%
SLO requirements: If SLO is 99.9%, alert at 99.5%
Business impact: What error rate causes user complaints?

The "for" Clause

Prevent alert flapping by requiring the condition to be true for a duration.

Guidelines

Critical alerts: Short duration (2-5m)

- alert: ServiceDown
  expr: up == 0
  for: 2m  # Quick detection for critical issues

Warning alerts: Longer duration (10-30m)

- alert: HighMemoryUsage
  expr: memory_usage > 80
  for: 15m  # Avoid noise from temporary spikes

Resource saturation: Medium duration (5-10m)

- alert: HighCPU
  expr: cpu_usage > 90
  for: 5m

Alert Routing

Severity-Based Routing

# alertmanager.yml
route:
  group_by: ['alertname', 'cluster']
  group_wait: 10s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default'

  routes:
  # Critical alerts → PagerDuty
  - match:
      severity: critical
    receiver: pagerduty
    group_wait: 10s
    repeat_interval: 5m

  # Warning alerts → Slack
  - match:
      severity: warning
    receiver: slack
    group_wait: 30s
    repeat_interval: 12h

  # Info alerts → Email
  - match:
      severity: info
    receiver: email
    repeat_interval: 24h

Team-Based Routing

routes:
  # Platform team
  - match:
      team: platform
    receiver: platform-pagerduty

  # Backend team
  - match:
      team: backend
    receiver: backend-slack

  # Database team
  - match:
      component: database
    receiver: dba-pagerduty

Time-Based Routing

# Only page during business hours for non-critical
routes:
  - match:
      severity: warning
    receiver: slack
    active_time_intervals:
      - business_hours

time_intervals:
  - name: business_hours
    time_intervals:
      - weekdays: ['monday:friday']
        times:
          - start_time: '09:00'
            end_time: '17:00'
        location: 'America/New_York'

Alert Grouping

Intelligent Grouping

Group by service and environment:

route:
  group_by: ['alertname', 'service', 'environment']
  group_wait: 30s
  group_interval: 5m

This prevents:

50 alerts for "HighCPU" on different pods → 1 grouped alert
Mixing production and staging alerts

Inhibition Rules

Suppress related alerts when a parent alert fires.

inhibit_rules:
  # If service is down, suppress latency alerts
  - source_match:
      alertname: ServiceDown
    target_match:
      alertname: HighLatency
    equal: ['service']

  # If node is down, suppress all pod alerts on that node
  - source_match:
      alertname: NodeDown
    target_match_re:
      alertname: '(PodCrashLoop|HighCPU|HighMemory)'
    equal: ['node']

Runbook Structure

Every alert should link to a runbook with:

1. Context

What does this alert mean?
What is the user impact?
What is the urgency?

2. Investigation Steps

## Investigation

1. Check service health dashboard
   https://grafana.example.com/d/service-dashboard

2. Check recent deployments
   kubectl rollout history deployment/myapp -n production

3. Check error logs
   kubectl logs deployment/myapp -n production --tail=100 | grep ERROR

4. Check dependencies
   - Database: Check slow query log
   - Redis: Check memory usage
   - External APIs: Check status pages

3. Common Causes

## Common Causes

- **Recent deployment**: Check if alert started after deployment
- **Traffic spike**: Check request rate, might need to scale
- **Database issues**: Check query performance and connection pool
- **External API degradation**: Check third-party status pages

4. Resolution Steps

## Resolution

### Immediate Actions (< 5 minutes)
1. Scale up if traffic spike: `kubectl scale deployment myapp --replicas=10`
2. Rollback if recent deployment: `kubectl rollout undo deployment/myapp`

### Short-term Actions (< 30 minutes)
1. Restart pods if memory leak: `kubectl rollout restart deployment/myapp`
2. Clear cache if stale data: `redis-cli -h cache.example.com FLUSHDB`

### Long-term Actions (post-incident)
1. Review and optimize slow queries
2. Implement circuit breakers
3. Add more capacity
4. Update alert thresholds if false positive

5. Escalation

## Escalation

If issue persists after 30 minutes:
- Slack: #backend-oncall
- PagerDuty: Escalate to senior engineer
- Incident Commander: Jane Doe (jane@example.com)

Anti-Patterns to Avoid

1. Alert on Everything

❌ Don't: Alert on every warning log ✅ Do: Alert on error rate exceeding threshold

2. Alert Without Context

❌ Don't: "Error rate high" ✅ Do: "Error rate 5.2% exceeds 1% threshold for 10m, impacting checkout flow"

3. Static Thresholds for Dynamic Systems

❌ Don't: cpu_usage > 70 (fails during scale-up) ✅ Do: Alert on SLO violations or rate of change

4. No "for" Clause

❌ Don't: Alert immediately on threshold breach ✅ Do: Use for: 5m to avoid flapping

5. Too Many Recipients

❌ Don't: Page 10 people for every alert ✅ Do: Route to specific on-call rotation

6. Duplicate Alerts

❌ Don't: Alert on both cause and symptom ✅ Do: Alert on symptom, use inhibition for causes

7. No Runbook

❌ Don't: Alert without guidance ✅ Do: Include runbook_url in every alert

Alert Testing

Test Alert Firing

# Trigger test alert in Prometheus
amtool alert add alertname="TestAlert" \
  severity="warning" \
  summary="Test alert"

# Or use Alertmanager API
curl -X POST http://alertmanager:9093/api/v1/alerts \
  -d '[{
    "labels": {"alertname": "TestAlert", "severity": "critical"},
    "annotations": {"summary": "Test critical alert"}
  }]'

Verify Alert Rules

# Check syntax
promtool check rules alerts.yml

# Test expression
promtool query instant http://prometheus:9090 \
  'sum(rate(http_requests_total{status=~"5.."}[5m]))'

# Unit test alerts
promtool test rules test.yml

Test Alertmanager Routing

# Test which receiver an alert would go to
amtool config routes test \
  --config.file=alertmanager.yml \
  alertname="HighLatency" \
  severity="critical" \
  team="backend"

On-Call Best Practices

Rotation Schedule

Primary on-call: First responder
Secondary on-call: Escalation backup
Rotation length: 1 week (balance load vs context)
Handoff: Monday morning (not Friday evening)

On-Call Checklist

## Pre-shift
- [ ] Test pager/phone
- [ ] Review recent incidents
- [ ] Check upcoming deployments
- [ ] Update contact info

## During shift
- [ ] Respond to pages within 5 minutes
- [ ] Document all incidents
- [ ] Update runbooks if gaps found
- [ ] Communicate in #incidents channel

## Post-shift
- [ ] Hand off open incidents
- [ ] Complete incident reports
- [ ] Suggest improvements
- [ ] Update team documentation

Escalation Policy

Primary: Responds within 5 minutes
Secondary: Auto-escalate after 15 minutes
Manager: Auto-escalate after 30 minutes
Incident Commander: Critical incidents only

Metrics About Alerts

Monitor your monitoring system!

Key Metrics

# Alert firing frequency
sum(ALERTS{alertstate="firing"}) by (alertname)

# Alert duration
ALERTS_FOR_STATE{alertstate="firing"}

# Alerts per severity
sum(ALERTS{alertstate="firing"}) by (severity)

# Time to acknowledge (from PagerDuty/etc)
pagerduty_incident_ack_duration_seconds

Alert Quality Metrics

Mean Time to Acknowledge (MTTA): < 5 minutes
Mean Time to Resolve (MTTR): < 30 minutes
False Positive Rate: < 10%
Alert Coverage: % of incidents with preceding alert > 80%

13 KiB Raw Blame History