Initial commit
This commit is contained in:
609
references/alerting_best_practices.md
Normal file
609
references/alerting_best_practices.md
Normal file
@@ -0,0 +1,609 @@
|
||||
# Alerting Best Practices
|
||||
|
||||
## Core Principles
|
||||
|
||||
### 1. Every Alert Should Be Actionable
|
||||
If you can't do something about it, don't alert on it.
|
||||
|
||||
❌ Bad: `Alert: CPU > 50%` (What action should be taken?)
|
||||
✅ Good: `Alert: API latency p95 > 2s for 10m` (Investigate/scale up)
|
||||
|
||||
### 2. Alert on Symptoms, Not Causes
|
||||
Alert on what users experience, not underlying components.
|
||||
|
||||
❌ Bad: `Database connection pool 80% full`
|
||||
✅ Good: `Request latency p95 > 1s` (which might be caused by DB pool)
|
||||
|
||||
### 3. Alert on SLO Violations
|
||||
Tie alerts to Service Level Objectives.
|
||||
|
||||
✅ `Error rate exceeds 0.1% (SLO: 99.9% availability)`
|
||||
|
||||
### 4. Reduce Noise
|
||||
Alert fatigue is real. Only page for critical issues.
|
||||
|
||||
**Alert Severity Levels**:
|
||||
- **Critical**: Page on-call immediately (user-facing issue)
|
||||
- **Warning**: Create ticket, review during business hours
|
||||
- **Info**: Log for awareness, no action needed
|
||||
|
||||
---
|
||||
|
||||
## Alert Design Patterns
|
||||
|
||||
### Pattern 1: Multi-Window Multi-Burn-Rate
|
||||
|
||||
Google's recommended SLO alerting approach.
|
||||
|
||||
**Concept**: Alert when error budget burn rate is high enough to exhaust the budget too quickly.
|
||||
|
||||
```yaml
|
||||
# Fast burn (6% of budget in 1 hour)
|
||||
- alert: FastBurnRate
|
||||
expr: |
|
||||
sum(rate(http_requests_total{status=~"5.."}[1h]))
|
||||
/
|
||||
sum(rate(http_requests_total[1h]))
|
||||
> (14.4 * 0.001) # 14.4x burn rate for 99.9% SLO
|
||||
for: 2m
|
||||
labels:
|
||||
severity: critical
|
||||
|
||||
# Slow burn (6% of budget in 6 hours)
|
||||
- alert: SlowBurnRate
|
||||
expr: |
|
||||
sum(rate(http_requests_total{status=~"5.."}[6h]))
|
||||
/
|
||||
sum(rate(http_requests_total[6h]))
|
||||
> (6 * 0.001) # 6x burn rate for 99.9% SLO
|
||||
for: 30m
|
||||
labels:
|
||||
severity: warning
|
||||
```
|
||||
|
||||
**Burn Rate Multipliers for 99.9% SLO (0.1% error budget)**:
|
||||
- 1 hour window, 2m grace: 14.4x burn rate
|
||||
- 6 hour window, 30m grace: 6x burn rate
|
||||
- 3 day window, 6h grace: 1x burn rate
|
||||
|
||||
### Pattern 2: Rate of Change
|
||||
Alert when metrics change rapidly.
|
||||
|
||||
```yaml
|
||||
- alert: TrafficSpike
|
||||
expr: |
|
||||
sum(rate(http_requests_total[5m]))
|
||||
>
|
||||
1.5 * sum(rate(http_requests_total[5m] offset 1h))
|
||||
for: 10m
|
||||
annotations:
|
||||
summary: "Traffic increased by 50% compared to 1 hour ago"
|
||||
```
|
||||
|
||||
### Pattern 3: Threshold with Hysteresis
|
||||
Prevent flapping with different thresholds for firing and resolving.
|
||||
|
||||
```yaml
|
||||
# Fire at 90%, resolve at 70%
|
||||
- alert: HighCPU
|
||||
expr: cpu_usage > 90
|
||||
for: 5m
|
||||
|
||||
- alert: HighCPU_Resolved
|
||||
expr: cpu_usage < 70
|
||||
for: 5m
|
||||
```
|
||||
|
||||
### Pattern 4: Absent Metrics
|
||||
Alert when expected metrics stop being reported (service down).
|
||||
|
||||
```yaml
|
||||
- alert: ServiceDown
|
||||
expr: absent(up{job="my-service"})
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "Service {{ $labels.job }} is not reporting metrics"
|
||||
```
|
||||
|
||||
### Pattern 5: Aggregate Alerts
|
||||
Alert on aggregate performance across multiple instances.
|
||||
|
||||
```yaml
|
||||
- alert: HighOverallErrorRate
|
||||
expr: |
|
||||
sum(rate(http_requests_total{status=~"5.."}[5m]))
|
||||
/
|
||||
sum(rate(http_requests_total[5m]))
|
||||
> 0.05
|
||||
for: 10m
|
||||
annotations:
|
||||
summary: "Overall error rate is {{ $value | humanizePercentage }}"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Alert Annotation Best Practices
|
||||
|
||||
### Required Fields
|
||||
|
||||
**summary**: One-line description of the issue
|
||||
```yaml
|
||||
summary: "High error rate on {{ $labels.service }}: {{ $value | humanizePercentage }}"
|
||||
```
|
||||
|
||||
**description**: Detailed explanation with context
|
||||
```yaml
|
||||
description: |
|
||||
Error rate on {{ $labels.service }} is {{ $value | humanizePercentage }},
|
||||
which exceeds the threshold of 1% for more than 10 minutes.
|
||||
|
||||
Current value: {{ $value }}
|
||||
Runbook: https://runbooks.example.com/high-error-rate
|
||||
```
|
||||
|
||||
**runbook_url**: Link to investigation steps
|
||||
```yaml
|
||||
runbook_url: "https://runbooks.example.com/alerts/{{ $labels.alertname }}"
|
||||
```
|
||||
|
||||
### Optional but Recommended
|
||||
|
||||
**dashboard**: Link to relevant dashboard
|
||||
```yaml
|
||||
dashboard: "https://grafana.example.com/d/service-dashboard?var-service={{ $labels.service }}"
|
||||
```
|
||||
|
||||
**logs**: Link to logs
|
||||
```yaml
|
||||
logs: "https://kibana.example.com/app/discover#/?_a=(query:(query_string:(query:'service:{{ $labels.service }}')))"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Alert Label Best Practices
|
||||
|
||||
### Required Labels
|
||||
|
||||
**severity**: Critical, warning, or info
|
||||
```yaml
|
||||
labels:
|
||||
severity: critical
|
||||
```
|
||||
|
||||
**team**: Who should handle this alert
|
||||
```yaml
|
||||
labels:
|
||||
team: platform
|
||||
severity: critical
|
||||
```
|
||||
|
||||
**component**: What part of the system
|
||||
```yaml
|
||||
labels:
|
||||
component: api-gateway
|
||||
severity: warning
|
||||
```
|
||||
|
||||
### Example Complete Alert
|
||||
```yaml
|
||||
- alert: HighLatency
|
||||
expr: |
|
||||
histogram_quantile(0.95,
|
||||
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
|
||||
) > 1
|
||||
for: 10m
|
||||
labels:
|
||||
severity: warning
|
||||
team: backend
|
||||
component: api
|
||||
environment: "{{ $labels.environment }}"
|
||||
annotations:
|
||||
summary: "High latency on {{ $labels.service }}"
|
||||
description: |
|
||||
P95 latency on {{ $labels.service }} is {{ $value }}s, exceeding 1s threshold.
|
||||
|
||||
This may impact user experience. Check recent deployments and database performance.
|
||||
|
||||
Current p95: {{ $value }}s
|
||||
Threshold: 1s
|
||||
Duration: 10m+
|
||||
runbook_url: "https://runbooks.example.com/high-latency"
|
||||
dashboard: "https://grafana.example.com/d/api-dashboard"
|
||||
logs: "https://kibana.example.com/app/discover#/?_a=(query:(query_string:(query:'service:{{ $labels.service }} AND level:error')))"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Alert Thresholds
|
||||
|
||||
### General Guidelines
|
||||
|
||||
**Response Time / Latency**:
|
||||
- Warning: p95 > 500ms or p99 > 1s
|
||||
- Critical: p95 > 2s or p99 > 5s
|
||||
|
||||
**Error Rate**:
|
||||
- Warning: > 1%
|
||||
- Critical: > 5%
|
||||
|
||||
**Availability**:
|
||||
- Warning: < 99.9%
|
||||
- Critical: < 99.5%
|
||||
|
||||
**CPU Utilization**:
|
||||
- Warning: > 70% for 15m
|
||||
- Critical: > 90% for 5m
|
||||
|
||||
**Memory Utilization**:
|
||||
- Warning: > 80% for 15m
|
||||
- Critical: > 95% for 5m
|
||||
|
||||
**Disk Space**:
|
||||
- Warning: > 80% full
|
||||
- Critical: > 90% full
|
||||
|
||||
**Queue Depth**:
|
||||
- Warning: > 70% of max capacity
|
||||
- Critical: > 90% of max capacity
|
||||
|
||||
### Application-Specific Thresholds
|
||||
|
||||
Set thresholds based on:
|
||||
1. **Historical performance**: Use p95 of last 30 days + 20%
|
||||
2. **SLO requirements**: If SLO is 99.9%, alert at 99.5%
|
||||
3. **Business impact**: What error rate causes user complaints?
|
||||
|
||||
---
|
||||
|
||||
## The "for" Clause
|
||||
|
||||
Prevent alert flapping by requiring the condition to be true for a duration.
|
||||
|
||||
### Guidelines
|
||||
|
||||
**Critical alerts**: Short duration (2-5m)
|
||||
```yaml
|
||||
- alert: ServiceDown
|
||||
expr: up == 0
|
||||
for: 2m # Quick detection for critical issues
|
||||
```
|
||||
|
||||
**Warning alerts**: Longer duration (10-30m)
|
||||
```yaml
|
||||
- alert: HighMemoryUsage
|
||||
expr: memory_usage > 80
|
||||
for: 15m # Avoid noise from temporary spikes
|
||||
```
|
||||
|
||||
**Resource saturation**: Medium duration (5-10m)
|
||||
```yaml
|
||||
- alert: HighCPU
|
||||
expr: cpu_usage > 90
|
||||
for: 5m
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Alert Routing
|
||||
|
||||
### Severity-Based Routing
|
||||
|
||||
```yaml
|
||||
# alertmanager.yml
|
||||
route:
|
||||
group_by: ['alertname', 'cluster']
|
||||
group_wait: 10s
|
||||
group_interval: 5m
|
||||
repeat_interval: 4h
|
||||
receiver: 'default'
|
||||
|
||||
routes:
|
||||
# Critical alerts → PagerDuty
|
||||
- match:
|
||||
severity: critical
|
||||
receiver: pagerduty
|
||||
group_wait: 10s
|
||||
repeat_interval: 5m
|
||||
|
||||
# Warning alerts → Slack
|
||||
- match:
|
||||
severity: warning
|
||||
receiver: slack
|
||||
group_wait: 30s
|
||||
repeat_interval: 12h
|
||||
|
||||
# Info alerts → Email
|
||||
- match:
|
||||
severity: info
|
||||
receiver: email
|
||||
repeat_interval: 24h
|
||||
```
|
||||
|
||||
### Team-Based Routing
|
||||
|
||||
```yaml
|
||||
routes:
|
||||
# Platform team
|
||||
- match:
|
||||
team: platform
|
||||
receiver: platform-pagerduty
|
||||
|
||||
# Backend team
|
||||
- match:
|
||||
team: backend
|
||||
receiver: backend-slack
|
||||
|
||||
# Database team
|
||||
- match:
|
||||
component: database
|
||||
receiver: dba-pagerduty
|
||||
```
|
||||
|
||||
### Time-Based Routing
|
||||
|
||||
```yaml
|
||||
# Only page during business hours for non-critical
|
||||
routes:
|
||||
- match:
|
||||
severity: warning
|
||||
receiver: slack
|
||||
active_time_intervals:
|
||||
- business_hours
|
||||
|
||||
time_intervals:
|
||||
- name: business_hours
|
||||
time_intervals:
|
||||
- weekdays: ['monday:friday']
|
||||
times:
|
||||
- start_time: '09:00'
|
||||
end_time: '17:00'
|
||||
location: 'America/New_York'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Alert Grouping
|
||||
|
||||
### Intelligent Grouping
|
||||
|
||||
**Group by service and environment**:
|
||||
```yaml
|
||||
route:
|
||||
group_by: ['alertname', 'service', 'environment']
|
||||
group_wait: 30s
|
||||
group_interval: 5m
|
||||
```
|
||||
|
||||
This prevents:
|
||||
- 50 alerts for "HighCPU" on different pods → 1 grouped alert
|
||||
- Mixing production and staging alerts
|
||||
|
||||
### Inhibition Rules
|
||||
|
||||
Suppress related alerts when a parent alert fires.
|
||||
|
||||
```yaml
|
||||
inhibit_rules:
|
||||
# If service is down, suppress latency alerts
|
||||
- source_match:
|
||||
alertname: ServiceDown
|
||||
target_match:
|
||||
alertname: HighLatency
|
||||
equal: ['service']
|
||||
|
||||
# If node is down, suppress all pod alerts on that node
|
||||
- source_match:
|
||||
alertname: NodeDown
|
||||
target_match_re:
|
||||
alertname: '(PodCrashLoop|HighCPU|HighMemory)'
|
||||
equal: ['node']
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Runbook Structure
|
||||
|
||||
Every alert should link to a runbook with:
|
||||
|
||||
### 1. Context
|
||||
- What does this alert mean?
|
||||
- What is the user impact?
|
||||
- What is the urgency?
|
||||
|
||||
### 2. Investigation Steps
|
||||
```markdown
|
||||
## Investigation
|
||||
|
||||
1. Check service health dashboard
|
||||
https://grafana.example.com/d/service-dashboard
|
||||
|
||||
2. Check recent deployments
|
||||
kubectl rollout history deployment/myapp -n production
|
||||
|
||||
3. Check error logs
|
||||
kubectl logs deployment/myapp -n production --tail=100 | grep ERROR
|
||||
|
||||
4. Check dependencies
|
||||
- Database: Check slow query log
|
||||
- Redis: Check memory usage
|
||||
- External APIs: Check status pages
|
||||
```
|
||||
|
||||
### 3. Common Causes
|
||||
```markdown
|
||||
## Common Causes
|
||||
|
||||
- **Recent deployment**: Check if alert started after deployment
|
||||
- **Traffic spike**: Check request rate, might need to scale
|
||||
- **Database issues**: Check query performance and connection pool
|
||||
- **External API degradation**: Check third-party status pages
|
||||
```
|
||||
|
||||
### 4. Resolution Steps
|
||||
```markdown
|
||||
## Resolution
|
||||
|
||||
### Immediate Actions (< 5 minutes)
|
||||
1. Scale up if traffic spike: `kubectl scale deployment myapp --replicas=10`
|
||||
2. Rollback if recent deployment: `kubectl rollout undo deployment/myapp`
|
||||
|
||||
### Short-term Actions (< 30 minutes)
|
||||
1. Restart pods if memory leak: `kubectl rollout restart deployment/myapp`
|
||||
2. Clear cache if stale data: `redis-cli -h cache.example.com FLUSHDB`
|
||||
|
||||
### Long-term Actions (post-incident)
|
||||
1. Review and optimize slow queries
|
||||
2. Implement circuit breakers
|
||||
3. Add more capacity
|
||||
4. Update alert thresholds if false positive
|
||||
```
|
||||
|
||||
### 5. Escalation
|
||||
```markdown
|
||||
## Escalation
|
||||
|
||||
If issue persists after 30 minutes:
|
||||
- Slack: #backend-oncall
|
||||
- PagerDuty: Escalate to senior engineer
|
||||
- Incident Commander: Jane Doe (jane@example.com)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Anti-Patterns to Avoid
|
||||
|
||||
### 1. Alert on Everything
|
||||
❌ Don't: Alert on every warning log
|
||||
✅ Do: Alert on error rate exceeding threshold
|
||||
|
||||
### 2. Alert Without Context
|
||||
❌ Don't: "Error rate high"
|
||||
✅ Do: "Error rate 5.2% exceeds 1% threshold for 10m, impacting checkout flow"
|
||||
|
||||
### 3. Static Thresholds for Dynamic Systems
|
||||
❌ Don't: `cpu_usage > 70` (fails during scale-up)
|
||||
✅ Do: Alert on SLO violations or rate of change
|
||||
|
||||
### 4. No "for" Clause
|
||||
❌ Don't: Alert immediately on threshold breach
|
||||
✅ Do: Use `for: 5m` to avoid flapping
|
||||
|
||||
### 5. Too Many Recipients
|
||||
❌ Don't: Page 10 people for every alert
|
||||
✅ Do: Route to specific on-call rotation
|
||||
|
||||
### 6. Duplicate Alerts
|
||||
❌ Don't: Alert on both cause and symptom
|
||||
✅ Do: Alert on symptom, use inhibition for causes
|
||||
|
||||
### 7. No Runbook
|
||||
❌ Don't: Alert without guidance
|
||||
✅ Do: Include runbook_url in every alert
|
||||
|
||||
---
|
||||
|
||||
## Alert Testing
|
||||
|
||||
### Test Alert Firing
|
||||
```bash
|
||||
# Trigger test alert in Prometheus
|
||||
amtool alert add alertname="TestAlert" \
|
||||
severity="warning" \
|
||||
summary="Test alert"
|
||||
|
||||
# Or use Alertmanager API
|
||||
curl -X POST http://alertmanager:9093/api/v1/alerts \
|
||||
-d '[{
|
||||
"labels": {"alertname": "TestAlert", "severity": "critical"},
|
||||
"annotations": {"summary": "Test critical alert"}
|
||||
}]'
|
||||
```
|
||||
|
||||
### Verify Alert Rules
|
||||
```bash
|
||||
# Check syntax
|
||||
promtool check rules alerts.yml
|
||||
|
||||
# Test expression
|
||||
promtool query instant http://prometheus:9090 \
|
||||
'sum(rate(http_requests_total{status=~"5.."}[5m]))'
|
||||
|
||||
# Unit test alerts
|
||||
promtool test rules test.yml
|
||||
```
|
||||
|
||||
### Test Alertmanager Routing
|
||||
```bash
|
||||
# Test which receiver an alert would go to
|
||||
amtool config routes test \
|
||||
--config.file=alertmanager.yml \
|
||||
alertname="HighLatency" \
|
||||
severity="critical" \
|
||||
team="backend"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## On-Call Best Practices
|
||||
|
||||
### Rotation Schedule
|
||||
- **Primary on-call**: First responder
|
||||
- **Secondary on-call**: Escalation backup
|
||||
- **Rotation length**: 1 week (balance load vs context)
|
||||
- **Handoff**: Monday morning (not Friday evening)
|
||||
|
||||
### On-Call Checklist
|
||||
```markdown
|
||||
## Pre-shift
|
||||
- [ ] Test pager/phone
|
||||
- [ ] Review recent incidents
|
||||
- [ ] Check upcoming deployments
|
||||
- [ ] Update contact info
|
||||
|
||||
## During shift
|
||||
- [ ] Respond to pages within 5 minutes
|
||||
- [ ] Document all incidents
|
||||
- [ ] Update runbooks if gaps found
|
||||
- [ ] Communicate in #incidents channel
|
||||
|
||||
## Post-shift
|
||||
- [ ] Hand off open incidents
|
||||
- [ ] Complete incident reports
|
||||
- [ ] Suggest improvements
|
||||
- [ ] Update team documentation
|
||||
```
|
||||
|
||||
### Escalation Policy
|
||||
1. **Primary**: Responds within 5 minutes
|
||||
2. **Secondary**: Auto-escalate after 15 minutes
|
||||
3. **Manager**: Auto-escalate after 30 minutes
|
||||
4. **Incident Commander**: Critical incidents only
|
||||
|
||||
---
|
||||
|
||||
## Metrics About Alerts
|
||||
|
||||
Monitor your monitoring system!
|
||||
|
||||
### Key Metrics
|
||||
```promql
|
||||
# Alert firing frequency
|
||||
sum(ALERTS{alertstate="firing"}) by (alertname)
|
||||
|
||||
# Alert duration
|
||||
ALERTS_FOR_STATE{alertstate="firing"}
|
||||
|
||||
# Alerts per severity
|
||||
sum(ALERTS{alertstate="firing"}) by (severity)
|
||||
|
||||
# Time to acknowledge (from PagerDuty/etc)
|
||||
pagerduty_incident_ack_duration_seconds
|
||||
```
|
||||
|
||||
### Alert Quality Metrics
|
||||
- **Mean Time to Acknowledge (MTTA)**: < 5 minutes
|
||||
- **Mean Time to Resolve (MTTR)**: < 30 minutes
|
||||
- **False Positive Rate**: < 10%
|
||||
- **Alert Coverage**: % of incidents with preceding alert > 80%
|
||||
649
references/datadog_migration.md
Normal file
649
references/datadog_migration.md
Normal file
@@ -0,0 +1,649 @@
|
||||
# Migrating from Datadog to Open-Source Stack
|
||||
|
||||
## Overview
|
||||
|
||||
This guide helps you migrate from Datadog to a cost-effective open-source observability stack:
|
||||
- **Metrics**: Datadog → Prometheus + Grafana
|
||||
- **Logs**: Datadog → Loki + Grafana
|
||||
- **Traces**: Datadog APM → Tempo/Jaeger + Grafana
|
||||
- **Dashboards**: Datadog → Grafana
|
||||
- **Alerts**: Datadog Monitors → Prometheus Alertmanager
|
||||
|
||||
**Estimated Cost Savings**: 60-80% for similar functionality
|
||||
|
||||
---
|
||||
|
||||
## Cost Comparison
|
||||
|
||||
### Example: 100-host infrastructure
|
||||
|
||||
**Datadog**:
|
||||
- Infrastructure Pro: $1,500/month (100 hosts × $15)
|
||||
- Custom Metrics: $50/month (5,000 extra metrics beyond included 10,000)
|
||||
- Logs: $2,000/month (20GB/day × $0.10/GB × 30 days)
|
||||
- APM: $3,100/month (100 hosts × $31)
|
||||
- **Total**: ~$6,650/month ($79,800/year)
|
||||
|
||||
**Open-Source Stack** (self-hosted):
|
||||
- Infrastructure: $1,200/month (EC2/GKE for Prometheus, Grafana, Loki, Tempo)
|
||||
- Storage: $300/month (S3/GCS for long-term metrics and traces)
|
||||
- Operations time: Variable
|
||||
- **Total**: ~$1,500-2,500/month ($18,000-30,000/year)
|
||||
|
||||
**Savings**: $49,800-61,800/year
|
||||
|
||||
---
|
||||
|
||||
## Migration Strategy
|
||||
|
||||
### Phase 1: Run Parallel (Month 1-2)
|
||||
- Deploy open-source stack alongside Datadog
|
||||
- Migrate metrics first (lowest risk)
|
||||
- Validate data accuracy
|
||||
- Build confidence
|
||||
|
||||
### Phase 2: Migrate Dashboards & Alerts (Month 2-3)
|
||||
- Convert Datadog dashboards to Grafana
|
||||
- Translate alert rules
|
||||
- Train team on new tools
|
||||
|
||||
### Phase 3: Migrate Logs & Traces (Month 3-4)
|
||||
- Set up Loki for log aggregation
|
||||
- Deploy Tempo/Jaeger for tracing
|
||||
- Update application instrumentation
|
||||
|
||||
### Phase 4: Decommission Datadog (Month 4-5)
|
||||
- Confirm all functionality migrated
|
||||
- Cancel Datadog subscription
|
||||
- Archive Datadog dashboards/alerts for reference
|
||||
|
||||
---
|
||||
|
||||
## 1. Metrics Migration (Datadog → Prometheus)
|
||||
|
||||
### Step 1: Deploy Prometheus
|
||||
|
||||
**Kubernetes** (recommended):
|
||||
```yaml
|
||||
# prometheus-values.yaml
|
||||
prometheus:
|
||||
prometheusSpec:
|
||||
retention: 30d
|
||||
storageSpec:
|
||||
volumeClaimTemplate:
|
||||
spec:
|
||||
resources:
|
||||
requests:
|
||||
storage: 100Gi
|
||||
|
||||
# Scrape configs
|
||||
additionalScrapeConfigs:
|
||||
- job_name: 'kubernetes-pods'
|
||||
kubernetes_sd_configs:
|
||||
- role: pod
|
||||
```
|
||||
|
||||
**Install**:
|
||||
```bash
|
||||
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
|
||||
helm install prometheus prometheus-community/kube-prometheus-stack -f prometheus-values.yaml
|
||||
```
|
||||
|
||||
**Docker Compose**:
|
||||
```yaml
|
||||
version: '3'
|
||||
services:
|
||||
prometheus:
|
||||
image: prom/prometheus:latest
|
||||
ports:
|
||||
- "9090:9090"
|
||||
volumes:
|
||||
- ./prometheus.yml:/etc/prometheus/prometheus.yml
|
||||
- prometheus-data:/prometheus
|
||||
command:
|
||||
- '--config.file=/etc/prometheus/prometheus.yml'
|
||||
- '--storage.tsdb.retention.time=30d'
|
||||
|
||||
volumes:
|
||||
prometheus-data:
|
||||
```
|
||||
|
||||
### Step 2: Replace DogStatsD with Prometheus Exporters
|
||||
|
||||
**Before (DogStatsD)**:
|
||||
```python
|
||||
from datadog import statsd
|
||||
|
||||
statsd.increment('page.views')
|
||||
statsd.histogram('request.duration', 0.5)
|
||||
statsd.gauge('active_users', 100)
|
||||
```
|
||||
|
||||
**After (Prometheus Python client)**:
|
||||
```python
|
||||
from prometheus_client import Counter, Histogram, Gauge
|
||||
|
||||
page_views = Counter('page_views_total', 'Page views')
|
||||
request_duration = Histogram('request_duration_seconds', 'Request duration')
|
||||
active_users = Gauge('active_users', 'Active users')
|
||||
|
||||
# Usage
|
||||
page_views.inc()
|
||||
request_duration.observe(0.5)
|
||||
active_users.set(100)
|
||||
```
|
||||
|
||||
### Step 3: Metric Name Translation
|
||||
|
||||
| Datadog Metric | Prometheus Equivalent |
|
||||
|----------------|----------------------|
|
||||
| `system.cpu.idle` | `node_cpu_seconds_total{mode="idle"}` |
|
||||
| `system.mem.free` | `node_memory_MemFree_bytes` |
|
||||
| `system.disk.used` | `node_filesystem_size_bytes - node_filesystem_free_bytes` |
|
||||
| `docker.cpu.usage` | `container_cpu_usage_seconds_total` |
|
||||
| `kubernetes.pods.running` | `kube_pod_status_phase{phase="Running"}` |
|
||||
|
||||
### Step 4: Export Existing Datadog Metrics (Optional)
|
||||
|
||||
Use Datadog API to export historical data:
|
||||
|
||||
```python
|
||||
from datadog import api, initialize
|
||||
|
||||
options = {
|
||||
'api_key': 'YOUR_API_KEY',
|
||||
'app_key': 'YOUR_APP_KEY'
|
||||
}
|
||||
initialize(**options)
|
||||
|
||||
# Query metric
|
||||
result = api.Metric.query(
|
||||
start=int(time.time() - 86400), # Last 24h
|
||||
end=int(time.time()),
|
||||
query='avg:system.cpu.user{*}'
|
||||
)
|
||||
|
||||
# Convert to Prometheus format and import
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 2. Dashboard Migration (Datadog → Grafana)
|
||||
|
||||
### Step 1: Export Datadog Dashboards
|
||||
|
||||
```python
|
||||
import requests
|
||||
import json
|
||||
|
||||
api_key = "YOUR_API_KEY"
|
||||
app_key = "YOUR_APP_KEY"
|
||||
|
||||
headers = {
|
||||
'DD-API-KEY': api_key,
|
||||
'DD-APPLICATION-KEY': app_key
|
||||
}
|
||||
|
||||
# Get all dashboards
|
||||
response = requests.get(
|
||||
'https://api.datadoghq.com/api/v1/dashboard',
|
||||
headers=headers
|
||||
)
|
||||
|
||||
dashboards = response.json()
|
||||
|
||||
# Export each dashboard
|
||||
for dashboard in dashboards['dashboards']:
|
||||
dash_id = dashboard['id']
|
||||
detail = requests.get(
|
||||
f'https://api.datadoghq.com/api/v1/dashboard/{dash_id}',
|
||||
headers=headers
|
||||
).json()
|
||||
|
||||
with open(f'datadog_{dash_id}.json', 'w') as f:
|
||||
json.dump(detail, f, indent=2)
|
||||
```
|
||||
|
||||
### Step 2: Convert to Grafana Format
|
||||
|
||||
**Manual Conversion Template**:
|
||||
|
||||
| Datadog Widget | Grafana Panel Type |
|
||||
|----------------|-------------------|
|
||||
| Timeseries | Graph / Time series |
|
||||
| Query Value | Stat |
|
||||
| Toplist | Table / Bar gauge |
|
||||
| Heatmap | Heatmap |
|
||||
| Distribution | Histogram |
|
||||
|
||||
**Automated Conversion** (basic example):
|
||||
```python
|
||||
def convert_datadog_to_grafana(datadog_dashboard):
|
||||
grafana_dashboard = {
|
||||
"title": datadog_dashboard['title'],
|
||||
"panels": []
|
||||
}
|
||||
|
||||
for widget in datadog_dashboard['widgets']:
|
||||
panel = {
|
||||
"title": widget['definition'].get('title', ''),
|
||||
"type": map_widget_type(widget['definition']['type']),
|
||||
"targets": convert_queries(widget['definition']['requests'])
|
||||
}
|
||||
grafana_dashboard['panels'].append(panel)
|
||||
|
||||
return grafana_dashboard
|
||||
```
|
||||
|
||||
### Step 3: Common Query Translations
|
||||
|
||||
See `dql_promql_translation.md` for comprehensive query mappings.
|
||||
|
||||
**Example conversions**:
|
||||
|
||||
```
|
||||
Datadog: avg:system.cpu.user{*}
|
||||
Prometheus: avg(rate(node_cpu_seconds_total{mode="user"}[5m])) * 100
|
||||
|
||||
Datadog: sum:requests.count{status:200}.as_rate()
|
||||
Prometheus: sum(rate(http_requests_total{status="200"}[5m]))
|
||||
|
||||
Datadog: p95:request.duration{*}
|
||||
Prometheus: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3. Alert Migration (Datadog Monitors → Prometheus Alerts)
|
||||
|
||||
### Step 1: Export Datadog Monitors
|
||||
|
||||
```python
|
||||
import requests
|
||||
|
||||
api_key = "YOUR_API_KEY"
|
||||
app_key = "YOUR_APP_KEY"
|
||||
|
||||
headers = {
|
||||
'DD-API-KEY': api_key,
|
||||
'DD-APPLICATION-KEY': app_key
|
||||
}
|
||||
|
||||
response = requests.get(
|
||||
'https://api.datadoghq.com/api/v1/monitor',
|
||||
headers=headers
|
||||
)
|
||||
|
||||
monitors = response.json()
|
||||
|
||||
# Save each monitor
|
||||
for monitor in monitors:
|
||||
with open(f'monitor_{monitor["id"]}.json', 'w') as f:
|
||||
json.dump(monitor, f, indent=2)
|
||||
```
|
||||
|
||||
### Step 2: Convert to Prometheus Alert Rules
|
||||
|
||||
**Datadog Monitor**:
|
||||
```json
|
||||
{
|
||||
"name": "High CPU Usage",
|
||||
"type": "metric alert",
|
||||
"query": "avg(last_5m):avg:system.cpu.user{*} > 80",
|
||||
"message": "CPU usage is high on {{host.name}}"
|
||||
}
|
||||
```
|
||||
|
||||
**Prometheus Alert**:
|
||||
```yaml
|
||||
groups:
|
||||
- name: infrastructure
|
||||
rules:
|
||||
- alert: HighCPUUsage
|
||||
expr: |
|
||||
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "High CPU usage on {{ $labels.instance }}"
|
||||
description: "CPU usage is {{ $value }}%"
|
||||
```
|
||||
|
||||
### Step 3: Alert Routing (Datadog → Alertmanager)
|
||||
|
||||
**Datadog notification channels** → **Alertmanager receivers**
|
||||
|
||||
```yaml
|
||||
# alertmanager.yml
|
||||
route:
|
||||
group_by: ['alertname', 'severity']
|
||||
receiver: 'slack-notifications'
|
||||
|
||||
receivers:
|
||||
- name: 'slack-notifications'
|
||||
slack_configs:
|
||||
- api_url: 'YOUR_SLACK_WEBHOOK'
|
||||
channel: '#alerts'
|
||||
|
||||
- name: 'pagerduty-critical'
|
||||
pagerduty_configs:
|
||||
- service_key: 'YOUR_PAGERDUTY_KEY'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. Log Migration (Datadog → Loki)
|
||||
|
||||
### Step 1: Deploy Loki
|
||||
|
||||
**Kubernetes**:
|
||||
```bash
|
||||
helm repo add grafana https://grafana.github.io/helm-charts
|
||||
helm install loki grafana/loki-stack \
|
||||
--set loki.persistence.enabled=true \
|
||||
--set loki.persistence.size=100Gi \
|
||||
--set promtail.enabled=true
|
||||
```
|
||||
|
||||
**Docker Compose**:
|
||||
```yaml
|
||||
version: '3'
|
||||
services:
|
||||
loki:
|
||||
image: grafana/loki:latest
|
||||
ports:
|
||||
- "3100:3100"
|
||||
volumes:
|
||||
- ./loki-config.yaml:/etc/loki/local-config.yaml
|
||||
- loki-data:/loki
|
||||
|
||||
promtail:
|
||||
image: grafana/promtail:latest
|
||||
volumes:
|
||||
- /var/log:/var/log
|
||||
- ./promtail-config.yaml:/etc/promtail/config.yml
|
||||
|
||||
volumes:
|
||||
loki-data:
|
||||
```
|
||||
|
||||
### Step 2: Replace Datadog Log Forwarder
|
||||
|
||||
**Before (Datadog Agent)**:
|
||||
```yaml
|
||||
# datadog.yaml
|
||||
logs_enabled: true
|
||||
|
||||
logs_config:
|
||||
container_collect_all: true
|
||||
```
|
||||
|
||||
**After (Promtail)**:
|
||||
```yaml
|
||||
# promtail-config.yaml
|
||||
server:
|
||||
http_listen_port: 9080
|
||||
|
||||
positions:
|
||||
filename: /tmp/positions.yaml
|
||||
|
||||
clients:
|
||||
- url: http://loki:3100/loki/api/v1/push
|
||||
|
||||
scrape_configs:
|
||||
- job_name: system
|
||||
static_configs:
|
||||
- targets:
|
||||
- localhost
|
||||
labels:
|
||||
job: varlogs
|
||||
__path__: /var/log/*.log
|
||||
```
|
||||
|
||||
### Step 3: Query Translation
|
||||
|
||||
**Datadog Logs Query**:
|
||||
```
|
||||
service:my-app status:error
|
||||
```
|
||||
|
||||
**Loki LogQL**:
|
||||
```logql
|
||||
{job="my-app", level="error"}
|
||||
```
|
||||
|
||||
**More examples**:
|
||||
```
|
||||
Datadog: service:api-gateway status:error @http.status_code:>=500
|
||||
Loki: {service="api-gateway", level="error"} | json | http_status_code >= 500
|
||||
|
||||
Datadog: source:nginx "404"
|
||||
Loki: {source="nginx"} |= "404"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. APM Migration (Datadog APM → Tempo/Jaeger)
|
||||
|
||||
### Step 1: Choose Tracing Backend
|
||||
|
||||
- **Tempo**: Better for high volume, cheaper storage (object storage)
|
||||
- **Jaeger**: More mature, better UI, requires separate storage
|
||||
|
||||
### Step 2: Replace Datadog Tracer with OpenTelemetry
|
||||
|
||||
**Before (Datadog Python)**:
|
||||
```python
|
||||
from ddtrace import tracer
|
||||
|
||||
@tracer.wrap()
|
||||
def my_function():
|
||||
pass
|
||||
```
|
||||
|
||||
**After (OpenTelemetry)**:
|
||||
```python
|
||||
from opentelemetry import trace
|
||||
from opentelemetry.sdk.trace import TracerProvider
|
||||
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
|
||||
|
||||
# Setup
|
||||
trace.set_tracer_provider(TracerProvider())
|
||||
tracer = trace.get_tracer(__name__)
|
||||
exporter = OTLPSpanExporter(endpoint="tempo:4317")
|
||||
|
||||
@tracer.start_as_current_span("my_function")
|
||||
def my_function():
|
||||
pass
|
||||
```
|
||||
|
||||
### Step 3: Deploy Tempo
|
||||
|
||||
```yaml
|
||||
# tempo.yaml
|
||||
server:
|
||||
http_listen_port: 3200
|
||||
|
||||
distributor:
|
||||
receivers:
|
||||
otlp:
|
||||
protocols:
|
||||
grpc:
|
||||
endpoint: 0.0.0.0:4317
|
||||
|
||||
storage:
|
||||
trace:
|
||||
backend: s3
|
||||
s3:
|
||||
bucket: tempo-traces
|
||||
endpoint: s3.amazonaws.com
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. Infrastructure Migration
|
||||
|
||||
### Recommended Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────┐
|
||||
│ Grafana (Visualization) │
|
||||
│ - Dashboards │
|
||||
│ - Unified view │
|
||||
└─────────────────────────────────────────┘
|
||||
↓ ↓ ↓
|
||||
┌──────────────┐ ┌──────────┐ ┌──────────┐
|
||||
│ Prometheus │ │ Loki │ │ Tempo │
|
||||
│ (Metrics) │ │ (Logs) │ │ (Traces) │
|
||||
└──────────────┘ └──────────┘ └──────────┘
|
||||
↓ ↓ ↓
|
||||
┌─────────────────────────────────────────┐
|
||||
│ Applications (OpenTelemetry) │
|
||||
└─────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Sizing Recommendations
|
||||
|
||||
**100-host environment**:
|
||||
|
||||
- **Prometheus**: 2-4 CPU, 8-16GB RAM, 100GB SSD
|
||||
- **Grafana**: 1 CPU, 2GB RAM
|
||||
- **Loki**: 2-4 CPU, 8GB RAM, 100GB SSD
|
||||
- **Tempo**: 2-4 CPU, 8GB RAM, S3 for storage
|
||||
- **Alertmanager**: 1 CPU, 1GB RAM
|
||||
|
||||
**Total**: ~8-16 CPU, 32-64GB RAM, 200GB SSD + object storage
|
||||
|
||||
---
|
||||
|
||||
## 7. Migration Checklist
|
||||
|
||||
### Pre-Migration
|
||||
- [ ] Calculate current Datadog costs
|
||||
- [ ] Identify all Datadog integrations
|
||||
- [ ] Export all dashboards
|
||||
- [ ] Export all monitors
|
||||
- [ ] Document custom metrics
|
||||
- [ ] Get stakeholder approval
|
||||
|
||||
### During Migration
|
||||
- [ ] Deploy Prometheus + Grafana
|
||||
- [ ] Deploy Loki + Promtail
|
||||
- [ ] Deploy Tempo/Jaeger (if using APM)
|
||||
- [ ] Migrate metrics instrumentation
|
||||
- [ ] Convert dashboards (top 10 critical first)
|
||||
- [ ] Convert alerts (critical alerts first)
|
||||
- [ ] Update application logging
|
||||
- [ ] Replace APM instrumentation
|
||||
- [ ] Run parallel for 2-4 weeks
|
||||
- [ ] Validate data accuracy
|
||||
- [ ] Train team on new tools
|
||||
|
||||
### Post-Migration
|
||||
- [ ] Decommission Datadog agent from all hosts
|
||||
- [ ] Cancel Datadog subscription
|
||||
- [ ] Archive Datadog configs
|
||||
- [ ] Document new workflows
|
||||
- [ ] Create runbooks for common tasks
|
||||
|
||||
---
|
||||
|
||||
## 8. Common Challenges & Solutions
|
||||
|
||||
### Challenge: Missing Datadog Features
|
||||
|
||||
**Datadog Synthetic Monitoring**:
|
||||
- Solution: Use **Blackbox Exporter** (Prometheus) or **Grafana Synthetic Monitoring**
|
||||
|
||||
**Datadog Network Performance Monitoring**:
|
||||
- Solution: Use **Cilium Hubble** (Kubernetes) or **eBPF-based tools**
|
||||
|
||||
**Datadog RUM (Real User Monitoring)**:
|
||||
- Solution: Use **Grafana Faro** or **OpenTelemetry Browser SDK**
|
||||
|
||||
### Challenge: Team Learning Curve
|
||||
|
||||
**Solution**:
|
||||
- Provide training sessions (2-3 hours per tool)
|
||||
- Create internal documentation with examples
|
||||
- Set up sandbox environment for practice
|
||||
- Assign champions for each tool
|
||||
|
||||
### Challenge: Query Performance
|
||||
|
||||
**Prometheus too slow**:
|
||||
- Use **Thanos** or **Cortex** for scaling
|
||||
- Implement recording rules for expensive queries
|
||||
- Increase retention only where needed
|
||||
|
||||
**Loki too slow**:
|
||||
- Add more labels for better filtering
|
||||
- Use chunk caching
|
||||
- Consider **parallel query execution**
|
||||
|
||||
---
|
||||
|
||||
## 9. Maintenance Comparison
|
||||
|
||||
### Datadog (Managed)
|
||||
- **Ops burden**: Low (fully managed)
|
||||
- **Upgrades**: Automatic
|
||||
- **Scaling**: Automatic
|
||||
- **Cost**: High ($6k-10k+/month)
|
||||
|
||||
### Open-Source Stack (Self-hosted)
|
||||
- **Ops burden**: Medium (requires ops team)
|
||||
- **Upgrades**: Manual (quarterly)
|
||||
- **Scaling**: Manual planning required
|
||||
- **Cost**: Low ($1.5k-3k/month infrastructure)
|
||||
|
||||
**Hybrid Option**: Use **Grafana Cloud** (managed Prometheus/Loki/Tempo)
|
||||
- Cost: ~$3k/month for 100 hosts
|
||||
- Ops burden: Low
|
||||
- Savings: ~50% vs Datadog
|
||||
|
||||
---
|
||||
|
||||
## 10. ROI Calculation
|
||||
|
||||
### Example Scenario
|
||||
|
||||
**Before (Datadog)**:
|
||||
- Monthly cost: $7,000
|
||||
- Annual cost: $84,000
|
||||
|
||||
**After (Self-hosted OSS)**:
|
||||
- Infrastructure: $1,800/month
|
||||
- Operations (0.5 FTE): $4,000/month
|
||||
- Annual cost: $69,600
|
||||
|
||||
**Savings**: $14,400/year
|
||||
|
||||
**After (Grafana Cloud)**:
|
||||
- Monthly cost: $3,500
|
||||
- Annual cost: $42,000
|
||||
|
||||
**Savings**: $42,000/year (50%)
|
||||
|
||||
**Break-even**: Immediate (no migration costs beyond engineering time)
|
||||
|
||||
---
|
||||
|
||||
## Resources
|
||||
|
||||
- **Prometheus**: https://prometheus.io/docs/
|
||||
- **Grafana**: https://grafana.com/docs/
|
||||
- **Loki**: https://grafana.com/docs/loki/
|
||||
- **Tempo**: https://grafana.com/docs/tempo/
|
||||
- **OpenTelemetry**: https://opentelemetry.io/
|
||||
- **Migration Tools**: https://github.com/grafana/dashboard-linter
|
||||
|
||||
---
|
||||
|
||||
## Support
|
||||
|
||||
If you need help with migration:
|
||||
- Grafana Labs offers migration consulting
|
||||
- Many SRE consulting firms specialize in this
|
||||
- Community support via Slack/Discord channels
|
||||
756
references/dql_promql_translation.md
Normal file
756
references/dql_promql_translation.md
Normal file
@@ -0,0 +1,756 @@
|
||||
# DQL (Datadog Query Language) ↔ PromQL Translation Guide
|
||||
|
||||
## Quick Reference
|
||||
|
||||
| Concept | Datadog (DQL) | Prometheus (PromQL) |
|
||||
|---------|---------------|---------------------|
|
||||
| Aggregation | `avg:`, `sum:`, `min:`, `max:` | `avg()`, `sum()`, `min()`, `max()` |
|
||||
| Rate | `.as_rate()`, `.as_count()` | `rate()`, `increase()` |
|
||||
| Percentile | `p50:`, `p95:`, `p99:` | `histogram_quantile()` |
|
||||
| Filtering | `{tag:value}` | `{label="value"}` |
|
||||
| Time window | `last_5m`, `last_1h` | `[5m]`, `[1h]` |
|
||||
|
||||
---
|
||||
|
||||
## Basic Queries
|
||||
|
||||
### Simple Metric Query
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
system.cpu.user
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
node_cpu_seconds_total{mode="user"}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Metric with Filter
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
system.cpu.user{host:web-01}
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
node_cpu_seconds_total{mode="user", instance="web-01"}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Multiple Filters (AND)
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
system.cpu.user{host:web-01,env:production}
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
node_cpu_seconds_total{mode="user", instance="web-01", env="production"}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Wildcard Filters
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
system.cpu.user{host:web-*}
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
node_cpu_seconds_total{mode="user", instance=~"web-.*"}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### OR Filters
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
system.cpu.user{host:web-01 OR host:web-02}
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
node_cpu_seconds_total{mode="user", instance=~"web-01|web-02"}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Aggregations
|
||||
|
||||
### Average
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
avg:system.cpu.user{*}
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
avg(node_cpu_seconds_total{mode="user"})
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Sum
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
sum:requests.count{*}
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
sum(http_requests_total)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Min/Max
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
min:system.mem.free{*}
|
||||
max:system.mem.free{*}
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
min(node_memory_MemFree_bytes)
|
||||
max(node_memory_MemFree_bytes)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Aggregation by Tag/Label
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
avg:system.cpu.user{*} by {host}
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
avg by (instance) (node_cpu_seconds_total{mode="user"})
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Rates and Counts
|
||||
|
||||
### Rate (per second)
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
sum:requests.count{*}.as_rate()
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
sum(rate(http_requests_total[5m]))
|
||||
```
|
||||
|
||||
Note: Prometheus requires explicit time window `[5m]`
|
||||
|
||||
---
|
||||
|
||||
### Count (total over time)
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
sum:requests.count{*}.as_count()
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
sum(increase(http_requests_total[1h]))
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Derivative (change over time)
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
derivative(avg:system.disk.used{*})
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
deriv(node_filesystem_size_bytes[5m])
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Percentiles
|
||||
|
||||
### P50 (Median)
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
p50:request.duration{*}
|
||||
```
|
||||
|
||||
**Prometheus** (requires histogram):
|
||||
```promql
|
||||
histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### P95
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
p95:request.duration{*}
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### P99
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
p99:request.duration{*}
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Time Windows
|
||||
|
||||
### Last 5 minutes
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
avg(last_5m):system.cpu.user{*}
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
avg(node_cpu_seconds_total{mode="user"}[5m])
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Last 1 hour
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
avg(last_1h):system.cpu.user{*}
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
avg_over_time(node_cpu_seconds_total{mode="user"}[1h])
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Math Operations
|
||||
|
||||
### Division
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
avg:system.mem.used{*} / avg:system.mem.total{*}
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
node_memory_MemUsed_bytes / node_memory_MemTotal_bytes
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Multiplication
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
avg:system.cpu.user{*} * 100
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
avg(node_cpu_seconds_total{mode="user"}) * 100
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Percentage Calculation
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
(sum:requests.errors{*} / sum:requests.count{*}) * 100
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
(sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) * 100
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Common Use Cases
|
||||
|
||||
### CPU Usage Percentage
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
100 - avg:system.cpu.idle{*}
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Memory Usage Percentage
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
(avg:system.mem.used{*} / avg:system.mem.total{*}) * 100
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Disk Usage Percentage
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
(avg:system.disk.used{*} / avg:system.disk.total{*}) * 100
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
(node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes * 100
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Request Rate (requests/sec)
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
sum:requests.count{*}.as_rate()
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
sum(rate(http_requests_total[5m]))
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Error Rate Percentage
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
(sum:requests.errors{*}.as_rate() / sum:requests.count{*}.as_rate()) * 100
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
(sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) * 100
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Request Latency (P95)
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
p95:request.duration{*}
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Top 5 Hosts by CPU
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
top(avg:system.cpu.user{*} by {host}, 5, 'mean', 'desc')
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
topk(5, avg by (instance) (rate(node_cpu_seconds_total{mode="user"}[5m])))
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Functions
|
||||
|
||||
### Absolute Value
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
abs(diff(avg:system.cpu.user{*}))
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
abs(delta(node_cpu_seconds_total{mode="user"}[5m]))
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Ceiling/Floor
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
ceil(avg:system.cpu.user{*})
|
||||
floor(avg:system.cpu.user{*})
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
ceil(avg(node_cpu_seconds_total{mode="user"}))
|
||||
floor(avg(node_cpu_seconds_total{mode="user"}))
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Clamp (Limit Range)
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
clamp_min(avg:system.cpu.user{*}, 0)
|
||||
clamp_max(avg:system.cpu.user{*}, 100)
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
clamp_min(avg(node_cpu_seconds_total{mode="user"}), 0)
|
||||
clamp_max(avg(node_cpu_seconds_total{mode="user"}), 100)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Moving Average
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
moving_rollup(avg:system.cpu.user{*}, 60, 'avg')
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
avg_over_time(node_cpu_seconds_total{mode="user"}[1h])
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Advanced Patterns
|
||||
|
||||
### Compare to Previous Period
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
sum:requests.count{*}.as_rate() / timeshift(sum:requests.count{*}.as_rate(), 3600)
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
sum(rate(http_requests_total[5m])) / sum(rate(http_requests_total[5m] offset 1h))
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Forecast
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
forecast(avg:system.disk.used{*}, 'linear', 1)
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
predict_linear(node_filesystem_size_bytes[1h], 3600)
|
||||
```
|
||||
|
||||
Note: Predicts value 1 hour in future based on last 1 hour trend
|
||||
|
||||
---
|
||||
|
||||
### Anomaly Detection
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
anomalies(avg:system.cpu.user{*}, 'basic', 2)
|
||||
```
|
||||
|
||||
**Prometheus**: No built-in function
|
||||
- Use recording rules with stddev
|
||||
- External tools like **Robust Perception's anomaly detector**
|
||||
- Or use **Grafana ML** plugin
|
||||
|
||||
---
|
||||
|
||||
### Outlier Detection
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
outliers(avg:system.cpu.user{*} by {host}, 'mad')
|
||||
```
|
||||
|
||||
**Prometheus**: No built-in function
|
||||
- Calculate manually with stddev:
|
||||
```promql
|
||||
abs(metric - avg(metric)) > 2 * stddev(metric)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Container & Kubernetes
|
||||
|
||||
### Container CPU Usage
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
avg:docker.cpu.usage{*} by {container_name}
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
avg by (container) (rate(container_cpu_usage_seconds_total[5m]))
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Container Memory Usage
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
avg:docker.mem.rss{*} by {container_name}
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
avg by (container) (container_memory_rss)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Pod Count by Status
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
sum:kubernetes.pods.running{*} by {kube_namespace}
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
sum by (namespace) (kube_pod_status_phase{phase="Running"})
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Database Queries
|
||||
|
||||
### MySQL Queries Per Second
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
sum:mysql.performance.queries{*}.as_rate()
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
sum(rate(mysql_global_status_queries[5m]))
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### PostgreSQL Active Connections
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
avg:postgresql.connections{*}
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
avg(pg_stat_database_numbackends)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Redis Memory Usage
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
avg:redis.mem.used{*}
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
avg(redis_memory_used_bytes)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Network Metrics
|
||||
|
||||
### Network Bytes Sent
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
sum:system.net.bytes_sent{*}.as_rate()
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
sum(rate(node_network_transmit_bytes_total[5m]))
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Network Bytes Received
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
sum:system.net.bytes_rcvd{*}.as_rate()
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
sum(rate(node_network_receive_bytes_total[5m]))
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Key Differences
|
||||
|
||||
### 1. Time Windows
|
||||
- **Datadog**: Optional, defaults to query time range
|
||||
- **Prometheus**: Always required for rate/increase functions
|
||||
|
||||
### 2. Histograms
|
||||
- **Datadog**: Percentiles available directly
|
||||
- **Prometheus**: Requires histogram buckets + `histogram_quantile()`
|
||||
|
||||
### 3. Default Aggregation
|
||||
- **Datadog**: No default, must specify
|
||||
- **Prometheus**: Returns all time series unless aggregated
|
||||
|
||||
### 4. Metric Types
|
||||
- **Datadog**: All metrics treated similarly
|
||||
- **Prometheus**: Explicit types (counter, gauge, histogram, summary)
|
||||
|
||||
### 5. Tag vs Label
|
||||
- **Datadog**: Uses "tags" (key:value)
|
||||
- **Prometheus**: Uses "labels" (key="value")
|
||||
|
||||
---
|
||||
|
||||
## Migration Tips
|
||||
|
||||
1. **Start with dashboards**: Convert most-used dashboards first
|
||||
2. **Use recording rules**: Pre-calculate expensive PromQL queries
|
||||
3. **Test in parallel**: Run both systems during migration
|
||||
4. **Document mappings**: Create team-specific translation guide
|
||||
5. **Train team**: PromQL has learning curve, invest in training
|
||||
|
||||
---
|
||||
|
||||
## Tools
|
||||
|
||||
- **Datadog Dashboard Exporter**: Export JSON dashboards
|
||||
- **Grafana Dashboard Linter**: Validate converted dashboards
|
||||
- **PromQL Learning Resources**: https://prometheus.io/docs/prometheus/latest/querying/basics/
|
||||
|
||||
---
|
||||
|
||||
## Common Gotchas
|
||||
|
||||
### Rate without Time Window
|
||||
|
||||
❌ **Wrong**:
|
||||
```promql
|
||||
rate(http_requests_total)
|
||||
```
|
||||
|
||||
✅ **Correct**:
|
||||
```promql
|
||||
rate(http_requests_total[5m])
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Aggregating Before Rate
|
||||
|
||||
❌ **Wrong**:
|
||||
```promql
|
||||
rate(sum(http_requests_total)[5m])
|
||||
```
|
||||
|
||||
✅ **Correct**:
|
||||
```promql
|
||||
sum(rate(http_requests_total[5m]))
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Histogram Quantile Without by (le)
|
||||
|
||||
❌ **Wrong**:
|
||||
```promql
|
||||
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
|
||||
```
|
||||
|
||||
✅ **Correct**:
|
||||
```promql
|
||||
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Quick Conversion Checklist
|
||||
|
||||
When converting a Datadog query to PromQL:
|
||||
|
||||
- [ ] Replace metric name (e.g., `system.cpu.user` → `node_cpu_seconds_total`)
|
||||
- [ ] Convert tags to labels (`{tag:value}` → `{label="value"}`)
|
||||
- [ ] Add time window for rate/increase (`[5m]`)
|
||||
- [ ] Change aggregation syntax (`avg:` → `avg()`)
|
||||
- [ ] Convert percentiles to histogram_quantile if needed
|
||||
- [ ] Test query in Prometheus before adding to dashboard
|
||||
- [ ] Add `by (label)` for grouped aggregations
|
||||
|
||||
---
|
||||
|
||||
## Need More Help?
|
||||
|
||||
- See `datadog_migration.md` for full migration guide
|
||||
- PromQL documentation: https://prometheus.io/docs/prometheus/latest/querying/
|
||||
- Practice at: https://demo.promlens.com/
|
||||
775
references/logging_guide.md
Normal file
775
references/logging_guide.md
Normal file
@@ -0,0 +1,775 @@
|
||||
# Logging Guide
|
||||
|
||||
## Structured Logging
|
||||
|
||||
### Why Structured Logs?
|
||||
|
||||
**Unstructured** (text):
|
||||
```
|
||||
2024-10-28 14:32:15 User john@example.com logged in from 192.168.1.1
|
||||
```
|
||||
|
||||
**Structured** (JSON):
|
||||
```json
|
||||
{
|
||||
"timestamp": "2024-10-28T14:32:15Z",
|
||||
"level": "info",
|
||||
"message": "User logged in",
|
||||
"user": "john@example.com",
|
||||
"ip": "192.168.1.1",
|
||||
"event_type": "user_login"
|
||||
}
|
||||
```
|
||||
|
||||
**Benefits**:
|
||||
- Easy to parse and query
|
||||
- Consistent format
|
||||
- Machine-readable
|
||||
- Efficient storage and indexing
|
||||
|
||||
---
|
||||
|
||||
## Log Levels
|
||||
|
||||
Use appropriate log levels for better filtering and alerting.
|
||||
|
||||
### DEBUG
|
||||
**When**: Development, troubleshooting
|
||||
**Examples**:
|
||||
- Function entry/exit
|
||||
- Variable values
|
||||
- Internal state changes
|
||||
|
||||
```python
|
||||
logger.debug("Processing request", extra={
|
||||
"request_id": req_id,
|
||||
"params": params
|
||||
})
|
||||
```
|
||||
|
||||
### INFO
|
||||
**When**: Important business events
|
||||
**Examples**:
|
||||
- User actions (login, purchase)
|
||||
- System state changes (started, stopped)
|
||||
- Significant milestones
|
||||
|
||||
```python
|
||||
logger.info("Order placed", extra={
|
||||
"order_id": "12345",
|
||||
"user_id": "user123",
|
||||
"amount": 99.99
|
||||
})
|
||||
```
|
||||
|
||||
### WARN
|
||||
**When**: Potentially problematic situations
|
||||
**Examples**:
|
||||
- Deprecated API usage
|
||||
- Slow operations (but not failing)
|
||||
- Retry attempts
|
||||
- Resource usage approaching limits
|
||||
|
||||
```python
|
||||
logger.warning("API response slow", extra={
|
||||
"endpoint": "/api/users",
|
||||
"duration_ms": 2500,
|
||||
"threshold_ms": 1000
|
||||
})
|
||||
```
|
||||
|
||||
### ERROR
|
||||
**When**: Error conditions that need attention
|
||||
**Examples**:
|
||||
- Failed requests
|
||||
- Exceptions caught and handled
|
||||
- Integration failures
|
||||
- Data validation errors
|
||||
|
||||
```python
|
||||
logger.error("Payment processing failed", extra={
|
||||
"order_id": "12345",
|
||||
"error": str(e),
|
||||
"payment_gateway": "stripe"
|
||||
}, exc_info=True)
|
||||
```
|
||||
|
||||
### FATAL/CRITICAL
|
||||
**When**: Severe errors causing shutdown
|
||||
**Examples**:
|
||||
- Database connection lost
|
||||
- Out of memory
|
||||
- Configuration errors preventing startup
|
||||
|
||||
```python
|
||||
logger.critical("Database connection lost", extra={
|
||||
"database": "postgres",
|
||||
"host": "db.example.com",
|
||||
"attempt": 3
|
||||
})
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Required Fields
|
||||
|
||||
Every log entry should include:
|
||||
|
||||
### 1. Timestamp
|
||||
ISO 8601 format with timezone:
|
||||
```json
|
||||
{
|
||||
"timestamp": "2024-10-28T14:32:15.123Z"
|
||||
}
|
||||
```
|
||||
|
||||
### 2. Level
|
||||
Standard levels: debug, info, warn, error, critical
|
||||
```json
|
||||
{
|
||||
"level": "error"
|
||||
}
|
||||
```
|
||||
|
||||
### 3. Message
|
||||
Human-readable description:
|
||||
```json
|
||||
{
|
||||
"message": "User authentication failed"
|
||||
}
|
||||
```
|
||||
|
||||
### 4. Service/Application
|
||||
What component logged this:
|
||||
```json
|
||||
{
|
||||
"service": "api-gateway",
|
||||
"version": "1.2.3"
|
||||
}
|
||||
```
|
||||
|
||||
### 5. Environment
|
||||
```json
|
||||
{
|
||||
"environment": "production"
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Recommended Fields
|
||||
|
||||
### Request Context
|
||||
```json
|
||||
{
|
||||
"request_id": "550e8400-e29b-41d4-a716-446655440000",
|
||||
"user_id": "user123",
|
||||
"session_id": "sess_abc",
|
||||
"ip_address": "192.168.1.1",
|
||||
"user_agent": "Mozilla/5.0..."
|
||||
}
|
||||
```
|
||||
|
||||
### Performance Metrics
|
||||
```json
|
||||
{
|
||||
"duration_ms": 245,
|
||||
"response_size_bytes": 1024
|
||||
}
|
||||
```
|
||||
|
||||
### Error Details
|
||||
```json
|
||||
{
|
||||
"error_type": "ValidationError",
|
||||
"error_message": "Invalid email format",
|
||||
"stack_trace": "...",
|
||||
"error_code": "VAL_001"
|
||||
}
|
||||
```
|
||||
|
||||
### Business Context
|
||||
```json
|
||||
{
|
||||
"order_id": "ORD-12345",
|
||||
"customer_id": "CUST-789",
|
||||
"transaction_amount": 99.99,
|
||||
"payment_method": "credit_card"
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Implementation Examples
|
||||
|
||||
### Python (using structlog)
|
||||
```python
|
||||
import structlog
|
||||
|
||||
logger = structlog.get_logger()
|
||||
|
||||
# Configure structured logging
|
||||
structlog.configure(
|
||||
processors=[
|
||||
structlog.processors.TimeStamper(fmt="iso"),
|
||||
structlog.processors.add_log_level,
|
||||
structlog.processors.JSONRenderer()
|
||||
]
|
||||
)
|
||||
|
||||
# Usage
|
||||
logger.info(
|
||||
"user_logged_in",
|
||||
user_id="user123",
|
||||
ip_address="192.168.1.1",
|
||||
login_method="oauth"
|
||||
)
|
||||
```
|
||||
|
||||
### Node.js (using Winston)
|
||||
```javascript
|
||||
const winston = require('winston');
|
||||
|
||||
const logger = winston.createLogger({
|
||||
format: winston.format.json(),
|
||||
defaultMeta: { service: 'api-gateway' },
|
||||
transports: [
|
||||
new winston.transports.Console()
|
||||
]
|
||||
});
|
||||
|
||||
logger.info('User logged in', {
|
||||
userId: 'user123',
|
||||
ipAddress: '192.168.1.1',
|
||||
loginMethod: 'oauth'
|
||||
});
|
||||
```
|
||||
|
||||
### Go (using zap)
|
||||
```go
|
||||
import "go.uber.org/zap"
|
||||
|
||||
logger, _ := zap.NewProduction()
|
||||
defer logger.Sync()
|
||||
|
||||
logger.Info("User logged in",
|
||||
zap.String("userId", "user123"),
|
||||
zap.String("ipAddress", "192.168.1.1"),
|
||||
zap.String("loginMethod", "oauth"),
|
||||
)
|
||||
```
|
||||
|
||||
### Java (using Logback with JSON)
|
||||
```java
|
||||
import org.slf4j.Logger;
|
||||
import org.slf4j.LoggerFactory;
|
||||
import net.logstash.logback.argument.StructuredArguments;
|
||||
|
||||
Logger logger = LoggerFactory.getLogger(MyClass.class);
|
||||
|
||||
logger.info("User logged in",
|
||||
StructuredArguments.kv("userId", "user123"),
|
||||
StructuredArguments.kv("ipAddress", "192.168.1.1"),
|
||||
StructuredArguments.kv("loginMethod", "oauth")
|
||||
);
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Log Aggregation Patterns
|
||||
|
||||
### Pattern 1: ELK Stack (Elasticsearch, Logstash, Kibana)
|
||||
|
||||
**Architecture**:
|
||||
```
|
||||
Application → Filebeat → Logstash → Elasticsearch → Kibana
|
||||
```
|
||||
|
||||
**filebeat.yml**:
|
||||
```yaml
|
||||
filebeat.inputs:
|
||||
- type: log
|
||||
enabled: true
|
||||
paths:
|
||||
- /var/log/app/*.log
|
||||
json.keys_under_root: true
|
||||
json.add_error_key: true
|
||||
|
||||
output.logstash:
|
||||
hosts: ["logstash:5044"]
|
||||
```
|
||||
|
||||
**logstash.conf**:
|
||||
```
|
||||
input {
|
||||
beats {
|
||||
port => 5044
|
||||
}
|
||||
}
|
||||
|
||||
filter {
|
||||
json {
|
||||
source => "message"
|
||||
}
|
||||
|
||||
date {
|
||||
match => ["timestamp", "ISO8601"]
|
||||
}
|
||||
|
||||
grok {
|
||||
match => { "message" => "%{COMBINEDAPACHELOG}" }
|
||||
}
|
||||
}
|
||||
|
||||
output {
|
||||
elasticsearch {
|
||||
hosts => ["elasticsearch:9200"]
|
||||
index => "app-logs-%{+YYYY.MM.dd}"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Pattern 2: Loki (Grafana Loki)
|
||||
|
||||
**Architecture**:
|
||||
```
|
||||
Application → Promtail → Loki → Grafana
|
||||
```
|
||||
|
||||
**promtail-config.yml**:
|
||||
```yaml
|
||||
server:
|
||||
http_listen_port: 9080
|
||||
|
||||
positions:
|
||||
filename: /tmp/positions.yaml
|
||||
|
||||
clients:
|
||||
- url: http://loki:3100/loki/api/v1/push
|
||||
|
||||
scrape_configs:
|
||||
- job_name: app
|
||||
static_configs:
|
||||
- targets:
|
||||
- localhost
|
||||
labels:
|
||||
job: app
|
||||
__path__: /var/log/app/*.log
|
||||
pipeline_stages:
|
||||
- json:
|
||||
expressions:
|
||||
level: level
|
||||
timestamp: timestamp
|
||||
- labels:
|
||||
level:
|
||||
service:
|
||||
- timestamp:
|
||||
source: timestamp
|
||||
format: RFC3339
|
||||
```
|
||||
|
||||
**Query in Grafana**:
|
||||
```logql
|
||||
{job="app"} |= "error" | json | level="error"
|
||||
```
|
||||
|
||||
### Pattern 3: CloudWatch Logs
|
||||
|
||||
**Install CloudWatch agent**:
|
||||
```json
|
||||
{
|
||||
"logs": {
|
||||
"logs_collected": {
|
||||
"files": {
|
||||
"collect_list": [
|
||||
{
|
||||
"file_path": "/var/log/app/*.log",
|
||||
"log_group_name": "/aws/app/production",
|
||||
"log_stream_name": "{instance_id}",
|
||||
"timezone": "UTC"
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Query with CloudWatch Insights**:
|
||||
```
|
||||
fields @timestamp, level, message, user_id
|
||||
| filter level = "error"
|
||||
| sort @timestamp desc
|
||||
| limit 100
|
||||
```
|
||||
|
||||
### Pattern 4: Fluentd/Fluent Bit
|
||||
|
||||
**fluent-bit.conf**:
|
||||
```
|
||||
[INPUT]
|
||||
Name tail
|
||||
Path /var/log/app/*.log
|
||||
Parser json
|
||||
Tag app.*
|
||||
|
||||
[FILTER]
|
||||
Name record_modifier
|
||||
Match *
|
||||
Record hostname ${HOSTNAME}
|
||||
Record cluster production
|
||||
|
||||
[OUTPUT]
|
||||
Name es
|
||||
Match *
|
||||
Host elasticsearch
|
||||
Port 9200
|
||||
Index app-logs
|
||||
Type _doc
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Query Patterns
|
||||
|
||||
### Find Errors in Time Range
|
||||
**Elasticsearch**:
|
||||
```json
|
||||
GET /app-logs-*/_search
|
||||
{
|
||||
"query": {
|
||||
"bool": {
|
||||
"must": [
|
||||
{ "match": { "level": "error" } },
|
||||
{ "range": { "@timestamp": {
|
||||
"gte": "now-1h",
|
||||
"lte": "now"
|
||||
}}}
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Loki (LogQL)**:
|
||||
```logql
|
||||
{job="app", level="error"} |= "error"
|
||||
```
|
||||
|
||||
**CloudWatch Insights**:
|
||||
```
|
||||
fields @timestamp, @message
|
||||
| filter level = "error"
|
||||
| filter @timestamp > ago(1h)
|
||||
```
|
||||
|
||||
### Count Errors by Type
|
||||
**Elasticsearch**:
|
||||
```json
|
||||
GET /app-logs-*/_search
|
||||
{
|
||||
"size": 0,
|
||||
"query": { "match": { "level": "error" } },
|
||||
"aggs": {
|
||||
"error_types": {
|
||||
"terms": { "field": "error_type.keyword" }
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Loki**:
|
||||
```logql
|
||||
sum by (error_type) (count_over_time({job="app", level="error"}[1h]))
|
||||
```
|
||||
|
||||
### Find Slow Requests
|
||||
**Elasticsearch**:
|
||||
```json
|
||||
GET /app-logs-*/_search
|
||||
{
|
||||
"query": {
|
||||
"range": { "duration_ms": { "gte": 1000 } }
|
||||
},
|
||||
"sort": [ { "duration_ms": "desc" } ]
|
||||
}
|
||||
```
|
||||
|
||||
### Trace Request Through Services
|
||||
**Elasticsearch** (using request_id):
|
||||
```json
|
||||
GET /_search
|
||||
{
|
||||
"query": {
|
||||
"match": { "request_id": "550e8400-e29b-41d4-a716-446655440000" }
|
||||
},
|
||||
"sort": [ { "@timestamp": "asc" } ]
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Sampling and Rate Limiting
|
||||
|
||||
### When to Sample
|
||||
- **High volume services**: > 10,000 logs/second
|
||||
- **Debug logs in production**: Sample 1-10%
|
||||
- **Cost optimization**: Reduce storage costs
|
||||
|
||||
### Sampling Strategies
|
||||
|
||||
**1. Random Sampling**:
|
||||
```python
|
||||
import random
|
||||
|
||||
if random.random() < 0.1: # Sample 10%
|
||||
logger.debug("Debug message", ...)
|
||||
```
|
||||
|
||||
**2. Rate Limiting**:
|
||||
```python
|
||||
from rate_limiter import RateLimiter
|
||||
|
||||
limiter = RateLimiter(max_per_second=100)
|
||||
|
||||
if limiter.allow():
|
||||
logger.info("Rate limited log", ...)
|
||||
```
|
||||
|
||||
**3. Error-Biased Sampling**:
|
||||
```python
|
||||
# Always log errors, sample successful requests
|
||||
if level == "error" or random.random() < 0.01:
|
||||
logger.log(level, message, ...)
|
||||
```
|
||||
|
||||
**4. Head-Based Sampling** (trace-aware):
|
||||
```python
|
||||
# If trace is sampled, log all related logs
|
||||
if trace_context.is_sampled():
|
||||
logger.info("Traced log", trace_id=trace_context.trace_id)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Log Retention
|
||||
|
||||
### Retention Strategy
|
||||
|
||||
**Hot tier** (fast SSD): 7-30 days
|
||||
- Recent logs
|
||||
- Full query performance
|
||||
- High cost
|
||||
|
||||
**Warm tier** (regular disk): 30-90 days
|
||||
- Older logs
|
||||
- Slower queries acceptable
|
||||
- Medium cost
|
||||
|
||||
**Cold tier** (object storage): 90+ days
|
||||
- Archive logs
|
||||
- Query via restore
|
||||
- Low cost
|
||||
|
||||
### Example: Elasticsearch ILM Policy
|
||||
```json
|
||||
{
|
||||
"policy": {
|
||||
"phases": {
|
||||
"hot": {
|
||||
"actions": {
|
||||
"rollover": {
|
||||
"max_size": "50GB",
|
||||
"max_age": "1d"
|
||||
}
|
||||
}
|
||||
},
|
||||
"warm": {
|
||||
"min_age": "7d",
|
||||
"actions": {
|
||||
"allocate": { "number_of_replicas": 1 },
|
||||
"shrink": { "number_of_shards": 1 }
|
||||
}
|
||||
},
|
||||
"cold": {
|
||||
"min_age": "30d",
|
||||
"actions": {
|
||||
"allocate": { "require": { "box_type": "cold" } }
|
||||
}
|
||||
},
|
||||
"delete": {
|
||||
"min_age": "90d",
|
||||
"actions": {
|
||||
"delete": {}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Security and Compliance
|
||||
|
||||
### PII Redaction
|
||||
|
||||
**Before logging**:
|
||||
```python
|
||||
import re
|
||||
|
||||
def redact_pii(data):
|
||||
# Redact email
|
||||
data = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
|
||||
'[EMAIL]', data)
|
||||
# Redact credit card
|
||||
data = re.sub(r'\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b',
|
||||
'[CARD]', data)
|
||||
# Redact SSN
|
||||
data = re.sub(r'\b\d{3}-\d{2}-\d{4}\b', '[SSN]', data)
|
||||
return data
|
||||
|
||||
logger.info("User data", user_input=redact_pii(user_input))
|
||||
```
|
||||
|
||||
**In Logstash**:
|
||||
```
|
||||
filter {
|
||||
mutate {
|
||||
gsub => [
|
||||
"message", "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b", "[EMAIL]",
|
||||
"message", "\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b", "[CARD]"
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Access Control
|
||||
|
||||
**Elasticsearch** (with Security):
|
||||
```yaml
|
||||
# Role for developers
|
||||
dev_logs:
|
||||
indices:
|
||||
- names: ['app-logs-*']
|
||||
privileges: ['read']
|
||||
query: '{"match": {"environment": "development"}}'
|
||||
```
|
||||
|
||||
**CloudWatch** (IAM Policy):
|
||||
```json
|
||||
{
|
||||
"Effect": "Allow",
|
||||
"Action": [
|
||||
"logs:DescribeLogGroups",
|
||||
"logs:GetLogEvents",
|
||||
"logs:FilterLogEvents"
|
||||
],
|
||||
"Resource": "arn:aws:logs:*:*:log-group:/aws/app/production:*"
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Common Pitfalls
|
||||
|
||||
### 1. Logging Sensitive Data
|
||||
❌ `logger.info("Login", password=password)`
|
||||
✅ `logger.info("Login", user_id=user_id)`
|
||||
|
||||
### 2. Excessive Logging
|
||||
❌ Logging every iteration of a loop
|
||||
✅ Log aggregate results or sample
|
||||
|
||||
### 3. Not Including Context
|
||||
❌ `logger.error("Failed")`
|
||||
✅ `logger.error("Payment failed", order_id=order_id, error=str(e))`
|
||||
|
||||
### 4. Inconsistent Formats
|
||||
❌ Mix of JSON and plain text
|
||||
✅ Pick one format and stick to it
|
||||
|
||||
### 5. No Request IDs
|
||||
❌ Can't trace request across services
|
||||
✅ Generate and propagate request_id
|
||||
|
||||
### 6. Logging to Multiple Places
|
||||
❌ Log to file AND stdout AND syslog
|
||||
✅ Log to stdout, let agent handle routing
|
||||
|
||||
### 7. Blocking on Log Writes
|
||||
❌ Synchronous writes to remote systems
|
||||
✅ Asynchronous buffered writes
|
||||
|
||||
---
|
||||
|
||||
## Performance Optimization
|
||||
|
||||
### 1. Async Logging
|
||||
```python
|
||||
import logging
|
||||
from logging.handlers import QueueHandler, QueueListener
|
||||
import queue
|
||||
|
||||
# Create queue
|
||||
log_queue = queue.Queue()
|
||||
|
||||
# Configure async handler
|
||||
queue_handler = QueueHandler(log_queue)
|
||||
logger.addHandler(queue_handler)
|
||||
|
||||
# Process logs in background thread
|
||||
listener = QueueListener(log_queue, *handlers)
|
||||
listener.start()
|
||||
```
|
||||
|
||||
### 2. Conditional Logging
|
||||
```python
|
||||
# Avoid expensive operations if not logging
|
||||
if logger.isEnabledFor(logging.DEBUG):
|
||||
logger.debug("Details", data=expensive_serialization(obj))
|
||||
```
|
||||
|
||||
### 3. Batching
|
||||
```python
|
||||
# Batch logs before sending
|
||||
batch = []
|
||||
for log in logs:
|
||||
batch.append(log)
|
||||
if len(batch) >= 100:
|
||||
send_to_aggregator(batch)
|
||||
batch = []
|
||||
```
|
||||
|
||||
### 4. Compression
|
||||
```yaml
|
||||
# Filebeat with compression
|
||||
output.logstash:
|
||||
hosts: ["logstash:5044"]
|
||||
compression_level: 3
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Monitoring Log Pipeline
|
||||
|
||||
Track pipeline health with metrics:
|
||||
|
||||
```promql
|
||||
# Log ingestion rate
|
||||
rate(logs_ingested_total[5m])
|
||||
|
||||
# Pipeline lag
|
||||
log_processing_lag_seconds
|
||||
|
||||
# Dropped logs
|
||||
rate(logs_dropped_total[5m])
|
||||
|
||||
# Error parsing rate
|
||||
rate(logs_parse_errors_total[5m])
|
||||
```
|
||||
|
||||
Alert on:
|
||||
- Sudden drop in log volume (service down?)
|
||||
- High parse error rate (format changed?)
|
||||
- Pipeline lag > 1 minute (capacity issue?)
|
||||
406
references/metrics_design.md
Normal file
406
references/metrics_design.md
Normal file
@@ -0,0 +1,406 @@
|
||||
# Metrics Design Guide
|
||||
|
||||
## The Four Golden Signals
|
||||
|
||||
The Four Golden Signals from Google's SRE book provide a comprehensive view of system health:
|
||||
|
||||
### 1. Latency
|
||||
**What**: Time to service a request
|
||||
|
||||
**Why Monitor**: Directly impacts user experience
|
||||
|
||||
**Key Metrics**:
|
||||
- Request duration (p50, p95, p99, p99.9)
|
||||
- Time to first byte (TTFB)
|
||||
- Backend processing time
|
||||
- Database query latency
|
||||
|
||||
**PromQL Examples**:
|
||||
```promql
|
||||
# P95 latency
|
||||
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
|
||||
|
||||
# Average latency by endpoint
|
||||
avg(rate(http_request_duration_seconds_sum[5m])) by (endpoint)
|
||||
/
|
||||
avg(rate(http_request_duration_seconds_count[5m])) by (endpoint)
|
||||
```
|
||||
|
||||
**Alert Thresholds**:
|
||||
- Warning: p95 > 500ms
|
||||
- Critical: p99 > 2s
|
||||
|
||||
### 2. Traffic
|
||||
**What**: Demand on your system
|
||||
|
||||
**Why Monitor**: Understand load patterns, capacity planning
|
||||
|
||||
**Key Metrics**:
|
||||
- Requests per second (RPS)
|
||||
- Transactions per second (TPS)
|
||||
- Concurrent connections
|
||||
- Network throughput
|
||||
|
||||
**PromQL Examples**:
|
||||
```promql
|
||||
# Requests per second
|
||||
sum(rate(http_requests_total[5m]))
|
||||
|
||||
# Requests per second by status code
|
||||
sum(rate(http_requests_total[5m])) by (status)
|
||||
|
||||
# Traffic growth rate (week over week)
|
||||
sum(rate(http_requests_total[5m]))
|
||||
/
|
||||
sum(rate(http_requests_total[5m] offset 7d))
|
||||
```
|
||||
|
||||
**Alert Thresholds**:
|
||||
- Warning: RPS > 80% of capacity
|
||||
- Critical: RPS > 95% of capacity
|
||||
|
||||
### 3. Errors
|
||||
**What**: Rate of requests that fail
|
||||
|
||||
**Why Monitor**: Direct indicator of user-facing problems
|
||||
|
||||
**Key Metrics**:
|
||||
- Error rate (%)
|
||||
- 5xx response codes
|
||||
- Failed transactions
|
||||
- Exception counts
|
||||
|
||||
**PromQL Examples**:
|
||||
```promql
|
||||
# Error rate percentage
|
||||
sum(rate(http_requests_total{status=~"5.."}[5m]))
|
||||
/
|
||||
sum(rate(http_requests_total[5m])) * 100
|
||||
|
||||
# Error count by type
|
||||
sum(rate(http_requests_total{status=~"5.."}[5m])) by (status)
|
||||
|
||||
# Application errors
|
||||
rate(application_errors_total[5m])
|
||||
```
|
||||
|
||||
**Alert Thresholds**:
|
||||
- Warning: Error rate > 1%
|
||||
- Critical: Error rate > 5%
|
||||
|
||||
### 4. Saturation
|
||||
**What**: How "full" your service is
|
||||
|
||||
**Why Monitor**: Predict capacity issues before they impact users
|
||||
|
||||
**Key Metrics**:
|
||||
- CPU utilization
|
||||
- Memory utilization
|
||||
- Disk I/O
|
||||
- Network bandwidth
|
||||
- Queue depth
|
||||
- Thread pool usage
|
||||
|
||||
**PromQL Examples**:
|
||||
```promql
|
||||
# CPU saturation
|
||||
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
|
||||
|
||||
# Memory saturation
|
||||
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
|
||||
|
||||
# Disk saturation
|
||||
rate(node_disk_io_time_seconds_total[5m]) * 100
|
||||
|
||||
# Queue depth
|
||||
queue_depth_current / queue_depth_max * 100
|
||||
```
|
||||
|
||||
**Alert Thresholds**:
|
||||
- Warning: > 70% utilization
|
||||
- Critical: > 90% utilization
|
||||
|
||||
---
|
||||
|
||||
## RED Method (for Services)
|
||||
|
||||
**R**ate, **E**rrors, **D**uration - a simplified approach for request-driven services
|
||||
|
||||
### Rate
|
||||
Number of requests per second:
|
||||
```promql
|
||||
sum(rate(http_requests_total[5m]))
|
||||
```
|
||||
|
||||
### Errors
|
||||
Number of failed requests per second:
|
||||
```promql
|
||||
sum(rate(http_requests_total{status=~"5.."}[5m]))
|
||||
```
|
||||
|
||||
### Duration
|
||||
Time taken to process requests:
|
||||
```promql
|
||||
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
|
||||
```
|
||||
|
||||
**When to Use**: Microservices, APIs, web applications
|
||||
|
||||
---
|
||||
|
||||
## USE Method (for Resources)
|
||||
|
||||
**U**tilization, **S**aturation, **E**rrors - for infrastructure resources
|
||||
|
||||
### Utilization
|
||||
Percentage of time resource is busy:
|
||||
```promql
|
||||
# CPU utilization
|
||||
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
|
||||
|
||||
# Disk utilization
|
||||
(node_filesystem_size_bytes - node_filesystem_avail_bytes)
|
||||
/ node_filesystem_size_bytes * 100
|
||||
```
|
||||
|
||||
### Saturation
|
||||
Amount of work the resource cannot service (queued):
|
||||
```promql
|
||||
# Load average (saturation indicator)
|
||||
node_load15
|
||||
|
||||
# Disk I/O wait time
|
||||
rate(node_disk_io_time_weighted_seconds_total[5m])
|
||||
```
|
||||
|
||||
### Errors
|
||||
Count of error events:
|
||||
```promql
|
||||
# Network errors
|
||||
rate(node_network_receive_errs_total[5m])
|
||||
rate(node_network_transmit_errs_total[5m])
|
||||
|
||||
# Disk errors
|
||||
rate(node_disk_io_errors_total[5m])
|
||||
```
|
||||
|
||||
**When to Use**: Servers, databases, network devices
|
||||
|
||||
---
|
||||
|
||||
## Metric Types
|
||||
|
||||
### Counter
|
||||
Monotonically increasing value (never decreases)
|
||||
|
||||
**Examples**: Request count, error count, bytes sent
|
||||
|
||||
**Usage**:
|
||||
```promql
|
||||
# Always use rate() or increase() with counters
|
||||
rate(http_requests_total[5m]) # Requests per second
|
||||
increase(http_requests_total[1h]) # Total requests in 1 hour
|
||||
```
|
||||
|
||||
### Gauge
|
||||
Value that can go up and down
|
||||
|
||||
**Examples**: Memory usage, queue depth, concurrent connections
|
||||
|
||||
**Usage**:
|
||||
```promql
|
||||
# Use directly or with aggregations
|
||||
avg(memory_usage_bytes)
|
||||
max(queue_depth)
|
||||
```
|
||||
|
||||
### Histogram
|
||||
Samples observations and counts them in configurable buckets
|
||||
|
||||
**Examples**: Request duration, response size
|
||||
|
||||
**Usage**:
|
||||
```promql
|
||||
# Calculate percentiles
|
||||
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
|
||||
|
||||
# Average from histogram
|
||||
rate(http_request_duration_seconds_sum[5m])
|
||||
/
|
||||
rate(http_request_duration_seconds_count[5m])
|
||||
```
|
||||
|
||||
### Summary
|
||||
Similar to histogram but calculates quantiles on client side
|
||||
|
||||
**Usage**: Less flexible than histograms, avoid for new metrics
|
||||
|
||||
---
|
||||
|
||||
## Cardinality Best Practices
|
||||
|
||||
**Cardinality**: Number of unique time series
|
||||
|
||||
### High Cardinality Labels (AVOID)
|
||||
❌ User ID
|
||||
❌ Email address
|
||||
❌ IP address
|
||||
❌ Timestamp
|
||||
❌ Random IDs
|
||||
|
||||
### Low Cardinality Labels (GOOD)
|
||||
✅ Environment (prod, staging)
|
||||
✅ Region (us-east-1, eu-west-1)
|
||||
✅ Service name
|
||||
✅ HTTP status code category (2xx, 4xx, 5xx)
|
||||
✅ Endpoint/route
|
||||
|
||||
### Calculating Cardinality Impact
|
||||
```
|
||||
Time series = unique combinations of labels
|
||||
|
||||
Example:
|
||||
service (5) × environment (3) × region (4) × status (5) = 300 time series ✅
|
||||
|
||||
service (5) × environment (3) × region (4) × user_id (1M) = 60M time series ❌
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Naming Conventions
|
||||
|
||||
### Prometheus Naming
|
||||
```
|
||||
<namespace>_<name>_<unit>_total
|
||||
|
||||
Examples:
|
||||
http_requests_total
|
||||
http_request_duration_seconds
|
||||
process_cpu_seconds_total
|
||||
node_memory_MemAvailable_bytes
|
||||
```
|
||||
|
||||
**Rules**:
|
||||
- Use snake_case
|
||||
- Include unit in name (seconds, bytes, ratio)
|
||||
- Use `_total` suffix for counters
|
||||
- Namespace by application/component
|
||||
|
||||
### CloudWatch Naming
|
||||
```
|
||||
<Namespace>/<MetricName>
|
||||
|
||||
Examples:
|
||||
AWS/EC2/CPUUtilization
|
||||
MyApp/RequestCount
|
||||
```
|
||||
|
||||
**Rules**:
|
||||
- Use PascalCase
|
||||
- Group by namespace
|
||||
- No unit in name (specified separately)
|
||||
|
||||
---
|
||||
|
||||
## Dashboard Design
|
||||
|
||||
### Key Principles
|
||||
|
||||
1. **Top-Down Layout**: Most important metrics first
|
||||
2. **Color Coding**: Red (critical), yellow (warning), green (healthy)
|
||||
3. **Consistent Time Windows**: All panels use same time range
|
||||
4. **Limit Panels**: 8-12 panels per dashboard maximum
|
||||
5. **Include Context**: Show related metrics together
|
||||
|
||||
### Dashboard Structure
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────┐
|
||||
│ Overall Health (Single Stats) │
|
||||
│ [Requests/s] [Error%] [P95 Latency] │
|
||||
└─────────────────────────────────────────────┘
|
||||
|
||||
┌─────────────────────────────────────────────┐
|
||||
│ Request Rate & Errors (Graphs) │
|
||||
└─────────────────────────────────────────────┘
|
||||
|
||||
┌─────────────────────────────────────────────┐
|
||||
│ Latency Distribution (Graphs) │
|
||||
└─────────────────────────────────────────────┘
|
||||
|
||||
┌─────────────────────────────────────────────┐
|
||||
│ Resource Usage (Graphs) │
|
||||
└─────────────────────────────────────────────┘
|
||||
|
||||
┌─────────────────────────────────────────────┐
|
||||
│ Dependencies (Graphs) │
|
||||
└─────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Template Variables
|
||||
Use variables for filtering:
|
||||
- Environment: `$environment`
|
||||
- Service: `$service`
|
||||
- Region: `$region`
|
||||
- Pod: `$pod`
|
||||
|
||||
---
|
||||
|
||||
## Common Pitfalls
|
||||
|
||||
### 1. Monitoring What You Build, Not What Users Experience
|
||||
❌ `backend_processing_complete`
|
||||
✅ `user_request_completed`
|
||||
|
||||
### 2. Too Many Metrics
|
||||
- Start with Four Golden Signals
|
||||
- Add metrics only when needed for specific issues
|
||||
- Remove unused metrics
|
||||
|
||||
### 3. Incorrect Aggregations
|
||||
❌ `avg(rate(...))` - averages rates incorrectly
|
||||
✅ `sum(rate(...)) / count(...)` - correct average
|
||||
|
||||
### 4. Wrong Time Windows
|
||||
- Too short (< 1m): Noisy data
|
||||
- Too long (> 15m): Miss short-lived issues
|
||||
- Sweet spot: 5m for most alerts
|
||||
|
||||
### 5. Missing Labels
|
||||
❌ `http_requests_total`
|
||||
✅ `http_requests_total{method="GET", status="200", endpoint="/api/users"}`
|
||||
|
||||
---
|
||||
|
||||
## Metric Collection Best Practices
|
||||
|
||||
### Application Instrumentation
|
||||
```python
|
||||
from prometheus_client import Counter, Histogram, Gauge
|
||||
|
||||
# Counter for requests
|
||||
requests_total = Counter('http_requests_total',
|
||||
'Total HTTP requests',
|
||||
['method', 'endpoint', 'status'])
|
||||
|
||||
# Histogram for latency
|
||||
request_duration = Histogram('http_request_duration_seconds',
|
||||
'HTTP request duration',
|
||||
['method', 'endpoint'])
|
||||
|
||||
# Gauge for in-progress requests
|
||||
requests_in_progress = Gauge('http_requests_in_progress',
|
||||
'HTTP requests currently being processed')
|
||||
```
|
||||
|
||||
### Collection Intervals
|
||||
- Application metrics: 15-30s
|
||||
- Infrastructure metrics: 30-60s
|
||||
- Billing/cost metrics: 5-15m
|
||||
- External API checks: 1-5m
|
||||
|
||||
### Retention
|
||||
- Raw metrics: 15-30 days
|
||||
- 5m aggregates: 90 days
|
||||
- 1h aggregates: 1 year
|
||||
- Daily aggregates: 2+ years
|
||||
652
references/slo_sla_guide.md
Normal file
652
references/slo_sla_guide.md
Normal file
@@ -0,0 +1,652 @@
|
||||
# SLI, SLO, and SLA Guide
|
||||
|
||||
## Definitions
|
||||
|
||||
### SLI (Service Level Indicator)
|
||||
**What**: A quantitative measure of service quality
|
||||
|
||||
**Examples**:
|
||||
- Request latency (ms)
|
||||
- Error rate (%)
|
||||
- Availability (%)
|
||||
- Throughput (requests/sec)
|
||||
|
||||
### SLO (Service Level Objective)
|
||||
**What**: Target value or range for an SLI
|
||||
|
||||
**Examples**:
|
||||
- "99.9% of requests return in < 500ms"
|
||||
- "99.95% availability"
|
||||
- "Error rate < 0.1%"
|
||||
|
||||
### SLA (Service Level Agreement)
|
||||
**What**: Business contract with consequences for SLO violations
|
||||
|
||||
**Examples**:
|
||||
- "99.9% uptime or 10% monthly credit"
|
||||
- "p95 latency < 1s or refund"
|
||||
|
||||
### Relationship
|
||||
```
|
||||
SLI = Measurement
|
||||
SLO = Target (internal goal)
|
||||
SLA = Promise (customer contract with penalties)
|
||||
|
||||
Example:
|
||||
SLI: Actual availability this month = 99.92%
|
||||
SLO: Target availability = 99.9%
|
||||
SLA: Guaranteed availability = 99.5% (with penalties)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Choosing SLIs
|
||||
|
||||
### The Four Golden Signals as SLIs
|
||||
|
||||
1. **Latency SLIs**
|
||||
- Request duration (p50, p95, p99)
|
||||
- Time to first byte
|
||||
- Page load time
|
||||
|
||||
2. **Availability/Success SLIs**
|
||||
- % of successful requests
|
||||
- % uptime
|
||||
- % of requests completing
|
||||
|
||||
3. **Throughput SLIs** (less common)
|
||||
- Requests per second
|
||||
- Transactions per second
|
||||
|
||||
4. **Saturation SLIs** (internal only)
|
||||
- Resource utilization
|
||||
- Queue depth
|
||||
|
||||
### SLI Selection Criteria
|
||||
|
||||
✅ **Good SLIs**:
|
||||
- Measured from user perspective
|
||||
- Directly impact user experience
|
||||
- Aggregatable across instances
|
||||
- Proportional to user happiness
|
||||
|
||||
❌ **Bad SLIs**:
|
||||
- Internal metrics only
|
||||
- Not user-facing
|
||||
- Hard to measure consistently
|
||||
|
||||
### Examples by Service Type
|
||||
|
||||
**Web Application**:
|
||||
```
|
||||
SLI 1: Request Success Rate
|
||||
= successful_requests / total_requests
|
||||
|
||||
SLI 2: Request Latency (p95)
|
||||
= 95th percentile of response times
|
||||
|
||||
SLI 3: Availability
|
||||
= time_service_responding / total_time
|
||||
```
|
||||
|
||||
**API Service**:
|
||||
```
|
||||
SLI 1: Error Rate
|
||||
= (4xx_errors + 5xx_errors) / total_requests
|
||||
|
||||
SLI 2: Response Time (p99)
|
||||
= 99th percentile latency
|
||||
|
||||
SLI 3: Throughput
|
||||
= requests_per_second
|
||||
```
|
||||
|
||||
**Batch Processing**:
|
||||
```
|
||||
SLI 1: Job Success Rate
|
||||
= successful_jobs / total_jobs
|
||||
|
||||
SLI 2: Processing Latency
|
||||
= time_from_submission_to_completion
|
||||
|
||||
SLI 3: Freshness
|
||||
= age_of_oldest_unprocessed_item
|
||||
```
|
||||
|
||||
**Storage Service**:
|
||||
```
|
||||
SLI 1: Durability
|
||||
= data_not_lost / total_data
|
||||
|
||||
SLI 2: Read Latency (p99)
|
||||
= 99th percentile read time
|
||||
|
||||
SLI 3: Write Success Rate
|
||||
= successful_writes / total_writes
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Setting SLO Targets
|
||||
|
||||
### Start with Current Performance
|
||||
|
||||
1. **Measure baseline**: Collect 30 days of data
|
||||
2. **Analyze distribution**: Look at p50, p95, p99, p99.9
|
||||
3. **Set initial SLO**: Slightly better than worst performer
|
||||
4. **Iterate**: Tighten or loosen based on feasibility
|
||||
|
||||
### Example Process
|
||||
|
||||
**Current Performance** (30 days):
|
||||
```
|
||||
p50 latency: 120ms
|
||||
p95 latency: 450ms
|
||||
p99 latency: 1200ms
|
||||
p99.9 latency: 3500ms
|
||||
|
||||
Error rate: 0.05%
|
||||
Availability: 99.95%
|
||||
```
|
||||
|
||||
**Initial SLOs**:
|
||||
```
|
||||
Latency: p95 < 500ms (slightly worse than current p95)
|
||||
Error rate: < 0.1% (double current rate)
|
||||
Availability: 99.9% (slightly worse than current)
|
||||
```
|
||||
|
||||
**Rationale**: Start loose, prevent false alarms, tighten over time
|
||||
|
||||
### Common SLO Targets
|
||||
|
||||
**Availability**:
|
||||
- **99%** (3.65 days downtime/year): Internal tools
|
||||
- **99.5%** (1.83 days/year): Non-critical services
|
||||
- **99.9%** (8.76 hours/year): Standard production
|
||||
- **99.95%** (4.38 hours/year): Critical services
|
||||
- **99.99%** (52 minutes/year): High availability
|
||||
- **99.999%** (5 minutes/year): Mission critical
|
||||
|
||||
**Latency**:
|
||||
- **p50 < 100ms**: Excellent responsiveness
|
||||
- **p95 < 500ms**: Standard web applications
|
||||
- **p99 < 1s**: Acceptable for most users
|
||||
- **p99.9 < 5s**: Acceptable for rare edge cases
|
||||
|
||||
**Error Rate**:
|
||||
- **< 0.01%** (99.99% success): Critical operations
|
||||
- **< 0.1%** (99.9% success): Standard production
|
||||
- **< 1%** (99% success): Non-critical services
|
||||
|
||||
---
|
||||
|
||||
## Error Budgets
|
||||
|
||||
### Concept
|
||||
|
||||
Error budget = (100% - SLO target)
|
||||
|
||||
If SLO is 99.9%, error budget is 0.1%
|
||||
|
||||
**Purpose**: Balance reliability with feature velocity
|
||||
|
||||
### Calculation
|
||||
|
||||
**For availability**:
|
||||
```
|
||||
Monthly error budget = (1 - SLO) × time_period
|
||||
|
||||
Example (99.9% SLO, 30 days):
|
||||
Error budget = 0.001 × 30 days = 0.03 days = 43.2 minutes
|
||||
```
|
||||
|
||||
**For request-based SLIs**:
|
||||
```
|
||||
Error budget = (1 - SLO) × total_requests
|
||||
|
||||
Example (99.9% SLO, 10M requests/month):
|
||||
Error budget = 0.001 × 10,000,000 = 10,000 failed requests
|
||||
```
|
||||
|
||||
### Error Budget Consumption
|
||||
|
||||
**Formula**:
|
||||
```
|
||||
Budget consumed = actual_errors / allowed_errors × 100%
|
||||
|
||||
Example:
|
||||
SLO: 99.9% (0.1% error budget)
|
||||
Total requests: 1,000,000
|
||||
Failed requests: 500
|
||||
Allowed failures: 1,000
|
||||
|
||||
Budget consumed = 500 / 1,000 × 100% = 50%
|
||||
Budget remaining = 50%
|
||||
```
|
||||
|
||||
### Error Budget Policy
|
||||
|
||||
**Example policy**:
|
||||
|
||||
```markdown
|
||||
## Error Budget Policy
|
||||
|
||||
### If error budget > 50%
|
||||
- Deploy frequently (multiple times per day)
|
||||
- Take calculated risks
|
||||
- Experiment with new features
|
||||
- Acceptable to have some incidents
|
||||
|
||||
### If error budget 20-50%
|
||||
- Deploy normally (once per day)
|
||||
- Increase testing
|
||||
- Review recent changes
|
||||
- Monitor closely
|
||||
|
||||
### If error budget < 20%
|
||||
- Freeze non-critical deploys
|
||||
- Focus on reliability improvements
|
||||
- Postmortem all incidents
|
||||
- Reduce change velocity
|
||||
|
||||
### If error budget exhausted (< 0%)
|
||||
- Complete deploy freeze except rollbacks
|
||||
- All hands on reliability
|
||||
- Mandatory postmortems
|
||||
- Executive escalation
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Error Budget Burn Rate
|
||||
|
||||
### Concept
|
||||
|
||||
Burn rate = rate of error budget consumption
|
||||
|
||||
**Example**:
|
||||
- Monthly budget: 43.2 minutes (99.9% SLO)
|
||||
- If consuming at 2x rate: Budget exhausted in 15 days
|
||||
- If consuming at 10x rate: Budget exhausted in 3 days
|
||||
|
||||
### Burn Rate Calculation
|
||||
|
||||
```
|
||||
Burn rate = (actual_error_rate / allowed_error_rate)
|
||||
|
||||
Example:
|
||||
SLO: 99.9% (0.1% allowed error rate)
|
||||
Current error rate: 0.5%
|
||||
|
||||
Burn rate = 0.5% / 0.1% = 5x
|
||||
Time to exhaust = 30 days / 5 = 6 days
|
||||
```
|
||||
|
||||
### Multi-Window Alerting
|
||||
|
||||
Alert on burn rate across multiple time windows:
|
||||
|
||||
**Fast burn** (1 hour window):
|
||||
```
|
||||
Burn rate > 14.4x → Exhausts budget in 2 days
|
||||
Alert after 2 minutes
|
||||
Severity: Critical (page immediately)
|
||||
```
|
||||
|
||||
**Moderate burn** (6 hour window):
|
||||
```
|
||||
Burn rate > 6x → Exhausts budget in 5 days
|
||||
Alert after 30 minutes
|
||||
Severity: Warning (create ticket)
|
||||
```
|
||||
|
||||
**Slow burn** (3 day window):
|
||||
```
|
||||
Burn rate > 1x → Exhausts budget by end of month
|
||||
Alert after 6 hours
|
||||
Severity: Info (monitor)
|
||||
```
|
||||
|
||||
### Implementation
|
||||
|
||||
**Prometheus**:
|
||||
```yaml
|
||||
# Fast burn alert (1h window, 2m grace period)
|
||||
- alert: ErrorBudgetFastBurn
|
||||
expr: |
|
||||
(
|
||||
sum(rate(http_requests_total{status=~"5.."}[1h]))
|
||||
/
|
||||
sum(rate(http_requests_total[1h]))
|
||||
) > (14.4 * 0.001) # 14.4x burn rate for 99.9% SLO
|
||||
for: 2m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "Fast error budget burn detected"
|
||||
description: "Error budget will be exhausted in 2 days at current rate"
|
||||
|
||||
# Slow burn alert (6h window, 30m grace period)
|
||||
- alert: ErrorBudgetSlowBurn
|
||||
expr: |
|
||||
(
|
||||
sum(rate(http_requests_total{status=~"5.."}[6h]))
|
||||
/
|
||||
sum(rate(http_requests_total[6h]))
|
||||
) > (6 * 0.001) # 6x burn rate for 99.9% SLO
|
||||
for: 30m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "Elevated error budget burn detected"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## SLO Reporting
|
||||
|
||||
### Dashboard Structure
|
||||
|
||||
**Overall Health**:
|
||||
```
|
||||
┌─────────────────────────────────────────┐
|
||||
│ SLO Compliance: 99.92% ✅ │
|
||||
│ Error Budget Remaining: 73% 🟢 │
|
||||
│ Burn Rate: 0.8x 🟢 │
|
||||
└─────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
**SLI Performance**:
|
||||
```
|
||||
Latency p95: 420ms (Target: 500ms) ✅
|
||||
Error Rate: 0.08% (Target: < 0.1%) ✅
|
||||
Availability: 99.95% (Target: > 99.9%) ✅
|
||||
```
|
||||
|
||||
**Error Budget Trend**:
|
||||
```
|
||||
Graph showing:
|
||||
- Error budget consumption over time
|
||||
- Burn rate spikes
|
||||
- Incidents marked
|
||||
- Deploy events overlaid
|
||||
```
|
||||
|
||||
### Monthly SLO Report
|
||||
|
||||
**Template**:
|
||||
```markdown
|
||||
# SLO Report: October 2024
|
||||
|
||||
## Executive Summary
|
||||
- ✅ All SLOs met this month
|
||||
- 🟡 Latency SLO came close to violation (99.1% compliance)
|
||||
- 3 incidents consumed 47% of error budget
|
||||
- Error budget remaining: 53%
|
||||
|
||||
## SLO Performance
|
||||
|
||||
### Availability SLO: 99.9%
|
||||
- Actual: 99.92%
|
||||
- Status: ✅ Met
|
||||
- Error budget consumed: 33%
|
||||
- Downtime: 23 minutes (allowed: 43.2 minutes)
|
||||
|
||||
### Latency SLO: p95 < 500ms
|
||||
- Actual p95: 445ms
|
||||
- Status: ✅ Met
|
||||
- Compliance: 99.1% (target: 99%)
|
||||
- 0.9% of requests exceeded threshold
|
||||
|
||||
### Error Rate SLO: < 0.1%
|
||||
- Actual: 0.05%
|
||||
- Status: ✅ Met
|
||||
- Error budget consumed: 50%
|
||||
|
||||
## Incidents
|
||||
|
||||
### Incident #1: Database Overload (Oct 5)
|
||||
- Duration: 15 minutes
|
||||
- Error budget consumed: 35%
|
||||
- Root cause: Slow query after schema change
|
||||
- Prevention: Added query review to deploy checklist
|
||||
|
||||
### Incident #2: API Gateway Timeout (Oct 12)
|
||||
- Duration: 5 minutes
|
||||
- Error budget consumed: 10%
|
||||
- Root cause: Configuration error in load balancer
|
||||
- Prevention: Automated configuration validation
|
||||
|
||||
### Incident #3: Upstream Service Degradation (Oct 20)
|
||||
- Duration: 3 minutes
|
||||
- Error budget consumed: 2%
|
||||
- Root cause: Third-party API outage
|
||||
- Prevention: Implemented circuit breaker
|
||||
|
||||
## Recommendations
|
||||
1. Investigate latency near-miss (Oct 15-17)
|
||||
2. Add automated rollback for database changes
|
||||
3. Increase circuit breaker thresholds for third-party APIs
|
||||
4. Consider tightening availability SLO to 99.95%
|
||||
|
||||
## Next Month's Focus
|
||||
- Reduce p95 latency to 400ms
|
||||
- Implement automated canary deployments
|
||||
- Add synthetic monitoring for critical paths
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## SLA Structure
|
||||
|
||||
### Components
|
||||
|
||||
**Service Description**:
|
||||
```
|
||||
The API Service provides RESTful endpoints for user management,
|
||||
authentication, and data retrieval.
|
||||
```
|
||||
|
||||
**Covered Metrics**:
|
||||
```
|
||||
- Availability: Service is reachable and returns valid responses
|
||||
- Latency: Time from request to response
|
||||
- Error Rate: Percentage of requests returning errors
|
||||
```
|
||||
|
||||
**SLA Targets**:
|
||||
```
|
||||
Service commits to:
|
||||
1. 99.9% monthly uptime
|
||||
2. p95 API response time < 1 second
|
||||
3. Error rate < 0.5%
|
||||
```
|
||||
|
||||
**Measurement**:
|
||||
```
|
||||
Metrics calculated from server-side monitoring:
|
||||
- Uptime: Successful health check probes / total probes
|
||||
- Latency: Server-side request duration (p95)
|
||||
- Errors: HTTP 5xx responses / total responses
|
||||
|
||||
Calculated monthly (first of month for previous month).
|
||||
```
|
||||
|
||||
**Exclusions**:
|
||||
```
|
||||
SLA does not cover:
|
||||
- Scheduled maintenance (with 7 days notice)
|
||||
- Client-side network issues
|
||||
- DDoS attacks or force majeure
|
||||
- Beta/preview features
|
||||
- Issues caused by customer misuse
|
||||
```
|
||||
|
||||
**Service Credits**:
|
||||
```
|
||||
Monthly Uptime | Service Credit
|
||||
---------------- | --------------
|
||||
< 99.9% (SLA) | 10%
|
||||
< 99.0% | 25%
|
||||
< 95.0% | 50%
|
||||
```
|
||||
|
||||
**Claiming Credits**:
|
||||
```
|
||||
Customer must:
|
||||
1. Report violation within 30 days
|
||||
2. Provide ticket numbers for support requests
|
||||
3. Credits applied to next month's invoice
|
||||
4. Credits do not exceed monthly fee
|
||||
```
|
||||
|
||||
### Example SLAs by Industry
|
||||
|
||||
**E-commerce**:
|
||||
```
|
||||
- 99.95% availability
|
||||
- p95 page load < 2s
|
||||
- p99 checkout < 5s
|
||||
- Credits: 5% per 0.1% below target
|
||||
```
|
||||
|
||||
**Financial Services**:
|
||||
```
|
||||
- 99.99% availability
|
||||
- p99 transaction < 500ms
|
||||
- Zero data loss
|
||||
- Penalties: $10,000 per hour of downtime
|
||||
```
|
||||
|
||||
**Media/Content**:
|
||||
```
|
||||
- 99.9% availability
|
||||
- p95 video start < 3s
|
||||
- No credit system (best effort latency)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Best Practices
|
||||
|
||||
### 1. SLOs Should Be User-Centric
|
||||
❌ "Database queries < 100ms"
|
||||
✅ "API response time p95 < 500ms"
|
||||
|
||||
### 2. Start Loose, Tighten Over Time
|
||||
- Begin with achievable targets
|
||||
- Build reliability culture
|
||||
- Gradually raise bar
|
||||
|
||||
### 3. Fewer, Better SLOs
|
||||
- 1-3 SLOs per service
|
||||
- Focus on user impact
|
||||
- Avoid SLO sprawl
|
||||
|
||||
### 4. SLAs More Conservative Than SLOs
|
||||
```
|
||||
Internal SLO: 99.95%
|
||||
Customer SLA: 99.9%
|
||||
Margin: 0.05% buffer
|
||||
```
|
||||
|
||||
### 5. Make Error Budgets Actionable
|
||||
- Define policies at different thresholds
|
||||
- Empower teams to make tradeoffs
|
||||
- Review in planning meetings
|
||||
|
||||
### 6. Document Everything
|
||||
- How SLIs are measured
|
||||
- Why targets were chosen
|
||||
- Who owns each SLO
|
||||
- How to interpret metrics
|
||||
|
||||
### 7. Review Regularly
|
||||
- Monthly SLO reviews
|
||||
- Quarterly SLO adjustments
|
||||
- Annual SLA renegotiation
|
||||
|
||||
---
|
||||
|
||||
## Common Pitfalls
|
||||
|
||||
### 1. Too Many SLOs
|
||||
❌ 20 different SLOs per service
|
||||
✅ 2-3 critical SLOs
|
||||
|
||||
### 2. Unrealistic Targets
|
||||
❌ 99.999% for non-critical service
|
||||
✅ 99.9% with room to improve
|
||||
|
||||
### 3. SLOs Without Error Budgets
|
||||
❌ "Must always be 99.9%"
|
||||
✅ "Budget for 0.1% errors"
|
||||
|
||||
### 4. No Consequences
|
||||
❌ Missing SLO has no impact
|
||||
✅ Deploy freeze when budget exhausted
|
||||
|
||||
### 5. SLA Equals SLO
|
||||
❌ Promise exactly what you target
|
||||
✅ SLA more conservative than SLO
|
||||
|
||||
### 6. Ignoring User Experience
|
||||
❌ "Our servers are up 99.99%"
|
||||
✅ "Users can complete actions 99.9% of the time"
|
||||
|
||||
### 7. Static Targets
|
||||
❌ Set once, never revisit
|
||||
✅ Quarterly reviews and adjustments
|
||||
|
||||
---
|
||||
|
||||
## Tools and Automation
|
||||
|
||||
### SLO Tracking Tools
|
||||
|
||||
**Prometheus + Grafana**:
|
||||
- Use recording rules for SLIs
|
||||
- Alert on burn rates
|
||||
- Dashboard for compliance
|
||||
|
||||
**Google Cloud SLO Monitoring**:
|
||||
- Built-in SLO tracking
|
||||
- Automatic error budget calculation
|
||||
- Integration with alerting
|
||||
|
||||
**Datadog SLOs**:
|
||||
- UI for SLO definition
|
||||
- Automatic burn rate alerts
|
||||
- Status pages
|
||||
|
||||
**Custom Tools**:
|
||||
- sloth: Generate Prometheus rules from SLO definitions
|
||||
- slo-libsonnet: Jsonnet library for SLO monitoring
|
||||
|
||||
### Example: Prometheus Recording Rules
|
||||
|
||||
```yaml
|
||||
groups:
|
||||
- name: sli_recording
|
||||
interval: 30s
|
||||
rules:
|
||||
# SLI: Request success rate
|
||||
- record: sli:request_success:ratio
|
||||
expr: |
|
||||
sum(rate(http_requests_total{status!~"5.."}[5m]))
|
||||
/
|
||||
sum(rate(http_requests_total[5m]))
|
||||
|
||||
# SLI: Request latency (p95)
|
||||
- record: sli:request_latency:p95
|
||||
expr: |
|
||||
histogram_quantile(0.95,
|
||||
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
|
||||
)
|
||||
|
||||
# Error budget burn rate (1h window)
|
||||
- record: slo:error_budget_burn_rate:1h
|
||||
expr: |
|
||||
(1 - sli:request_success:ratio) / 0.001
|
||||
```
|
||||
697
references/tool_comparison.md
Normal file
697
references/tool_comparison.md
Normal file
@@ -0,0 +1,697 @@
|
||||
# Monitoring Tools Comparison
|
||||
|
||||
## Overview Matrix
|
||||
|
||||
| Tool | Type | Best For | Complexity | Cost | Cloud/Self-Hosted |
|
||||
|------|------|----------|------------|------|-------------------|
|
||||
| **Prometheus** | Metrics | Kubernetes, time-series | Medium | Free | Self-hosted |
|
||||
| **Grafana** | Visualization | Dashboards, multi-source | Low-Medium | Free | Both |
|
||||
| **Datadog** | Full-stack | Ease of use, APM | Low | High | Cloud |
|
||||
| **New Relic** | Full-stack | APM, traces | Low | High | Cloud |
|
||||
| **Elasticsearch (ELK)** | Logs | Log search, analysis | High | Medium | Both |
|
||||
| **Grafana Loki** | Logs | Cost-effective logs | Medium | Free | Both |
|
||||
| **CloudWatch** | AWS-native | AWS infrastructure | Low | Medium | Cloud |
|
||||
| **Jaeger** | Tracing | Distributed tracing | Medium | Free | Self-hosted |
|
||||
| **Grafana Tempo** | Tracing | Cost-effective tracing | Medium | Free | Self-hosted |
|
||||
|
||||
---
|
||||
|
||||
## Metrics Platforms
|
||||
|
||||
### Prometheus
|
||||
|
||||
**Type**: Open-source time-series database
|
||||
|
||||
**Strengths**:
|
||||
- ✅ Industry standard for Kubernetes
|
||||
- ✅ Powerful query language (PromQL)
|
||||
- ✅ Pull-based model (no agent config)
|
||||
- ✅ Service discovery
|
||||
- ✅ Free and open source
|
||||
- ✅ Huge ecosystem (exporters for everything)
|
||||
|
||||
**Weaknesses**:
|
||||
- ❌ No built-in dashboards (need Grafana)
|
||||
- ❌ Single-node only (no HA without federation)
|
||||
- ❌ Limited long-term storage (need Thanos/Cortex)
|
||||
- ❌ Steep learning curve for PromQL
|
||||
|
||||
**Best For**:
|
||||
- Kubernetes monitoring
|
||||
- Infrastructure metrics
|
||||
- Custom application metrics
|
||||
- Organizations that need control
|
||||
|
||||
**Pricing**: Free (open source)
|
||||
|
||||
**Setup Complexity**: Medium
|
||||
|
||||
**Example**:
|
||||
```yaml
|
||||
# prometheus.yml
|
||||
scrape_configs:
|
||||
- job_name: 'app'
|
||||
static_configs:
|
||||
- targets: ['localhost:8080']
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Datadog
|
||||
|
||||
**Type**: SaaS monitoring platform
|
||||
|
||||
**Strengths**:
|
||||
- ✅ Easy to set up (install agent, done)
|
||||
- ✅ Beautiful pre-built dashboards
|
||||
- ✅ APM, logs, metrics, traces in one platform
|
||||
- ✅ Great anomaly detection
|
||||
- ✅ Excellent integrations (500+)
|
||||
- ✅ Good mobile app
|
||||
|
||||
**Weaknesses**:
|
||||
- ❌ Very expensive at scale
|
||||
- ❌ Vendor lock-in
|
||||
- ❌ Cost can be unpredictable (per-host pricing)
|
||||
- ❌ Limited PromQL support
|
||||
|
||||
**Best For**:
|
||||
- Teams that want quick setup
|
||||
- Companies prioritizing ease of use over cost
|
||||
- Organizations needing full observability
|
||||
|
||||
**Pricing**: $15-$31/host/month + custom metrics fees
|
||||
|
||||
**Setup Complexity**: Low
|
||||
|
||||
**Example**:
|
||||
```bash
|
||||
# Install agent
|
||||
DD_API_KEY=xxx bash -c "$(curl -L https://s3.amazonaws.com/dd-agent/scripts/install_script.sh)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### New Relic
|
||||
|
||||
**Type**: SaaS application performance monitoring
|
||||
|
||||
**Strengths**:
|
||||
- ✅ Excellent APM capabilities
|
||||
- ✅ User-friendly interface
|
||||
- ✅ Good transaction tracing
|
||||
- ✅ Comprehensive alerting
|
||||
- ✅ Generous free tier
|
||||
|
||||
**Weaknesses**:
|
||||
- ❌ Can get expensive at scale
|
||||
- ❌ Vendor lock-in
|
||||
- ❌ Query language less powerful than PromQL
|
||||
- ❌ Limited customization
|
||||
|
||||
**Best For**:
|
||||
- Application performance monitoring
|
||||
- Teams focused on APM over infrastructure
|
||||
- Startups (free tier is generous)
|
||||
|
||||
**Pricing**: Free up to 100GB/month, then $0.30/GB
|
||||
|
||||
**Setup Complexity**: Low
|
||||
|
||||
**Example**:
|
||||
```python
|
||||
import newrelic.agent
|
||||
newrelic.agent.initialize('newrelic.ini')
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### CloudWatch
|
||||
|
||||
**Type**: AWS-native monitoring
|
||||
|
||||
**Strengths**:
|
||||
- ✅ Zero setup for AWS services
|
||||
- ✅ Native integration with AWS
|
||||
- ✅ Automatic dashboards for AWS resources
|
||||
- ✅ Tightly integrated with other AWS services
|
||||
- ✅ Good for cost if already on AWS
|
||||
|
||||
**Weaknesses**:
|
||||
- ❌ AWS-only (not multi-cloud)
|
||||
- ❌ Limited query capabilities
|
||||
- ❌ High costs for custom metrics
|
||||
- ❌ Basic visualization
|
||||
- ❌ 1-minute minimum resolution
|
||||
|
||||
**Best For**:
|
||||
- AWS-centric infrastructure
|
||||
- Quick setup for AWS services
|
||||
- Organizations already invested in AWS
|
||||
|
||||
**Pricing**:
|
||||
- First 10 custom metrics: Free
|
||||
- Additional: $0.30/metric/month
|
||||
- API calls: $0.01/1000 requests
|
||||
|
||||
**Setup Complexity**: Low (for AWS), Medium (for custom metrics)
|
||||
|
||||
**Example**:
|
||||
```python
|
||||
import boto3
|
||||
cloudwatch = boto3.client('cloudwatch')
|
||||
cloudwatch.put_metric_data(
|
||||
Namespace='MyApp',
|
||||
MetricData=[{'MetricName': 'RequestCount', 'Value': 1}]
|
||||
)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Grafana Cloud / Mimir
|
||||
|
||||
**Type**: Managed Prometheus-compatible
|
||||
|
||||
**Strengths**:
|
||||
- ✅ Prometheus-compatible (PromQL)
|
||||
- ✅ Managed service (no ops burden)
|
||||
- ✅ Good cost model (pay for what you use)
|
||||
- ✅ Grafana dashboards included
|
||||
- ✅ Long-term storage
|
||||
|
||||
**Weaknesses**:
|
||||
- ❌ Relatively new (less mature)
|
||||
- ❌ Some Prometheus features missing
|
||||
- ❌ Requires Grafana for visualization
|
||||
|
||||
**Best For**:
|
||||
- Teams wanting Prometheus without ops overhead
|
||||
- Multi-cloud environments
|
||||
- Organizations already using Grafana
|
||||
|
||||
**Pricing**: $8/month + $0.29/1M samples
|
||||
|
||||
**Setup Complexity**: Low-Medium
|
||||
|
||||
---
|
||||
|
||||
## Logging Platforms
|
||||
|
||||
### Elasticsearch (ELK Stack)
|
||||
|
||||
**Type**: Open-source log search and analytics
|
||||
|
||||
**Full Stack**: Elasticsearch + Logstash + Kibana
|
||||
|
||||
**Strengths**:
|
||||
- ✅ Powerful search capabilities
|
||||
- ✅ Rich query language
|
||||
- ✅ Great for log analysis
|
||||
- ✅ Mature ecosystem
|
||||
- ✅ Can handle large volumes
|
||||
- ✅ Flexible data model
|
||||
|
||||
**Weaknesses**:
|
||||
- ❌ Complex to operate
|
||||
- ❌ Resource intensive (RAM hungry)
|
||||
- ❌ Expensive at scale
|
||||
- ❌ Requires dedicated ops team
|
||||
- ❌ Slow for high-cardinality queries
|
||||
|
||||
**Best For**:
|
||||
- Large organizations with ops teams
|
||||
- Deep log analysis needs
|
||||
- Search-heavy use cases
|
||||
|
||||
**Pricing**: Free (open source) + infrastructure costs
|
||||
|
||||
**Infrastructure**: ~$500-2000/month for medium scale
|
||||
|
||||
**Setup Complexity**: High
|
||||
|
||||
**Example**:
|
||||
```json
|
||||
PUT /logs-2024.10/_doc/1
|
||||
{
|
||||
"timestamp": "2024-10-28T14:32:15Z",
|
||||
"level": "error",
|
||||
"message": "Payment failed"
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Grafana Loki
|
||||
|
||||
**Type**: Log aggregation system
|
||||
|
||||
**Strengths**:
|
||||
- ✅ Cost-effective (labels only, not full-text indexing)
|
||||
- ✅ Easy to operate
|
||||
- ✅ Prometheus-like label model
|
||||
- ✅ Great Grafana integration
|
||||
- ✅ Low resource usage
|
||||
- ✅ Fast time-range queries
|
||||
|
||||
**Weaknesses**:
|
||||
- ❌ Limited full-text search
|
||||
- ❌ Requires careful label design
|
||||
- ❌ Younger ecosystem than ELK
|
||||
- ❌ Not ideal for complex queries
|
||||
|
||||
**Best For**:
|
||||
- Cost-conscious organizations
|
||||
- Kubernetes environments
|
||||
- Teams already using Prometheus
|
||||
- Time-series log queries
|
||||
|
||||
**Pricing**: Free (open source) + infrastructure costs
|
||||
|
||||
**Infrastructure**: ~$100-500/month for medium scale
|
||||
|
||||
**Setup Complexity**: Medium
|
||||
|
||||
**Example**:
|
||||
```logql
|
||||
{job="api", environment="prod"} |= "error" | json | level="error"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Splunk
|
||||
|
||||
**Type**: Enterprise log management
|
||||
|
||||
**Strengths**:
|
||||
- ✅ Extremely powerful search
|
||||
- ✅ Great for security/compliance
|
||||
- ✅ Mature platform
|
||||
- ✅ Enterprise support
|
||||
- ✅ Machine learning features
|
||||
|
||||
**Weaknesses**:
|
||||
- ❌ Very expensive
|
||||
- ❌ Complex pricing (per GB ingested)
|
||||
- ❌ Steep learning curve
|
||||
- ❌ Heavy resource usage
|
||||
|
||||
**Best For**:
|
||||
- Large enterprises
|
||||
- Security operations centers (SOCs)
|
||||
- Compliance-heavy industries
|
||||
|
||||
**Pricing**: $150-$1800/GB/month (depending on tier)
|
||||
|
||||
**Setup Complexity**: Medium-High
|
||||
|
||||
---
|
||||
|
||||
### CloudWatch Logs
|
||||
|
||||
**Type**: AWS-native log management
|
||||
|
||||
**Strengths**:
|
||||
- ✅ Zero setup for AWS services
|
||||
- ✅ Integrated with AWS ecosystem
|
||||
- ✅ CloudWatch Insights for queries
|
||||
- ✅ Reasonable cost for low volume
|
||||
|
||||
**Weaknesses**:
|
||||
- ❌ AWS-only
|
||||
- ❌ Limited query capabilities
|
||||
- ❌ Expensive at high volume
|
||||
- ❌ Basic visualization
|
||||
|
||||
**Best For**:
|
||||
- AWS-centric applications
|
||||
- Low-volume logging
|
||||
- Simple log aggregation
|
||||
|
||||
**Pricing**: Tiered (as of May 2025)
|
||||
- Vended Logs: $0.50/GB (first 10TB), $0.25/GB (next 20TB), then lower tiers
|
||||
- Standard logs: $0.50/GB flat
|
||||
- Storage: $0.03/GB
|
||||
|
||||
**Setup Complexity**: Low (AWS), Medium (custom)
|
||||
|
||||
---
|
||||
|
||||
### Sumo Logic
|
||||
|
||||
**Type**: SaaS log management
|
||||
|
||||
**Strengths**:
|
||||
- ✅ Easy to use
|
||||
- ✅ Good for cloud-native apps
|
||||
- ✅ Real-time analytics
|
||||
- ✅ Good compliance features
|
||||
|
||||
**Weaknesses**:
|
||||
- ❌ Expensive at scale
|
||||
- ❌ Vendor lock-in
|
||||
- ❌ Limited customization
|
||||
|
||||
**Best For**:
|
||||
- Cloud-native applications
|
||||
- Teams wanting managed solution
|
||||
- Security and compliance use cases
|
||||
|
||||
**Pricing**: $90-$180/GB/month
|
||||
|
||||
**Setup Complexity**: Low
|
||||
|
||||
---
|
||||
|
||||
## Tracing Platforms
|
||||
|
||||
### Jaeger
|
||||
|
||||
**Type**: Open-source distributed tracing
|
||||
|
||||
**Strengths**:
|
||||
- ✅ Industry standard
|
||||
- ✅ CNCF graduated project
|
||||
- ✅ Supports OpenTelemetry
|
||||
- ✅ Good UI
|
||||
- ✅ Free and open source
|
||||
|
||||
**Weaknesses**:
|
||||
- ❌ Requires separate storage backend
|
||||
- ❌ Limited query capabilities
|
||||
- ❌ No built-in analytics
|
||||
|
||||
**Best For**:
|
||||
- Microservices architectures
|
||||
- Kubernetes environments
|
||||
- OpenTelemetry users
|
||||
|
||||
**Pricing**: Free (open source) + storage costs
|
||||
|
||||
**Setup Complexity**: Medium
|
||||
|
||||
---
|
||||
|
||||
### Grafana Tempo
|
||||
|
||||
**Type**: Open-source distributed tracing
|
||||
|
||||
**Strengths**:
|
||||
- ✅ Cost-effective (object storage)
|
||||
- ✅ Easy to operate
|
||||
- ✅ Great Grafana integration
|
||||
- ✅ TraceQL query language
|
||||
- ✅ Supports OpenTelemetry
|
||||
|
||||
**Weaknesses**:
|
||||
- ❌ Younger than Jaeger
|
||||
- ❌ Limited third-party integrations
|
||||
- ❌ Requires Grafana for UI
|
||||
|
||||
**Best For**:
|
||||
- Cost-conscious organizations
|
||||
- Teams using Grafana stack
|
||||
- High trace volumes
|
||||
|
||||
**Pricing**: Free (open source) + storage costs
|
||||
|
||||
**Setup Complexity**: Medium
|
||||
|
||||
---
|
||||
|
||||
### Datadog APM
|
||||
|
||||
**Type**: SaaS application performance monitoring
|
||||
|
||||
**Strengths**:
|
||||
- ✅ Easy to set up
|
||||
- ✅ Excellent trace visualization
|
||||
- ✅ Integrated with metrics/logs
|
||||
- ✅ Automatic service map
|
||||
- ✅ Good profiling features
|
||||
|
||||
**Weaknesses**:
|
||||
- ❌ Expensive ($31/host/month)
|
||||
- ❌ Vendor lock-in
|
||||
- ❌ Limited sampling control
|
||||
|
||||
**Best For**:
|
||||
- Teams wanting ease of use
|
||||
- Organizations already using Datadog
|
||||
- Complex microservices
|
||||
|
||||
**Pricing**: $31/host/month + $1.70/million spans
|
||||
|
||||
**Setup Complexity**: Low
|
||||
|
||||
---
|
||||
|
||||
### AWS X-Ray
|
||||
|
||||
**Type**: AWS-native distributed tracing
|
||||
|
||||
**Strengths**:
|
||||
- ✅ Native AWS integration
|
||||
- ✅ Automatic instrumentation for AWS services
|
||||
- ✅ Low cost
|
||||
|
||||
**Weaknesses**:
|
||||
- ❌ AWS-only
|
||||
- ❌ Basic UI
|
||||
- ❌ Limited query capabilities
|
||||
|
||||
**Best For**:
|
||||
- AWS-centric applications
|
||||
- Serverless architectures (Lambda)
|
||||
- Cost-sensitive projects
|
||||
|
||||
**Pricing**: $5/million traces, first 100k free/month
|
||||
|
||||
**Setup Complexity**: Low (AWS), Medium (custom)
|
||||
|
||||
---
|
||||
|
||||
## Full-Stack Observability
|
||||
|
||||
### Datadog (Full Platform)
|
||||
|
||||
**Components**: Metrics, logs, traces, RUM, synthetics
|
||||
|
||||
**Strengths**:
|
||||
- ✅ Everything in one platform
|
||||
- ✅ Excellent user experience
|
||||
- ✅ Correlation across signals
|
||||
- ✅ Great for teams
|
||||
|
||||
**Weaknesses**:
|
||||
- ❌ Very expensive ($50-100+/host/month)
|
||||
- ❌ Vendor lock-in
|
||||
- ❌ Unpredictable costs
|
||||
|
||||
**Total Cost** (example 100 hosts):
|
||||
- Infrastructure: $3,100/month
|
||||
- APM: $3,100/month
|
||||
- Logs: ~$2,000/month
|
||||
- **Total: ~$8,000/month**
|
||||
|
||||
---
|
||||
|
||||
### Grafana Stack (LGTM)
|
||||
|
||||
**Components**: Loki (logs), Grafana (viz), Tempo (traces), Mimir/Prometheus (metrics)
|
||||
|
||||
**Strengths**:
|
||||
- ✅ Open source and cost-effective
|
||||
- ✅ Unified visualization
|
||||
- ✅ Prometheus-compatible
|
||||
- ✅ Great for cloud-native
|
||||
|
||||
**Weaknesses**:
|
||||
- ❌ Requires self-hosting or Grafana Cloud
|
||||
- ❌ More ops burden
|
||||
- ❌ Less polished than commercial tools
|
||||
|
||||
**Total Cost** (self-hosted, 100 hosts):
|
||||
- Infrastructure: ~$1,500/month
|
||||
- Ops time: Variable
|
||||
- **Total: ~$1,500-3,000/month**
|
||||
|
||||
---
|
||||
|
||||
### Elastic Observability
|
||||
|
||||
**Components**: Elasticsearch (logs), Kibana (viz), APM, metrics
|
||||
|
||||
**Strengths**:
|
||||
- ✅ Powerful search
|
||||
- ✅ Mature platform
|
||||
- ✅ Good for log-heavy use cases
|
||||
|
||||
**Weaknesses**:
|
||||
- ❌ Complex to operate
|
||||
- ❌ Expensive infrastructure
|
||||
- ❌ Resource intensive
|
||||
|
||||
**Total Cost** (self-hosted, 100 hosts):
|
||||
- Infrastructure: ~$3,000-5,000/month
|
||||
- Ops time: High
|
||||
- **Total: ~$4,000-7,000/month**
|
||||
|
||||
---
|
||||
|
||||
### New Relic One
|
||||
|
||||
**Components**: Metrics, logs, traces, synthetics
|
||||
|
||||
**Strengths**:
|
||||
- ✅ Generous free tier (100GB)
|
||||
- ✅ User-friendly
|
||||
- ✅ Good for startups
|
||||
|
||||
**Weaknesses**:
|
||||
- ❌ Costs increase quickly after free tier
|
||||
- ❌ Vendor lock-in
|
||||
|
||||
**Total Cost**:
|
||||
- Free: up to 100GB/month
|
||||
- Paid: $0.30/GB beyond 100GB
|
||||
|
||||
---
|
||||
|
||||
## Cloud Provider Native
|
||||
|
||||
### AWS (CloudWatch + X-Ray)
|
||||
|
||||
**Use When**:
|
||||
- Primarily on AWS
|
||||
- Simple monitoring needs
|
||||
- Want minimal setup
|
||||
|
||||
**Avoid When**:
|
||||
- Multi-cloud environment
|
||||
- Need advanced features
|
||||
- High log volume (expensive)
|
||||
|
||||
**Cost** (example):
|
||||
- 100 EC2 instances with basic metrics: ~$150/month
|
||||
- 1TB logs: ~$500/month ingestion + storage
|
||||
- X-Ray: ~$50/month
|
||||
|
||||
---
|
||||
|
||||
### GCP (Cloud Monitoring + Cloud Trace)
|
||||
|
||||
**Use When**:
|
||||
- Primarily on GCP
|
||||
- Using GKE
|
||||
- Want tight GCP integration
|
||||
|
||||
**Avoid When**:
|
||||
- Multi-cloud environment
|
||||
- Need advanced querying
|
||||
|
||||
**Cost** (example):
|
||||
- First 150MB/month per resource: Free
|
||||
- Additional: $0.2508/MB
|
||||
|
||||
---
|
||||
|
||||
### Azure (Azure Monitor)
|
||||
|
||||
**Use When**:
|
||||
- Primarily on Azure
|
||||
- Using AKS
|
||||
- Need Azure integration
|
||||
|
||||
**Avoid When**:
|
||||
- Multi-cloud
|
||||
- Need advanced features
|
||||
|
||||
**Cost** (example):
|
||||
- First 5GB: Free
|
||||
- Additional: $2.76/GB
|
||||
|
||||
---
|
||||
|
||||
## Decision Matrix
|
||||
|
||||
### Choose Prometheus + Grafana If:
|
||||
- ✅ Using Kubernetes
|
||||
- ✅ Want control and customization
|
||||
- ✅ Have ops capacity
|
||||
- ✅ Budget-conscious
|
||||
- ✅ Need Prometheus ecosystem
|
||||
|
||||
### Choose Datadog If:
|
||||
- ✅ Want ease of use
|
||||
- ✅ Need full observability now
|
||||
- ✅ Budget allows ($8k+/month for 100 hosts)
|
||||
- ✅ Limited ops team
|
||||
- ✅ Need excellent UX
|
||||
|
||||
### Choose ELK If:
|
||||
- ✅ Heavy log analysis needs
|
||||
- ✅ Need powerful search
|
||||
- ✅ Have dedicated ops team
|
||||
- ✅ Compliance requirements
|
||||
- ✅ Willing to invest in infrastructure
|
||||
|
||||
### Choose Grafana Stack (LGTM) If:
|
||||
- ✅ Want open source full stack
|
||||
- ✅ Cost-effective solution
|
||||
- ✅ Cloud-native architecture
|
||||
- ✅ Already using Prometheus
|
||||
- ✅ Have some ops capacity
|
||||
|
||||
### Choose New Relic If:
|
||||
- ✅ Startup with free tier
|
||||
- ✅ APM is priority
|
||||
- ✅ Want easy setup
|
||||
- ✅ Don't need heavy customization
|
||||
|
||||
### Choose Cloud Native (CloudWatch/etc) If:
|
||||
- ✅ Single cloud provider
|
||||
- ✅ Simple needs
|
||||
- ✅ Want minimal setup
|
||||
- ✅ Low to medium scale
|
||||
|
||||
---
|
||||
|
||||
## Cost Comparison
|
||||
|
||||
**Example: 100 hosts, 1TB logs/month, 1M spans/day**
|
||||
|
||||
| Solution | Monthly Cost | Setup | Ops Burden |
|
||||
|----------|-------------|--------|------------|
|
||||
| **Prometheus + Loki + Tempo** | $1,500 | Medium | Medium |
|
||||
| **Grafana Cloud** | $3,000 | Low | Low |
|
||||
| **Datadog** | $8,000 | Low | None |
|
||||
| **New Relic** | $3,500 | Low | None |
|
||||
| **ELK Stack** | $4,000 | High | High |
|
||||
| **CloudWatch** | $2,000 | Low | Low |
|
||||
|
||||
---
|
||||
|
||||
## Recommendations by Company Size
|
||||
|
||||
### Startup (< 10 engineers)
|
||||
**Recommendation**: New Relic or Grafana Cloud
|
||||
- Minimal ops burden
|
||||
- Good free tiers
|
||||
- Easy to get started
|
||||
|
||||
### Small Company (10-50 engineers)
|
||||
**Recommendation**: Prometheus + Grafana + Loki (self-hosted or cloud)
|
||||
- Cost-effective
|
||||
- Growing ops capacity
|
||||
- Flexibility
|
||||
|
||||
### Medium Company (50-200 engineers)
|
||||
**Recommendation**: Datadog or Grafana Stack
|
||||
- Datadog if budget allows
|
||||
- Grafana Stack if cost-conscious
|
||||
|
||||
### Large Enterprise (200+ engineers)
|
||||
**Recommendation**: Build observability platform
|
||||
- Mix of tools based on needs
|
||||
- Dedicated observability team
|
||||
- Custom integrations
|
||||
663
references/tracing_guide.md
Normal file
663
references/tracing_guide.md
Normal file
@@ -0,0 +1,663 @@
|
||||
# Distributed Tracing Guide
|
||||
|
||||
## What is Distributed Tracing?
|
||||
|
||||
Distributed tracing tracks a request as it flows through multiple services in a distributed system.
|
||||
|
||||
### Key Concepts
|
||||
|
||||
**Trace**: End-to-end journey of a request
|
||||
**Span**: Single operation within a trace
|
||||
**Context**: Metadata propagated between services (trace_id, span_id)
|
||||
|
||||
### Example Flow
|
||||
```
|
||||
User Request → API Gateway → Auth Service → User Service → Database
|
||||
↓ ↓ ↓
|
||||
[Trace ID: abc123]
|
||||
Span 1: gateway (50ms)
|
||||
Span 2: auth (20ms)
|
||||
Span 3: user_service (100ms)
|
||||
Span 4: db_query (80ms)
|
||||
|
||||
Total: 250ms with waterfall view showing dependencies
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## OpenTelemetry (OTel)
|
||||
|
||||
OpenTelemetry is the industry standard for instrumentation.
|
||||
|
||||
### Components
|
||||
|
||||
**API**: Instrument code (create spans, add attributes)
|
||||
**SDK**: Implement API, configure exporters
|
||||
**Collector**: Receive, process, and export telemetry data
|
||||
**Exporters**: Send data to backends (Jaeger, Tempo, Zipkin)
|
||||
|
||||
### Architecture
|
||||
```
|
||||
Application → OTel SDK → OTel Collector → Backend (Jaeger/Tempo)
|
||||
↓
|
||||
Visualization
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Instrumentation Examples
|
||||
|
||||
### Python (using OpenTelemetry)
|
||||
|
||||
**Setup**:
|
||||
```python
|
||||
from opentelemetry import trace
|
||||
from opentelemetry.sdk.trace import TracerProvider
|
||||
from opentelemetry.sdk.trace.export import BatchSpanProcessor
|
||||
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
|
||||
|
||||
# Setup tracer
|
||||
trace.set_tracer_provider(TracerProvider())
|
||||
tracer = trace.get_tracer(__name__)
|
||||
|
||||
# Configure exporter
|
||||
otlp_exporter = OTLPSpanExporter(endpoint="localhost:4317")
|
||||
span_processor = BatchSpanProcessor(otlp_exporter)
|
||||
trace.get_tracer_provider().add_span_processor(span_processor)
|
||||
```
|
||||
|
||||
**Manual instrumentation**:
|
||||
```python
|
||||
from opentelemetry import trace
|
||||
|
||||
tracer = trace.get_tracer(__name__)
|
||||
|
||||
@tracer.start_as_current_span("process_order")
|
||||
def process_order(order_id):
|
||||
span = trace.get_current_span()
|
||||
span.set_attribute("order.id", order_id)
|
||||
span.set_attribute("order.amount", 99.99)
|
||||
|
||||
try:
|
||||
result = payment_service.charge(order_id)
|
||||
span.set_attribute("payment.status", "success")
|
||||
return result
|
||||
except Exception as e:
|
||||
span.set_status(trace.Status(trace.StatusCode.ERROR))
|
||||
span.record_exception(e)
|
||||
raise
|
||||
```
|
||||
|
||||
**Auto-instrumentation** (Flask example):
|
||||
```python
|
||||
from opentelemetry.instrumentation.flask import FlaskInstrumentor
|
||||
from opentelemetry.instrumentation.requests import RequestsInstrumentor
|
||||
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
|
||||
|
||||
# Auto-instrument Flask
|
||||
FlaskInstrumentor().instrument_app(app)
|
||||
|
||||
# Auto-instrument requests library
|
||||
RequestsInstrumentor().instrument()
|
||||
|
||||
# Auto-instrument SQLAlchemy
|
||||
SQLAlchemyInstrumentor().instrument(engine=db.engine)
|
||||
```
|
||||
|
||||
### Node.js (using OpenTelemetry)
|
||||
|
||||
**Setup**:
|
||||
```javascript
|
||||
const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
|
||||
const { BatchSpanProcessor } = require('@opentelemetry/sdk-trace-base');
|
||||
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
|
||||
|
||||
// Setup provider
|
||||
const provider = new NodeTracerProvider();
|
||||
const exporter = new OTLPTraceExporter({ url: 'localhost:4317' });
|
||||
provider.addSpanProcessor(new BatchSpanProcessor(exporter));
|
||||
provider.register();
|
||||
```
|
||||
|
||||
**Manual instrumentation**:
|
||||
```javascript
|
||||
const tracer = provider.getTracer('my-service');
|
||||
|
||||
async function processOrder(orderId) {
|
||||
const span = tracer.startSpan('process_order');
|
||||
span.setAttribute('order.id', orderId);
|
||||
|
||||
try {
|
||||
const result = await paymentService.charge(orderId);
|
||||
span.setAttribute('payment.status', 'success');
|
||||
return result;
|
||||
} catch (error) {
|
||||
span.setStatus({ code: SpanStatusCode.ERROR });
|
||||
span.recordException(error);
|
||||
throw error;
|
||||
} finally {
|
||||
span.end();
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Auto-instrumentation**:
|
||||
```javascript
|
||||
const { registerInstrumentations } = require('@opentelemetry/instrumentation');
|
||||
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');
|
||||
const { ExpressInstrumentation } = require('@opentelemetry/instrumentation-express');
|
||||
const { MongoDBInstrumentation } = require('@opentelemetry/instrumentation-mongodb');
|
||||
|
||||
registerInstrumentations({
|
||||
instrumentations: [
|
||||
new HttpInstrumentation(),
|
||||
new ExpressInstrumentation(),
|
||||
new MongoDBInstrumentation()
|
||||
]
|
||||
});
|
||||
```
|
||||
|
||||
### Go (using OpenTelemetry)
|
||||
|
||||
**Setup**:
|
||||
```go
|
||||
import (
|
||||
"go.opentelemetry.io/otel"
|
||||
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
|
||||
"go.opentelemetry.io/otel/sdk/trace"
|
||||
)
|
||||
|
||||
func initTracer() {
|
||||
exporter, _ := otlptracegrpc.New(context.Background())
|
||||
tp := trace.NewTracerProvider(
|
||||
trace.WithBatcher(exporter),
|
||||
)
|
||||
otel.SetTracerProvider(tp)
|
||||
}
|
||||
```
|
||||
|
||||
**Manual instrumentation**:
|
||||
```go
|
||||
import (
|
||||
"go.opentelemetry.io/otel"
|
||||
"go.opentelemetry.io/otel/attribute"
|
||||
)
|
||||
|
||||
func processOrder(ctx context.Context, orderID string) error {
|
||||
tracer := otel.Tracer("my-service")
|
||||
ctx, span := tracer.Start(ctx, "process_order")
|
||||
defer span.End()
|
||||
|
||||
span.SetAttributes(
|
||||
attribute.String("order.id", orderID),
|
||||
attribute.Float64("order.amount", 99.99),
|
||||
)
|
||||
|
||||
err := paymentService.Charge(ctx, orderID)
|
||||
if err != nil {
|
||||
span.RecordError(err)
|
||||
return err
|
||||
}
|
||||
|
||||
span.SetAttributes(attribute.String("payment.status", "success"))
|
||||
return nil
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Span Attributes
|
||||
|
||||
### Semantic Conventions
|
||||
|
||||
Follow OpenTelemetry semantic conventions for consistency:
|
||||
|
||||
**HTTP**:
|
||||
```python
|
||||
span.set_attribute("http.method", "GET")
|
||||
span.set_attribute("http.url", "https://api.example.com/users")
|
||||
span.set_attribute("http.status_code", 200)
|
||||
span.set_attribute("http.user_agent", "Mozilla/5.0...")
|
||||
```
|
||||
|
||||
**Database**:
|
||||
```python
|
||||
span.set_attribute("db.system", "postgresql")
|
||||
span.set_attribute("db.name", "users_db")
|
||||
span.set_attribute("db.statement", "SELECT * FROM users WHERE id = ?")
|
||||
span.set_attribute("db.operation", "SELECT")
|
||||
```
|
||||
|
||||
**RPC/gRPC**:
|
||||
```python
|
||||
span.set_attribute("rpc.system", "grpc")
|
||||
span.set_attribute("rpc.service", "UserService")
|
||||
span.set_attribute("rpc.method", "GetUser")
|
||||
span.set_attribute("rpc.grpc.status_code", 0)
|
||||
```
|
||||
|
||||
**Messaging**:
|
||||
```python
|
||||
span.set_attribute("messaging.system", "kafka")
|
||||
span.set_attribute("messaging.destination", "user-events")
|
||||
span.set_attribute("messaging.operation", "publish")
|
||||
span.set_attribute("messaging.message_id", "msg123")
|
||||
```
|
||||
|
||||
### Custom Attributes
|
||||
|
||||
Add business context:
|
||||
```python
|
||||
span.set_attribute("user.id", "user123")
|
||||
span.set_attribute("order.id", "ORD-456")
|
||||
span.set_attribute("feature.flag.checkout_v2", True)
|
||||
span.set_attribute("cache.hit", False)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Context Propagation
|
||||
|
||||
### W3C Trace Context (Standard)
|
||||
|
||||
Headers propagated between services:
|
||||
```
|
||||
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
|
||||
tracestate: vendor1=value1,vendor2=value2
|
||||
```
|
||||
|
||||
**Format**: `version-trace_id-parent_span_id-trace_flags`
|
||||
|
||||
### Implementation
|
||||
|
||||
**Python**:
|
||||
```python
|
||||
from opentelemetry.propagate import inject, extract
|
||||
import requests
|
||||
|
||||
# Inject context into outgoing request
|
||||
headers = {}
|
||||
inject(headers)
|
||||
requests.get("https://api.example.com", headers=headers)
|
||||
|
||||
# Extract context from incoming request
|
||||
from flask import request
|
||||
ctx = extract(request.headers)
|
||||
```
|
||||
|
||||
**Node.js**:
|
||||
```javascript
|
||||
const { propagation } = require('@opentelemetry/api');
|
||||
|
||||
// Inject
|
||||
const headers = {};
|
||||
propagation.inject(context.active(), headers);
|
||||
axios.get('https://api.example.com', { headers });
|
||||
|
||||
// Extract
|
||||
const ctx = propagation.extract(context.active(), req.headers);
|
||||
```
|
||||
|
||||
**HTTP Example**:
|
||||
```bash
|
||||
curl -H "traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01" \
|
||||
https://api.example.com/users
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Sampling Strategies
|
||||
|
||||
### 1. Always On/Off
|
||||
```python
|
||||
from opentelemetry.sdk.trace import TracerProvider
|
||||
from opentelemetry.sdk.trace.sampling import ALWAYS_ON, ALWAYS_OFF
|
||||
|
||||
# Development: trace everything
|
||||
provider = TracerProvider(sampler=ALWAYS_ON)
|
||||
|
||||
# Production: trace nothing (usually not desired)
|
||||
provider = TracerProvider(sampler=ALWAYS_OFF)
|
||||
```
|
||||
|
||||
### 2. Probability-Based
|
||||
```python
|
||||
from opentelemetry.sdk.trace.sampling import TraceIdRatioBased
|
||||
|
||||
# Sample 10% of traces
|
||||
provider = TracerProvider(sampler=TraceIdRatioBased(0.1))
|
||||
```
|
||||
|
||||
### 3. Rate Limiting
|
||||
```python
|
||||
from opentelemetry.sdk.trace.sampling import ParentBased, RateLimitingSampler
|
||||
|
||||
# Sample max 100 traces per second
|
||||
sampler = ParentBased(root=RateLimitingSampler(100))
|
||||
provider = TracerProvider(sampler=sampler)
|
||||
```
|
||||
|
||||
### 4. Parent-Based (Default)
|
||||
```python
|
||||
from opentelemetry.sdk.trace.sampling import ParentBased, TraceIdRatioBased
|
||||
|
||||
# If parent span is sampled, sample child spans
|
||||
sampler = ParentBased(root=TraceIdRatioBased(0.1))
|
||||
provider = TracerProvider(sampler=sampler)
|
||||
```
|
||||
|
||||
### 5. Custom Sampling
|
||||
```python
|
||||
from opentelemetry.sdk.trace.sampling import Sampler, Decision
|
||||
|
||||
class ErrorSampler(Sampler):
|
||||
"""Always sample errors, sample 1% of successes"""
|
||||
|
||||
def should_sample(self, parent_context, trace_id, name, **kwargs):
|
||||
attributes = kwargs.get('attributes', {})
|
||||
|
||||
# Always sample if error
|
||||
if attributes.get('error', False):
|
||||
return Decision.RECORD_AND_SAMPLE
|
||||
|
||||
# Sample 1% of successes
|
||||
if trace_id & 0xFF < 3: # ~1%
|
||||
return Decision.RECORD_AND_SAMPLE
|
||||
|
||||
return Decision.DROP
|
||||
|
||||
provider = TracerProvider(sampler=ErrorSampler())
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Backends
|
||||
|
||||
### Jaeger
|
||||
|
||||
**Docker Compose**:
|
||||
```yaml
|
||||
version: '3'
|
||||
services:
|
||||
jaeger:
|
||||
image: jaegertracing/all-in-one:latest
|
||||
ports:
|
||||
- "16686:16686" # UI
|
||||
- "4317:4317" # OTLP gRPC
|
||||
- "4318:4318" # OTLP HTTP
|
||||
environment:
|
||||
- COLLECTOR_OTLP_ENABLED=true
|
||||
```
|
||||
|
||||
**Query traces**:
|
||||
```bash
|
||||
# UI: http://localhost:16686
|
||||
|
||||
# API: Get trace by ID
|
||||
curl http://localhost:16686/api/traces/abc123
|
||||
|
||||
# Search traces
|
||||
curl "http://localhost:16686/api/traces?service=my-service&limit=20"
|
||||
```
|
||||
|
||||
### Grafana Tempo
|
||||
|
||||
**Docker Compose**:
|
||||
```yaml
|
||||
version: '3'
|
||||
services:
|
||||
tempo:
|
||||
image: grafana/tempo:latest
|
||||
ports:
|
||||
- "3200:3200" # Tempo
|
||||
- "4317:4317" # OTLP gRPC
|
||||
volumes:
|
||||
- ./tempo.yaml:/etc/tempo.yaml
|
||||
command: ["-config.file=/etc/tempo.yaml"]
|
||||
```
|
||||
|
||||
**tempo.yaml**:
|
||||
```yaml
|
||||
server:
|
||||
http_listen_port: 3200
|
||||
|
||||
distributor:
|
||||
receivers:
|
||||
otlp:
|
||||
protocols:
|
||||
grpc:
|
||||
endpoint: 0.0.0.0:4317
|
||||
|
||||
storage:
|
||||
trace:
|
||||
backend: local
|
||||
local:
|
||||
path: /tmp/tempo/traces
|
||||
```
|
||||
|
||||
**Query in Grafana**:
|
||||
- Install Tempo data source
|
||||
- Use TraceQL: `{ span.http.status_code = 500 }`
|
||||
|
||||
### AWS X-Ray
|
||||
|
||||
**Configuration**:
|
||||
```python
|
||||
from aws_xray_sdk.core import xray_recorder
|
||||
from aws_xray_sdk.ext.flask.middleware import XRayMiddleware
|
||||
|
||||
xray_recorder.configure(service='my-service')
|
||||
XRayMiddleware(app, xray_recorder)
|
||||
```
|
||||
|
||||
**Query**:
|
||||
```bash
|
||||
aws xray get-trace-summaries \
|
||||
--start-time 2024-10-28T00:00:00 \
|
||||
--end-time 2024-10-28T23:59:59 \
|
||||
--filter-expression 'error = true'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Analysis Patterns
|
||||
|
||||
### Find Slow Traces
|
||||
```
|
||||
# Jaeger UI
|
||||
- Filter by service
|
||||
- Set min duration: 1000ms
|
||||
- Sort by duration
|
||||
|
||||
# TraceQL (Tempo)
|
||||
{ duration > 1s }
|
||||
```
|
||||
|
||||
### Find Error Traces
|
||||
```
|
||||
# Jaeger UI
|
||||
- Filter by tag: error=true
|
||||
- Or by HTTP status: http.status_code=500
|
||||
|
||||
# TraceQL (Tempo)
|
||||
{ span.http.status_code >= 500 }
|
||||
```
|
||||
|
||||
### Find Traces by User
|
||||
```
|
||||
# Jaeger UI
|
||||
- Filter by tag: user.id=user123
|
||||
|
||||
# TraceQL (Tempo)
|
||||
{ span.user.id = "user123" }
|
||||
```
|
||||
|
||||
### Find N+1 Query Problems
|
||||
Look for:
|
||||
- Many sequential database spans
|
||||
- Same query repeated multiple times
|
||||
- Pattern: API call → DB query → DB query → DB query...
|
||||
|
||||
### Find Service Bottlenecks
|
||||
- Identify spans with longest duration
|
||||
- Check if time is spent in service logic or waiting for dependencies
|
||||
- Look at span relationships (parallel vs sequential)
|
||||
|
||||
---
|
||||
|
||||
## Integration with Logs
|
||||
|
||||
### Trace ID in Logs
|
||||
|
||||
**Python**:
|
||||
```python
|
||||
from opentelemetry import trace
|
||||
|
||||
def add_trace_context():
|
||||
span = trace.get_current_span()
|
||||
trace_id = span.get_span_context().trace_id
|
||||
span_id = span.get_span_context().span_id
|
||||
|
||||
return {
|
||||
"trace_id": format(trace_id, '032x'),
|
||||
"span_id": format(span_id, '016x')
|
||||
}
|
||||
|
||||
logger.info("Processing order", **add_trace_context(), order_id=order_id)
|
||||
```
|
||||
|
||||
**Query logs for trace**:
|
||||
```
|
||||
# Elasticsearch
|
||||
GET /logs/_search
|
||||
{
|
||||
"query": {
|
||||
"match": { "trace_id": "0af7651916cd43dd8448eb211c80319c" }
|
||||
}
|
||||
}
|
||||
|
||||
# Loki (LogQL)
|
||||
{job="app"} |= "0af7651916cd43dd8448eb211c80319c"
|
||||
```
|
||||
|
||||
### Trace from Log (Grafana)
|
||||
|
||||
Configure derived fields in Grafana:
|
||||
```yaml
|
||||
datasources:
|
||||
- name: Loki
|
||||
type: loki
|
||||
jsonData:
|
||||
derivedFields:
|
||||
- name: TraceID
|
||||
matcherRegex: "trace_id=([\\w]+)"
|
||||
url: "http://tempo:3200/trace/$${__value.raw}"
|
||||
datasourceUid: tempo_uid
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Best Practices
|
||||
|
||||
### 1. Span Naming
|
||||
✅ Use operation names, not IDs
|
||||
- Good: `GET /api/users`, `UserService.GetUser`, `db.query.users`
|
||||
- Bad: `/api/users/123`, `span_abc`, `query_1`
|
||||
|
||||
### 2. Span Granularity
|
||||
✅ One span per logical operation
|
||||
- Too coarse: One span for entire request
|
||||
- Too fine: Span for every variable assignment
|
||||
- Just right: Span per service call, database query, external API
|
||||
|
||||
### 3. Add Context
|
||||
Always include:
|
||||
- Operation name
|
||||
- Service name
|
||||
- Error status
|
||||
- Business identifiers (user_id, order_id)
|
||||
|
||||
### 4. Handle Errors
|
||||
```python
|
||||
try:
|
||||
result = operation()
|
||||
except Exception as e:
|
||||
span.set_status(trace.Status(trace.StatusCode.ERROR))
|
||||
span.record_exception(e)
|
||||
raise
|
||||
```
|
||||
|
||||
### 5. Sampling Strategy
|
||||
- Development: 100%
|
||||
- Staging: 50-100%
|
||||
- Production: 1-10% (or error-based)
|
||||
|
||||
### 6. Performance Impact
|
||||
- Overhead: ~1-5% CPU
|
||||
- Use async exporters
|
||||
- Batch span exports
|
||||
- Sample appropriately
|
||||
|
||||
### 7. Cardinality
|
||||
Avoid high-cardinality attributes:
|
||||
- ❌ Email addresses
|
||||
- ❌ Full URLs with unique IDs
|
||||
- ❌ Timestamps
|
||||
- ✅ User ID
|
||||
- ✅ Endpoint pattern
|
||||
- ✅ Status code
|
||||
|
||||
---
|
||||
|
||||
## Common Issues
|
||||
|
||||
### Missing Traces
|
||||
**Cause**: Context not propagated
|
||||
**Solution**: Verify headers are injected/extracted
|
||||
|
||||
### Incomplete Traces
|
||||
**Cause**: Spans not closed properly
|
||||
**Solution**: Always use `defer span.End()` or context managers
|
||||
|
||||
### High Overhead
|
||||
**Cause**: Too many spans or synchronous export
|
||||
**Solution**: Reduce span count, use batch processor
|
||||
|
||||
### No Error Traces
|
||||
**Cause**: Errors not recorded on spans
|
||||
**Solution**: Call `span.record_exception()` and set error status
|
||||
|
||||
---
|
||||
|
||||
## Metrics from Traces
|
||||
|
||||
Generate RED metrics from trace data:
|
||||
|
||||
**Rate**: Traces per second
|
||||
**Errors**: Traces with error status
|
||||
**Duration**: Span duration percentiles
|
||||
|
||||
**Example** (using Tempo + Prometheus):
|
||||
```yaml
|
||||
# Generate metrics from spans
|
||||
metrics_generator:
|
||||
processor:
|
||||
span_metrics:
|
||||
dimensions:
|
||||
- http.method
|
||||
- http.status_code
|
||||
```
|
||||
|
||||
**Query**:
|
||||
```promql
|
||||
# Request rate
|
||||
rate(traces_spanmetrics_calls_total[5m])
|
||||
|
||||
# Error rate
|
||||
rate(traces_spanmetrics_calls_total{status_code="STATUS_CODE_ERROR"}[5m])
|
||||
/
|
||||
rate(traces_spanmetrics_calls_total[5m])
|
||||
|
||||
# P95 latency
|
||||
histogram_quantile(0.95, traces_spanmetrics_latency_bucket)
|
||||
```
|
||||
Reference in New Issue
Block a user