Initial commit

This commit is contained in:
Zhongwei Li
2025-11-29 17:51:22 +08:00
commit 23753b435e
24 changed files with 9837 additions and 0 deletions

View File

@@ -0,0 +1,609 @@
# Alerting Best Practices
## Core Principles
### 1. Every Alert Should Be Actionable
If you can't do something about it, don't alert on it.
❌ Bad: `Alert: CPU > 50%` (What action should be taken?)
✅ Good: `Alert: API latency p95 > 2s for 10m` (Investigate/scale up)
### 2. Alert on Symptoms, Not Causes
Alert on what users experience, not underlying components.
❌ Bad: `Database connection pool 80% full`
✅ Good: `Request latency p95 > 1s` (which might be caused by DB pool)
### 3. Alert on SLO Violations
Tie alerts to Service Level Objectives.
`Error rate exceeds 0.1% (SLO: 99.9% availability)`
### 4. Reduce Noise
Alert fatigue is real. Only page for critical issues.
**Alert Severity Levels**:
- **Critical**: Page on-call immediately (user-facing issue)
- **Warning**: Create ticket, review during business hours
- **Info**: Log for awareness, no action needed
---
## Alert Design Patterns
### Pattern 1: Multi-Window Multi-Burn-Rate
Google's recommended SLO alerting approach.
**Concept**: Alert when error budget burn rate is high enough to exhaust the budget too quickly.
```yaml
# Fast burn (6% of budget in 1 hour)
- alert: FastBurnRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
> (14.4 * 0.001) # 14.4x burn rate for 99.9% SLO
for: 2m
labels:
severity: critical
# Slow burn (6% of budget in 6 hours)
- alert: SlowBurnRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[6h]))
/
sum(rate(http_requests_total[6h]))
> (6 * 0.001) # 6x burn rate for 99.9% SLO
for: 30m
labels:
severity: warning
```
**Burn Rate Multipliers for 99.9% SLO (0.1% error budget)**:
- 1 hour window, 2m grace: 14.4x burn rate
- 6 hour window, 30m grace: 6x burn rate
- 3 day window, 6h grace: 1x burn rate
### Pattern 2: Rate of Change
Alert when metrics change rapidly.
```yaml
- alert: TrafficSpike
expr: |
sum(rate(http_requests_total[5m]))
>
1.5 * sum(rate(http_requests_total[5m] offset 1h))
for: 10m
annotations:
summary: "Traffic increased by 50% compared to 1 hour ago"
```
### Pattern 3: Threshold with Hysteresis
Prevent flapping with different thresholds for firing and resolving.
```yaml
# Fire at 90%, resolve at 70%
- alert: HighCPU
expr: cpu_usage > 90
for: 5m
- alert: HighCPU_Resolved
expr: cpu_usage < 70
for: 5m
```
### Pattern 4: Absent Metrics
Alert when expected metrics stop being reported (service down).
```yaml
- alert: ServiceDown
expr: absent(up{job="my-service"})
for: 5m
labels:
severity: critical
annotations:
summary: "Service {{ $labels.job }} is not reporting metrics"
```
### Pattern 5: Aggregate Alerts
Alert on aggregate performance across multiple instances.
```yaml
- alert: HighOverallErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
> 0.05
for: 10m
annotations:
summary: "Overall error rate is {{ $value | humanizePercentage }}"
```
---
## Alert Annotation Best Practices
### Required Fields
**summary**: One-line description of the issue
```yaml
summary: "High error rate on {{ $labels.service }}: {{ $value | humanizePercentage }}"
```
**description**: Detailed explanation with context
```yaml
description: |
Error rate on {{ $labels.service }} is {{ $value | humanizePercentage }},
which exceeds the threshold of 1% for more than 10 minutes.
Current value: {{ $value }}
Runbook: https://runbooks.example.com/high-error-rate
```
**runbook_url**: Link to investigation steps
```yaml
runbook_url: "https://runbooks.example.com/alerts/{{ $labels.alertname }}"
```
### Optional but Recommended
**dashboard**: Link to relevant dashboard
```yaml
dashboard: "https://grafana.example.com/d/service-dashboard?var-service={{ $labels.service }}"
```
**logs**: Link to logs
```yaml
logs: "https://kibana.example.com/app/discover#/?_a=(query:(query_string:(query:'service:{{ $labels.service }}')))"
```
---
## Alert Label Best Practices
### Required Labels
**severity**: Critical, warning, or info
```yaml
labels:
severity: critical
```
**team**: Who should handle this alert
```yaml
labels:
team: platform
severity: critical
```
**component**: What part of the system
```yaml
labels:
component: api-gateway
severity: warning
```
### Example Complete Alert
```yaml
- alert: HighLatency
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
) > 1
for: 10m
labels:
severity: warning
team: backend
component: api
environment: "{{ $labels.environment }}"
annotations:
summary: "High latency on {{ $labels.service }}"
description: |
P95 latency on {{ $labels.service }} is {{ $value }}s, exceeding 1s threshold.
This may impact user experience. Check recent deployments and database performance.
Current p95: {{ $value }}s
Threshold: 1s
Duration: 10m+
runbook_url: "https://runbooks.example.com/high-latency"
dashboard: "https://grafana.example.com/d/api-dashboard"
logs: "https://kibana.example.com/app/discover#/?_a=(query:(query_string:(query:'service:{{ $labels.service }} AND level:error')))"
```
---
## Alert Thresholds
### General Guidelines
**Response Time / Latency**:
- Warning: p95 > 500ms or p99 > 1s
- Critical: p95 > 2s or p99 > 5s
**Error Rate**:
- Warning: > 1%
- Critical: > 5%
**Availability**:
- Warning: < 99.9%
- Critical: < 99.5%
**CPU Utilization**:
- Warning: > 70% for 15m
- Critical: > 90% for 5m
**Memory Utilization**:
- Warning: > 80% for 15m
- Critical: > 95% for 5m
**Disk Space**:
- Warning: > 80% full
- Critical: > 90% full
**Queue Depth**:
- Warning: > 70% of max capacity
- Critical: > 90% of max capacity
### Application-Specific Thresholds
Set thresholds based on:
1. **Historical performance**: Use p95 of last 30 days + 20%
2. **SLO requirements**: If SLO is 99.9%, alert at 99.5%
3. **Business impact**: What error rate causes user complaints?
---
## The "for" Clause
Prevent alert flapping by requiring the condition to be true for a duration.
### Guidelines
**Critical alerts**: Short duration (2-5m)
```yaml
- alert: ServiceDown
expr: up == 0
for: 2m # Quick detection for critical issues
```
**Warning alerts**: Longer duration (10-30m)
```yaml
- alert: HighMemoryUsage
expr: memory_usage > 80
for: 15m # Avoid noise from temporary spikes
```
**Resource saturation**: Medium duration (5-10m)
```yaml
- alert: HighCPU
expr: cpu_usage > 90
for: 5m
```
---
## Alert Routing
### Severity-Based Routing
```yaml
# alertmanager.yml
route:
group_by: ['alertname', 'cluster']
group_wait: 10s
group_interval: 5m
repeat_interval: 4h
receiver: 'default'
routes:
# Critical alerts → PagerDuty
- match:
severity: critical
receiver: pagerduty
group_wait: 10s
repeat_interval: 5m
# Warning alerts → Slack
- match:
severity: warning
receiver: slack
group_wait: 30s
repeat_interval: 12h
# Info alerts → Email
- match:
severity: info
receiver: email
repeat_interval: 24h
```
### Team-Based Routing
```yaml
routes:
# Platform team
- match:
team: platform
receiver: platform-pagerduty
# Backend team
- match:
team: backend
receiver: backend-slack
# Database team
- match:
component: database
receiver: dba-pagerduty
```
### Time-Based Routing
```yaml
# Only page during business hours for non-critical
routes:
- match:
severity: warning
receiver: slack
active_time_intervals:
- business_hours
time_intervals:
- name: business_hours
time_intervals:
- weekdays: ['monday:friday']
times:
- start_time: '09:00'
end_time: '17:00'
location: 'America/New_York'
```
---
## Alert Grouping
### Intelligent Grouping
**Group by service and environment**:
```yaml
route:
group_by: ['alertname', 'service', 'environment']
group_wait: 30s
group_interval: 5m
```
This prevents:
- 50 alerts for "HighCPU" on different pods → 1 grouped alert
- Mixing production and staging alerts
### Inhibition Rules
Suppress related alerts when a parent alert fires.
```yaml
inhibit_rules:
# If service is down, suppress latency alerts
- source_match:
alertname: ServiceDown
target_match:
alertname: HighLatency
equal: ['service']
# If node is down, suppress all pod alerts on that node
- source_match:
alertname: NodeDown
target_match_re:
alertname: '(PodCrashLoop|HighCPU|HighMemory)'
equal: ['node']
```
---
## Runbook Structure
Every alert should link to a runbook with:
### 1. Context
- What does this alert mean?
- What is the user impact?
- What is the urgency?
### 2. Investigation Steps
```markdown
## Investigation
1. Check service health dashboard
https://grafana.example.com/d/service-dashboard
2. Check recent deployments
kubectl rollout history deployment/myapp -n production
3. Check error logs
kubectl logs deployment/myapp -n production --tail=100 | grep ERROR
4. Check dependencies
- Database: Check slow query log
- Redis: Check memory usage
- External APIs: Check status pages
```
### 3. Common Causes
```markdown
## Common Causes
- **Recent deployment**: Check if alert started after deployment
- **Traffic spike**: Check request rate, might need to scale
- **Database issues**: Check query performance and connection pool
- **External API degradation**: Check third-party status pages
```
### 4. Resolution Steps
```markdown
## Resolution
### Immediate Actions (< 5 minutes)
1. Scale up if traffic spike: `kubectl scale deployment myapp --replicas=10`
2. Rollback if recent deployment: `kubectl rollout undo deployment/myapp`
### Short-term Actions (< 30 minutes)
1. Restart pods if memory leak: `kubectl rollout restart deployment/myapp`
2. Clear cache if stale data: `redis-cli -h cache.example.com FLUSHDB`
### Long-term Actions (post-incident)
1. Review and optimize slow queries
2. Implement circuit breakers
3. Add more capacity
4. Update alert thresholds if false positive
```
### 5. Escalation
```markdown
## Escalation
If issue persists after 30 minutes:
- Slack: #backend-oncall
- PagerDuty: Escalate to senior engineer
- Incident Commander: Jane Doe (jane@example.com)
```
---
## Anti-Patterns to Avoid
### 1. Alert on Everything
❌ Don't: Alert on every warning log
✅ Do: Alert on error rate exceeding threshold
### 2. Alert Without Context
❌ Don't: "Error rate high"
✅ Do: "Error rate 5.2% exceeds 1% threshold for 10m, impacting checkout flow"
### 3. Static Thresholds for Dynamic Systems
❌ Don't: `cpu_usage > 70` (fails during scale-up)
✅ Do: Alert on SLO violations or rate of change
### 4. No "for" Clause
❌ Don't: Alert immediately on threshold breach
✅ Do: Use `for: 5m` to avoid flapping
### 5. Too Many Recipients
❌ Don't: Page 10 people for every alert
✅ Do: Route to specific on-call rotation
### 6. Duplicate Alerts
❌ Don't: Alert on both cause and symptom
✅ Do: Alert on symptom, use inhibition for causes
### 7. No Runbook
❌ Don't: Alert without guidance
✅ Do: Include runbook_url in every alert
---
## Alert Testing
### Test Alert Firing
```bash
# Trigger test alert in Prometheus
amtool alert add alertname="TestAlert" \
severity="warning" \
summary="Test alert"
# Or use Alertmanager API
curl -X POST http://alertmanager:9093/api/v1/alerts \
-d '[{
"labels": {"alertname": "TestAlert", "severity": "critical"},
"annotations": {"summary": "Test critical alert"}
}]'
```
### Verify Alert Rules
```bash
# Check syntax
promtool check rules alerts.yml
# Test expression
promtool query instant http://prometheus:9090 \
'sum(rate(http_requests_total{status=~"5.."}[5m]))'
# Unit test alerts
promtool test rules test.yml
```
### Test Alertmanager Routing
```bash
# Test which receiver an alert would go to
amtool config routes test \
--config.file=alertmanager.yml \
alertname="HighLatency" \
severity="critical" \
team="backend"
```
---
## On-Call Best Practices
### Rotation Schedule
- **Primary on-call**: First responder
- **Secondary on-call**: Escalation backup
- **Rotation length**: 1 week (balance load vs context)
- **Handoff**: Monday morning (not Friday evening)
### On-Call Checklist
```markdown
## Pre-shift
- [ ] Test pager/phone
- [ ] Review recent incidents
- [ ] Check upcoming deployments
- [ ] Update contact info
## During shift
- [ ] Respond to pages within 5 minutes
- [ ] Document all incidents
- [ ] Update runbooks if gaps found
- [ ] Communicate in #incidents channel
## Post-shift
- [ ] Hand off open incidents
- [ ] Complete incident reports
- [ ] Suggest improvements
- [ ] Update team documentation
```
### Escalation Policy
1. **Primary**: Responds within 5 minutes
2. **Secondary**: Auto-escalate after 15 minutes
3. **Manager**: Auto-escalate after 30 minutes
4. **Incident Commander**: Critical incidents only
---
## Metrics About Alerts
Monitor your monitoring system!
### Key Metrics
```promql
# Alert firing frequency
sum(ALERTS{alertstate="firing"}) by (alertname)
# Alert duration
ALERTS_FOR_STATE{alertstate="firing"}
# Alerts per severity
sum(ALERTS{alertstate="firing"}) by (severity)
# Time to acknowledge (from PagerDuty/etc)
pagerduty_incident_ack_duration_seconds
```
### Alert Quality Metrics
- **Mean Time to Acknowledge (MTTA)**: < 5 minutes
- **Mean Time to Resolve (MTTR)**: < 30 minutes
- **False Positive Rate**: < 10%
- **Alert Coverage**: % of incidents with preceding alert > 80%

View File

@@ -0,0 +1,649 @@
# Migrating from Datadog to Open-Source Stack
## Overview
This guide helps you migrate from Datadog to a cost-effective open-source observability stack:
- **Metrics**: Datadog → Prometheus + Grafana
- **Logs**: Datadog → Loki + Grafana
- **Traces**: Datadog APM → Tempo/Jaeger + Grafana
- **Dashboards**: Datadog → Grafana
- **Alerts**: Datadog Monitors → Prometheus Alertmanager
**Estimated Cost Savings**: 60-80% for similar functionality
---
## Cost Comparison
### Example: 100-host infrastructure
**Datadog**:
- Infrastructure Pro: $1,500/month (100 hosts × $15)
- Custom Metrics: $50/month (5,000 extra metrics beyond included 10,000)
- Logs: $2,000/month (20GB/day × $0.10/GB × 30 days)
- APM: $3,100/month (100 hosts × $31)
- **Total**: ~$6,650/month ($79,800/year)
**Open-Source Stack** (self-hosted):
- Infrastructure: $1,200/month (EC2/GKE for Prometheus, Grafana, Loki, Tempo)
- Storage: $300/month (S3/GCS for long-term metrics and traces)
- Operations time: Variable
- **Total**: ~$1,500-2,500/month ($18,000-30,000/year)
**Savings**: $49,800-61,800/year
---
## Migration Strategy
### Phase 1: Run Parallel (Month 1-2)
- Deploy open-source stack alongside Datadog
- Migrate metrics first (lowest risk)
- Validate data accuracy
- Build confidence
### Phase 2: Migrate Dashboards & Alerts (Month 2-3)
- Convert Datadog dashboards to Grafana
- Translate alert rules
- Train team on new tools
### Phase 3: Migrate Logs & Traces (Month 3-4)
- Set up Loki for log aggregation
- Deploy Tempo/Jaeger for tracing
- Update application instrumentation
### Phase 4: Decommission Datadog (Month 4-5)
- Confirm all functionality migrated
- Cancel Datadog subscription
- Archive Datadog dashboards/alerts for reference
---
## 1. Metrics Migration (Datadog → Prometheus)
### Step 1: Deploy Prometheus
**Kubernetes** (recommended):
```yaml
# prometheus-values.yaml
prometheus:
prometheusSpec:
retention: 30d
storageSpec:
volumeClaimTemplate:
spec:
resources:
requests:
storage: 100Gi
# Scrape configs
additionalScrapeConfigs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
```
**Install**:
```bash
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack -f prometheus-values.yaml
```
**Docker Compose**:
```yaml
version: '3'
services:
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=30d'
volumes:
prometheus-data:
```
### Step 2: Replace DogStatsD with Prometheus Exporters
**Before (DogStatsD)**:
```python
from datadog import statsd
statsd.increment('page.views')
statsd.histogram('request.duration', 0.5)
statsd.gauge('active_users', 100)
```
**After (Prometheus Python client)**:
```python
from prometheus_client import Counter, Histogram, Gauge
page_views = Counter('page_views_total', 'Page views')
request_duration = Histogram('request_duration_seconds', 'Request duration')
active_users = Gauge('active_users', 'Active users')
# Usage
page_views.inc()
request_duration.observe(0.5)
active_users.set(100)
```
### Step 3: Metric Name Translation
| Datadog Metric | Prometheus Equivalent |
|----------------|----------------------|
| `system.cpu.idle` | `node_cpu_seconds_total{mode="idle"}` |
| `system.mem.free` | `node_memory_MemFree_bytes` |
| `system.disk.used` | `node_filesystem_size_bytes - node_filesystem_free_bytes` |
| `docker.cpu.usage` | `container_cpu_usage_seconds_total` |
| `kubernetes.pods.running` | `kube_pod_status_phase{phase="Running"}` |
### Step 4: Export Existing Datadog Metrics (Optional)
Use Datadog API to export historical data:
```python
from datadog import api, initialize
options = {
'api_key': 'YOUR_API_KEY',
'app_key': 'YOUR_APP_KEY'
}
initialize(**options)
# Query metric
result = api.Metric.query(
start=int(time.time() - 86400), # Last 24h
end=int(time.time()),
query='avg:system.cpu.user{*}'
)
# Convert to Prometheus format and import
```
---
## 2. Dashboard Migration (Datadog → Grafana)
### Step 1: Export Datadog Dashboards
```python
import requests
import json
api_key = "YOUR_API_KEY"
app_key = "YOUR_APP_KEY"
headers = {
'DD-API-KEY': api_key,
'DD-APPLICATION-KEY': app_key
}
# Get all dashboards
response = requests.get(
'https://api.datadoghq.com/api/v1/dashboard',
headers=headers
)
dashboards = response.json()
# Export each dashboard
for dashboard in dashboards['dashboards']:
dash_id = dashboard['id']
detail = requests.get(
f'https://api.datadoghq.com/api/v1/dashboard/{dash_id}',
headers=headers
).json()
with open(f'datadog_{dash_id}.json', 'w') as f:
json.dump(detail, f, indent=2)
```
### Step 2: Convert to Grafana Format
**Manual Conversion Template**:
| Datadog Widget | Grafana Panel Type |
|----------------|-------------------|
| Timeseries | Graph / Time series |
| Query Value | Stat |
| Toplist | Table / Bar gauge |
| Heatmap | Heatmap |
| Distribution | Histogram |
**Automated Conversion** (basic example):
```python
def convert_datadog_to_grafana(datadog_dashboard):
grafana_dashboard = {
"title": datadog_dashboard['title'],
"panels": []
}
for widget in datadog_dashboard['widgets']:
panel = {
"title": widget['definition'].get('title', ''),
"type": map_widget_type(widget['definition']['type']),
"targets": convert_queries(widget['definition']['requests'])
}
grafana_dashboard['panels'].append(panel)
return grafana_dashboard
```
### Step 3: Common Query Translations
See `dql_promql_translation.md` for comprehensive query mappings.
**Example conversions**:
```
Datadog: avg:system.cpu.user{*}
Prometheus: avg(rate(node_cpu_seconds_total{mode="user"}[5m])) * 100
Datadog: sum:requests.count{status:200}.as_rate()
Prometheus: sum(rate(http_requests_total{status="200"}[5m]))
Datadog: p95:request.duration{*}
Prometheus: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
```
---
## 3. Alert Migration (Datadog Monitors → Prometheus Alerts)
### Step 1: Export Datadog Monitors
```python
import requests
api_key = "YOUR_API_KEY"
app_key = "YOUR_APP_KEY"
headers = {
'DD-API-KEY': api_key,
'DD-APPLICATION-KEY': app_key
}
response = requests.get(
'https://api.datadoghq.com/api/v1/monitor',
headers=headers
)
monitors = response.json()
# Save each monitor
for monitor in monitors:
with open(f'monitor_{monitor["id"]}.json', 'w') as f:
json.dump(monitor, f, indent=2)
```
### Step 2: Convert to Prometheus Alert Rules
**Datadog Monitor**:
```json
{
"name": "High CPU Usage",
"type": "metric alert",
"query": "avg(last_5m):avg:system.cpu.user{*} > 80",
"message": "CPU usage is high on {{host.name}}"
}
```
**Prometheus Alert**:
```yaml
groups:
- name: infrastructure
rules:
- alert: HighCPUUsage
expr: |
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is {{ $value }}%"
```
### Step 3: Alert Routing (Datadog → Alertmanager)
**Datadog notification channels****Alertmanager receivers**
```yaml
# alertmanager.yml
route:
group_by: ['alertname', 'severity']
receiver: 'slack-notifications'
receivers:
- name: 'slack-notifications'
slack_configs:
- api_url: 'YOUR_SLACK_WEBHOOK'
channel: '#alerts'
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: 'YOUR_PAGERDUTY_KEY'
```
---
## 4. Log Migration (Datadog → Loki)
### Step 1: Deploy Loki
**Kubernetes**:
```bash
helm repo add grafana https://grafana.github.io/helm-charts
helm install loki grafana/loki-stack \
--set loki.persistence.enabled=true \
--set loki.persistence.size=100Gi \
--set promtail.enabled=true
```
**Docker Compose**:
```yaml
version: '3'
services:
loki:
image: grafana/loki:latest
ports:
- "3100:3100"
volumes:
- ./loki-config.yaml:/etc/loki/local-config.yaml
- loki-data:/loki
promtail:
image: grafana/promtail:latest
volumes:
- /var/log:/var/log
- ./promtail-config.yaml:/etc/promtail/config.yml
volumes:
loki-data:
```
### Step 2: Replace Datadog Log Forwarder
**Before (Datadog Agent)**:
```yaml
# datadog.yaml
logs_enabled: true
logs_config:
container_collect_all: true
```
**After (Promtail)**:
```yaml
# promtail-config.yaml
server:
http_listen_port: 9080
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: system
static_configs:
- targets:
- localhost
labels:
job: varlogs
__path__: /var/log/*.log
```
### Step 3: Query Translation
**Datadog Logs Query**:
```
service:my-app status:error
```
**Loki LogQL**:
```logql
{job="my-app", level="error"}
```
**More examples**:
```
Datadog: service:api-gateway status:error @http.status_code:>=500
Loki: {service="api-gateway", level="error"} | json | http_status_code >= 500
Datadog: source:nginx "404"
Loki: {source="nginx"} |= "404"
```
---
## 5. APM Migration (Datadog APM → Tempo/Jaeger)
### Step 1: Choose Tracing Backend
- **Tempo**: Better for high volume, cheaper storage (object storage)
- **Jaeger**: More mature, better UI, requires separate storage
### Step 2: Replace Datadog Tracer with OpenTelemetry
**Before (Datadog Python)**:
```python
from ddtrace import tracer
@tracer.wrap()
def my_function():
pass
```
**After (OpenTelemetry)**:
```python
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
# Setup
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)
exporter = OTLPSpanExporter(endpoint="tempo:4317")
@tracer.start_as_current_span("my_function")
def my_function():
pass
```
### Step 3: Deploy Tempo
```yaml
# tempo.yaml
server:
http_listen_port: 3200
distributor:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
storage:
trace:
backend: s3
s3:
bucket: tempo-traces
endpoint: s3.amazonaws.com
```
---
## 6. Infrastructure Migration
### Recommended Architecture
```
┌─────────────────────────────────────────┐
│ Grafana (Visualization) │
│ - Dashboards │
│ - Unified view │
└─────────────────────────────────────────┘
↓ ↓ ↓
┌──────────────┐ ┌──────────┐ ┌──────────┐
│ Prometheus │ │ Loki │ │ Tempo │
│ (Metrics) │ │ (Logs) │ │ (Traces) │
└──────────────┘ └──────────┘ └──────────┘
↓ ↓ ↓
┌─────────────────────────────────────────┐
│ Applications (OpenTelemetry) │
└─────────────────────────────────────────┘
```
### Sizing Recommendations
**100-host environment**:
- **Prometheus**: 2-4 CPU, 8-16GB RAM, 100GB SSD
- **Grafana**: 1 CPU, 2GB RAM
- **Loki**: 2-4 CPU, 8GB RAM, 100GB SSD
- **Tempo**: 2-4 CPU, 8GB RAM, S3 for storage
- **Alertmanager**: 1 CPU, 1GB RAM
**Total**: ~8-16 CPU, 32-64GB RAM, 200GB SSD + object storage
---
## 7. Migration Checklist
### Pre-Migration
- [ ] Calculate current Datadog costs
- [ ] Identify all Datadog integrations
- [ ] Export all dashboards
- [ ] Export all monitors
- [ ] Document custom metrics
- [ ] Get stakeholder approval
### During Migration
- [ ] Deploy Prometheus + Grafana
- [ ] Deploy Loki + Promtail
- [ ] Deploy Tempo/Jaeger (if using APM)
- [ ] Migrate metrics instrumentation
- [ ] Convert dashboards (top 10 critical first)
- [ ] Convert alerts (critical alerts first)
- [ ] Update application logging
- [ ] Replace APM instrumentation
- [ ] Run parallel for 2-4 weeks
- [ ] Validate data accuracy
- [ ] Train team on new tools
### Post-Migration
- [ ] Decommission Datadog agent from all hosts
- [ ] Cancel Datadog subscription
- [ ] Archive Datadog configs
- [ ] Document new workflows
- [ ] Create runbooks for common tasks
---
## 8. Common Challenges & Solutions
### Challenge: Missing Datadog Features
**Datadog Synthetic Monitoring**:
- Solution: Use **Blackbox Exporter** (Prometheus) or **Grafana Synthetic Monitoring**
**Datadog Network Performance Monitoring**:
- Solution: Use **Cilium Hubble** (Kubernetes) or **eBPF-based tools**
**Datadog RUM (Real User Monitoring)**:
- Solution: Use **Grafana Faro** or **OpenTelemetry Browser SDK**
### Challenge: Team Learning Curve
**Solution**:
- Provide training sessions (2-3 hours per tool)
- Create internal documentation with examples
- Set up sandbox environment for practice
- Assign champions for each tool
### Challenge: Query Performance
**Prometheus too slow**:
- Use **Thanos** or **Cortex** for scaling
- Implement recording rules for expensive queries
- Increase retention only where needed
**Loki too slow**:
- Add more labels for better filtering
- Use chunk caching
- Consider **parallel query execution**
---
## 9. Maintenance Comparison
### Datadog (Managed)
- **Ops burden**: Low (fully managed)
- **Upgrades**: Automatic
- **Scaling**: Automatic
- **Cost**: High ($6k-10k+/month)
### Open-Source Stack (Self-hosted)
- **Ops burden**: Medium (requires ops team)
- **Upgrades**: Manual (quarterly)
- **Scaling**: Manual planning required
- **Cost**: Low ($1.5k-3k/month infrastructure)
**Hybrid Option**: Use **Grafana Cloud** (managed Prometheus/Loki/Tempo)
- Cost: ~$3k/month for 100 hosts
- Ops burden: Low
- Savings: ~50% vs Datadog
---
## 10. ROI Calculation
### Example Scenario
**Before (Datadog)**:
- Monthly cost: $7,000
- Annual cost: $84,000
**After (Self-hosted OSS)**:
- Infrastructure: $1,800/month
- Operations (0.5 FTE): $4,000/month
- Annual cost: $69,600
**Savings**: $14,400/year
**After (Grafana Cloud)**:
- Monthly cost: $3,500
- Annual cost: $42,000
**Savings**: $42,000/year (50%)
**Break-even**: Immediate (no migration costs beyond engineering time)
---
## Resources
- **Prometheus**: https://prometheus.io/docs/
- **Grafana**: https://grafana.com/docs/
- **Loki**: https://grafana.com/docs/loki/
- **Tempo**: https://grafana.com/docs/tempo/
- **OpenTelemetry**: https://opentelemetry.io/
- **Migration Tools**: https://github.com/grafana/dashboard-linter
---
## Support
If you need help with migration:
- Grafana Labs offers migration consulting
- Many SRE consulting firms specialize in this
- Community support via Slack/Discord channels

View File

@@ -0,0 +1,756 @@
# DQL (Datadog Query Language) ↔ PromQL Translation Guide
## Quick Reference
| Concept | Datadog (DQL) | Prometheus (PromQL) |
|---------|---------------|---------------------|
| Aggregation | `avg:`, `sum:`, `min:`, `max:` | `avg()`, `sum()`, `min()`, `max()` |
| Rate | `.as_rate()`, `.as_count()` | `rate()`, `increase()` |
| Percentile | `p50:`, `p95:`, `p99:` | `histogram_quantile()` |
| Filtering | `{tag:value}` | `{label="value"}` |
| Time window | `last_5m`, `last_1h` | `[5m]`, `[1h]` |
---
## Basic Queries
### Simple Metric Query
**Datadog**:
```
system.cpu.user
```
**Prometheus**:
```promql
node_cpu_seconds_total{mode="user"}
```
---
### Metric with Filter
**Datadog**:
```
system.cpu.user{host:web-01}
```
**Prometheus**:
```promql
node_cpu_seconds_total{mode="user", instance="web-01"}
```
---
### Multiple Filters (AND)
**Datadog**:
```
system.cpu.user{host:web-01,env:production}
```
**Prometheus**:
```promql
node_cpu_seconds_total{mode="user", instance="web-01", env="production"}
```
---
### Wildcard Filters
**Datadog**:
```
system.cpu.user{host:web-*}
```
**Prometheus**:
```promql
node_cpu_seconds_total{mode="user", instance=~"web-.*"}
```
---
### OR Filters
**Datadog**:
```
system.cpu.user{host:web-01 OR host:web-02}
```
**Prometheus**:
```promql
node_cpu_seconds_total{mode="user", instance=~"web-01|web-02"}
```
---
## Aggregations
### Average
**Datadog**:
```
avg:system.cpu.user{*}
```
**Prometheus**:
```promql
avg(node_cpu_seconds_total{mode="user"})
```
---
### Sum
**Datadog**:
```
sum:requests.count{*}
```
**Prometheus**:
```promql
sum(http_requests_total)
```
---
### Min/Max
**Datadog**:
```
min:system.mem.free{*}
max:system.mem.free{*}
```
**Prometheus**:
```promql
min(node_memory_MemFree_bytes)
max(node_memory_MemFree_bytes)
```
---
### Aggregation by Tag/Label
**Datadog**:
```
avg:system.cpu.user{*} by {host}
```
**Prometheus**:
```promql
avg by (instance) (node_cpu_seconds_total{mode="user"})
```
---
## Rates and Counts
### Rate (per second)
**Datadog**:
```
sum:requests.count{*}.as_rate()
```
**Prometheus**:
```promql
sum(rate(http_requests_total[5m]))
```
Note: Prometheus requires explicit time window `[5m]`
---
### Count (total over time)
**Datadog**:
```
sum:requests.count{*}.as_count()
```
**Prometheus**:
```promql
sum(increase(http_requests_total[1h]))
```
---
### Derivative (change over time)
**Datadog**:
```
derivative(avg:system.disk.used{*})
```
**Prometheus**:
```promql
deriv(node_filesystem_size_bytes[5m])
```
---
## Percentiles
### P50 (Median)
**Datadog**:
```
p50:request.duration{*}
```
**Prometheus** (requires histogram):
```promql
histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
```
---
### P95
**Datadog**:
```
p95:request.duration{*}
```
**Prometheus**:
```promql
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
```
---
### P99
**Datadog**:
```
p99:request.duration{*}
```
**Prometheus**:
```promql
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
```
---
## Time Windows
### Last 5 minutes
**Datadog**:
```
avg(last_5m):system.cpu.user{*}
```
**Prometheus**:
```promql
avg(node_cpu_seconds_total{mode="user"}[5m])
```
---
### Last 1 hour
**Datadog**:
```
avg(last_1h):system.cpu.user{*}
```
**Prometheus**:
```promql
avg_over_time(node_cpu_seconds_total{mode="user"}[1h])
```
---
## Math Operations
### Division
**Datadog**:
```
avg:system.mem.used{*} / avg:system.mem.total{*}
```
**Prometheus**:
```promql
node_memory_MemUsed_bytes / node_memory_MemTotal_bytes
```
---
### Multiplication
**Datadog**:
```
avg:system.cpu.user{*} * 100
```
**Prometheus**:
```promql
avg(node_cpu_seconds_total{mode="user"}) * 100
```
---
### Percentage Calculation
**Datadog**:
```
(sum:requests.errors{*} / sum:requests.count{*}) * 100
```
**Prometheus**:
```promql
(sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) * 100
```
---
## Common Use Cases
### CPU Usage Percentage
**Datadog**:
```
100 - avg:system.cpu.idle{*}
```
**Prometheus**:
```promql
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
```
---
### Memory Usage Percentage
**Datadog**:
```
(avg:system.mem.used{*} / avg:system.mem.total{*}) * 100
```
**Prometheus**:
```promql
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
```
---
### Disk Usage Percentage
**Datadog**:
```
(avg:system.disk.used{*} / avg:system.disk.total{*}) * 100
```
**Prometheus**:
```promql
(node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes * 100
```
---
### Request Rate (requests/sec)
**Datadog**:
```
sum:requests.count{*}.as_rate()
```
**Prometheus**:
```promql
sum(rate(http_requests_total[5m]))
```
---
### Error Rate Percentage
**Datadog**:
```
(sum:requests.errors{*}.as_rate() / sum:requests.count{*}.as_rate()) * 100
```
**Prometheus**:
```promql
(sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) * 100
```
---
### Request Latency (P95)
**Datadog**:
```
p95:request.duration{*}
```
**Prometheus**:
```promql
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
```
---
### Top 5 Hosts by CPU
**Datadog**:
```
top(avg:system.cpu.user{*} by {host}, 5, 'mean', 'desc')
```
**Prometheus**:
```promql
topk(5, avg by (instance) (rate(node_cpu_seconds_total{mode="user"}[5m])))
```
---
## Functions
### Absolute Value
**Datadog**:
```
abs(diff(avg:system.cpu.user{*}))
```
**Prometheus**:
```promql
abs(delta(node_cpu_seconds_total{mode="user"}[5m]))
```
---
### Ceiling/Floor
**Datadog**:
```
ceil(avg:system.cpu.user{*})
floor(avg:system.cpu.user{*})
```
**Prometheus**:
```promql
ceil(avg(node_cpu_seconds_total{mode="user"}))
floor(avg(node_cpu_seconds_total{mode="user"}))
```
---
### Clamp (Limit Range)
**Datadog**:
```
clamp_min(avg:system.cpu.user{*}, 0)
clamp_max(avg:system.cpu.user{*}, 100)
```
**Prometheus**:
```promql
clamp_min(avg(node_cpu_seconds_total{mode="user"}), 0)
clamp_max(avg(node_cpu_seconds_total{mode="user"}), 100)
```
---
### Moving Average
**Datadog**:
```
moving_rollup(avg:system.cpu.user{*}, 60, 'avg')
```
**Prometheus**:
```promql
avg_over_time(node_cpu_seconds_total{mode="user"}[1h])
```
---
## Advanced Patterns
### Compare to Previous Period
**Datadog**:
```
sum:requests.count{*}.as_rate() / timeshift(sum:requests.count{*}.as_rate(), 3600)
```
**Prometheus**:
```promql
sum(rate(http_requests_total[5m])) / sum(rate(http_requests_total[5m] offset 1h))
```
---
### Forecast
**Datadog**:
```
forecast(avg:system.disk.used{*}, 'linear', 1)
```
**Prometheus**:
```promql
predict_linear(node_filesystem_size_bytes[1h], 3600)
```
Note: Predicts value 1 hour in future based on last 1 hour trend
---
### Anomaly Detection
**Datadog**:
```
anomalies(avg:system.cpu.user{*}, 'basic', 2)
```
**Prometheus**: No built-in function
- Use recording rules with stddev
- External tools like **Robust Perception's anomaly detector**
- Or use **Grafana ML** plugin
---
### Outlier Detection
**Datadog**:
```
outliers(avg:system.cpu.user{*} by {host}, 'mad')
```
**Prometheus**: No built-in function
- Calculate manually with stddev:
```promql
abs(metric - avg(metric)) > 2 * stddev(metric)
```
---
## Container & Kubernetes
### Container CPU Usage
**Datadog**:
```
avg:docker.cpu.usage{*} by {container_name}
```
**Prometheus**:
```promql
avg by (container) (rate(container_cpu_usage_seconds_total[5m]))
```
---
### Container Memory Usage
**Datadog**:
```
avg:docker.mem.rss{*} by {container_name}
```
**Prometheus**:
```promql
avg by (container) (container_memory_rss)
```
---
### Pod Count by Status
**Datadog**:
```
sum:kubernetes.pods.running{*} by {kube_namespace}
```
**Prometheus**:
```promql
sum by (namespace) (kube_pod_status_phase{phase="Running"})
```
---
## Database Queries
### MySQL Queries Per Second
**Datadog**:
```
sum:mysql.performance.queries{*}.as_rate()
```
**Prometheus**:
```promql
sum(rate(mysql_global_status_queries[5m]))
```
---
### PostgreSQL Active Connections
**Datadog**:
```
avg:postgresql.connections{*}
```
**Prometheus**:
```promql
avg(pg_stat_database_numbackends)
```
---
### Redis Memory Usage
**Datadog**:
```
avg:redis.mem.used{*}
```
**Prometheus**:
```promql
avg(redis_memory_used_bytes)
```
---
## Network Metrics
### Network Bytes Sent
**Datadog**:
```
sum:system.net.bytes_sent{*}.as_rate()
```
**Prometheus**:
```promql
sum(rate(node_network_transmit_bytes_total[5m]))
```
---
### Network Bytes Received
**Datadog**:
```
sum:system.net.bytes_rcvd{*}.as_rate()
```
**Prometheus**:
```promql
sum(rate(node_network_receive_bytes_total[5m]))
```
---
## Key Differences
### 1. Time Windows
- **Datadog**: Optional, defaults to query time range
- **Prometheus**: Always required for rate/increase functions
### 2. Histograms
- **Datadog**: Percentiles available directly
- **Prometheus**: Requires histogram buckets + `histogram_quantile()`
### 3. Default Aggregation
- **Datadog**: No default, must specify
- **Prometheus**: Returns all time series unless aggregated
### 4. Metric Types
- **Datadog**: All metrics treated similarly
- **Prometheus**: Explicit types (counter, gauge, histogram, summary)
### 5. Tag vs Label
- **Datadog**: Uses "tags" (key:value)
- **Prometheus**: Uses "labels" (key="value")
---
## Migration Tips
1. **Start with dashboards**: Convert most-used dashboards first
2. **Use recording rules**: Pre-calculate expensive PromQL queries
3. **Test in parallel**: Run both systems during migration
4. **Document mappings**: Create team-specific translation guide
5. **Train team**: PromQL has learning curve, invest in training
---
## Tools
- **Datadog Dashboard Exporter**: Export JSON dashboards
- **Grafana Dashboard Linter**: Validate converted dashboards
- **PromQL Learning Resources**: https://prometheus.io/docs/prometheus/latest/querying/basics/
---
## Common Gotchas
### Rate without Time Window
**Wrong**:
```promql
rate(http_requests_total)
```
**Correct**:
```promql
rate(http_requests_total[5m])
```
---
### Aggregating Before Rate
**Wrong**:
```promql
rate(sum(http_requests_total)[5m])
```
**Correct**:
```promql
sum(rate(http_requests_total[5m]))
```
---
### Histogram Quantile Without by (le)
**Wrong**:
```promql
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
```
**Correct**:
```promql
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
```
---
## Quick Conversion Checklist
When converting a Datadog query to PromQL:
- [ ] Replace metric name (e.g., `system.cpu.user``node_cpu_seconds_total`)
- [ ] Convert tags to labels (`{tag:value}``{label="value"}`)
- [ ] Add time window for rate/increase (`[5m]`)
- [ ] Change aggregation syntax (`avg:``avg()`)
- [ ] Convert percentiles to histogram_quantile if needed
- [ ] Test query in Prometheus before adding to dashboard
- [ ] Add `by (label)` for grouped aggregations
---
## Need More Help?
- See `datadog_migration.md` for full migration guide
- PromQL documentation: https://prometheus.io/docs/prometheus/latest/querying/
- Practice at: https://demo.promlens.com/

775
references/logging_guide.md Normal file
View File

@@ -0,0 +1,775 @@
# Logging Guide
## Structured Logging
### Why Structured Logs?
**Unstructured** (text):
```
2024-10-28 14:32:15 User john@example.com logged in from 192.168.1.1
```
**Structured** (JSON):
```json
{
"timestamp": "2024-10-28T14:32:15Z",
"level": "info",
"message": "User logged in",
"user": "john@example.com",
"ip": "192.168.1.1",
"event_type": "user_login"
}
```
**Benefits**:
- Easy to parse and query
- Consistent format
- Machine-readable
- Efficient storage and indexing
---
## Log Levels
Use appropriate log levels for better filtering and alerting.
### DEBUG
**When**: Development, troubleshooting
**Examples**:
- Function entry/exit
- Variable values
- Internal state changes
```python
logger.debug("Processing request", extra={
"request_id": req_id,
"params": params
})
```
### INFO
**When**: Important business events
**Examples**:
- User actions (login, purchase)
- System state changes (started, stopped)
- Significant milestones
```python
logger.info("Order placed", extra={
"order_id": "12345",
"user_id": "user123",
"amount": 99.99
})
```
### WARN
**When**: Potentially problematic situations
**Examples**:
- Deprecated API usage
- Slow operations (but not failing)
- Retry attempts
- Resource usage approaching limits
```python
logger.warning("API response slow", extra={
"endpoint": "/api/users",
"duration_ms": 2500,
"threshold_ms": 1000
})
```
### ERROR
**When**: Error conditions that need attention
**Examples**:
- Failed requests
- Exceptions caught and handled
- Integration failures
- Data validation errors
```python
logger.error("Payment processing failed", extra={
"order_id": "12345",
"error": str(e),
"payment_gateway": "stripe"
}, exc_info=True)
```
### FATAL/CRITICAL
**When**: Severe errors causing shutdown
**Examples**:
- Database connection lost
- Out of memory
- Configuration errors preventing startup
```python
logger.critical("Database connection lost", extra={
"database": "postgres",
"host": "db.example.com",
"attempt": 3
})
```
---
## Required Fields
Every log entry should include:
### 1. Timestamp
ISO 8601 format with timezone:
```json
{
"timestamp": "2024-10-28T14:32:15.123Z"
}
```
### 2. Level
Standard levels: debug, info, warn, error, critical
```json
{
"level": "error"
}
```
### 3. Message
Human-readable description:
```json
{
"message": "User authentication failed"
}
```
### 4. Service/Application
What component logged this:
```json
{
"service": "api-gateway",
"version": "1.2.3"
}
```
### 5. Environment
```json
{
"environment": "production"
}
```
---
## Recommended Fields
### Request Context
```json
{
"request_id": "550e8400-e29b-41d4-a716-446655440000",
"user_id": "user123",
"session_id": "sess_abc",
"ip_address": "192.168.1.1",
"user_agent": "Mozilla/5.0..."
}
```
### Performance Metrics
```json
{
"duration_ms": 245,
"response_size_bytes": 1024
}
```
### Error Details
```json
{
"error_type": "ValidationError",
"error_message": "Invalid email format",
"stack_trace": "...",
"error_code": "VAL_001"
}
```
### Business Context
```json
{
"order_id": "ORD-12345",
"customer_id": "CUST-789",
"transaction_amount": 99.99,
"payment_method": "credit_card"
}
```
---
## Implementation Examples
### Python (using structlog)
```python
import structlog
logger = structlog.get_logger()
# Configure structured logging
structlog.configure(
processors=[
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.add_log_level,
structlog.processors.JSONRenderer()
]
)
# Usage
logger.info(
"user_logged_in",
user_id="user123",
ip_address="192.168.1.1",
login_method="oauth"
)
```
### Node.js (using Winston)
```javascript
const winston = require('winston');
const logger = winston.createLogger({
format: winston.format.json(),
defaultMeta: { service: 'api-gateway' },
transports: [
new winston.transports.Console()
]
});
logger.info('User logged in', {
userId: 'user123',
ipAddress: '192.168.1.1',
loginMethod: 'oauth'
});
```
### Go (using zap)
```go
import "go.uber.org/zap"
logger, _ := zap.NewProduction()
defer logger.Sync()
logger.Info("User logged in",
zap.String("userId", "user123"),
zap.String("ipAddress", "192.168.1.1"),
zap.String("loginMethod", "oauth"),
)
```
### Java (using Logback with JSON)
```java
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import net.logstash.logback.argument.StructuredArguments;
Logger logger = LoggerFactory.getLogger(MyClass.class);
logger.info("User logged in",
StructuredArguments.kv("userId", "user123"),
StructuredArguments.kv("ipAddress", "192.168.1.1"),
StructuredArguments.kv("loginMethod", "oauth")
);
```
---
## Log Aggregation Patterns
### Pattern 1: ELK Stack (Elasticsearch, Logstash, Kibana)
**Architecture**:
```
Application → Filebeat → Logstash → Elasticsearch → Kibana
```
**filebeat.yml**:
```yaml
filebeat.inputs:
- type: log
enabled: true
paths:
- /var/log/app/*.log
json.keys_under_root: true
json.add_error_key: true
output.logstash:
hosts: ["logstash:5044"]
```
**logstash.conf**:
```
input {
beats {
port => 5044
}
}
filter {
json {
source => "message"
}
date {
match => ["timestamp", "ISO8601"]
}
grok {
match => { "message" => "%{COMBINEDAPACHELOG}" }
}
}
output {
elasticsearch {
hosts => ["elasticsearch:9200"]
index => "app-logs-%{+YYYY.MM.dd}"
}
}
```
### Pattern 2: Loki (Grafana Loki)
**Architecture**:
```
Application → Promtail → Loki → Grafana
```
**promtail-config.yml**:
```yaml
server:
http_listen_port: 9080
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: app
static_configs:
- targets:
- localhost
labels:
job: app
__path__: /var/log/app/*.log
pipeline_stages:
- json:
expressions:
level: level
timestamp: timestamp
- labels:
level:
service:
- timestamp:
source: timestamp
format: RFC3339
```
**Query in Grafana**:
```logql
{job="app"} |= "error" | json | level="error"
```
### Pattern 3: CloudWatch Logs
**Install CloudWatch agent**:
```json
{
"logs": {
"logs_collected": {
"files": {
"collect_list": [
{
"file_path": "/var/log/app/*.log",
"log_group_name": "/aws/app/production",
"log_stream_name": "{instance_id}",
"timezone": "UTC"
}
]
}
}
}
}
```
**Query with CloudWatch Insights**:
```
fields @timestamp, level, message, user_id
| filter level = "error"
| sort @timestamp desc
| limit 100
```
### Pattern 4: Fluentd/Fluent Bit
**fluent-bit.conf**:
```
[INPUT]
Name tail
Path /var/log/app/*.log
Parser json
Tag app.*
[FILTER]
Name record_modifier
Match *
Record hostname ${HOSTNAME}
Record cluster production
[OUTPUT]
Name es
Match *
Host elasticsearch
Port 9200
Index app-logs
Type _doc
```
---
## Query Patterns
### Find Errors in Time Range
**Elasticsearch**:
```json
GET /app-logs-*/_search
{
"query": {
"bool": {
"must": [
{ "match": { "level": "error" } },
{ "range": { "@timestamp": {
"gte": "now-1h",
"lte": "now"
}}}
]
}
}
}
```
**Loki (LogQL)**:
```logql
{job="app", level="error"} |= "error"
```
**CloudWatch Insights**:
```
fields @timestamp, @message
| filter level = "error"
| filter @timestamp > ago(1h)
```
### Count Errors by Type
**Elasticsearch**:
```json
GET /app-logs-*/_search
{
"size": 0,
"query": { "match": { "level": "error" } },
"aggs": {
"error_types": {
"terms": { "field": "error_type.keyword" }
}
}
}
```
**Loki**:
```logql
sum by (error_type) (count_over_time({job="app", level="error"}[1h]))
```
### Find Slow Requests
**Elasticsearch**:
```json
GET /app-logs-*/_search
{
"query": {
"range": { "duration_ms": { "gte": 1000 } }
},
"sort": [ { "duration_ms": "desc" } ]
}
```
### Trace Request Through Services
**Elasticsearch** (using request_id):
```json
GET /_search
{
"query": {
"match": { "request_id": "550e8400-e29b-41d4-a716-446655440000" }
},
"sort": [ { "@timestamp": "asc" } ]
}
```
---
## Sampling and Rate Limiting
### When to Sample
- **High volume services**: > 10,000 logs/second
- **Debug logs in production**: Sample 1-10%
- **Cost optimization**: Reduce storage costs
### Sampling Strategies
**1. Random Sampling**:
```python
import random
if random.random() < 0.1: # Sample 10%
logger.debug("Debug message", ...)
```
**2. Rate Limiting**:
```python
from rate_limiter import RateLimiter
limiter = RateLimiter(max_per_second=100)
if limiter.allow():
logger.info("Rate limited log", ...)
```
**3. Error-Biased Sampling**:
```python
# Always log errors, sample successful requests
if level == "error" or random.random() < 0.01:
logger.log(level, message, ...)
```
**4. Head-Based Sampling** (trace-aware):
```python
# If trace is sampled, log all related logs
if trace_context.is_sampled():
logger.info("Traced log", trace_id=trace_context.trace_id)
```
---
## Log Retention
### Retention Strategy
**Hot tier** (fast SSD): 7-30 days
- Recent logs
- Full query performance
- High cost
**Warm tier** (regular disk): 30-90 days
- Older logs
- Slower queries acceptable
- Medium cost
**Cold tier** (object storage): 90+ days
- Archive logs
- Query via restore
- Low cost
### Example: Elasticsearch ILM Policy
```json
{
"policy": {
"phases": {
"hot": {
"actions": {
"rollover": {
"max_size": "50GB",
"max_age": "1d"
}
}
},
"warm": {
"min_age": "7d",
"actions": {
"allocate": { "number_of_replicas": 1 },
"shrink": { "number_of_shards": 1 }
}
},
"cold": {
"min_age": "30d",
"actions": {
"allocate": { "require": { "box_type": "cold" } }
}
},
"delete": {
"min_age": "90d",
"actions": {
"delete": {}
}
}
}
}
}
```
---
## Security and Compliance
### PII Redaction
**Before logging**:
```python
import re
def redact_pii(data):
# Redact email
data = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
'[EMAIL]', data)
# Redact credit card
data = re.sub(r'\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b',
'[CARD]', data)
# Redact SSN
data = re.sub(r'\b\d{3}-\d{2}-\d{4}\b', '[SSN]', data)
return data
logger.info("User data", user_input=redact_pii(user_input))
```
**In Logstash**:
```
filter {
mutate {
gsub => [
"message", "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b", "[EMAIL]",
"message", "\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b", "[CARD]"
]
}
}
```
### Access Control
**Elasticsearch** (with Security):
```yaml
# Role for developers
dev_logs:
indices:
- names: ['app-logs-*']
privileges: ['read']
query: '{"match": {"environment": "development"}}'
```
**CloudWatch** (IAM Policy):
```json
{
"Effect": "Allow",
"Action": [
"logs:DescribeLogGroups",
"logs:GetLogEvents",
"logs:FilterLogEvents"
],
"Resource": "arn:aws:logs:*:*:log-group:/aws/app/production:*"
}
```
---
## Common Pitfalls
### 1. Logging Sensitive Data
`logger.info("Login", password=password)`
`logger.info("Login", user_id=user_id)`
### 2. Excessive Logging
❌ Logging every iteration of a loop
✅ Log aggregate results or sample
### 3. Not Including Context
`logger.error("Failed")`
`logger.error("Payment failed", order_id=order_id, error=str(e))`
### 4. Inconsistent Formats
❌ Mix of JSON and plain text
✅ Pick one format and stick to it
### 5. No Request IDs
❌ Can't trace request across services
✅ Generate and propagate request_id
### 6. Logging to Multiple Places
❌ Log to file AND stdout AND syslog
✅ Log to stdout, let agent handle routing
### 7. Blocking on Log Writes
❌ Synchronous writes to remote systems
✅ Asynchronous buffered writes
---
## Performance Optimization
### 1. Async Logging
```python
import logging
from logging.handlers import QueueHandler, QueueListener
import queue
# Create queue
log_queue = queue.Queue()
# Configure async handler
queue_handler = QueueHandler(log_queue)
logger.addHandler(queue_handler)
# Process logs in background thread
listener = QueueListener(log_queue, *handlers)
listener.start()
```
### 2. Conditional Logging
```python
# Avoid expensive operations if not logging
if logger.isEnabledFor(logging.DEBUG):
logger.debug("Details", data=expensive_serialization(obj))
```
### 3. Batching
```python
# Batch logs before sending
batch = []
for log in logs:
batch.append(log)
if len(batch) >= 100:
send_to_aggregator(batch)
batch = []
```
### 4. Compression
```yaml
# Filebeat with compression
output.logstash:
hosts: ["logstash:5044"]
compression_level: 3
```
---
## Monitoring Log Pipeline
Track pipeline health with metrics:
```promql
# Log ingestion rate
rate(logs_ingested_total[5m])
# Pipeline lag
log_processing_lag_seconds
# Dropped logs
rate(logs_dropped_total[5m])
# Error parsing rate
rate(logs_parse_errors_total[5m])
```
Alert on:
- Sudden drop in log volume (service down?)
- High parse error rate (format changed?)
- Pipeline lag > 1 minute (capacity issue?)

View File

@@ -0,0 +1,406 @@
# Metrics Design Guide
## The Four Golden Signals
The Four Golden Signals from Google's SRE book provide a comprehensive view of system health:
### 1. Latency
**What**: Time to service a request
**Why Monitor**: Directly impacts user experience
**Key Metrics**:
- Request duration (p50, p95, p99, p99.9)
- Time to first byte (TTFB)
- Backend processing time
- Database query latency
**PromQL Examples**:
```promql
# P95 latency
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
# Average latency by endpoint
avg(rate(http_request_duration_seconds_sum[5m])) by (endpoint)
/
avg(rate(http_request_duration_seconds_count[5m])) by (endpoint)
```
**Alert Thresholds**:
- Warning: p95 > 500ms
- Critical: p99 > 2s
### 2. Traffic
**What**: Demand on your system
**Why Monitor**: Understand load patterns, capacity planning
**Key Metrics**:
- Requests per second (RPS)
- Transactions per second (TPS)
- Concurrent connections
- Network throughput
**PromQL Examples**:
```promql
# Requests per second
sum(rate(http_requests_total[5m]))
# Requests per second by status code
sum(rate(http_requests_total[5m])) by (status)
# Traffic growth rate (week over week)
sum(rate(http_requests_total[5m]))
/
sum(rate(http_requests_total[5m] offset 7d))
```
**Alert Thresholds**:
- Warning: RPS > 80% of capacity
- Critical: RPS > 95% of capacity
### 3. Errors
**What**: Rate of requests that fail
**Why Monitor**: Direct indicator of user-facing problems
**Key Metrics**:
- Error rate (%)
- 5xx response codes
- Failed transactions
- Exception counts
**PromQL Examples**:
```promql
# Error rate percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) * 100
# Error count by type
sum(rate(http_requests_total{status=~"5.."}[5m])) by (status)
# Application errors
rate(application_errors_total[5m])
```
**Alert Thresholds**:
- Warning: Error rate > 1%
- Critical: Error rate > 5%
### 4. Saturation
**What**: How "full" your service is
**Why Monitor**: Predict capacity issues before they impact users
**Key Metrics**:
- CPU utilization
- Memory utilization
- Disk I/O
- Network bandwidth
- Queue depth
- Thread pool usage
**PromQL Examples**:
```promql
# CPU saturation
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory saturation
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
# Disk saturation
rate(node_disk_io_time_seconds_total[5m]) * 100
# Queue depth
queue_depth_current / queue_depth_max * 100
```
**Alert Thresholds**:
- Warning: > 70% utilization
- Critical: > 90% utilization
---
## RED Method (for Services)
**R**ate, **E**rrors, **D**uration - a simplified approach for request-driven services
### Rate
Number of requests per second:
```promql
sum(rate(http_requests_total[5m]))
```
### Errors
Number of failed requests per second:
```promql
sum(rate(http_requests_total{status=~"5.."}[5m]))
```
### Duration
Time taken to process requests:
```promql
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
```
**When to Use**: Microservices, APIs, web applications
---
## USE Method (for Resources)
**U**tilization, **S**aturation, **E**rrors - for infrastructure resources
### Utilization
Percentage of time resource is busy:
```promql
# CPU utilization
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Disk utilization
(node_filesystem_size_bytes - node_filesystem_avail_bytes)
/ node_filesystem_size_bytes * 100
```
### Saturation
Amount of work the resource cannot service (queued):
```promql
# Load average (saturation indicator)
node_load15
# Disk I/O wait time
rate(node_disk_io_time_weighted_seconds_total[5m])
```
### Errors
Count of error events:
```promql
# Network errors
rate(node_network_receive_errs_total[5m])
rate(node_network_transmit_errs_total[5m])
# Disk errors
rate(node_disk_io_errors_total[5m])
```
**When to Use**: Servers, databases, network devices
---
## Metric Types
### Counter
Monotonically increasing value (never decreases)
**Examples**: Request count, error count, bytes sent
**Usage**:
```promql
# Always use rate() or increase() with counters
rate(http_requests_total[5m]) # Requests per second
increase(http_requests_total[1h]) # Total requests in 1 hour
```
### Gauge
Value that can go up and down
**Examples**: Memory usage, queue depth, concurrent connections
**Usage**:
```promql
# Use directly or with aggregations
avg(memory_usage_bytes)
max(queue_depth)
```
### Histogram
Samples observations and counts them in configurable buckets
**Examples**: Request duration, response size
**Usage**:
```promql
# Calculate percentiles
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
# Average from histogram
rate(http_request_duration_seconds_sum[5m])
/
rate(http_request_duration_seconds_count[5m])
```
### Summary
Similar to histogram but calculates quantiles on client side
**Usage**: Less flexible than histograms, avoid for new metrics
---
## Cardinality Best Practices
**Cardinality**: Number of unique time series
### High Cardinality Labels (AVOID)
❌ User ID
❌ Email address
❌ IP address
❌ Timestamp
❌ Random IDs
### Low Cardinality Labels (GOOD)
✅ Environment (prod, staging)
✅ Region (us-east-1, eu-west-1)
✅ Service name
✅ HTTP status code category (2xx, 4xx, 5xx)
✅ Endpoint/route
### Calculating Cardinality Impact
```
Time series = unique combinations of labels
Example:
service (5) × environment (3) × region (4) × status (5) = 300 time series ✅
service (5) × environment (3) × region (4) × user_id (1M) = 60M time series ❌
```
---
## Naming Conventions
### Prometheus Naming
```
<namespace>_<name>_<unit>_total
Examples:
http_requests_total
http_request_duration_seconds
process_cpu_seconds_total
node_memory_MemAvailable_bytes
```
**Rules**:
- Use snake_case
- Include unit in name (seconds, bytes, ratio)
- Use `_total` suffix for counters
- Namespace by application/component
### CloudWatch Naming
```
<Namespace>/<MetricName>
Examples:
AWS/EC2/CPUUtilization
MyApp/RequestCount
```
**Rules**:
- Use PascalCase
- Group by namespace
- No unit in name (specified separately)
---
## Dashboard Design
### Key Principles
1. **Top-Down Layout**: Most important metrics first
2. **Color Coding**: Red (critical), yellow (warning), green (healthy)
3. **Consistent Time Windows**: All panels use same time range
4. **Limit Panels**: 8-12 panels per dashboard maximum
5. **Include Context**: Show related metrics together
### Dashboard Structure
```
┌─────────────────────────────────────────────┐
│ Overall Health (Single Stats) │
│ [Requests/s] [Error%] [P95 Latency] │
└─────────────────────────────────────────────┘
┌─────────────────────────────────────────────┐
│ Request Rate & Errors (Graphs) │
└─────────────────────────────────────────────┘
┌─────────────────────────────────────────────┐
│ Latency Distribution (Graphs) │
└─────────────────────────────────────────────┘
┌─────────────────────────────────────────────┐
│ Resource Usage (Graphs) │
└─────────────────────────────────────────────┘
┌─────────────────────────────────────────────┐
│ Dependencies (Graphs) │
└─────────────────────────────────────────────┘
```
### Template Variables
Use variables for filtering:
- Environment: `$environment`
- Service: `$service`
- Region: `$region`
- Pod: `$pod`
---
## Common Pitfalls
### 1. Monitoring What You Build, Not What Users Experience
`backend_processing_complete`
`user_request_completed`
### 2. Too Many Metrics
- Start with Four Golden Signals
- Add metrics only when needed for specific issues
- Remove unused metrics
### 3. Incorrect Aggregations
`avg(rate(...))` - averages rates incorrectly
`sum(rate(...)) / count(...)` - correct average
### 4. Wrong Time Windows
- Too short (< 1m): Noisy data
- Too long (> 15m): Miss short-lived issues
- Sweet spot: 5m for most alerts
### 5. Missing Labels
`http_requests_total`
`http_requests_total{method="GET", status="200", endpoint="/api/users"}`
---
## Metric Collection Best Practices
### Application Instrumentation
```python
from prometheus_client import Counter, Histogram, Gauge
# Counter for requests
requests_total = Counter('http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status'])
# Histogram for latency
request_duration = Histogram('http_request_duration_seconds',
'HTTP request duration',
['method', 'endpoint'])
# Gauge for in-progress requests
requests_in_progress = Gauge('http_requests_in_progress',
'HTTP requests currently being processed')
```
### Collection Intervals
- Application metrics: 15-30s
- Infrastructure metrics: 30-60s
- Billing/cost metrics: 5-15m
- External API checks: 1-5m
### Retention
- Raw metrics: 15-30 days
- 5m aggregates: 90 days
- 1h aggregates: 1 year
- Daily aggregates: 2+ years

652
references/slo_sla_guide.md Normal file
View File

@@ -0,0 +1,652 @@
# SLI, SLO, and SLA Guide
## Definitions
### SLI (Service Level Indicator)
**What**: A quantitative measure of service quality
**Examples**:
- Request latency (ms)
- Error rate (%)
- Availability (%)
- Throughput (requests/sec)
### SLO (Service Level Objective)
**What**: Target value or range for an SLI
**Examples**:
- "99.9% of requests return in < 500ms"
- "99.95% availability"
- "Error rate < 0.1%"
### SLA (Service Level Agreement)
**What**: Business contract with consequences for SLO violations
**Examples**:
- "99.9% uptime or 10% monthly credit"
- "p95 latency < 1s or refund"
### Relationship
```
SLI = Measurement
SLO = Target (internal goal)
SLA = Promise (customer contract with penalties)
Example:
SLI: Actual availability this month = 99.92%
SLO: Target availability = 99.9%
SLA: Guaranteed availability = 99.5% (with penalties)
```
---
## Choosing SLIs
### The Four Golden Signals as SLIs
1. **Latency SLIs**
- Request duration (p50, p95, p99)
- Time to first byte
- Page load time
2. **Availability/Success SLIs**
- % of successful requests
- % uptime
- % of requests completing
3. **Throughput SLIs** (less common)
- Requests per second
- Transactions per second
4. **Saturation SLIs** (internal only)
- Resource utilization
- Queue depth
### SLI Selection Criteria
**Good SLIs**:
- Measured from user perspective
- Directly impact user experience
- Aggregatable across instances
- Proportional to user happiness
**Bad SLIs**:
- Internal metrics only
- Not user-facing
- Hard to measure consistently
### Examples by Service Type
**Web Application**:
```
SLI 1: Request Success Rate
= successful_requests / total_requests
SLI 2: Request Latency (p95)
= 95th percentile of response times
SLI 3: Availability
= time_service_responding / total_time
```
**API Service**:
```
SLI 1: Error Rate
= (4xx_errors + 5xx_errors) / total_requests
SLI 2: Response Time (p99)
= 99th percentile latency
SLI 3: Throughput
= requests_per_second
```
**Batch Processing**:
```
SLI 1: Job Success Rate
= successful_jobs / total_jobs
SLI 2: Processing Latency
= time_from_submission_to_completion
SLI 3: Freshness
= age_of_oldest_unprocessed_item
```
**Storage Service**:
```
SLI 1: Durability
= data_not_lost / total_data
SLI 2: Read Latency (p99)
= 99th percentile read time
SLI 3: Write Success Rate
= successful_writes / total_writes
```
---
## Setting SLO Targets
### Start with Current Performance
1. **Measure baseline**: Collect 30 days of data
2. **Analyze distribution**: Look at p50, p95, p99, p99.9
3. **Set initial SLO**: Slightly better than worst performer
4. **Iterate**: Tighten or loosen based on feasibility
### Example Process
**Current Performance** (30 days):
```
p50 latency: 120ms
p95 latency: 450ms
p99 latency: 1200ms
p99.9 latency: 3500ms
Error rate: 0.05%
Availability: 99.95%
```
**Initial SLOs**:
```
Latency: p95 < 500ms (slightly worse than current p95)
Error rate: < 0.1% (double current rate)
Availability: 99.9% (slightly worse than current)
```
**Rationale**: Start loose, prevent false alarms, tighten over time
### Common SLO Targets
**Availability**:
- **99%** (3.65 days downtime/year): Internal tools
- **99.5%** (1.83 days/year): Non-critical services
- **99.9%** (8.76 hours/year): Standard production
- **99.95%** (4.38 hours/year): Critical services
- **99.99%** (52 minutes/year): High availability
- **99.999%** (5 minutes/year): Mission critical
**Latency**:
- **p50 < 100ms**: Excellent responsiveness
- **p95 < 500ms**: Standard web applications
- **p99 < 1s**: Acceptable for most users
- **p99.9 < 5s**: Acceptable for rare edge cases
**Error Rate**:
- **< 0.01%** (99.99% success): Critical operations
- **< 0.1%** (99.9% success): Standard production
- **< 1%** (99% success): Non-critical services
---
## Error Budgets
### Concept
Error budget = (100% - SLO target)
If SLO is 99.9%, error budget is 0.1%
**Purpose**: Balance reliability with feature velocity
### Calculation
**For availability**:
```
Monthly error budget = (1 - SLO) × time_period
Example (99.9% SLO, 30 days):
Error budget = 0.001 × 30 days = 0.03 days = 43.2 minutes
```
**For request-based SLIs**:
```
Error budget = (1 - SLO) × total_requests
Example (99.9% SLO, 10M requests/month):
Error budget = 0.001 × 10,000,000 = 10,000 failed requests
```
### Error Budget Consumption
**Formula**:
```
Budget consumed = actual_errors / allowed_errors × 100%
Example:
SLO: 99.9% (0.1% error budget)
Total requests: 1,000,000
Failed requests: 500
Allowed failures: 1,000
Budget consumed = 500 / 1,000 × 100% = 50%
Budget remaining = 50%
```
### Error Budget Policy
**Example policy**:
```markdown
## Error Budget Policy
### If error budget > 50%
- Deploy frequently (multiple times per day)
- Take calculated risks
- Experiment with new features
- Acceptable to have some incidents
### If error budget 20-50%
- Deploy normally (once per day)
- Increase testing
- Review recent changes
- Monitor closely
### If error budget < 20%
- Freeze non-critical deploys
- Focus on reliability improvements
- Postmortem all incidents
- Reduce change velocity
### If error budget exhausted (< 0%)
- Complete deploy freeze except rollbacks
- All hands on reliability
- Mandatory postmortems
- Executive escalation
```
---
## Error Budget Burn Rate
### Concept
Burn rate = rate of error budget consumption
**Example**:
- Monthly budget: 43.2 minutes (99.9% SLO)
- If consuming at 2x rate: Budget exhausted in 15 days
- If consuming at 10x rate: Budget exhausted in 3 days
### Burn Rate Calculation
```
Burn rate = (actual_error_rate / allowed_error_rate)
Example:
SLO: 99.9% (0.1% allowed error rate)
Current error rate: 0.5%
Burn rate = 0.5% / 0.1% = 5x
Time to exhaust = 30 days / 5 = 6 days
```
### Multi-Window Alerting
Alert on burn rate across multiple time windows:
**Fast burn** (1 hour window):
```
Burn rate > 14.4x → Exhausts budget in 2 days
Alert after 2 minutes
Severity: Critical (page immediately)
```
**Moderate burn** (6 hour window):
```
Burn rate > 6x → Exhausts budget in 5 days
Alert after 30 minutes
Severity: Warning (create ticket)
```
**Slow burn** (3 day window):
```
Burn rate > 1x → Exhausts budget by end of month
Alert after 6 hours
Severity: Info (monitor)
```
### Implementation
**Prometheus**:
```yaml
# Fast burn alert (1h window, 2m grace period)
- alert: ErrorBudgetFastBurn
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
) > (14.4 * 0.001) # 14.4x burn rate for 99.9% SLO
for: 2m
labels:
severity: critical
annotations:
summary: "Fast error budget burn detected"
description: "Error budget will be exhausted in 2 days at current rate"
# Slow burn alert (6h window, 30m grace period)
- alert: ErrorBudgetSlowBurn
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[6h]))
/
sum(rate(http_requests_total[6h]))
) > (6 * 0.001) # 6x burn rate for 99.9% SLO
for: 30m
labels:
severity: warning
annotations:
summary: "Elevated error budget burn detected"
```
---
## SLO Reporting
### Dashboard Structure
**Overall Health**:
```
┌─────────────────────────────────────────┐
│ SLO Compliance: 99.92% ✅ │
│ Error Budget Remaining: 73% 🟢 │
│ Burn Rate: 0.8x 🟢 │
└─────────────────────────────────────────┘
```
**SLI Performance**:
```
Latency p95: 420ms (Target: 500ms) ✅
Error Rate: 0.08% (Target: < 0.1%) ✅
Availability: 99.95% (Target: > 99.9%) ✅
```
**Error Budget Trend**:
```
Graph showing:
- Error budget consumption over time
- Burn rate spikes
- Incidents marked
- Deploy events overlaid
```
### Monthly SLO Report
**Template**:
```markdown
# SLO Report: October 2024
## Executive Summary
- ✅ All SLOs met this month
- 🟡 Latency SLO came close to violation (99.1% compliance)
- 3 incidents consumed 47% of error budget
- Error budget remaining: 53%
## SLO Performance
### Availability SLO: 99.9%
- Actual: 99.92%
- Status: ✅ Met
- Error budget consumed: 33%
- Downtime: 23 minutes (allowed: 43.2 minutes)
### Latency SLO: p95 < 500ms
- Actual p95: 445ms
- Status: ✅ Met
- Compliance: 99.1% (target: 99%)
- 0.9% of requests exceeded threshold
### Error Rate SLO: < 0.1%
- Actual: 0.05%
- Status: ✅ Met
- Error budget consumed: 50%
## Incidents
### Incident #1: Database Overload (Oct 5)
- Duration: 15 minutes
- Error budget consumed: 35%
- Root cause: Slow query after schema change
- Prevention: Added query review to deploy checklist
### Incident #2: API Gateway Timeout (Oct 12)
- Duration: 5 minutes
- Error budget consumed: 10%
- Root cause: Configuration error in load balancer
- Prevention: Automated configuration validation
### Incident #3: Upstream Service Degradation (Oct 20)
- Duration: 3 minutes
- Error budget consumed: 2%
- Root cause: Third-party API outage
- Prevention: Implemented circuit breaker
## Recommendations
1. Investigate latency near-miss (Oct 15-17)
2. Add automated rollback for database changes
3. Increase circuit breaker thresholds for third-party APIs
4. Consider tightening availability SLO to 99.95%
## Next Month's Focus
- Reduce p95 latency to 400ms
- Implement automated canary deployments
- Add synthetic monitoring for critical paths
```
---
## SLA Structure
### Components
**Service Description**:
```
The API Service provides RESTful endpoints for user management,
authentication, and data retrieval.
```
**Covered Metrics**:
```
- Availability: Service is reachable and returns valid responses
- Latency: Time from request to response
- Error Rate: Percentage of requests returning errors
```
**SLA Targets**:
```
Service commits to:
1. 99.9% monthly uptime
2. p95 API response time < 1 second
3. Error rate < 0.5%
```
**Measurement**:
```
Metrics calculated from server-side monitoring:
- Uptime: Successful health check probes / total probes
- Latency: Server-side request duration (p95)
- Errors: HTTP 5xx responses / total responses
Calculated monthly (first of month for previous month).
```
**Exclusions**:
```
SLA does not cover:
- Scheduled maintenance (with 7 days notice)
- Client-side network issues
- DDoS attacks or force majeure
- Beta/preview features
- Issues caused by customer misuse
```
**Service Credits**:
```
Monthly Uptime | Service Credit
---------------- | --------------
< 99.9% (SLA) | 10%
< 99.0% | 25%
< 95.0% | 50%
```
**Claiming Credits**:
```
Customer must:
1. Report violation within 30 days
2. Provide ticket numbers for support requests
3. Credits applied to next month's invoice
4. Credits do not exceed monthly fee
```
### Example SLAs by Industry
**E-commerce**:
```
- 99.95% availability
- p95 page load < 2s
- p99 checkout < 5s
- Credits: 5% per 0.1% below target
```
**Financial Services**:
```
- 99.99% availability
- p99 transaction < 500ms
- Zero data loss
- Penalties: $10,000 per hour of downtime
```
**Media/Content**:
```
- 99.9% availability
- p95 video start < 3s
- No credit system (best effort latency)
```
---
## Best Practices
### 1. SLOs Should Be User-Centric
❌ "Database queries < 100ms"
✅ "API response time p95 < 500ms"
### 2. Start Loose, Tighten Over Time
- Begin with achievable targets
- Build reliability culture
- Gradually raise bar
### 3. Fewer, Better SLOs
- 1-3 SLOs per service
- Focus on user impact
- Avoid SLO sprawl
### 4. SLAs More Conservative Than SLOs
```
Internal SLO: 99.95%
Customer SLA: 99.9%
Margin: 0.05% buffer
```
### 5. Make Error Budgets Actionable
- Define policies at different thresholds
- Empower teams to make tradeoffs
- Review in planning meetings
### 6. Document Everything
- How SLIs are measured
- Why targets were chosen
- Who owns each SLO
- How to interpret metrics
### 7. Review Regularly
- Monthly SLO reviews
- Quarterly SLO adjustments
- Annual SLA renegotiation
---
## Common Pitfalls
### 1. Too Many SLOs
❌ 20 different SLOs per service
✅ 2-3 critical SLOs
### 2. Unrealistic Targets
❌ 99.999% for non-critical service
✅ 99.9% with room to improve
### 3. SLOs Without Error Budgets
❌ "Must always be 99.9%"
✅ "Budget for 0.1% errors"
### 4. No Consequences
❌ Missing SLO has no impact
✅ Deploy freeze when budget exhausted
### 5. SLA Equals SLO
❌ Promise exactly what you target
✅ SLA more conservative than SLO
### 6. Ignoring User Experience
❌ "Our servers are up 99.99%"
✅ "Users can complete actions 99.9% of the time"
### 7. Static Targets
❌ Set once, never revisit
✅ Quarterly reviews and adjustments
---
## Tools and Automation
### SLO Tracking Tools
**Prometheus + Grafana**:
- Use recording rules for SLIs
- Alert on burn rates
- Dashboard for compliance
**Google Cloud SLO Monitoring**:
- Built-in SLO tracking
- Automatic error budget calculation
- Integration with alerting
**Datadog SLOs**:
- UI for SLO definition
- Automatic burn rate alerts
- Status pages
**Custom Tools**:
- sloth: Generate Prometheus rules from SLO definitions
- slo-libsonnet: Jsonnet library for SLO monitoring
### Example: Prometheus Recording Rules
```yaml
groups:
- name: sli_recording
interval: 30s
rules:
# SLI: Request success rate
- record: sli:request_success:ratio
expr: |
sum(rate(http_requests_total{status!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
# SLI: Request latency (p95)
- record: sli:request_latency:p95
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)
# Error budget burn rate (1h window)
- record: slo:error_budget_burn_rate:1h
expr: |
(1 - sli:request_success:ratio) / 0.001
```

View File

@@ -0,0 +1,697 @@
# Monitoring Tools Comparison
## Overview Matrix
| Tool | Type | Best For | Complexity | Cost | Cloud/Self-Hosted |
|------|------|----------|------------|------|-------------------|
| **Prometheus** | Metrics | Kubernetes, time-series | Medium | Free | Self-hosted |
| **Grafana** | Visualization | Dashboards, multi-source | Low-Medium | Free | Both |
| **Datadog** | Full-stack | Ease of use, APM | Low | High | Cloud |
| **New Relic** | Full-stack | APM, traces | Low | High | Cloud |
| **Elasticsearch (ELK)** | Logs | Log search, analysis | High | Medium | Both |
| **Grafana Loki** | Logs | Cost-effective logs | Medium | Free | Both |
| **CloudWatch** | AWS-native | AWS infrastructure | Low | Medium | Cloud |
| **Jaeger** | Tracing | Distributed tracing | Medium | Free | Self-hosted |
| **Grafana Tempo** | Tracing | Cost-effective tracing | Medium | Free | Self-hosted |
---
## Metrics Platforms
### Prometheus
**Type**: Open-source time-series database
**Strengths**:
- ✅ Industry standard for Kubernetes
- ✅ Powerful query language (PromQL)
- ✅ Pull-based model (no agent config)
- ✅ Service discovery
- ✅ Free and open source
- ✅ Huge ecosystem (exporters for everything)
**Weaknesses**:
- ❌ No built-in dashboards (need Grafana)
- ❌ Single-node only (no HA without federation)
- ❌ Limited long-term storage (need Thanos/Cortex)
- ❌ Steep learning curve for PromQL
**Best For**:
- Kubernetes monitoring
- Infrastructure metrics
- Custom application metrics
- Organizations that need control
**Pricing**: Free (open source)
**Setup Complexity**: Medium
**Example**:
```yaml
# prometheus.yml
scrape_configs:
- job_name: 'app'
static_configs:
- targets: ['localhost:8080']
```
---
### Datadog
**Type**: SaaS monitoring platform
**Strengths**:
- ✅ Easy to set up (install agent, done)
- ✅ Beautiful pre-built dashboards
- ✅ APM, logs, metrics, traces in one platform
- ✅ Great anomaly detection
- ✅ Excellent integrations (500+)
- ✅ Good mobile app
**Weaknesses**:
- ❌ Very expensive at scale
- ❌ Vendor lock-in
- ❌ Cost can be unpredictable (per-host pricing)
- ❌ Limited PromQL support
**Best For**:
- Teams that want quick setup
- Companies prioritizing ease of use over cost
- Organizations needing full observability
**Pricing**: $15-$31/host/month + custom metrics fees
**Setup Complexity**: Low
**Example**:
```bash
# Install agent
DD_API_KEY=xxx bash -c "$(curl -L https://s3.amazonaws.com/dd-agent/scripts/install_script.sh)"
```
---
### New Relic
**Type**: SaaS application performance monitoring
**Strengths**:
- ✅ Excellent APM capabilities
- ✅ User-friendly interface
- ✅ Good transaction tracing
- ✅ Comprehensive alerting
- ✅ Generous free tier
**Weaknesses**:
- ❌ Can get expensive at scale
- ❌ Vendor lock-in
- ❌ Query language less powerful than PromQL
- ❌ Limited customization
**Best For**:
- Application performance monitoring
- Teams focused on APM over infrastructure
- Startups (free tier is generous)
**Pricing**: Free up to 100GB/month, then $0.30/GB
**Setup Complexity**: Low
**Example**:
```python
import newrelic.agent
newrelic.agent.initialize('newrelic.ini')
```
---
### CloudWatch
**Type**: AWS-native monitoring
**Strengths**:
- ✅ Zero setup for AWS services
- ✅ Native integration with AWS
- ✅ Automatic dashboards for AWS resources
- ✅ Tightly integrated with other AWS services
- ✅ Good for cost if already on AWS
**Weaknesses**:
- ❌ AWS-only (not multi-cloud)
- ❌ Limited query capabilities
- ❌ High costs for custom metrics
- ❌ Basic visualization
- ❌ 1-minute minimum resolution
**Best For**:
- AWS-centric infrastructure
- Quick setup for AWS services
- Organizations already invested in AWS
**Pricing**:
- First 10 custom metrics: Free
- Additional: $0.30/metric/month
- API calls: $0.01/1000 requests
**Setup Complexity**: Low (for AWS), Medium (for custom metrics)
**Example**:
```python
import boto3
cloudwatch = boto3.client('cloudwatch')
cloudwatch.put_metric_data(
Namespace='MyApp',
MetricData=[{'MetricName': 'RequestCount', 'Value': 1}]
)
```
---
### Grafana Cloud / Mimir
**Type**: Managed Prometheus-compatible
**Strengths**:
- ✅ Prometheus-compatible (PromQL)
- ✅ Managed service (no ops burden)
- ✅ Good cost model (pay for what you use)
- ✅ Grafana dashboards included
- ✅ Long-term storage
**Weaknesses**:
- ❌ Relatively new (less mature)
- ❌ Some Prometheus features missing
- ❌ Requires Grafana for visualization
**Best For**:
- Teams wanting Prometheus without ops overhead
- Multi-cloud environments
- Organizations already using Grafana
**Pricing**: $8/month + $0.29/1M samples
**Setup Complexity**: Low-Medium
---
## Logging Platforms
### Elasticsearch (ELK Stack)
**Type**: Open-source log search and analytics
**Full Stack**: Elasticsearch + Logstash + Kibana
**Strengths**:
- ✅ Powerful search capabilities
- ✅ Rich query language
- ✅ Great for log analysis
- ✅ Mature ecosystem
- ✅ Can handle large volumes
- ✅ Flexible data model
**Weaknesses**:
- ❌ Complex to operate
- ❌ Resource intensive (RAM hungry)
- ❌ Expensive at scale
- ❌ Requires dedicated ops team
- ❌ Slow for high-cardinality queries
**Best For**:
- Large organizations with ops teams
- Deep log analysis needs
- Search-heavy use cases
**Pricing**: Free (open source) + infrastructure costs
**Infrastructure**: ~$500-2000/month for medium scale
**Setup Complexity**: High
**Example**:
```json
PUT /logs-2024.10/_doc/1
{
"timestamp": "2024-10-28T14:32:15Z",
"level": "error",
"message": "Payment failed"
}
```
---
### Grafana Loki
**Type**: Log aggregation system
**Strengths**:
- ✅ Cost-effective (labels only, not full-text indexing)
- ✅ Easy to operate
- ✅ Prometheus-like label model
- ✅ Great Grafana integration
- ✅ Low resource usage
- ✅ Fast time-range queries
**Weaknesses**:
- ❌ Limited full-text search
- ❌ Requires careful label design
- ❌ Younger ecosystem than ELK
- ❌ Not ideal for complex queries
**Best For**:
- Cost-conscious organizations
- Kubernetes environments
- Teams already using Prometheus
- Time-series log queries
**Pricing**: Free (open source) + infrastructure costs
**Infrastructure**: ~$100-500/month for medium scale
**Setup Complexity**: Medium
**Example**:
```logql
{job="api", environment="prod"} |= "error" | json | level="error"
```
---
### Splunk
**Type**: Enterprise log management
**Strengths**:
- ✅ Extremely powerful search
- ✅ Great for security/compliance
- ✅ Mature platform
- ✅ Enterprise support
- ✅ Machine learning features
**Weaknesses**:
- ❌ Very expensive
- ❌ Complex pricing (per GB ingested)
- ❌ Steep learning curve
- ❌ Heavy resource usage
**Best For**:
- Large enterprises
- Security operations centers (SOCs)
- Compliance-heavy industries
**Pricing**: $150-$1800/GB/month (depending on tier)
**Setup Complexity**: Medium-High
---
### CloudWatch Logs
**Type**: AWS-native log management
**Strengths**:
- ✅ Zero setup for AWS services
- ✅ Integrated with AWS ecosystem
- ✅ CloudWatch Insights for queries
- ✅ Reasonable cost for low volume
**Weaknesses**:
- ❌ AWS-only
- ❌ Limited query capabilities
- ❌ Expensive at high volume
- ❌ Basic visualization
**Best For**:
- AWS-centric applications
- Low-volume logging
- Simple log aggregation
**Pricing**: Tiered (as of May 2025)
- Vended Logs: $0.50/GB (first 10TB), $0.25/GB (next 20TB), then lower tiers
- Standard logs: $0.50/GB flat
- Storage: $0.03/GB
**Setup Complexity**: Low (AWS), Medium (custom)
---
### Sumo Logic
**Type**: SaaS log management
**Strengths**:
- ✅ Easy to use
- ✅ Good for cloud-native apps
- ✅ Real-time analytics
- ✅ Good compliance features
**Weaknesses**:
- ❌ Expensive at scale
- ❌ Vendor lock-in
- ❌ Limited customization
**Best For**:
- Cloud-native applications
- Teams wanting managed solution
- Security and compliance use cases
**Pricing**: $90-$180/GB/month
**Setup Complexity**: Low
---
## Tracing Platforms
### Jaeger
**Type**: Open-source distributed tracing
**Strengths**:
- ✅ Industry standard
- ✅ CNCF graduated project
- ✅ Supports OpenTelemetry
- ✅ Good UI
- ✅ Free and open source
**Weaknesses**:
- ❌ Requires separate storage backend
- ❌ Limited query capabilities
- ❌ No built-in analytics
**Best For**:
- Microservices architectures
- Kubernetes environments
- OpenTelemetry users
**Pricing**: Free (open source) + storage costs
**Setup Complexity**: Medium
---
### Grafana Tempo
**Type**: Open-source distributed tracing
**Strengths**:
- ✅ Cost-effective (object storage)
- ✅ Easy to operate
- ✅ Great Grafana integration
- ✅ TraceQL query language
- ✅ Supports OpenTelemetry
**Weaknesses**:
- ❌ Younger than Jaeger
- ❌ Limited third-party integrations
- ❌ Requires Grafana for UI
**Best For**:
- Cost-conscious organizations
- Teams using Grafana stack
- High trace volumes
**Pricing**: Free (open source) + storage costs
**Setup Complexity**: Medium
---
### Datadog APM
**Type**: SaaS application performance monitoring
**Strengths**:
- ✅ Easy to set up
- ✅ Excellent trace visualization
- ✅ Integrated with metrics/logs
- ✅ Automatic service map
- ✅ Good profiling features
**Weaknesses**:
- ❌ Expensive ($31/host/month)
- ❌ Vendor lock-in
- ❌ Limited sampling control
**Best For**:
- Teams wanting ease of use
- Organizations already using Datadog
- Complex microservices
**Pricing**: $31/host/month + $1.70/million spans
**Setup Complexity**: Low
---
### AWS X-Ray
**Type**: AWS-native distributed tracing
**Strengths**:
- ✅ Native AWS integration
- ✅ Automatic instrumentation for AWS services
- ✅ Low cost
**Weaknesses**:
- ❌ AWS-only
- ❌ Basic UI
- ❌ Limited query capabilities
**Best For**:
- AWS-centric applications
- Serverless architectures (Lambda)
- Cost-sensitive projects
**Pricing**: $5/million traces, first 100k free/month
**Setup Complexity**: Low (AWS), Medium (custom)
---
## Full-Stack Observability
### Datadog (Full Platform)
**Components**: Metrics, logs, traces, RUM, synthetics
**Strengths**:
- ✅ Everything in one platform
- ✅ Excellent user experience
- ✅ Correlation across signals
- ✅ Great for teams
**Weaknesses**:
- ❌ Very expensive ($50-100+/host/month)
- ❌ Vendor lock-in
- ❌ Unpredictable costs
**Total Cost** (example 100 hosts):
- Infrastructure: $3,100/month
- APM: $3,100/month
- Logs: ~$2,000/month
- **Total: ~$8,000/month**
---
### Grafana Stack (LGTM)
**Components**: Loki (logs), Grafana (viz), Tempo (traces), Mimir/Prometheus (metrics)
**Strengths**:
- ✅ Open source and cost-effective
- ✅ Unified visualization
- ✅ Prometheus-compatible
- ✅ Great for cloud-native
**Weaknesses**:
- ❌ Requires self-hosting or Grafana Cloud
- ❌ More ops burden
- ❌ Less polished than commercial tools
**Total Cost** (self-hosted, 100 hosts):
- Infrastructure: ~$1,500/month
- Ops time: Variable
- **Total: ~$1,500-3,000/month**
---
### Elastic Observability
**Components**: Elasticsearch (logs), Kibana (viz), APM, metrics
**Strengths**:
- ✅ Powerful search
- ✅ Mature platform
- ✅ Good for log-heavy use cases
**Weaknesses**:
- ❌ Complex to operate
- ❌ Expensive infrastructure
- ❌ Resource intensive
**Total Cost** (self-hosted, 100 hosts):
- Infrastructure: ~$3,000-5,000/month
- Ops time: High
- **Total: ~$4,000-7,000/month**
---
### New Relic One
**Components**: Metrics, logs, traces, synthetics
**Strengths**:
- ✅ Generous free tier (100GB)
- ✅ User-friendly
- ✅ Good for startups
**Weaknesses**:
- ❌ Costs increase quickly after free tier
- ❌ Vendor lock-in
**Total Cost**:
- Free: up to 100GB/month
- Paid: $0.30/GB beyond 100GB
---
## Cloud Provider Native
### AWS (CloudWatch + X-Ray)
**Use When**:
- Primarily on AWS
- Simple monitoring needs
- Want minimal setup
**Avoid When**:
- Multi-cloud environment
- Need advanced features
- High log volume (expensive)
**Cost** (example):
- 100 EC2 instances with basic metrics: ~$150/month
- 1TB logs: ~$500/month ingestion + storage
- X-Ray: ~$50/month
---
### GCP (Cloud Monitoring + Cloud Trace)
**Use When**:
- Primarily on GCP
- Using GKE
- Want tight GCP integration
**Avoid When**:
- Multi-cloud environment
- Need advanced querying
**Cost** (example):
- First 150MB/month per resource: Free
- Additional: $0.2508/MB
---
### Azure (Azure Monitor)
**Use When**:
- Primarily on Azure
- Using AKS
- Need Azure integration
**Avoid When**:
- Multi-cloud
- Need advanced features
**Cost** (example):
- First 5GB: Free
- Additional: $2.76/GB
---
## Decision Matrix
### Choose Prometheus + Grafana If:
- ✅ Using Kubernetes
- ✅ Want control and customization
- ✅ Have ops capacity
- ✅ Budget-conscious
- ✅ Need Prometheus ecosystem
### Choose Datadog If:
- ✅ Want ease of use
- ✅ Need full observability now
- ✅ Budget allows ($8k+/month for 100 hosts)
- ✅ Limited ops team
- ✅ Need excellent UX
### Choose ELK If:
- ✅ Heavy log analysis needs
- ✅ Need powerful search
- ✅ Have dedicated ops team
- ✅ Compliance requirements
- ✅ Willing to invest in infrastructure
### Choose Grafana Stack (LGTM) If:
- ✅ Want open source full stack
- ✅ Cost-effective solution
- ✅ Cloud-native architecture
- ✅ Already using Prometheus
- ✅ Have some ops capacity
### Choose New Relic If:
- ✅ Startup with free tier
- ✅ APM is priority
- ✅ Want easy setup
- ✅ Don't need heavy customization
### Choose Cloud Native (CloudWatch/etc) If:
- ✅ Single cloud provider
- ✅ Simple needs
- ✅ Want minimal setup
- ✅ Low to medium scale
---
## Cost Comparison
**Example: 100 hosts, 1TB logs/month, 1M spans/day**
| Solution | Monthly Cost | Setup | Ops Burden |
|----------|-------------|--------|------------|
| **Prometheus + Loki + Tempo** | $1,500 | Medium | Medium |
| **Grafana Cloud** | $3,000 | Low | Low |
| **Datadog** | $8,000 | Low | None |
| **New Relic** | $3,500 | Low | None |
| **ELK Stack** | $4,000 | High | High |
| **CloudWatch** | $2,000 | Low | Low |
---
## Recommendations by Company Size
### Startup (< 10 engineers)
**Recommendation**: New Relic or Grafana Cloud
- Minimal ops burden
- Good free tiers
- Easy to get started
### Small Company (10-50 engineers)
**Recommendation**: Prometheus + Grafana + Loki (self-hosted or cloud)
- Cost-effective
- Growing ops capacity
- Flexibility
### Medium Company (50-200 engineers)
**Recommendation**: Datadog or Grafana Stack
- Datadog if budget allows
- Grafana Stack if cost-conscious
### Large Enterprise (200+ engineers)
**Recommendation**: Build observability platform
- Mix of tools based on needs
- Dedicated observability team
- Custom integrations

663
references/tracing_guide.md Normal file
View File

@@ -0,0 +1,663 @@
# Distributed Tracing Guide
## What is Distributed Tracing?
Distributed tracing tracks a request as it flows through multiple services in a distributed system.
### Key Concepts
**Trace**: End-to-end journey of a request
**Span**: Single operation within a trace
**Context**: Metadata propagated between services (trace_id, span_id)
### Example Flow
```
User Request → API Gateway → Auth Service → User Service → Database
↓ ↓ ↓
[Trace ID: abc123]
Span 1: gateway (50ms)
Span 2: auth (20ms)
Span 3: user_service (100ms)
Span 4: db_query (80ms)
Total: 250ms with waterfall view showing dependencies
```
---
## OpenTelemetry (OTel)
OpenTelemetry is the industry standard for instrumentation.
### Components
**API**: Instrument code (create spans, add attributes)
**SDK**: Implement API, configure exporters
**Collector**: Receive, process, and export telemetry data
**Exporters**: Send data to backends (Jaeger, Tempo, Zipkin)
### Architecture
```
Application → OTel SDK → OTel Collector → Backend (Jaeger/Tempo)
Visualization
```
---
## Instrumentation Examples
### Python (using OpenTelemetry)
**Setup**:
```python
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
# Setup tracer
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)
# Configure exporter
otlp_exporter = OTLPSpanExporter(endpoint="localhost:4317")
span_processor = BatchSpanProcessor(otlp_exporter)
trace.get_tracer_provider().add_span_processor(span_processor)
```
**Manual instrumentation**:
```python
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
@tracer.start_as_current_span("process_order")
def process_order(order_id):
span = trace.get_current_span()
span.set_attribute("order.id", order_id)
span.set_attribute("order.amount", 99.99)
try:
result = payment_service.charge(order_id)
span.set_attribute("payment.status", "success")
return result
except Exception as e:
span.set_status(trace.Status(trace.StatusCode.ERROR))
span.record_exception(e)
raise
```
**Auto-instrumentation** (Flask example):
```python
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
# Auto-instrument Flask
FlaskInstrumentor().instrument_app(app)
# Auto-instrument requests library
RequestsInstrumentor().instrument()
# Auto-instrument SQLAlchemy
SQLAlchemyInstrumentor().instrument(engine=db.engine)
```
### Node.js (using OpenTelemetry)
**Setup**:
```javascript
const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
const { BatchSpanProcessor } = require('@opentelemetry/sdk-trace-base');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
// Setup provider
const provider = new NodeTracerProvider();
const exporter = new OTLPTraceExporter({ url: 'localhost:4317' });
provider.addSpanProcessor(new BatchSpanProcessor(exporter));
provider.register();
```
**Manual instrumentation**:
```javascript
const tracer = provider.getTracer('my-service');
async function processOrder(orderId) {
const span = tracer.startSpan('process_order');
span.setAttribute('order.id', orderId);
try {
const result = await paymentService.charge(orderId);
span.setAttribute('payment.status', 'success');
return result;
} catch (error) {
span.setStatus({ code: SpanStatusCode.ERROR });
span.recordException(error);
throw error;
} finally {
span.end();
}
}
```
**Auto-instrumentation**:
```javascript
const { registerInstrumentations } = require('@opentelemetry/instrumentation');
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');
const { ExpressInstrumentation } = require('@opentelemetry/instrumentation-express');
const { MongoDBInstrumentation } = require('@opentelemetry/instrumentation-mongodb');
registerInstrumentations({
instrumentations: [
new HttpInstrumentation(),
new ExpressInstrumentation(),
new MongoDBInstrumentation()
]
});
```
### Go (using OpenTelemetry)
**Setup**:
```go
import (
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
"go.opentelemetry.io/otel/sdk/trace"
)
func initTracer() {
exporter, _ := otlptracegrpc.New(context.Background())
tp := trace.NewTracerProvider(
trace.WithBatcher(exporter),
)
otel.SetTracerProvider(tp)
}
```
**Manual instrumentation**:
```go
import (
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
)
func processOrder(ctx context.Context, orderID string) error {
tracer := otel.Tracer("my-service")
ctx, span := tracer.Start(ctx, "process_order")
defer span.End()
span.SetAttributes(
attribute.String("order.id", orderID),
attribute.Float64("order.amount", 99.99),
)
err := paymentService.Charge(ctx, orderID)
if err != nil {
span.RecordError(err)
return err
}
span.SetAttributes(attribute.String("payment.status", "success"))
return nil
}
```
---
## Span Attributes
### Semantic Conventions
Follow OpenTelemetry semantic conventions for consistency:
**HTTP**:
```python
span.set_attribute("http.method", "GET")
span.set_attribute("http.url", "https://api.example.com/users")
span.set_attribute("http.status_code", 200)
span.set_attribute("http.user_agent", "Mozilla/5.0...")
```
**Database**:
```python
span.set_attribute("db.system", "postgresql")
span.set_attribute("db.name", "users_db")
span.set_attribute("db.statement", "SELECT * FROM users WHERE id = ?")
span.set_attribute("db.operation", "SELECT")
```
**RPC/gRPC**:
```python
span.set_attribute("rpc.system", "grpc")
span.set_attribute("rpc.service", "UserService")
span.set_attribute("rpc.method", "GetUser")
span.set_attribute("rpc.grpc.status_code", 0)
```
**Messaging**:
```python
span.set_attribute("messaging.system", "kafka")
span.set_attribute("messaging.destination", "user-events")
span.set_attribute("messaging.operation", "publish")
span.set_attribute("messaging.message_id", "msg123")
```
### Custom Attributes
Add business context:
```python
span.set_attribute("user.id", "user123")
span.set_attribute("order.id", "ORD-456")
span.set_attribute("feature.flag.checkout_v2", True)
span.set_attribute("cache.hit", False)
```
---
## Context Propagation
### W3C Trace Context (Standard)
Headers propagated between services:
```
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
tracestate: vendor1=value1,vendor2=value2
```
**Format**: `version-trace_id-parent_span_id-trace_flags`
### Implementation
**Python**:
```python
from opentelemetry.propagate import inject, extract
import requests
# Inject context into outgoing request
headers = {}
inject(headers)
requests.get("https://api.example.com", headers=headers)
# Extract context from incoming request
from flask import request
ctx = extract(request.headers)
```
**Node.js**:
```javascript
const { propagation } = require('@opentelemetry/api');
// Inject
const headers = {};
propagation.inject(context.active(), headers);
axios.get('https://api.example.com', { headers });
// Extract
const ctx = propagation.extract(context.active(), req.headers);
```
**HTTP Example**:
```bash
curl -H "traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01" \
https://api.example.com/users
```
---
## Sampling Strategies
### 1. Always On/Off
```python
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.sampling import ALWAYS_ON, ALWAYS_OFF
# Development: trace everything
provider = TracerProvider(sampler=ALWAYS_ON)
# Production: trace nothing (usually not desired)
provider = TracerProvider(sampler=ALWAYS_OFF)
```
### 2. Probability-Based
```python
from opentelemetry.sdk.trace.sampling import TraceIdRatioBased
# Sample 10% of traces
provider = TracerProvider(sampler=TraceIdRatioBased(0.1))
```
### 3. Rate Limiting
```python
from opentelemetry.sdk.trace.sampling import ParentBased, RateLimitingSampler
# Sample max 100 traces per second
sampler = ParentBased(root=RateLimitingSampler(100))
provider = TracerProvider(sampler=sampler)
```
### 4. Parent-Based (Default)
```python
from opentelemetry.sdk.trace.sampling import ParentBased, TraceIdRatioBased
# If parent span is sampled, sample child spans
sampler = ParentBased(root=TraceIdRatioBased(0.1))
provider = TracerProvider(sampler=sampler)
```
### 5. Custom Sampling
```python
from opentelemetry.sdk.trace.sampling import Sampler, Decision
class ErrorSampler(Sampler):
"""Always sample errors, sample 1% of successes"""
def should_sample(self, parent_context, trace_id, name, **kwargs):
attributes = kwargs.get('attributes', {})
# Always sample if error
if attributes.get('error', False):
return Decision.RECORD_AND_SAMPLE
# Sample 1% of successes
if trace_id & 0xFF < 3: # ~1%
return Decision.RECORD_AND_SAMPLE
return Decision.DROP
provider = TracerProvider(sampler=ErrorSampler())
```
---
## Backends
### Jaeger
**Docker Compose**:
```yaml
version: '3'
services:
jaeger:
image: jaegertracing/all-in-one:latest
ports:
- "16686:16686" # UI
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
environment:
- COLLECTOR_OTLP_ENABLED=true
```
**Query traces**:
```bash
# UI: http://localhost:16686
# API: Get trace by ID
curl http://localhost:16686/api/traces/abc123
# Search traces
curl "http://localhost:16686/api/traces?service=my-service&limit=20"
```
### Grafana Tempo
**Docker Compose**:
```yaml
version: '3'
services:
tempo:
image: grafana/tempo:latest
ports:
- "3200:3200" # Tempo
- "4317:4317" # OTLP gRPC
volumes:
- ./tempo.yaml:/etc/tempo.yaml
command: ["-config.file=/etc/tempo.yaml"]
```
**tempo.yaml**:
```yaml
server:
http_listen_port: 3200
distributor:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
storage:
trace:
backend: local
local:
path: /tmp/tempo/traces
```
**Query in Grafana**:
- Install Tempo data source
- Use TraceQL: `{ span.http.status_code = 500 }`
### AWS X-Ray
**Configuration**:
```python
from aws_xray_sdk.core import xray_recorder
from aws_xray_sdk.ext.flask.middleware import XRayMiddleware
xray_recorder.configure(service='my-service')
XRayMiddleware(app, xray_recorder)
```
**Query**:
```bash
aws xray get-trace-summaries \
--start-time 2024-10-28T00:00:00 \
--end-time 2024-10-28T23:59:59 \
--filter-expression 'error = true'
```
---
## Analysis Patterns
### Find Slow Traces
```
# Jaeger UI
- Filter by service
- Set min duration: 1000ms
- Sort by duration
# TraceQL (Tempo)
{ duration > 1s }
```
### Find Error Traces
```
# Jaeger UI
- Filter by tag: error=true
- Or by HTTP status: http.status_code=500
# TraceQL (Tempo)
{ span.http.status_code >= 500 }
```
### Find Traces by User
```
# Jaeger UI
- Filter by tag: user.id=user123
# TraceQL (Tempo)
{ span.user.id = "user123" }
```
### Find N+1 Query Problems
Look for:
- Many sequential database spans
- Same query repeated multiple times
- Pattern: API call → DB query → DB query → DB query...
### Find Service Bottlenecks
- Identify spans with longest duration
- Check if time is spent in service logic or waiting for dependencies
- Look at span relationships (parallel vs sequential)
---
## Integration with Logs
### Trace ID in Logs
**Python**:
```python
from opentelemetry import trace
def add_trace_context():
span = trace.get_current_span()
trace_id = span.get_span_context().trace_id
span_id = span.get_span_context().span_id
return {
"trace_id": format(trace_id, '032x'),
"span_id": format(span_id, '016x')
}
logger.info("Processing order", **add_trace_context(), order_id=order_id)
```
**Query logs for trace**:
```
# Elasticsearch
GET /logs/_search
{
"query": {
"match": { "trace_id": "0af7651916cd43dd8448eb211c80319c" }
}
}
# Loki (LogQL)
{job="app"} |= "0af7651916cd43dd8448eb211c80319c"
```
### Trace from Log (Grafana)
Configure derived fields in Grafana:
```yaml
datasources:
- name: Loki
type: loki
jsonData:
derivedFields:
- name: TraceID
matcherRegex: "trace_id=([\\w]+)"
url: "http://tempo:3200/trace/$${__value.raw}"
datasourceUid: tempo_uid
```
---
## Best Practices
### 1. Span Naming
✅ Use operation names, not IDs
- Good: `GET /api/users`, `UserService.GetUser`, `db.query.users`
- Bad: `/api/users/123`, `span_abc`, `query_1`
### 2. Span Granularity
✅ One span per logical operation
- Too coarse: One span for entire request
- Too fine: Span for every variable assignment
- Just right: Span per service call, database query, external API
### 3. Add Context
Always include:
- Operation name
- Service name
- Error status
- Business identifiers (user_id, order_id)
### 4. Handle Errors
```python
try:
result = operation()
except Exception as e:
span.set_status(trace.Status(trace.StatusCode.ERROR))
span.record_exception(e)
raise
```
### 5. Sampling Strategy
- Development: 100%
- Staging: 50-100%
- Production: 1-10% (or error-based)
### 6. Performance Impact
- Overhead: ~1-5% CPU
- Use async exporters
- Batch span exports
- Sample appropriately
### 7. Cardinality
Avoid high-cardinality attributes:
- ❌ Email addresses
- ❌ Full URLs with unique IDs
- ❌ Timestamps
- ✅ User ID
- ✅ Endpoint pattern
- ✅ Status code
---
## Common Issues
### Missing Traces
**Cause**: Context not propagated
**Solution**: Verify headers are injected/extracted
### Incomplete Traces
**Cause**: Spans not closed properly
**Solution**: Always use `defer span.End()` or context managers
### High Overhead
**Cause**: Too many spans or synchronous export
**Solution**: Reduce span count, use batch processor
### No Error Traces
**Cause**: Errors not recorded on spans
**Solution**: Call `span.record_exception()` and set error status
---
## Metrics from Traces
Generate RED metrics from trace data:
**Rate**: Traces per second
**Errors**: Traces with error status
**Duration**: Span duration percentiles
**Example** (using Tempo + Prometheus):
```yaml
# Generate metrics from spans
metrics_generator:
processor:
span_metrics:
dimensions:
- http.method
- http.status_code
```
**Query**:
```promql
# Request rate
rate(traces_spanmetrics_calls_total[5m])
# Error rate
rate(traces_spanmetrics_calls_total{status_code="STATUS_CODE_ERROR"}[5m])
/
rate(traces_spanmetrics_calls_total[5m])
# P95 latency
histogram_quantile(0.95, traces_spanmetrics_latency_bucket)
```