Initial commit

2025-11-29 17:51:22 +08:00
commit 23753b435e
24 changed files with 9837 additions and 0 deletions
--- a/references/alerting_best_practices.md
+++ b/references/alerting_best_practices.md
@@ -0,0 +1,609 @@
+# Alerting Best Practices
+
+## Core Principles
+
+### 1. Every Alert Should Be Actionable
+If you can't do something about it, don't alert on it.
+
+❌ Bad: `Alert: CPU > 50%` (What action should be taken?)
+✅ Good: `Alert: API latency p95 > 2s for 10m` (Investigate/scale up)
+
+### 2. Alert on Symptoms, Not Causes
+Alert on what users experience, not underlying components.
+
+❌ Bad: `Database connection pool 80% full`
+✅ Good: `Request latency p95 > 1s` (which might be caused by DB pool)
+
+### 3. Alert on SLO Violations
+Tie alerts to Service Level Objectives.
+
+✅ `Error rate exceeds 0.1% (SLO: 99.9% availability)`
+
+### 4. Reduce Noise
+Alert fatigue is real. Only page for critical issues.
+
+**Alert Severity Levels**:
+- **Critical**: Page on-call immediately (user-facing issue)
+- **Warning**: Create ticket, review during business hours
+- **Info**: Log for awareness, no action needed
+
+---
+
+## Alert Design Patterns
+
+### Pattern 1: Multi-Window Multi-Burn-Rate
+
+Google's recommended SLO alerting approach.
+
+**Concept**: Alert when error budget burn rate is high enough to exhaust the budget too quickly.
+
+```yaml
+# Fast burn (6% of budget in 1 hour)
+- alert: FastBurnRate
+  expr: |
+    sum(rate(http_requests_total{status=~"5.."}[1h]))
+      /
+    sum(rate(http_requests_total[1h]))
+    > (14.4 * 0.001)  # 14.4x burn rate for 99.9% SLO
+  for: 2m
+  labels:
+    severity: critical
+
+# Slow burn (6% of budget in 6 hours)
+- alert: SlowBurnRate
+  expr: |
+    sum(rate(http_requests_total{status=~"5.."}[6h]))
+      /
+    sum(rate(http_requests_total[6h]))
+    > (6 * 0.001)  # 6x burn rate for 99.9% SLO
+  for: 30m
+  labels:
+    severity: warning
+```
+
+**Burn Rate Multipliers for 99.9% SLO (0.1% error budget)**:
+- 1 hour window, 2m grace: 14.4x burn rate
+- 6 hour window, 30m grace: 6x burn rate
+- 3 day window, 6h grace: 1x burn rate
+
+### Pattern 2: Rate of Change
+Alert when metrics change rapidly.
+
+```yaml
+- alert: TrafficSpike
+  expr: |
+    sum(rate(http_requests_total[5m]))
+      >
+    1.5 * sum(rate(http_requests_total[5m] offset 1h))
+  for: 10m
+  annotations:
+    summary: "Traffic increased by 50% compared to 1 hour ago"
+```
+
+### Pattern 3: Threshold with Hysteresis
+Prevent flapping with different thresholds for firing and resolving.
+
+```yaml
+# Fire at 90%, resolve at 70%
+- alert: HighCPU
+  expr: cpu_usage > 90
+  for: 5m
+
+- alert: HighCPU_Resolved
+  expr: cpu_usage < 70
+  for: 5m
+```
+
+### Pattern 4: Absent Metrics
+Alert when expected metrics stop being reported (service down).
+
+```yaml
+- alert: ServiceDown
+  expr: absent(up{job="my-service"})
+  for: 5m
+  labels:
+    severity: critical
+  annotations:
+    summary: "Service {{ $labels.job }} is not reporting metrics"
+```
+
+### Pattern 5: Aggregate Alerts
+Alert on aggregate performance across multiple instances.
+
+```yaml
+- alert: HighOverallErrorRate
+  expr: |
+    sum(rate(http_requests_total{status=~"5.."}[5m]))
+      /
+    sum(rate(http_requests_total[5m]))
+    > 0.05
+  for: 10m
+  annotations:
+    summary: "Overall error rate is {{ $value | humanizePercentage }}"
+```
+
+---
+
+## Alert Annotation Best Practices
+
+### Required Fields
+
+**summary**: One-line description of the issue
+```yaml
+summary: "High error rate on {{ $labels.service }}: {{ $value | humanizePercentage }}"
+```
+
+**description**: Detailed explanation with context
+```yaml
+description: |
+  Error rate on {{ $labels.service }} is {{ $value | humanizePercentage }},
+  which exceeds the threshold of 1% for more than 10 minutes.
+
+  Current value: {{ $value }}
+  Runbook: https://runbooks.example.com/high-error-rate
+```
+
+**runbook_url**: Link to investigation steps
+```yaml
+runbook_url: "https://runbooks.example.com/alerts/{{ $labels.alertname }}"
+```
+
+### Optional but Recommended
+
+**dashboard**: Link to relevant dashboard
+```yaml
+dashboard: "https://grafana.example.com/d/service-dashboard?var-service={{ $labels.service }}"
+```
+
+**logs**: Link to logs
+```yaml
+logs: "https://kibana.example.com/app/discover#/?_a=(query:(query_string:(query:'service:{{ $labels.service }}')))"
+```
+
+---
+
+## Alert Label Best Practices
+
+### Required Labels
+
+**severity**: Critical, warning, or info
+```yaml
+labels:
+  severity: critical
+```
+
+**team**: Who should handle this alert
+```yaml
+labels:
+  team: platform
+  severity: critical
+```
+
+**component**: What part of the system
+```yaml
+labels:
+  component: api-gateway
+  severity: warning
+```
+
+### Example Complete Alert
+```yaml
+- alert: HighLatency
+  expr: |
+    histogram_quantile(0.95,
+      sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
+    ) > 1
+  for: 10m
+  labels:
+    severity: warning
+    team: backend
+    component: api
+    environment: "{{ $labels.environment }}"
+  annotations:
+    summary: "High latency on {{ $labels.service }}"
+    description: |
+      P95 latency on {{ $labels.service }} is {{ $value }}s, exceeding 1s threshold.
+
+      This may impact user experience. Check recent deployments and database performance.
+
+      Current p95: {{ $value }}s
+      Threshold: 1s
+      Duration: 10m+
+    runbook_url: "https://runbooks.example.com/high-latency"
+    dashboard: "https://grafana.example.com/d/api-dashboard"
+    logs: "https://kibana.example.com/app/discover#/?_a=(query:(query_string:(query:'service:{{ $labels.service }} AND level:error')))"
+```
+
+---
+
+## Alert Thresholds
+
+### General Guidelines
+
+**Response Time / Latency**:
+- Warning: p95 > 500ms or p99 > 1s
+- Critical: p95 > 2s or p99 > 5s
+
+**Error Rate**:
+- Warning: > 1%
+- Critical: > 5%
+
+**Availability**:
+- Warning: < 99.9%
+- Critical: < 99.5%
+
+**CPU Utilization**:
+- Warning: > 70% for 15m
+- Critical: > 90% for 5m
+
+**Memory Utilization**:
+- Warning: > 80% for 15m
+- Critical: > 95% for 5m
+
+**Disk Space**:
+- Warning: > 80% full
+- Critical: > 90% full
+
+**Queue Depth**:
+- Warning: > 70% of max capacity
+- Critical: > 90% of max capacity
+
+### Application-Specific Thresholds
+
+Set thresholds based on:
+1. **Historical performance**: Use p95 of last 30 days + 20%
+2. **SLO requirements**: If SLO is 99.9%, alert at 99.5%
+3. **Business impact**: What error rate causes user complaints?
+
+---
+
+## The "for" Clause
+
+Prevent alert flapping by requiring the condition to be true for a duration.
+
+### Guidelines
+
+**Critical alerts**: Short duration (2-5m)
+```yaml
+- alert: ServiceDown
+  expr: up == 0
+  for: 2m  # Quick detection for critical issues
+```
+
+**Warning alerts**: Longer duration (10-30m)
+```yaml
+- alert: HighMemoryUsage
+  expr: memory_usage > 80
+  for: 15m  # Avoid noise from temporary spikes
+```
+
+**Resource saturation**: Medium duration (5-10m)
+```yaml
+- alert: HighCPU
+  expr: cpu_usage > 90
+  for: 5m
+```
+
+---
+
+## Alert Routing
+
+### Severity-Based Routing
+
+```yaml
+# alertmanager.yml
+route:
+  group_by: ['alertname', 'cluster']
+  group_wait: 10s
+  group_interval: 5m
+  repeat_interval: 4h
+  receiver: 'default'
+
+  routes:
+  # Critical alerts → PagerDuty
+  - match:
+      severity: critical
+    receiver: pagerduty
+    group_wait: 10s
+    repeat_interval: 5m
+
+  # Warning alerts → Slack
+  - match:
+      severity: warning
+    receiver: slack
+    group_wait: 30s
+    repeat_interval: 12h
+
+  # Info alerts → Email
+  - match:
+      severity: info
+    receiver: email
+    repeat_interval: 24h
+```
+
+### Team-Based Routing
+
+```yaml
+routes:
+  # Platform team
+  - match:
+      team: platform
+    receiver: platform-pagerduty
+
+  # Backend team
+  - match:
+      team: backend
+    receiver: backend-slack
+
+  # Database team
+  - match:
+      component: database
+    receiver: dba-pagerduty
+```
+
+### Time-Based Routing
+
+```yaml
+# Only page during business hours for non-critical
+routes:
+  - match:
+      severity: warning
+    receiver: slack
+    active_time_intervals:
+      - business_hours
+
+time_intervals:
+  - name: business_hours
+    time_intervals:
+      - weekdays: ['monday:friday']
+        times:
+          - start_time: '09:00'
+            end_time: '17:00'
+        location: 'America/New_York'
+```
+
+---
+
+## Alert Grouping
+
+### Intelligent Grouping
+
+**Group by service and environment**:
+```yaml
+route:
+  group_by: ['alertname', 'service', 'environment']
+  group_wait: 30s
+  group_interval: 5m
+```
+
+This prevents:
+- 50 alerts for "HighCPU" on different pods → 1 grouped alert
+- Mixing production and staging alerts
+
+### Inhibition Rules
+
+Suppress related alerts when a parent alert fires.
+
+```yaml
+inhibit_rules:
+  # If service is down, suppress latency alerts
+  - source_match:
+      alertname: ServiceDown
+    target_match:
+      alertname: HighLatency
+    equal: ['service']
+
+  # If node is down, suppress all pod alerts on that node
+  - source_match:
+      alertname: NodeDown
+    target_match_re:
+      alertname: '(PodCrashLoop|HighCPU|HighMemory)'
+    equal: ['node']
+```
+
+---
+
+## Runbook Structure
+
+Every alert should link to a runbook with:
+
+### 1. Context
+- What does this alert mean?
+- What is the user impact?
+- What is the urgency?
+
+### 2. Investigation Steps
+```markdown
+## Investigation
+
+1. Check service health dashboard
+   https://grafana.example.com/d/service-dashboard
+
+2. Check recent deployments
+   kubectl rollout history deployment/myapp -n production
+
+3. Check error logs
+   kubectl logs deployment/myapp -n production --tail=100 | grep ERROR
+
+4. Check dependencies
+   - Database: Check slow query log
+   - Redis: Check memory usage
+   - External APIs: Check status pages
+```
+
+### 3. Common Causes
+```markdown
+## Common Causes
+
+- **Recent deployment**: Check if alert started after deployment
+- **Traffic spike**: Check request rate, might need to scale
+- **Database issues**: Check query performance and connection pool
+- **External API degradation**: Check third-party status pages
+```
+
+### 4. Resolution Steps
+```markdown
+## Resolution
+
+### Immediate Actions (< 5 minutes)
+1. Scale up if traffic spike: `kubectl scale deployment myapp --replicas=10`
+2. Rollback if recent deployment: `kubectl rollout undo deployment/myapp`
+
+### Short-term Actions (< 30 minutes)
+1. Restart pods if memory leak: `kubectl rollout restart deployment/myapp`
+2. Clear cache if stale data: `redis-cli -h cache.example.com FLUSHDB`
+
+### Long-term Actions (post-incident)
+1. Review and optimize slow queries
+2. Implement circuit breakers
+3. Add more capacity
+4. Update alert thresholds if false positive
+```
+
+### 5. Escalation
+```markdown
+## Escalation
+
+If issue persists after 30 minutes:
+- Slack: #backend-oncall
+- PagerDuty: Escalate to senior engineer
+- Incident Commander: Jane Doe (jane@example.com)
+```
+
+---
+
+## Anti-Patterns to Avoid
+
+### 1. Alert on Everything
+❌ Don't: Alert on every warning log
+✅ Do: Alert on error rate exceeding threshold
+
+### 2. Alert Without Context
+❌ Don't: "Error rate high"
+✅ Do: "Error rate 5.2% exceeds 1% threshold for 10m, impacting checkout flow"
+
+### 3. Static Thresholds for Dynamic Systems
+❌ Don't: `cpu_usage > 70` (fails during scale-up)
+✅ Do: Alert on SLO violations or rate of change
+
+### 4. No "for" Clause
+❌ Don't: Alert immediately on threshold breach
+✅ Do: Use `for: 5m` to avoid flapping
+
+### 5. Too Many Recipients
+❌ Don't: Page 10 people for every alert
+✅ Do: Route to specific on-call rotation
+
+### 6. Duplicate Alerts
+❌ Don't: Alert on both cause and symptom
+✅ Do: Alert on symptom, use inhibition for causes
+
+### 7. No Runbook
+❌ Don't: Alert without guidance
+✅ Do: Include runbook_url in every alert
+
+---
+
+## Alert Testing
+
+### Test Alert Firing
+```bash
+# Trigger test alert in Prometheus
+amtool alert add alertname="TestAlert" \
+  severity="warning" \
+  summary="Test alert"
+
+# Or use Alertmanager API
+curl -X POST http://alertmanager:9093/api/v1/alerts \
+  -d '[{
+    "labels": {"alertname": "TestAlert", "severity": "critical"},
+    "annotations": {"summary": "Test critical alert"}
+  }]'
+```
+
+### Verify Alert Rules
+```bash
+# Check syntax
+promtool check rules alerts.yml
+
+# Test expression
+promtool query instant http://prometheus:9090 \
+  'sum(rate(http_requests_total{status=~"5.."}[5m]))'
+
+# Unit test alerts
+promtool test rules test.yml
+```
+
+### Test Alertmanager Routing
+```bash
+# Test which receiver an alert would go to
+amtool config routes test \
+  --config.file=alertmanager.yml \
+  alertname="HighLatency" \
+  severity="critical" \
+  team="backend"
+```
+
+---
+
+## On-Call Best Practices
+
+### Rotation Schedule
+- **Primary on-call**: First responder
+- **Secondary on-call**: Escalation backup
+- **Rotation length**: 1 week (balance load vs context)
+- **Handoff**: Monday morning (not Friday evening)
+
+### On-Call Checklist
+```markdown
+## Pre-shift
+- [ ] Test pager/phone
+- [ ] Review recent incidents
+- [ ] Check upcoming deployments
+- [ ] Update contact info
+
+## During shift
+- [ ] Respond to pages within 5 minutes
+- [ ] Document all incidents
+- [ ] Update runbooks if gaps found
+- [ ] Communicate in #incidents channel
+
+## Post-shift
+- [ ] Hand off open incidents
+- [ ] Complete incident reports
+- [ ] Suggest improvements
+- [ ] Update team documentation
+```
+
+### Escalation Policy
+1. **Primary**: Responds within 5 minutes
+2. **Secondary**: Auto-escalate after 15 minutes
+3. **Manager**: Auto-escalate after 30 minutes
+4. **Incident Commander**: Critical incidents only
+
+---
+
+## Metrics About Alerts
+
+Monitor your monitoring system!
+
+### Key Metrics
+```promql
+# Alert firing frequency
+sum(ALERTS{alertstate="firing"}) by (alertname)
+
+# Alert duration
+ALERTS_FOR_STATE{alertstate="firing"}
+
+# Alerts per severity
+sum(ALERTS{alertstate="firing"}) by (severity)
+
+# Time to acknowledge (from PagerDuty/etc)
+pagerduty_incident_ack_duration_seconds
+```
+
+### Alert Quality Metrics
+- **Mean Time to Acknowledge (MTTA)**: < 5 minutes
+- **Mean Time to Resolve (MTTR)**: < 30 minutes
+- **False Positive Rate**: < 10%
+- **Alert Coverage**: % of incidents with preceding alert > 80%
--- a/references/datadog_migration.md
+++ b/references/datadog_migration.md
@@ -0,0 +1,649 @@
+# Migrating from Datadog to Open-Source Stack
+
+## Overview
+
+This guide helps you migrate from Datadog to a cost-effective open-source observability stack:
+- **Metrics**: Datadog → Prometheus + Grafana
+- **Logs**: Datadog → Loki + Grafana
+- **Traces**: Datadog APM → Tempo/Jaeger + Grafana
+- **Dashboards**: Datadog → Grafana
+- **Alerts**: Datadog Monitors → Prometheus Alertmanager
+
+**Estimated Cost Savings**: 60-80% for similar functionality
+
+---
+
+## Cost Comparison
+
+### Example: 100-host infrastructure
+
+**Datadog**:
+- Infrastructure Pro: $1,500/month (100 hosts × $15)
+- Custom Metrics: $50/month (5,000 extra metrics beyond included 10,000)
+- Logs: $2,000/month (20GB/day × $0.10/GB × 30 days)
+- APM: $3,100/month (100 hosts × $31)
+- **Total**: ~$6,650/month ($79,800/year)
+
+**Open-Source Stack** (self-hosted):
+- Infrastructure: $1,200/month (EC2/GKE for Prometheus, Grafana, Loki, Tempo)
+- Storage: $300/month (S3/GCS for long-term metrics and traces)
+- Operations time: Variable
+- **Total**: ~$1,500-2,500/month ($18,000-30,000/year)
+
+**Savings**: $49,800-61,800/year
+
+---
+
+## Migration Strategy
+
+### Phase 1: Run Parallel (Month 1-2)
+- Deploy open-source stack alongside Datadog
+- Migrate metrics first (lowest risk)
+- Validate data accuracy
+- Build confidence
+
+### Phase 2: Migrate Dashboards & Alerts (Month 2-3)
+- Convert Datadog dashboards to Grafana
+- Translate alert rules
+- Train team on new tools
+
+### Phase 3: Migrate Logs & Traces (Month 3-4)
+- Set up Loki for log aggregation
+- Deploy Tempo/Jaeger for tracing
+- Update application instrumentation
+
+### Phase 4: Decommission Datadog (Month 4-5)
+- Confirm all functionality migrated
+- Cancel Datadog subscription
+- Archive Datadog dashboards/alerts for reference
+
+---
+
+## 1. Metrics Migration (Datadog → Prometheus)
+
+### Step 1: Deploy Prometheus
+
+**Kubernetes** (recommended):
+```yaml
+# prometheus-values.yaml
+prometheus:
+  prometheusSpec:
+    retention: 30d
+    storageSpec:
+      volumeClaimTemplate:
+        spec:
+          resources:
+            requests:
+              storage: 100Gi
+
+    # Scrape configs
+    additionalScrapeConfigs:
+      - job_name: 'kubernetes-pods'
+        kubernetes_sd_configs:
+          - role: pod
+```
+
+**Install**:
+```bash
+helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
+helm install prometheus prometheus-community/kube-prometheus-stack -f prometheus-values.yaml
+```
+
+**Docker Compose**:
+```yaml
+version: '3'
+services:
+  prometheus:
+    image: prom/prometheus:latest
+    ports:
+      - "9090:9090"
+    volumes:
+      - ./prometheus.yml:/etc/prometheus/prometheus.yml
+      - prometheus-data:/prometheus
+    command:
+      - '--config.file=/etc/prometheus/prometheus.yml'
+      - '--storage.tsdb.retention.time=30d'
+
+volumes:
+  prometheus-data:
+```
+
+### Step 2: Replace DogStatsD with Prometheus Exporters
+
+**Before (DogStatsD)**:
+```python
+from datadog import statsd
+
+statsd.increment('page.views')
+statsd.histogram('request.duration', 0.5)
+statsd.gauge('active_users', 100)
+```
+
+**After (Prometheus Python client)**:
+```python
+from prometheus_client import Counter, Histogram, Gauge
+
+page_views = Counter('page_views_total', 'Page views')
+request_duration = Histogram('request_duration_seconds', 'Request duration')
+active_users = Gauge('active_users', 'Active users')
+
+# Usage
+page_views.inc()
+request_duration.observe(0.5)
+active_users.set(100)
+```
+
+### Step 3: Metric Name Translation
+
+| Datadog Metric | Prometheus Equivalent |
+|----------------|----------------------|
+| `system.cpu.idle` | `node_cpu_seconds_total{mode="idle"}` |
+| `system.mem.free` | `node_memory_MemFree_bytes` |
+| `system.disk.used` | `node_filesystem_size_bytes - node_filesystem_free_bytes` |
+| `docker.cpu.usage` | `container_cpu_usage_seconds_total` |
+| `kubernetes.pods.running` | `kube_pod_status_phase{phase="Running"}` |
+
+### Step 4: Export Existing Datadog Metrics (Optional)
+
+Use Datadog API to export historical data:
+
+```python
+from datadog import api, initialize
+
+options = {
+    'api_key': 'YOUR_API_KEY',
+    'app_key': 'YOUR_APP_KEY'
+}
+initialize(**options)
+
+# Query metric
+result = api.Metric.query(
+    start=int(time.time() - 86400),  # Last 24h
+    end=int(time.time()),
+    query='avg:system.cpu.user{*}'
+)
+
+# Convert to Prometheus format and import
+```
+
+---
+
+## 2. Dashboard Migration (Datadog → Grafana)
+
+### Step 1: Export Datadog Dashboards
+
+```python
+import requests
+import json
+
+api_key = "YOUR_API_KEY"
+app_key = "YOUR_APP_KEY"
+
+headers = {
+    'DD-API-KEY': api_key,
+    'DD-APPLICATION-KEY': app_key
+}
+
+# Get all dashboards
+response = requests.get(
+    'https://api.datadoghq.com/api/v1/dashboard',
+    headers=headers
+)
+
+dashboards = response.json()
+
+# Export each dashboard
+for dashboard in dashboards['dashboards']:
+    dash_id = dashboard['id']
+    detail = requests.get(
+        f'https://api.datadoghq.com/api/v1/dashboard/{dash_id}',
+        headers=headers
+    ).json()
+
+    with open(f'datadog_{dash_id}.json', 'w') as f:
+        json.dump(detail, f, indent=2)
+```
+
+### Step 2: Convert to Grafana Format
+
+**Manual Conversion Template**:
+
+| Datadog Widget | Grafana Panel Type |
+|----------------|-------------------|
+| Timeseries | Graph / Time series |
+| Query Value | Stat |
+| Toplist | Table / Bar gauge |
+| Heatmap | Heatmap |
+| Distribution | Histogram |
+
+**Automated Conversion** (basic example):
+```python
+def convert_datadog_to_grafana(datadog_dashboard):
+    grafana_dashboard = {
+        "title": datadog_dashboard['title'],
+        "panels": []
+    }
+
+    for widget in datadog_dashboard['widgets']:
+        panel = {
+            "title": widget['definition'].get('title', ''),
+            "type": map_widget_type(widget['definition']['type']),
+            "targets": convert_queries(widget['definition']['requests'])
+        }
+        grafana_dashboard['panels'].append(panel)
+
+    return grafana_dashboard
+```
+
+### Step 3: Common Query Translations
+
+See `dql_promql_translation.md` for comprehensive query mappings.
+
+**Example conversions**:
+
+```
+Datadog: avg:system.cpu.user{*}
+Prometheus: avg(rate(node_cpu_seconds_total{mode="user"}[5m])) * 100
+
+Datadog: sum:requests.count{status:200}.as_rate()
+Prometheus: sum(rate(http_requests_total{status="200"}[5m]))
+
+Datadog: p95:request.duration{*}
+Prometheus: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
+```
+
+---
+
+## 3. Alert Migration (Datadog Monitors → Prometheus Alerts)
+
+### Step 1: Export Datadog Monitors
+
+```python
+import requests
+
+api_key = "YOUR_API_KEY"
+app_key = "YOUR_APP_KEY"
+
+headers = {
+    'DD-API-KEY': api_key,
+    'DD-APPLICATION-KEY': app_key
+}
+
+response = requests.get(
+    'https://api.datadoghq.com/api/v1/monitor',
+    headers=headers
+)
+
+monitors = response.json()
+
+# Save each monitor
+for monitor in monitors:
+    with open(f'monitor_{monitor["id"]}.json', 'w') as f:
+        json.dump(monitor, f, indent=2)
+```
+
+### Step 2: Convert to Prometheus Alert Rules
+
+**Datadog Monitor**:
+```json
+{
+  "name": "High CPU Usage",
+  "type": "metric alert",
+  "query": "avg(last_5m):avg:system.cpu.user{*} > 80",
+  "message": "CPU usage is high on {{host.name}}"
+}
+```
+
+**Prometheus Alert**:
+```yaml
+groups:
+  - name: infrastructure
+    rules:
+      - alert: HighCPUUsage
+        expr: |
+          100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
+        for: 5m
+        labels:
+          severity: warning
+        annotations:
+          summary: "High CPU usage on {{ $labels.instance }}"
+          description: "CPU usage is {{ $value }}%"
+```
+
+### Step 3: Alert Routing (Datadog → Alertmanager)
+
+**Datadog notification channels** → **Alertmanager receivers**
+
+```yaml
+# alertmanager.yml
+route:
+  group_by: ['alertname', 'severity']
+  receiver: 'slack-notifications'
+
+receivers:
+  - name: 'slack-notifications'
+    slack_configs:
+      - api_url: 'YOUR_SLACK_WEBHOOK'
+        channel: '#alerts'
+
+  - name: 'pagerduty-critical'
+    pagerduty_configs:
+      - service_key: 'YOUR_PAGERDUTY_KEY'
+```
+
+---
+
+## 4. Log Migration (Datadog → Loki)
+
+### Step 1: Deploy Loki
+
+**Kubernetes**:
+```bash
+helm repo add grafana https://grafana.github.io/helm-charts
+helm install loki grafana/loki-stack \
+  --set loki.persistence.enabled=true \
+  --set loki.persistence.size=100Gi \
+  --set promtail.enabled=true
+```
+
+**Docker Compose**:
+```yaml
+version: '3'
+services:
+  loki:
+    image: grafana/loki:latest
+    ports:
+      - "3100:3100"
+    volumes:
+      - ./loki-config.yaml:/etc/loki/local-config.yaml
+      - loki-data:/loki
+
+  promtail:
+    image: grafana/promtail:latest
+    volumes:
+      - /var/log:/var/log
+      - ./promtail-config.yaml:/etc/promtail/config.yml
+
+volumes:
+  loki-data:
+```
+
+### Step 2: Replace Datadog Log Forwarder
+
+**Before (Datadog Agent)**:
+```yaml
+# datadog.yaml
+logs_enabled: true
+
+logs_config:
+  container_collect_all: true
+```
+
+**After (Promtail)**:
+```yaml
+# promtail-config.yaml
+server:
+  http_listen_port: 9080
+
+positions:
+  filename: /tmp/positions.yaml
+
+clients:
+  - url: http://loki:3100/loki/api/v1/push
+
+scrape_configs:
+  - job_name: system
+    static_configs:
+      - targets:
+          - localhost
+        labels:
+          job: varlogs
+          __path__: /var/log/*.log
+```
+
+### Step 3: Query Translation
+
+**Datadog Logs Query**:
+```
+service:my-app status:error
+```
+
+**Loki LogQL**:
+```logql
+{job="my-app", level="error"}
+```
+
+**More examples**:
+```
+Datadog: service:api-gateway status:error @http.status_code:>=500
+Loki: {service="api-gateway", level="error"} | json | http_status_code >= 500
+
+Datadog: source:nginx "404"
+Loki: {source="nginx"} |= "404"
+```
+
+---
+
+## 5. APM Migration (Datadog APM → Tempo/Jaeger)
+
+### Step 1: Choose Tracing Backend
+
+- **Tempo**: Better for high volume, cheaper storage (object storage)
+- **Jaeger**: More mature, better UI, requires separate storage
+
+### Step 2: Replace Datadog Tracer with OpenTelemetry
+
+**Before (Datadog Python)**:
+```python
+from ddtrace import tracer
+
+@tracer.wrap()
+def my_function():
+    pass
+```
+
+**After (OpenTelemetry)**:
+```python
+from opentelemetry import trace
+from opentelemetry.sdk.trace import TracerProvider
+from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
+
+# Setup
+trace.set_tracer_provider(TracerProvider())
+tracer = trace.get_tracer(__name__)
+exporter = OTLPSpanExporter(endpoint="tempo:4317")
+
+@tracer.start_as_current_span("my_function")
+def my_function():
+    pass
+```
+
+### Step 3: Deploy Tempo
+
+```yaml
+# tempo.yaml
+server:
+  http_listen_port: 3200
+
+distributor:
+  receivers:
+    otlp:
+      protocols:
+        grpc:
+          endpoint: 0.0.0.0:4317
+
+storage:
+  trace:
+    backend: s3
+    s3:
+      bucket: tempo-traces
+      endpoint: s3.amazonaws.com
+```
+
+---
+
+## 6. Infrastructure Migration
+
+### Recommended Architecture
+
+```
+┌─────────────────────────────────────────┐
+│  Grafana (Visualization)                │
+│  - Dashboards                           │
+│  - Unified view                         │
+└─────────────────────────────────────────┘
+         ↓           ↓           ↓
+┌──────────────┐ ┌──────────┐ ┌──────────┐
+│  Prometheus  │ │   Loki   │ │  Tempo   │
+│  (Metrics)   │ │  (Logs)  │ │ (Traces) │
+└──────────────┘ └──────────┘ └──────────┘
+         ↓           ↓           ↓
+┌─────────────────────────────────────────┐
+│  Applications (OpenTelemetry)           │
+└─────────────────────────────────────────┘
+```
+
+### Sizing Recommendations
+
+**100-host environment**:
+
+- **Prometheus**: 2-4 CPU, 8-16GB RAM, 100GB SSD
+- **Grafana**: 1 CPU, 2GB RAM
+- **Loki**: 2-4 CPU, 8GB RAM, 100GB SSD
+- **Tempo**: 2-4 CPU, 8GB RAM, S3 for storage
+- **Alertmanager**: 1 CPU, 1GB RAM
+
+**Total**: ~8-16 CPU, 32-64GB RAM, 200GB SSD + object storage
+
+---
+
+## 7. Migration Checklist
+
+### Pre-Migration
+- [ ] Calculate current Datadog costs
+- [ ] Identify all Datadog integrations
+- [ ] Export all dashboards
+- [ ] Export all monitors
+- [ ] Document custom metrics
+- [ ] Get stakeholder approval
+
+### During Migration
+- [ ] Deploy Prometheus + Grafana
+- [ ] Deploy Loki + Promtail
+- [ ] Deploy Tempo/Jaeger (if using APM)
+- [ ] Migrate metrics instrumentation
+- [ ] Convert dashboards (top 10 critical first)
+- [ ] Convert alerts (critical alerts first)
+- [ ] Update application logging
+- [ ] Replace APM instrumentation
+- [ ] Run parallel for 2-4 weeks
+- [ ] Validate data accuracy
+- [ ] Train team on new tools
+
+### Post-Migration
+- [ ] Decommission Datadog agent from all hosts
+- [ ] Cancel Datadog subscription
+- [ ] Archive Datadog configs
+- [ ] Document new workflows
+- [ ] Create runbooks for common tasks
+
+---
+
+## 8. Common Challenges & Solutions
+
+### Challenge: Missing Datadog Features
+
+**Datadog Synthetic Monitoring**:
+- Solution: Use **Blackbox Exporter** (Prometheus) or **Grafana Synthetic Monitoring**
+
+**Datadog Network Performance Monitoring**:
+- Solution: Use **Cilium Hubble** (Kubernetes) or **eBPF-based tools**
+
+**Datadog RUM (Real User Monitoring)**:
+- Solution: Use **Grafana Faro** or **OpenTelemetry Browser SDK**
+
+### Challenge: Team Learning Curve
+
+**Solution**:
+- Provide training sessions (2-3 hours per tool)
+- Create internal documentation with examples
+- Set up sandbox environment for practice
+- Assign champions for each tool
+
+### Challenge: Query Performance
+
+**Prometheus too slow**:
+- Use **Thanos** or **Cortex** for scaling
+- Implement recording rules for expensive queries
+- Increase retention only where needed
+
+**Loki too slow**:
+- Add more labels for better filtering
+- Use chunk caching
+- Consider **parallel query execution**
+
+---
+
+## 9. Maintenance Comparison
+
+### Datadog (Managed)
+- **Ops burden**: Low (fully managed)
+- **Upgrades**: Automatic
+- **Scaling**: Automatic
+- **Cost**: High ($6k-10k+/month)
+
+### Open-Source Stack (Self-hosted)
+- **Ops burden**: Medium (requires ops team)
+- **Upgrades**: Manual (quarterly)
+- **Scaling**: Manual planning required
+- **Cost**: Low ($1.5k-3k/month infrastructure)
+
+**Hybrid Option**: Use **Grafana Cloud** (managed Prometheus/Loki/Tempo)
+- Cost: ~$3k/month for 100 hosts
+- Ops burden: Low
+- Savings: ~50% vs Datadog
+
+---
+
+## 10. ROI Calculation
+
+### Example Scenario
+
+**Before (Datadog)**:
+- Monthly cost: $7,000
+- Annual cost: $84,000
+
+**After (Self-hosted OSS)**:
+- Infrastructure: $1,800/month
+- Operations (0.5 FTE): $4,000/month
+- Annual cost: $69,600
+
+**Savings**: $14,400/year
+
+**After (Grafana Cloud)**:
+- Monthly cost: $3,500
+- Annual cost: $42,000
+
+**Savings**: $42,000/year (50%)
+
+**Break-even**: Immediate (no migration costs beyond engineering time)
+
+---
+
+## Resources
+
+- **Prometheus**: https://prometheus.io/docs/
+- **Grafana**: https://grafana.com/docs/
+- **Loki**: https://grafana.com/docs/loki/
+- **Tempo**: https://grafana.com/docs/tempo/
+- **OpenTelemetry**: https://opentelemetry.io/
+- **Migration Tools**: https://github.com/grafana/dashboard-linter
+
+---
+
+## Support
+
+If you need help with migration:
+- Grafana Labs offers migration consulting
+- Many SRE consulting firms specialize in this
+- Community support via Slack/Discord channels
--- a/references/dql_promql_translation.md
+++ b/references/dql_promql_translation.md
@@ -0,0 +1,756 @@
+# DQL (Datadog Query Language) ↔ PromQL Translation Guide
+
+## Quick Reference
+
+| Concept | Datadog (DQL) | Prometheus (PromQL) |
+|---------|---------------|---------------------|
+| Aggregation | `avg:`, `sum:`, `min:`, `max:` | `avg()`, `sum()`, `min()`, `max()` |
+| Rate | `.as_rate()`, `.as_count()` | `rate()`, `increase()` |
+| Percentile | `p50:`, `p95:`, `p99:` | `histogram_quantile()` |
+| Filtering | `{tag:value}` | `{label="value"}` |
+| Time window | `last_5m`, `last_1h` | `[5m]`, `[1h]` |
+
+---
+
+## Basic Queries
+
+### Simple Metric Query
+
+**Datadog**:
+```
+system.cpu.user
+```
+
+**Prometheus**:
+```promql
+node_cpu_seconds_total{mode="user"}
+```
+
+---
+
+### Metric with Filter
+
+**Datadog**:
+```
+system.cpu.user{host:web-01}
+```
+
+**Prometheus**:
+```promql
+node_cpu_seconds_total{mode="user", instance="web-01"}
+```
+
+---
+
+### Multiple Filters (AND)
+
+**Datadog**:
+```
+system.cpu.user{host:web-01,env:production}
+```
+
+**Prometheus**:
+```promql
+node_cpu_seconds_total{mode="user", instance="web-01", env="production"}
+```
+
+---
+
+### Wildcard Filters
+
+**Datadog**:
+```
+system.cpu.user{host:web-*}
+```
+
+**Prometheus**:
+```promql
+node_cpu_seconds_total{mode="user", instance=~"web-.*"}
+```
+
+---
+
+### OR Filters
+
+**Datadog**:
+```
+system.cpu.user{host:web-01 OR host:web-02}
+```
+
+**Prometheus**:
+```promql
+node_cpu_seconds_total{mode="user", instance=~"web-01|web-02"}
+```
+
+---
+
+## Aggregations
+
+### Average
+
+**Datadog**:
+```
+avg:system.cpu.user{*}
+```
+
+**Prometheus**:
+```promql
+avg(node_cpu_seconds_total{mode="user"})
+```
+
+---
+
+### Sum
+
+**Datadog**:
+```
+sum:requests.count{*}
+```
+
+**Prometheus**:
+```promql
+sum(http_requests_total)
+```
+
+---
+
+### Min/Max
+
+**Datadog**:
+```
+min:system.mem.free{*}
+max:system.mem.free{*}
+```
+
+**Prometheus**:
+```promql
+min(node_memory_MemFree_bytes)
+max(node_memory_MemFree_bytes)
+```
+
+---
+
+### Aggregation by Tag/Label
+
+**Datadog**:
+```
+avg:system.cpu.user{*} by {host}
+```
+
+**Prometheus**:
+```promql
+avg by (instance) (node_cpu_seconds_total{mode="user"})
+```
+
+---
+
+## Rates and Counts
+
+### Rate (per second)
+
+**Datadog**:
+```
+sum:requests.count{*}.as_rate()
+```
+
+**Prometheus**:
+```promql
+sum(rate(http_requests_total[5m]))
+```
+
+Note: Prometheus requires explicit time window `[5m]`
+
+---
+
+### Count (total over time)
+
+**Datadog**:
+```
+sum:requests.count{*}.as_count()
+```
+
+**Prometheus**:
+```promql
+sum(increase(http_requests_total[1h]))
+```
+
+---
+
+### Derivative (change over time)
+
+**Datadog**:
+```
+derivative(avg:system.disk.used{*})
+```
+
+**Prometheus**:
+```promql
+deriv(node_filesystem_size_bytes[5m])
+```
+
+---
+
+## Percentiles
+
+### P50 (Median)
+
+**Datadog**:
+```
+p50:request.duration{*}
+```
+
+**Prometheus** (requires histogram):
+```promql
+histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
+```
+
+---
+
+### P95
+
+**Datadog**:
+```
+p95:request.duration{*}
+```
+
+**Prometheus**:
+```promql
+histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
+```
+
+---
+
+### P99
+
+**Datadog**:
+```
+p99:request.duration{*}
+```
+
+**Prometheus**:
+```promql
+histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
+```
+
+---
+
+## Time Windows
+
+### Last 5 minutes
+
+**Datadog**:
+```
+avg(last_5m):system.cpu.user{*}
+```
+
+**Prometheus**:
+```promql
+avg(node_cpu_seconds_total{mode="user"}[5m])
+```
+
+---
+
+### Last 1 hour
+
+**Datadog**:
+```
+avg(last_1h):system.cpu.user{*}
+```
+
+**Prometheus**:
+```promql
+avg_over_time(node_cpu_seconds_total{mode="user"}[1h])
+```
+
+---
+
+## Math Operations
+
+### Division
+
+**Datadog**:
+```
+avg:system.mem.used{*} / avg:system.mem.total{*}
+```
+
+**Prometheus**:
+```promql
+node_memory_MemUsed_bytes / node_memory_MemTotal_bytes
+```
+
+---
+
+### Multiplication
+
+**Datadog**:
+```
+avg:system.cpu.user{*} * 100
+```
+
+**Prometheus**:
+```promql
+avg(node_cpu_seconds_total{mode="user"}) * 100
+```
+
+---
+
+### Percentage Calculation
+
+**Datadog**:
+```
+(sum:requests.errors{*} / sum:requests.count{*}) * 100
+```
+
+**Prometheus**:
+```promql
+(sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) * 100
+```
+
+---
+
+## Common Use Cases
+
+### CPU Usage Percentage
+
+**Datadog**:
+```
+100 - avg:system.cpu.idle{*}
+```
+
+**Prometheus**:
+```promql
+100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
+```
+
+---
+
+### Memory Usage Percentage
+
+**Datadog**:
+```
+(avg:system.mem.used{*} / avg:system.mem.total{*}) * 100
+```
+
+**Prometheus**:
+```promql
+(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
+```
+
+---
+
+### Disk Usage Percentage
+
+**Datadog**:
+```
+(avg:system.disk.used{*} / avg:system.disk.total{*}) * 100
+```
+
+**Prometheus**:
+```promql
+(node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes * 100
+```
+
+---
+
+### Request Rate (requests/sec)
+
+**Datadog**:
+```
+sum:requests.count{*}.as_rate()
+```
+
+**Prometheus**:
+```promql
+sum(rate(http_requests_total[5m]))
+```
+
+---
+
+### Error Rate Percentage
+
+**Datadog**:
+```
+(sum:requests.errors{*}.as_rate() / sum:requests.count{*}.as_rate()) * 100
+```
+
+**Prometheus**:
+```promql
+(sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) * 100
+```
+
+---
+
+### Request Latency (P95)
+
+**Datadog**:
+```
+p95:request.duration{*}
+```
+
+**Prometheus**:
+```promql
+histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
+```
+
+---
+
+### Top 5 Hosts by CPU
+
+**Datadog**:
+```
+top(avg:system.cpu.user{*} by {host}, 5, 'mean', 'desc')
+```
+
+**Prometheus**:
+```promql
+topk(5, avg by (instance) (rate(node_cpu_seconds_total{mode="user"}[5m])))
+```
+
+---
+
+## Functions
+
+### Absolute Value
+
+**Datadog**:
+```
+abs(diff(avg:system.cpu.user{*}))
+```
+
+**Prometheus**:
+```promql
+abs(delta(node_cpu_seconds_total{mode="user"}[5m]))
+```
+
+---
+
+### Ceiling/Floor
+
+**Datadog**:
+```
+ceil(avg:system.cpu.user{*})
+floor(avg:system.cpu.user{*})
+```
+
+**Prometheus**:
+```promql
+ceil(avg(node_cpu_seconds_total{mode="user"}))
+floor(avg(node_cpu_seconds_total{mode="user"}))
+```
+
+---
+
+### Clamp (Limit Range)
+
+**Datadog**:
+```
+clamp_min(avg:system.cpu.user{*}, 0)
+clamp_max(avg:system.cpu.user{*}, 100)
+```
+
+**Prometheus**:
+```promql
+clamp_min(avg(node_cpu_seconds_total{mode="user"}), 0)
+clamp_max(avg(node_cpu_seconds_total{mode="user"}), 100)
+```
+
+---
+
+### Moving Average
+
+**Datadog**:
+```
+moving_rollup(avg:system.cpu.user{*}, 60, 'avg')
+```
+
+**Prometheus**:
+```promql
+avg_over_time(node_cpu_seconds_total{mode="user"}[1h])
+```
+
+---
+
+## Advanced Patterns
+
+### Compare to Previous Period
+
+**Datadog**:
+```
+sum:requests.count{*}.as_rate() / timeshift(sum:requests.count{*}.as_rate(), 3600)
+```
+
+**Prometheus**:
+```promql
+sum(rate(http_requests_total[5m])) / sum(rate(http_requests_total[5m] offset 1h))
+```
+
+---
+
+### Forecast
+
+**Datadog**:
+```
+forecast(avg:system.disk.used{*}, 'linear', 1)
+```
+
+**Prometheus**:
+```promql
+predict_linear(node_filesystem_size_bytes[1h], 3600)
+```
+
+Note: Predicts value 1 hour in future based on last 1 hour trend
+
+---
+
+### Anomaly Detection
+
+**Datadog**:
+```
+anomalies(avg:system.cpu.user{*}, 'basic', 2)
+```
+
+**Prometheus**: No built-in function
+- Use recording rules with stddev
+- External tools like **Robust Perception's anomaly detector**
+- Or use **Grafana ML** plugin
+
+---
+
+### Outlier Detection
+
+**Datadog**:
+```
+outliers(avg:system.cpu.user{*} by {host}, 'mad')
+```
+
+**Prometheus**: No built-in function
+- Calculate manually with stddev:
+```promql
+abs(metric - avg(metric)) > 2 * stddev(metric)
+```
+
+---
+
+## Container & Kubernetes
+
+### Container CPU Usage
+
+**Datadog**:
+```
+avg:docker.cpu.usage{*} by {container_name}
+```
+
+**Prometheus**:
+```promql
+avg by (container) (rate(container_cpu_usage_seconds_total[5m]))
+```
+
+---
+
+### Container Memory Usage
+
+**Datadog**:
+```
+avg:docker.mem.rss{*} by {container_name}
+```
+
+**Prometheus**:
+```promql
+avg by (container) (container_memory_rss)
+```
+
+---
+
+### Pod Count by Status
+
+**Datadog**:
+```
+sum:kubernetes.pods.running{*} by {kube_namespace}
+```
+
+**Prometheus**:
+```promql
+sum by (namespace) (kube_pod_status_phase{phase="Running"})
+```
+
+---
+
+## Database Queries
+
+### MySQL Queries Per Second
+
+**Datadog**:
+```
+sum:mysql.performance.queries{*}.as_rate()
+```
+
+**Prometheus**:
+```promql
+sum(rate(mysql_global_status_queries[5m]))
+```
+
+---
+
+### PostgreSQL Active Connections
+
+**Datadog**:
+```
+avg:postgresql.connections{*}
+```
+
+**Prometheus**:
+```promql
+avg(pg_stat_database_numbackends)
+```
+
+---
+
+### Redis Memory Usage
+
+**Datadog**:
+```
+avg:redis.mem.used{*}
+```
+
+**Prometheus**:
+```promql
+avg(redis_memory_used_bytes)
+```
+
+---
+
+## Network Metrics
+
+### Network Bytes Sent
+
+**Datadog**:
+```
+sum:system.net.bytes_sent{*}.as_rate()
+```
+
+**Prometheus**:
+```promql
+sum(rate(node_network_transmit_bytes_total[5m]))
+```
+
+---
+
+### Network Bytes Received
+
+**Datadog**:
+```
+sum:system.net.bytes_rcvd{*}.as_rate()
+```
+
+**Prometheus**:
+```promql
+sum(rate(node_network_receive_bytes_total[5m]))
+```
+
+---
+
+## Key Differences
+
+### 1. Time Windows
+- **Datadog**: Optional, defaults to query time range
+- **Prometheus**: Always required for rate/increase functions
+
+### 2. Histograms
+- **Datadog**: Percentiles available directly
+- **Prometheus**: Requires histogram buckets + `histogram_quantile()`
+
+### 3. Default Aggregation
+- **Datadog**: No default, must specify
+- **Prometheus**: Returns all time series unless aggregated
+
+### 4. Metric Types
+- **Datadog**: All metrics treated similarly
+- **Prometheus**: Explicit types (counter, gauge, histogram, summary)
+
+### 5. Tag vs Label
+- **Datadog**: Uses "tags" (key:value)
+- **Prometheus**: Uses "labels" (key="value")
+
+---
+
+## Migration Tips
+
+1. **Start with dashboards**: Convert most-used dashboards first
+2. **Use recording rules**: Pre-calculate expensive PromQL queries
+3. **Test in parallel**: Run both systems during migration
+4. **Document mappings**: Create team-specific translation guide
+5. **Train team**: PromQL has learning curve, invest in training
+
+---
+
+## Tools
+
+- **Datadog Dashboard Exporter**: Export JSON dashboards
+- **Grafana Dashboard Linter**: Validate converted dashboards
+- **PromQL Learning Resources**: https://prometheus.io/docs/prometheus/latest/querying/basics/
+
+---
+
+## Common Gotchas
+
+### Rate without Time Window
+
+❌ **Wrong**:
+```promql
+rate(http_requests_total)
+```
+
+✅ **Correct**:
+```promql
+rate(http_requests_total[5m])
+```
+
+---
+
+### Aggregating Before Rate
+
+❌ **Wrong**:
+```promql
+rate(sum(http_requests_total)[5m])
+```
+
+✅ **Correct**:
+```promql
+sum(rate(http_requests_total[5m]))
+```
+
+---
+
+### Histogram Quantile Without by (le)
+
+❌ **Wrong**:
+```promql
+histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
+```
+
+✅ **Correct**:
+```promql
+histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
+```
+
+---
+
+## Quick Conversion Checklist
+
+When converting a Datadog query to PromQL:
+
+- [ ] Replace metric name (e.g., `system.cpu.user` → `node_cpu_seconds_total`)
+- [ ] Convert tags to labels (`{tag:value}` → `{label="value"}`)
+- [ ] Add time window for rate/increase (`[5m]`)
+- [ ] Change aggregation syntax (`avg:` → `avg()`)
+- [ ] Convert percentiles to histogram_quantile if needed
+- [ ] Test query in Prometheus before adding to dashboard
+- [ ] Add `by (label)` for grouped aggregations
+
+---
+
+## Need More Help?
+
+- See `datadog_migration.md` for full migration guide
+- PromQL documentation: https://prometheus.io/docs/prometheus/latest/querying/
+- Practice at: https://demo.promlens.com/
--- a/references/logging_guide.md
+++ b/references/logging_guide.md
@@ -0,0 +1,775 @@
+# Logging Guide
+
+## Structured Logging
+
+### Why Structured Logs?
+
+**Unstructured** (text):
+```
+2024-10-28 14:32:15 User john@example.com logged in from 192.168.1.1
+```
+
+**Structured** (JSON):
+```json
+{
+  "timestamp": "2024-10-28T14:32:15Z",
+  "level": "info",
+  "message": "User logged in",
+  "user": "john@example.com",
+  "ip": "192.168.1.1",
+  "event_type": "user_login"
+}
+```
+
+**Benefits**:
+- Easy to parse and query
+- Consistent format
+- Machine-readable
+- Efficient storage and indexing
+
+---
+
+## Log Levels
+
+Use appropriate log levels for better filtering and alerting.
+
+### DEBUG
+**When**: Development, troubleshooting
+**Examples**:
+- Function entry/exit
+- Variable values
+- Internal state changes
+
+```python
+logger.debug("Processing request", extra={
+    "request_id": req_id,
+    "params": params
+})
+```
+
+### INFO
+**When**: Important business events
+**Examples**:
+- User actions (login, purchase)
+- System state changes (started, stopped)
+- Significant milestones
+
+```python
+logger.info("Order placed", extra={
+    "order_id": "12345",
+    "user_id": "user123",
+    "amount": 99.99
+})
+```
+
+### WARN
+**When**: Potentially problematic situations
+**Examples**:
+- Deprecated API usage
+- Slow operations (but not failing)
+- Retry attempts
+- Resource usage approaching limits
+
+```python
+logger.warning("API response slow", extra={
+    "endpoint": "/api/users",
+    "duration_ms": 2500,
+    "threshold_ms": 1000
+})
+```
+
+### ERROR
+**When**: Error conditions that need attention
+**Examples**:
+- Failed requests
+- Exceptions caught and handled
+- Integration failures
+- Data validation errors
+
+```python
+logger.error("Payment processing failed", extra={
+    "order_id": "12345",
+    "error": str(e),
+    "payment_gateway": "stripe"
+}, exc_info=True)
+```
+
+### FATAL/CRITICAL
+**When**: Severe errors causing shutdown
+**Examples**:
+- Database connection lost
+- Out of memory
+- Configuration errors preventing startup
+
+```python
+logger.critical("Database connection lost", extra={
+    "database": "postgres",
+    "host": "db.example.com",
+    "attempt": 3
+})
+```
+
+---
+
+## Required Fields
+
+Every log entry should include:
+
+### 1. Timestamp
+ISO 8601 format with timezone:
+```json
+{
+  "timestamp": "2024-10-28T14:32:15.123Z"
+}
+```
+
+### 2. Level
+Standard levels: debug, info, warn, error, critical
+```json
+{
+  "level": "error"
+}
+```
+
+### 3. Message
+Human-readable description:
+```json
+{
+  "message": "User authentication failed"
+}
+```
+
+### 4. Service/Application
+What component logged this:
+```json
+{
+  "service": "api-gateway",
+  "version": "1.2.3"
+}
+```
+
+### 5. Environment
+```json
+{
+  "environment": "production"
+}
+```
+
+---
+
+## Recommended Fields
+
+### Request Context
+```json
+{
+  "request_id": "550e8400-e29b-41d4-a716-446655440000",
+  "user_id": "user123",
+  "session_id": "sess_abc",
+  "ip_address": "192.168.1.1",
+  "user_agent": "Mozilla/5.0..."
+}
+```
+
+### Performance Metrics
+```json
+{
+  "duration_ms": 245,
+  "response_size_bytes": 1024
+}
+```
+
+### Error Details
+```json
+{
+  "error_type": "ValidationError",
+  "error_message": "Invalid email format",
+  "stack_trace": "...",
+  "error_code": "VAL_001"
+}
+```
+
+### Business Context
+```json
+{
+  "order_id": "ORD-12345",
+  "customer_id": "CUST-789",
+  "transaction_amount": 99.99,
+  "payment_method": "credit_card"
+}
+```
+
+---
+
+## Implementation Examples
+
+### Python (using structlog)
+```python
+import structlog
+
+logger = structlog.get_logger()
+
+# Configure structured logging
+structlog.configure(
+    processors=[
+        structlog.processors.TimeStamper(fmt="iso"),
+        structlog.processors.add_log_level,
+        structlog.processors.JSONRenderer()
+    ]
+)
+
+# Usage
+logger.info(
+    "user_logged_in",
+    user_id="user123",
+    ip_address="192.168.1.1",
+    login_method="oauth"
+)
+```
+
+### Node.js (using Winston)
+```javascript
+const winston = require('winston');
+
+const logger = winston.createLogger({
+  format: winston.format.json(),
+  defaultMeta: { service: 'api-gateway' },
+  transports: [
+    new winston.transports.Console()
+  ]
+});
+
+logger.info('User logged in', {
+  userId: 'user123',
+  ipAddress: '192.168.1.1',
+  loginMethod: 'oauth'
+});
+```
+
+### Go (using zap)
+```go
+import "go.uber.org/zap"
+
+logger, _ := zap.NewProduction()
+defer logger.Sync()
+
+logger.Info("User logged in",
+    zap.String("userId", "user123"),
+    zap.String("ipAddress", "192.168.1.1"),
+    zap.String("loginMethod", "oauth"),
+)
+```
+
+### Java (using Logback with JSON)
+```java
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import net.logstash.logback.argument.StructuredArguments;
+
+Logger logger = LoggerFactory.getLogger(MyClass.class);
+
+logger.info("User logged in",
+    StructuredArguments.kv("userId", "user123"),
+    StructuredArguments.kv("ipAddress", "192.168.1.1"),
+    StructuredArguments.kv("loginMethod", "oauth")
+);
+```
+
+---
+
+## Log Aggregation Patterns
+
+### Pattern 1: ELK Stack (Elasticsearch, Logstash, Kibana)
+
+**Architecture**:
+```
+Application → Filebeat → Logstash → Elasticsearch → Kibana
+```
+
+**filebeat.yml**:
+```yaml
+filebeat.inputs:
+  - type: log
+    enabled: true
+    paths:
+      - /var/log/app/*.log
+    json.keys_under_root: true
+    json.add_error_key: true
+
+output.logstash:
+  hosts: ["logstash:5044"]
+```
+
+**logstash.conf**:
+```
+input {
+  beats {
+    port => 5044
+  }
+}
+
+filter {
+  json {
+    source => "message"
+  }
+
+  date {
+    match => ["timestamp", "ISO8601"]
+  }
+
+  grok {
+    match => { "message" => "%{COMBINEDAPACHELOG}" }
+  }
+}
+
+output {
+  elasticsearch {
+    hosts => ["elasticsearch:9200"]
+    index => "app-logs-%{+YYYY.MM.dd}"
+  }
+}
+```
+
+### Pattern 2: Loki (Grafana Loki)
+
+**Architecture**:
+```
+Application → Promtail → Loki → Grafana
+```
+
+**promtail-config.yml**:
+```yaml
+server:
+  http_listen_port: 9080
+
+positions:
+  filename: /tmp/positions.yaml
+
+clients:
+  - url: http://loki:3100/loki/api/v1/push
+
+scrape_configs:
+  - job_name: app
+    static_configs:
+      - targets:
+          - localhost
+        labels:
+          job: app
+          __path__: /var/log/app/*.log
+    pipeline_stages:
+      - json:
+          expressions:
+            level: level
+            timestamp: timestamp
+      - labels:
+          level:
+          service:
+      - timestamp:
+          source: timestamp
+          format: RFC3339
+```
+
+**Query in Grafana**:
+```logql
+{job="app"} |= "error" | json | level="error"
+```
+
+### Pattern 3: CloudWatch Logs
+
+**Install CloudWatch agent**:
+```json
+{
+  "logs": {
+    "logs_collected": {
+      "files": {
+        "collect_list": [
+          {
+            "file_path": "/var/log/app/*.log",
+            "log_group_name": "/aws/app/production",
+            "log_stream_name": "{instance_id}",
+            "timezone": "UTC"
+          }
+        ]
+      }
+    }
+  }
+}
+```
+
+**Query with CloudWatch Insights**:
+```
+fields @timestamp, level, message, user_id
+| filter level = "error"
+| sort @timestamp desc
+| limit 100
+```
+
+### Pattern 4: Fluentd/Fluent Bit
+
+**fluent-bit.conf**:
+```
+[INPUT]
+    Name              tail
+    Path              /var/log/app/*.log
+    Parser            json
+    Tag               app.*
+
+[FILTER]
+    Name              record_modifier
+    Match             *
+    Record            hostname ${HOSTNAME}
+    Record            cluster production
+
+[OUTPUT]
+    Name              es
+    Match             *
+    Host              elasticsearch
+    Port              9200
+    Index             app-logs
+    Type              _doc
+```
+
+---
+
+## Query Patterns
+
+### Find Errors in Time Range
+**Elasticsearch**:
+```json
+GET /app-logs-*/_search
+{
+  "query": {
+    "bool": {
+      "must": [
+        { "match": { "level": "error" } },
+        { "range": { "@timestamp": {
+            "gte": "now-1h",
+            "lte": "now"
+        }}}
+      ]
+    }
+  }
+}
+```
+
+**Loki (LogQL)**:
+```logql
+{job="app", level="error"} |= "error"
+```
+
+**CloudWatch Insights**:
+```
+fields @timestamp, @message
+| filter level = "error"
+| filter @timestamp > ago(1h)
+```
+
+### Count Errors by Type
+**Elasticsearch**:
+```json
+GET /app-logs-*/_search
+{
+  "size": 0,
+  "query": { "match": { "level": "error" } },
+  "aggs": {
+    "error_types": {
+      "terms": { "field": "error_type.keyword" }
+    }
+  }
+}
+```
+
+**Loki**:
+```logql
+sum by (error_type) (count_over_time({job="app", level="error"}[1h]))
+```
+
+### Find Slow Requests
+**Elasticsearch**:
+```json
+GET /app-logs-*/_search
+{
+  "query": {
+    "range": { "duration_ms": { "gte": 1000 } }
+  },
+  "sort": [ { "duration_ms": "desc" } ]
+}
+```
+
+### Trace Request Through Services
+**Elasticsearch** (using request_id):
+```json
+GET /_search
+{
+  "query": {
+    "match": { "request_id": "550e8400-e29b-41d4-a716-446655440000" }
+  },
+  "sort": [ { "@timestamp": "asc" } ]
+}
+```
+
+---
+
+## Sampling and Rate Limiting
+
+### When to Sample
+- **High volume services**: > 10,000 logs/second
+- **Debug logs in production**: Sample 1-10%
+- **Cost optimization**: Reduce storage costs
+
+### Sampling Strategies
+
+**1. Random Sampling**:
+```python
+import random
+
+if random.random() < 0.1:  # Sample 10%
+    logger.debug("Debug message", ...)
+```
+
+**2. Rate Limiting**:
+```python
+from rate_limiter import RateLimiter
+
+limiter = RateLimiter(max_per_second=100)
+
+if limiter.allow():
+    logger.info("Rate limited log", ...)
+```
+
+**3. Error-Biased Sampling**:
+```python
+# Always log errors, sample successful requests
+if level == "error" or random.random() < 0.01:
+    logger.log(level, message, ...)
+```
+
+**4. Head-Based Sampling** (trace-aware):
+```python
+# If trace is sampled, log all related logs
+if trace_context.is_sampled():
+    logger.info("Traced log", trace_id=trace_context.trace_id)
+```
+
+---
+
+## Log Retention
+
+### Retention Strategy
+
+**Hot tier** (fast SSD): 7-30 days
+- Recent logs
+- Full query performance
+- High cost
+
+**Warm tier** (regular disk): 30-90 days
+- Older logs
+- Slower queries acceptable
+- Medium cost
+
+**Cold tier** (object storage): 90+ days
+- Archive logs
+- Query via restore
+- Low cost
+
+### Example: Elasticsearch ILM Policy
+```json
+{
+  "policy": {
+    "phases": {
+      "hot": {
+        "actions": {
+          "rollover": {
+            "max_size": "50GB",
+            "max_age": "1d"
+          }
+        }
+      },
+      "warm": {
+        "min_age": "7d",
+        "actions": {
+          "allocate": { "number_of_replicas": 1 },
+          "shrink": { "number_of_shards": 1 }
+        }
+      },
+      "cold": {
+        "min_age": "30d",
+        "actions": {
+          "allocate": { "require": { "box_type": "cold" } }
+        }
+      },
+      "delete": {
+        "min_age": "90d",
+        "actions": {
+          "delete": {}
+        }
+      }
+    }
+  }
+}
+```
+
+---
+
+## Security and Compliance
+
+### PII Redaction
+
+**Before logging**:
+```python
+import re
+
+def redact_pii(data):
+    # Redact email
+    data = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
+                  '[EMAIL]', data)
+    # Redact credit card
+    data = re.sub(r'\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b',
+                  '[CARD]', data)
+    # Redact SSN
+    data = re.sub(r'\b\d{3}-\d{2}-\d{4}\b', '[SSN]', data)
+    return data
+
+logger.info("User data", user_input=redact_pii(user_input))
+```
+
+**In Logstash**:
+```
+filter {
+  mutate {
+    gsub => [
+      "message", "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b", "[EMAIL]",
+      "message", "\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b", "[CARD]"
+    ]
+  }
+}
+```
+
+### Access Control
+
+**Elasticsearch** (with Security):
+```yaml
+# Role for developers
+dev_logs:
+  indices:
+    - names: ['app-logs-*']
+      privileges: ['read']
+      query: '{"match": {"environment": "development"}}'
+```
+
+**CloudWatch** (IAM Policy):
+```json
+{
+  "Effect": "Allow",
+  "Action": [
+    "logs:DescribeLogGroups",
+    "logs:GetLogEvents",
+    "logs:FilterLogEvents"
+  ],
+  "Resource": "arn:aws:logs:*:*:log-group:/aws/app/production:*"
+}
+```
+
+---
+
+## Common Pitfalls
+
+### 1. Logging Sensitive Data
+❌ `logger.info("Login", password=password)`
+✅ `logger.info("Login", user_id=user_id)`
+
+### 2. Excessive Logging
+❌ Logging every iteration of a loop
+✅ Log aggregate results or sample
+
+### 3. Not Including Context
+❌ `logger.error("Failed")`
+✅ `logger.error("Payment failed", order_id=order_id, error=str(e))`
+
+### 4. Inconsistent Formats
+❌ Mix of JSON and plain text
+✅ Pick one format and stick to it
+
+### 5. No Request IDs
+❌ Can't trace request across services
+✅ Generate and propagate request_id
+
+### 6. Logging to Multiple Places
+❌ Log to file AND stdout AND syslog
+✅ Log to stdout, let agent handle routing
+
+### 7. Blocking on Log Writes
+❌ Synchronous writes to remote systems
+✅ Asynchronous buffered writes
+
+---
+
+## Performance Optimization
+
+### 1. Async Logging
+```python
+import logging
+from logging.handlers import QueueHandler, QueueListener
+import queue
+
+# Create queue
+log_queue = queue.Queue()
+
+# Configure async handler
+queue_handler = QueueHandler(log_queue)
+logger.addHandler(queue_handler)
+
+# Process logs in background thread
+listener = QueueListener(log_queue, *handlers)
+listener.start()
+```
+
+### 2. Conditional Logging
+```python
+# Avoid expensive operations if not logging
+if logger.isEnabledFor(logging.DEBUG):
+    logger.debug("Details", data=expensive_serialization(obj))
+```
+
+### 3. Batching
+```python
+# Batch logs before sending
+batch = []
+for log in logs:
+    batch.append(log)
+    if len(batch) >= 100:
+        send_to_aggregator(batch)
+        batch = []
+```
+
+### 4. Compression
+```yaml
+# Filebeat with compression
+output.logstash:
+  hosts: ["logstash:5044"]
+  compression_level: 3
+```
+
+---
+
+## Monitoring Log Pipeline
+
+Track pipeline health with metrics:
+
+```promql
+# Log ingestion rate
+rate(logs_ingested_total[5m])
+
+# Pipeline lag
+log_processing_lag_seconds
+
+# Dropped logs
+rate(logs_dropped_total[5m])
+
+# Error parsing rate
+rate(logs_parse_errors_total[5m])
+```
+
+Alert on:
+- Sudden drop in log volume (service down?)
+- High parse error rate (format changed?)
+- Pipeline lag > 1 minute (capacity issue?)
--- a/references/metrics_design.md
+++ b/references/metrics_design.md
@@ -0,0 +1,406 @@
+# Metrics Design Guide
+
+## The Four Golden Signals
+
+The Four Golden Signals from Google's SRE book provide a comprehensive view of system health:
+
+### 1. Latency
+**What**: Time to service a request
+
+**Why Monitor**: Directly impacts user experience
+
+**Key Metrics**:
+- Request duration (p50, p95, p99, p99.9)
+- Time to first byte (TTFB)
+- Backend processing time
+- Database query latency
+
+**PromQL Examples**:
+```promql
+# P95 latency
+histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
+
+# Average latency by endpoint
+avg(rate(http_request_duration_seconds_sum[5m])) by (endpoint)
+  /
+avg(rate(http_request_duration_seconds_count[5m])) by (endpoint)
+```
+
+**Alert Thresholds**:
+- Warning: p95 > 500ms
+- Critical: p99 > 2s
+
+### 2. Traffic
+**What**: Demand on your system
+
+**Why Monitor**: Understand load patterns, capacity planning
+
+**Key Metrics**:
+- Requests per second (RPS)
+- Transactions per second (TPS)
+- Concurrent connections
+- Network throughput
+
+**PromQL Examples**:
+```promql
+# Requests per second
+sum(rate(http_requests_total[5m]))
+
+# Requests per second by status code
+sum(rate(http_requests_total[5m])) by (status)
+
+# Traffic growth rate (week over week)
+sum(rate(http_requests_total[5m]))
+  /
+sum(rate(http_requests_total[5m] offset 7d))
+```
+
+**Alert Thresholds**:
+- Warning: RPS > 80% of capacity
+- Critical: RPS > 95% of capacity
+
+### 3. Errors
+**What**: Rate of requests that fail
+
+**Why Monitor**: Direct indicator of user-facing problems
+
+**Key Metrics**:
+- Error rate (%)
+- 5xx response codes
+- Failed transactions
+- Exception counts
+
+**PromQL Examples**:
+```promql
+# Error rate percentage
+sum(rate(http_requests_total{status=~"5.."}[5m]))
+  /
+sum(rate(http_requests_total[5m])) * 100
+
+# Error count by type
+sum(rate(http_requests_total{status=~"5.."}[5m])) by (status)
+
+# Application errors
+rate(application_errors_total[5m])
+```
+
+**Alert Thresholds**:
+- Warning: Error rate > 1%
+- Critical: Error rate > 5%
+
+### 4. Saturation
+**What**: How "full" your service is
+
+**Why Monitor**: Predict capacity issues before they impact users
+
+**Key Metrics**:
+- CPU utilization
+- Memory utilization
+- Disk I/O
+- Network bandwidth
+- Queue depth
+- Thread pool usage
+
+**PromQL Examples**:
+```promql
+# CPU saturation
+100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
+
+# Memory saturation
+(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
+
+# Disk saturation
+rate(node_disk_io_time_seconds_total[5m]) * 100
+
+# Queue depth
+queue_depth_current / queue_depth_max * 100
+```
+
+**Alert Thresholds**:
+- Warning: > 70% utilization
+- Critical: > 90% utilization
+
+---
+
+## RED Method (for Services)
+
+**R**ate, **E**rrors, **D**uration - a simplified approach for request-driven services
+
+### Rate
+Number of requests per second:
+```promql
+sum(rate(http_requests_total[5m]))
+```
+
+### Errors
+Number of failed requests per second:
+```promql
+sum(rate(http_requests_total{status=~"5.."}[5m]))
+```
+
+### Duration
+Time taken to process requests:
+```promql
+histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
+```
+
+**When to Use**: Microservices, APIs, web applications
+
+---
+
+## USE Method (for Resources)
+
+**U**tilization, **S**aturation, **E**rrors - for infrastructure resources
+
+### Utilization
+Percentage of time resource is busy:
+```promql
+# CPU utilization
+100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
+
+# Disk utilization
+(node_filesystem_size_bytes - node_filesystem_avail_bytes)
+  / node_filesystem_size_bytes * 100
+```
+
+### Saturation
+Amount of work the resource cannot service (queued):
+```promql
+# Load average (saturation indicator)
+node_load15
+
+# Disk I/O wait time
+rate(node_disk_io_time_weighted_seconds_total[5m])
+```
+
+### Errors
+Count of error events:
+```promql
+# Network errors
+rate(node_network_receive_errs_total[5m])
+rate(node_network_transmit_errs_total[5m])
+
+# Disk errors
+rate(node_disk_io_errors_total[5m])
+```
+
+**When to Use**: Servers, databases, network devices
+
+---
+
+## Metric Types
+
+### Counter
+Monotonically increasing value (never decreases)
+
+**Examples**: Request count, error count, bytes sent
+
+**Usage**:
+```promql
+# Always use rate() or increase() with counters
+rate(http_requests_total[5m])  # Requests per second
+increase(http_requests_total[1h])  # Total requests in 1 hour
+```
+
+### Gauge
+Value that can go up and down
+
+**Examples**: Memory usage, queue depth, concurrent connections
+
+**Usage**:
+```promql
+# Use directly or with aggregations
+avg(memory_usage_bytes)
+max(queue_depth)
+```
+
+### Histogram
+Samples observations and counts them in configurable buckets
+
+**Examples**: Request duration, response size
+
+**Usage**:
+```promql
+# Calculate percentiles
+histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
+
+# Average from histogram
+rate(http_request_duration_seconds_sum[5m])
+  /
+rate(http_request_duration_seconds_count[5m])
+```
+
+### Summary
+Similar to histogram but calculates quantiles on client side
+
+**Usage**: Less flexible than histograms, avoid for new metrics
+
+---
+
+## Cardinality Best Practices
+
+**Cardinality**: Number of unique time series
+
+### High Cardinality Labels (AVOID)
+❌ User ID
+❌ Email address
+❌ IP address
+❌ Timestamp
+❌ Random IDs
+
+### Low Cardinality Labels (GOOD)
+✅ Environment (prod, staging)
+✅ Region (us-east-1, eu-west-1)
+✅ Service name
+✅ HTTP status code category (2xx, 4xx, 5xx)
+✅ Endpoint/route
+
+### Calculating Cardinality Impact
+```
+Time series = unique combinations of labels
+
+Example:
+service (5) × environment (3) × region (4) × status (5) = 300 time series ✅
+
+service (5) × environment (3) × region (4) × user_id (1M) = 60M time series ❌
+```
+
+---
+
+## Naming Conventions
+
+### Prometheus Naming
+```
+<namespace>_<name>_<unit>_total
+
+Examples:
+http_requests_total
+http_request_duration_seconds
+process_cpu_seconds_total
+node_memory_MemAvailable_bytes
+```
+
+**Rules**:
+- Use snake_case
+- Include unit in name (seconds, bytes, ratio)
+- Use `_total` suffix for counters
+- Namespace by application/component
+
+### CloudWatch Naming
+```
+<Namespace>/<MetricName>
+
+Examples:
+AWS/EC2/CPUUtilization
+MyApp/RequestCount
+```
+
+**Rules**:
+- Use PascalCase
+- Group by namespace
+- No unit in name (specified separately)
+
+---
+
+## Dashboard Design
+
+### Key Principles
+
+1. **Top-Down Layout**: Most important metrics first
+2. **Color Coding**: Red (critical), yellow (warning), green (healthy)
+3. **Consistent Time Windows**: All panels use same time range
+4. **Limit Panels**: 8-12 panels per dashboard maximum
+5. **Include Context**: Show related metrics together
+
+### Dashboard Structure
+
+```
+┌─────────────────────────────────────────────┐
+│  Overall Health (Single Stats)              │
+│  [Requests/s] [Error%] [P95 Latency]        │
+└─────────────────────────────────────────────┘
+
+┌─────────────────────────────────────────────┐
+│  Request Rate & Errors (Graphs)             │
+└─────────────────────────────────────────────┘
+
+┌─────────────────────────────────────────────┐
+│  Latency Distribution (Graphs)              │
+└─────────────────────────────────────────────┘
+
+┌─────────────────────────────────────────────┐
+│  Resource Usage (Graphs)                    │
+└─────────────────────────────────────────────┘
+
+┌─────────────────────────────────────────────┐
+│  Dependencies (Graphs)                      │
+└─────────────────────────────────────────────┘
+```
+
+### Template Variables
+Use variables for filtering:
+- Environment: `$environment`
+- Service: `$service`
+- Region: `$region`
+- Pod: `$pod`
+
+---
+
+## Common Pitfalls
+
+### 1. Monitoring What You Build, Not What Users Experience
+❌ `backend_processing_complete`
+✅ `user_request_completed`
+
+### 2. Too Many Metrics
+- Start with Four Golden Signals
+- Add metrics only when needed for specific issues
+- Remove unused metrics
+
+### 3. Incorrect Aggregations
+❌ `avg(rate(...))` - averages rates incorrectly
+✅ `sum(rate(...)) / count(...)` - correct average
+
+### 4. Wrong Time Windows
+- Too short (< 1m): Noisy data
+- Too long (> 15m): Miss short-lived issues
+- Sweet spot: 5m for most alerts
+
+### 5. Missing Labels
+❌ `http_requests_total`
+✅ `http_requests_total{method="GET", status="200", endpoint="/api/users"}`
+
+---
+
+## Metric Collection Best Practices
+
+### Application Instrumentation
+```python
+from prometheus_client import Counter, Histogram, Gauge
+
+# Counter for requests
+requests_total = Counter('http_requests_total',
+                        'Total HTTP requests',
+                        ['method', 'endpoint', 'status'])
+
+# Histogram for latency
+request_duration = Histogram('http_request_duration_seconds',
+                            'HTTP request duration',
+                            ['method', 'endpoint'])
+
+# Gauge for in-progress requests
+requests_in_progress = Gauge('http_requests_in_progress',
+                            'HTTP requests currently being processed')
+```
+
+### Collection Intervals
+- Application metrics: 15-30s
+- Infrastructure metrics: 30-60s
+- Billing/cost metrics: 5-15m
+- External API checks: 1-5m
+
+### Retention
+- Raw metrics: 15-30 days
+- 5m aggregates: 90 days
+- 1h aggregates: 1 year
+- Daily aggregates: 2+ years
--- a/references/slo_sla_guide.md
+++ b/references/slo_sla_guide.md
@@ -0,0 +1,652 @@
+# SLI, SLO, and SLA Guide
+
+## Definitions
+
+### SLI (Service Level Indicator)
+**What**: A quantitative measure of service quality
+
+**Examples**:
+- Request latency (ms)
+- Error rate (%)
+- Availability (%)
+- Throughput (requests/sec)
+
+### SLO (Service Level Objective)
+**What**: Target value or range for an SLI
+
+**Examples**:
+- "99.9% of requests return in < 500ms"
+- "99.95% availability"
+- "Error rate < 0.1%"
+
+### SLA (Service Level Agreement)
+**What**: Business contract with consequences for SLO violations
+
+**Examples**:
+- "99.9% uptime or 10% monthly credit"
+- "p95 latency < 1s or refund"
+
+### Relationship
+```
+SLI = Measurement
+SLO = Target (internal goal)
+SLA = Promise (customer contract with penalties)
+
+Example:
+SLI: Actual availability this month = 99.92%
+SLO: Target availability = 99.9%
+SLA: Guaranteed availability = 99.5% (with penalties)
+```
+
+---
+
+## Choosing SLIs
+
+### The Four Golden Signals as SLIs
+
+1. **Latency SLIs**
+   - Request duration (p50, p95, p99)
+   - Time to first byte
+   - Page load time
+
+2. **Availability/Success SLIs**
+   - % of successful requests
+   - % uptime
+   - % of requests completing
+
+3. **Throughput SLIs** (less common)
+   - Requests per second
+   - Transactions per second
+
+4. **Saturation SLIs** (internal only)
+   - Resource utilization
+   - Queue depth
+
+### SLI Selection Criteria
+
+✅ **Good SLIs**:
+- Measured from user perspective
+- Directly impact user experience
+- Aggregatable across instances
+- Proportional to user happiness
+
+❌ **Bad SLIs**:
+- Internal metrics only
+- Not user-facing
+- Hard to measure consistently
+
+### Examples by Service Type
+
+**Web Application**:
+```
+SLI 1: Request Success Rate
+  = successful_requests / total_requests
+
+SLI 2: Request Latency (p95)
+  = 95th percentile of response times
+
+SLI 3: Availability
+  = time_service_responding / total_time
+```
+
+**API Service**:
+```
+SLI 1: Error Rate
+  = (4xx_errors + 5xx_errors) / total_requests
+
+SLI 2: Response Time (p99)
+  = 99th percentile latency
+
+SLI 3: Throughput
+  = requests_per_second
+```
+
+**Batch Processing**:
+```
+SLI 1: Job Success Rate
+  = successful_jobs / total_jobs
+
+SLI 2: Processing Latency
+  = time_from_submission_to_completion
+
+SLI 3: Freshness
+  = age_of_oldest_unprocessed_item
+```
+
+**Storage Service**:
+```
+SLI 1: Durability
+  = data_not_lost / total_data
+
+SLI 2: Read Latency (p99)
+  = 99th percentile read time
+
+SLI 3: Write Success Rate
+  = successful_writes / total_writes
+```
+
+---
+
+## Setting SLO Targets
+
+### Start with Current Performance
+
+1. **Measure baseline**: Collect 30 days of data
+2. **Analyze distribution**: Look at p50, p95, p99, p99.9
+3. **Set initial SLO**: Slightly better than worst performer
+4. **Iterate**: Tighten or loosen based on feasibility
+
+### Example Process
+
+**Current Performance** (30 days):
+```
+p50 latency: 120ms
+p95 latency: 450ms
+p99 latency: 1200ms
+p99.9 latency: 3500ms
+
+Error rate: 0.05%
+Availability: 99.95%
+```
+
+**Initial SLOs**:
+```
+Latency: p95 < 500ms (slightly worse than current p95)
+Error rate: < 0.1% (double current rate)
+Availability: 99.9% (slightly worse than current)
+```
+
+**Rationale**: Start loose, prevent false alarms, tighten over time
+
+### Common SLO Targets
+
+**Availability**:
+- **99%** (3.65 days downtime/year): Internal tools
+- **99.5%** (1.83 days/year): Non-critical services
+- **99.9%** (8.76 hours/year): Standard production
+- **99.95%** (4.38 hours/year): Critical services
+- **99.99%** (52 minutes/year): High availability
+- **99.999%** (5 minutes/year): Mission critical
+
+**Latency**:
+- **p50 < 100ms**: Excellent responsiveness
+- **p95 < 500ms**: Standard web applications
+- **p99 < 1s**: Acceptable for most users
+- **p99.9 < 5s**: Acceptable for rare edge cases
+
+**Error Rate**:
+- **< 0.01%** (99.99% success): Critical operations
+- **< 0.1%** (99.9% success): Standard production
+- **< 1%** (99% success): Non-critical services
+
+---
+
+## Error Budgets
+
+### Concept
+
+Error budget = (100% - SLO target)
+
+If SLO is 99.9%, error budget is 0.1%
+
+**Purpose**: Balance reliability with feature velocity
+
+### Calculation
+
+**For availability**:
+```
+Monthly error budget = (1 - SLO) × time_period
+
+Example (99.9% SLO, 30 days):
+Error budget = 0.001 × 30 days = 0.03 days = 43.2 minutes
+```
+
+**For request-based SLIs**:
+```
+Error budget = (1 - SLO) × total_requests
+
+Example (99.9% SLO, 10M requests/month):
+Error budget = 0.001 × 10,000,000 = 10,000 failed requests
+```
+
+### Error Budget Consumption
+
+**Formula**:
+```
+Budget consumed = actual_errors / allowed_errors × 100%
+
+Example:
+SLO: 99.9% (0.1% error budget)
+Total requests: 1,000,000
+Failed requests: 500
+Allowed failures: 1,000
+
+Budget consumed = 500 / 1,000 × 100% = 50%
+Budget remaining = 50%
+```
+
+### Error Budget Policy
+
+**Example policy**:
+
+```markdown
+## Error Budget Policy
+
+### If error budget > 50%
+- Deploy frequently (multiple times per day)
+- Take calculated risks
+- Experiment with new features
+- Acceptable to have some incidents
+
+### If error budget 20-50%
+- Deploy normally (once per day)
+- Increase testing
+- Review recent changes
+- Monitor closely
+
+### If error budget < 20%
+- Freeze non-critical deploys
+- Focus on reliability improvements
+- Postmortem all incidents
+- Reduce change velocity
+
+### If error budget exhausted (< 0%)
+- Complete deploy freeze except rollbacks
+- All hands on reliability
+- Mandatory postmortems
+- Executive escalation
+```
+
+---
+
+## Error Budget Burn Rate
+
+### Concept
+
+Burn rate = rate of error budget consumption
+
+**Example**:
+- Monthly budget: 43.2 minutes (99.9% SLO)
+- If consuming at 2x rate: Budget exhausted in 15 days
+- If consuming at 10x rate: Budget exhausted in 3 days
+
+### Burn Rate Calculation
+
+```
+Burn rate = (actual_error_rate / allowed_error_rate)
+
+Example:
+SLO: 99.9% (0.1% allowed error rate)
+Current error rate: 0.5%
+
+Burn rate = 0.5% / 0.1% = 5x
+Time to exhaust = 30 days / 5 = 6 days
+```
+
+### Multi-Window Alerting
+
+Alert on burn rate across multiple time windows:
+
+**Fast burn** (1 hour window):
+```
+Burn rate > 14.4x → Exhausts budget in 2 days
+Alert after 2 minutes
+Severity: Critical (page immediately)
+```
+
+**Moderate burn** (6 hour window):
+```
+Burn rate > 6x → Exhausts budget in 5 days
+Alert after 30 minutes
+Severity: Warning (create ticket)
+```
+
+**Slow burn** (3 day window):
+```
+Burn rate > 1x → Exhausts budget by end of month
+Alert after 6 hours
+Severity: Info (monitor)
+```
+
+### Implementation
+
+**Prometheus**:
+```yaml
+# Fast burn alert (1h window, 2m grace period)
+- alert: ErrorBudgetFastBurn
+  expr: |
+    (
+      sum(rate(http_requests_total{status=~"5.."}[1h]))
+      /
+      sum(rate(http_requests_total[1h]))
+    ) > (14.4 * 0.001)  # 14.4x burn rate for 99.9% SLO
+  for: 2m
+  labels:
+    severity: critical
+  annotations:
+    summary: "Fast error budget burn detected"
+    description: "Error budget will be exhausted in 2 days at current rate"
+
+# Slow burn alert (6h window, 30m grace period)
+- alert: ErrorBudgetSlowBurn
+  expr: |
+    (
+      sum(rate(http_requests_total{status=~"5.."}[6h]))
+      /
+      sum(rate(http_requests_total[6h]))
+    ) > (6 * 0.001)  # 6x burn rate for 99.9% SLO
+  for: 30m
+  labels:
+    severity: warning
+  annotations:
+    summary: "Elevated error budget burn detected"
+```
+
+---
+
+## SLO Reporting
+
+### Dashboard Structure
+
+**Overall Health**:
+```
+┌─────────────────────────────────────────┐
+│  SLO Compliance: 99.92% ✅              │
+│  Error Budget Remaining: 73% 🟢         │
+│  Burn Rate: 0.8x 🟢                     │
+└─────────────────────────────────────────┘
+```
+
+**SLI Performance**:
+```
+Latency p95: 420ms (Target: 500ms) ✅
+Error Rate: 0.08% (Target: < 0.1%) ✅
+Availability: 99.95% (Target: > 99.9%) ✅
+```
+
+**Error Budget Trend**:
+```
+Graph showing:
+- Error budget consumption over time
+- Burn rate spikes
+- Incidents marked
+- Deploy events overlaid
+```
+
+### Monthly SLO Report
+
+**Template**:
+```markdown
+# SLO Report: October 2024
+
+## Executive Summary
+- ✅ All SLOs met this month
+- 🟡 Latency SLO came close to violation (99.1% compliance)
+- 3 incidents consumed 47% of error budget
+- Error budget remaining: 53%
+
+## SLO Performance
+
+### Availability SLO: 99.9%
+- Actual: 99.92%
+- Status: ✅ Met
+- Error budget consumed: 33%
+- Downtime: 23 minutes (allowed: 43.2 minutes)
+
+### Latency SLO: p95 < 500ms
+- Actual p95: 445ms
+- Status: ✅ Met
+- Compliance: 99.1% (target: 99%)
+- 0.9% of requests exceeded threshold
+
+### Error Rate SLO: < 0.1%
+- Actual: 0.05%
+- Status: ✅ Met
+- Error budget consumed: 50%
+
+## Incidents
+
+### Incident #1: Database Overload (Oct 5)
+- Duration: 15 minutes
+- Error budget consumed: 35%
+- Root cause: Slow query after schema change
+- Prevention: Added query review to deploy checklist
+
+### Incident #2: API Gateway Timeout (Oct 12)
+- Duration: 5 minutes
+- Error budget consumed: 10%
+- Root cause: Configuration error in load balancer
+- Prevention: Automated configuration validation
+
+### Incident #3: Upstream Service Degradation (Oct 20)
+- Duration: 3 minutes
+- Error budget consumed: 2%
+- Root cause: Third-party API outage
+- Prevention: Implemented circuit breaker
+
+## Recommendations
+1. Investigate latency near-miss (Oct 15-17)
+2. Add automated rollback for database changes
+3. Increase circuit breaker thresholds for third-party APIs
+4. Consider tightening availability SLO to 99.95%
+
+## Next Month's Focus
+- Reduce p95 latency to 400ms
+- Implement automated canary deployments
+- Add synthetic monitoring for critical paths
+```
+
+---
+
+## SLA Structure
+
+### Components
+
+**Service Description**:
+```
+The API Service provides RESTful endpoints for user management,
+authentication, and data retrieval.
+```
+
+**Covered Metrics**:
+```
+- Availability: Service is reachable and returns valid responses
+- Latency: Time from request to response
+- Error Rate: Percentage of requests returning errors
+```
+
+**SLA Targets**:
+```
+Service commits to:
+1. 99.9% monthly uptime
+2. p95 API response time < 1 second
+3. Error rate < 0.5%
+```
+
+**Measurement**:
+```
+Metrics calculated from server-side monitoring:
+- Uptime: Successful health check probes / total probes
+- Latency: Server-side request duration (p95)
+- Errors: HTTP 5xx responses / total responses
+
+Calculated monthly (first of month for previous month).
+```
+
+**Exclusions**:
+```
+SLA does not cover:
+- Scheduled maintenance (with 7 days notice)
+- Client-side network issues
+- DDoS attacks or force majeure
+- Beta/preview features
+- Issues caused by customer misuse
+```
+
+**Service Credits**:
+```
+Monthly Uptime    | Service Credit
+----------------  | --------------
+< 99.9% (SLA)     | 10%
+< 99.0%           | 25%
+< 95.0%           | 50%
+```
+
+**Claiming Credits**:
+```
+Customer must:
+1. Report violation within 30 days
+2. Provide ticket numbers for support requests
+3. Credits applied to next month's invoice
+4. Credits do not exceed monthly fee
+```
+
+### Example SLAs by Industry
+
+**E-commerce**:
+```
+- 99.95% availability
+- p95 page load < 2s
+- p99 checkout < 5s
+- Credits: 5% per 0.1% below target
+```
+
+**Financial Services**:
+```
+- 99.99% availability
+- p99 transaction < 500ms
+- Zero data loss
+- Penalties: $10,000 per hour of downtime
+```
+
+**Media/Content**:
+```
+- 99.9% availability
+- p95 video start < 3s
+- No credit system (best effort latency)
+```
+
+---
+
+## Best Practices
+
+### 1. SLOs Should Be User-Centric
+❌ "Database queries < 100ms"
+✅ "API response time p95 < 500ms"
+
+### 2. Start Loose, Tighten Over Time
+- Begin with achievable targets
+- Build reliability culture
+- Gradually raise bar
+
+### 3. Fewer, Better SLOs
+- 1-3 SLOs per service
+- Focus on user impact
+- Avoid SLO sprawl
+
+### 4. SLAs More Conservative Than SLOs
+```
+Internal SLO: 99.95%
+Customer SLA: 99.9%
+Margin: 0.05% buffer
+```
+
+### 5. Make Error Budgets Actionable
+- Define policies at different thresholds
+- Empower teams to make tradeoffs
+- Review in planning meetings
+
+### 6. Document Everything
+- How SLIs are measured
+- Why targets were chosen
+- Who owns each SLO
+- How to interpret metrics
+
+### 7. Review Regularly
+- Monthly SLO reviews
+- Quarterly SLO adjustments
+- Annual SLA renegotiation
+
+---
+
+## Common Pitfalls
+
+### 1. Too Many SLOs
+❌ 20 different SLOs per service
+✅ 2-3 critical SLOs
+
+### 2. Unrealistic Targets
+❌ 99.999% for non-critical service
+✅ 99.9% with room to improve
+
+### 3. SLOs Without Error Budgets
+❌ "Must always be 99.9%"
+✅ "Budget for 0.1% errors"
+
+### 4. No Consequences
+❌ Missing SLO has no impact
+✅ Deploy freeze when budget exhausted
+
+### 5. SLA Equals SLO
+❌ Promise exactly what you target
+✅ SLA more conservative than SLO
+
+### 6. Ignoring User Experience
+❌ "Our servers are up 99.99%"
+✅ "Users can complete actions 99.9% of the time"
+
+### 7. Static Targets
+❌ Set once, never revisit
+✅ Quarterly reviews and adjustments
+
+---
+
+## Tools and Automation
+
+### SLO Tracking Tools
+
+**Prometheus + Grafana**:
+- Use recording rules for SLIs
+- Alert on burn rates
+- Dashboard for compliance
+
+**Google Cloud SLO Monitoring**:
+- Built-in SLO tracking
+- Automatic error budget calculation
+- Integration with alerting
+
+**Datadog SLOs**:
+- UI for SLO definition
+- Automatic burn rate alerts
+- Status pages
+
+**Custom Tools**:
+- sloth: Generate Prometheus rules from SLO definitions
+- slo-libsonnet: Jsonnet library for SLO monitoring
+
+### Example: Prometheus Recording Rules
+
+```yaml
+groups:
+  - name: sli_recording
+    interval: 30s
+    rules:
+      # SLI: Request success rate
+      - record: sli:request_success:ratio
+        expr: |
+          sum(rate(http_requests_total{status!~"5.."}[5m]))
+          /
+          sum(rate(http_requests_total[5m]))
+
+      # SLI: Request latency (p95)
+      - record: sli:request_latency:p95
+        expr: |
+          histogram_quantile(0.95,
+            sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
+          )
+
+      # Error budget burn rate (1h window)
+      - record: slo:error_budget_burn_rate:1h
+        expr: |
+          (1 - sli:request_success:ratio) / 0.001
+```
--- a/references/tool_comparison.md
+++ b/references/tool_comparison.md
@@ -0,0 +1,697 @@
+# Monitoring Tools Comparison
+
+## Overview Matrix
+
+| Tool | Type | Best For | Complexity | Cost | Cloud/Self-Hosted |
+|------|------|----------|------------|------|-------------------|
+| **Prometheus** | Metrics | Kubernetes, time-series | Medium | Free | Self-hosted |
+| **Grafana** | Visualization | Dashboards, multi-source | Low-Medium | Free | Both |
+| **Datadog** | Full-stack | Ease of use, APM | Low | High | Cloud |
+| **New Relic** | Full-stack | APM, traces | Low | High | Cloud |
+| **Elasticsearch (ELK)** | Logs | Log search, analysis | High | Medium | Both |
+| **Grafana Loki** | Logs | Cost-effective logs | Medium | Free | Both |
+| **CloudWatch** | AWS-native | AWS infrastructure | Low | Medium | Cloud |
+| **Jaeger** | Tracing | Distributed tracing | Medium | Free | Self-hosted |
+| **Grafana Tempo** | Tracing | Cost-effective tracing | Medium | Free | Self-hosted |
+
+---
+
+## Metrics Platforms
+
+### Prometheus
+
+**Type**: Open-source time-series database
+
+**Strengths**:
+- ✅ Industry standard for Kubernetes
+- ✅ Powerful query language (PromQL)
+- ✅ Pull-based model (no agent config)
+- ✅ Service discovery
+- ✅ Free and open source
+- ✅ Huge ecosystem (exporters for everything)
+
+**Weaknesses**:
+- ❌ No built-in dashboards (need Grafana)
+- ❌ Single-node only (no HA without federation)
+- ❌ Limited long-term storage (need Thanos/Cortex)
+- ❌ Steep learning curve for PromQL
+
+**Best For**:
+- Kubernetes monitoring
+- Infrastructure metrics
+- Custom application metrics
+- Organizations that need control
+
+**Pricing**: Free (open source)
+
+**Setup Complexity**: Medium
+
+**Example**:
+```yaml
+# prometheus.yml
+scrape_configs:
+  - job_name: 'app'
+    static_configs:
+      - targets: ['localhost:8080']
+```
+
+---
+
+### Datadog
+
+**Type**: SaaS monitoring platform
+
+**Strengths**:
+- ✅ Easy to set up (install agent, done)
+- ✅ Beautiful pre-built dashboards
+- ✅ APM, logs, metrics, traces in one platform
+- ✅ Great anomaly detection
+- ✅ Excellent integrations (500+)
+- ✅ Good mobile app
+
+**Weaknesses**:
+- ❌ Very expensive at scale
+- ❌ Vendor lock-in
+- ❌ Cost can be unpredictable (per-host pricing)
+- ❌ Limited PromQL support
+
+**Best For**:
+- Teams that want quick setup
+- Companies prioritizing ease of use over cost
+- Organizations needing full observability
+
+**Pricing**: $15-$31/host/month + custom metrics fees
+
+**Setup Complexity**: Low
+
+**Example**:
+```bash
+# Install agent
+DD_API_KEY=xxx bash -c "$(curl -L https://s3.amazonaws.com/dd-agent/scripts/install_script.sh)"
+```
+
+---
+
+### New Relic
+
+**Type**: SaaS application performance monitoring
+
+**Strengths**:
+- ✅ Excellent APM capabilities
+- ✅ User-friendly interface
+- ✅ Good transaction tracing
+- ✅ Comprehensive alerting
+- ✅ Generous free tier
+
+**Weaknesses**:
+- ❌ Can get expensive at scale
+- ❌ Vendor lock-in
+- ❌ Query language less powerful than PromQL
+- ❌ Limited customization
+
+**Best For**:
+- Application performance monitoring
+- Teams focused on APM over infrastructure
+- Startups (free tier is generous)
+
+**Pricing**: Free up to 100GB/month, then $0.30/GB
+
+**Setup Complexity**: Low
+
+**Example**:
+```python
+import newrelic.agent
+newrelic.agent.initialize('newrelic.ini')
+```
+
+---
+
+### CloudWatch
+
+**Type**: AWS-native monitoring
+
+**Strengths**:
+- ✅ Zero setup for AWS services
+- ✅ Native integration with AWS
+- ✅ Automatic dashboards for AWS resources
+- ✅ Tightly integrated with other AWS services
+- ✅ Good for cost if already on AWS
+
+**Weaknesses**:
+- ❌ AWS-only (not multi-cloud)
+- ❌ Limited query capabilities
+- ❌ High costs for custom metrics
+- ❌ Basic visualization
+- ❌ 1-minute minimum resolution
+
+**Best For**:
+- AWS-centric infrastructure
+- Quick setup for AWS services
+- Organizations already invested in AWS
+
+**Pricing**:
+- First 10 custom metrics: Free
+- Additional: $0.30/metric/month
+- API calls: $0.01/1000 requests
+
+**Setup Complexity**: Low (for AWS), Medium (for custom metrics)
+
+**Example**:
+```python
+import boto3
+cloudwatch = boto3.client('cloudwatch')
+cloudwatch.put_metric_data(
+    Namespace='MyApp',
+    MetricData=[{'MetricName': 'RequestCount', 'Value': 1}]
+)
+```
+
+---
+
+### Grafana Cloud / Mimir
+
+**Type**: Managed Prometheus-compatible
+
+**Strengths**:
+- ✅ Prometheus-compatible (PromQL)
+- ✅ Managed service (no ops burden)
+- ✅ Good cost model (pay for what you use)
+- ✅ Grafana dashboards included
+- ✅ Long-term storage
+
+**Weaknesses**:
+- ❌ Relatively new (less mature)
+- ❌ Some Prometheus features missing
+- ❌ Requires Grafana for visualization
+
+**Best For**:
+- Teams wanting Prometheus without ops overhead
+- Multi-cloud environments
+- Organizations already using Grafana
+
+**Pricing**: $8/month + $0.29/1M samples
+
+**Setup Complexity**: Low-Medium
+
+---
+
+## Logging Platforms
+
+### Elasticsearch (ELK Stack)
+
+**Type**: Open-source log search and analytics
+
+**Full Stack**: Elasticsearch + Logstash + Kibana
+
+**Strengths**:
+- ✅ Powerful search capabilities
+- ✅ Rich query language
+- ✅ Great for log analysis
+- ✅ Mature ecosystem
+- ✅ Can handle large volumes
+- ✅ Flexible data model
+
+**Weaknesses**:
+- ❌ Complex to operate
+- ❌ Resource intensive (RAM hungry)
+- ❌ Expensive at scale
+- ❌ Requires dedicated ops team
+- ❌ Slow for high-cardinality queries
+
+**Best For**:
+- Large organizations with ops teams
+- Deep log analysis needs
+- Search-heavy use cases
+
+**Pricing**: Free (open source) + infrastructure costs
+
+**Infrastructure**: ~$500-2000/month for medium scale
+
+**Setup Complexity**: High
+
+**Example**:
+```json
+PUT /logs-2024.10/_doc/1
+{
+  "timestamp": "2024-10-28T14:32:15Z",
+  "level": "error",
+  "message": "Payment failed"
+}
+```
+
+---
+
+### Grafana Loki
+
+**Type**: Log aggregation system
+
+**Strengths**:
+- ✅ Cost-effective (labels only, not full-text indexing)
+- ✅ Easy to operate
+- ✅ Prometheus-like label model
+- ✅ Great Grafana integration
+- ✅ Low resource usage
+- ✅ Fast time-range queries
+
+**Weaknesses**:
+- ❌ Limited full-text search
+- ❌ Requires careful label design
+- ❌ Younger ecosystem than ELK
+- ❌ Not ideal for complex queries
+
+**Best For**:
+- Cost-conscious organizations
+- Kubernetes environments
+- Teams already using Prometheus
+- Time-series log queries
+
+**Pricing**: Free (open source) + infrastructure costs
+
+**Infrastructure**: ~$100-500/month for medium scale
+
+**Setup Complexity**: Medium
+
+**Example**:
+```logql
+{job="api", environment="prod"} |= "error" | json | level="error"
+```
+
+---
+
+### Splunk
+
+**Type**: Enterprise log management
+
+**Strengths**:
+- ✅ Extremely powerful search
+- ✅ Great for security/compliance
+- ✅ Mature platform
+- ✅ Enterprise support
+- ✅ Machine learning features
+
+**Weaknesses**:
+- ❌ Very expensive
+- ❌ Complex pricing (per GB ingested)
+- ❌ Steep learning curve
+- ❌ Heavy resource usage
+
+**Best For**:
+- Large enterprises
+- Security operations centers (SOCs)
+- Compliance-heavy industries
+
+**Pricing**: $150-$1800/GB/month (depending on tier)
+
+**Setup Complexity**: Medium-High
+
+---
+
+### CloudWatch Logs
+
+**Type**: AWS-native log management
+
+**Strengths**:
+- ✅ Zero setup for AWS services
+- ✅ Integrated with AWS ecosystem
+- ✅ CloudWatch Insights for queries
+- ✅ Reasonable cost for low volume
+
+**Weaknesses**:
+- ❌ AWS-only
+- ❌ Limited query capabilities
+- ❌ Expensive at high volume
+- ❌ Basic visualization
+
+**Best For**:
+- AWS-centric applications
+- Low-volume logging
+- Simple log aggregation
+
+**Pricing**: Tiered (as of May 2025)
+- Vended Logs: $0.50/GB (first 10TB), $0.25/GB (next 20TB), then lower tiers
+- Standard logs: $0.50/GB flat
+- Storage: $0.03/GB
+
+**Setup Complexity**: Low (AWS), Medium (custom)
+
+---
+
+### Sumo Logic
+
+**Type**: SaaS log management
+
+**Strengths**:
+- ✅ Easy to use
+- ✅ Good for cloud-native apps
+- ✅ Real-time analytics
+- ✅ Good compliance features
+
+**Weaknesses**:
+- ❌ Expensive at scale
+- ❌ Vendor lock-in
+- ❌ Limited customization
+
+**Best For**:
+- Cloud-native applications
+- Teams wanting managed solution
+- Security and compliance use cases
+
+**Pricing**: $90-$180/GB/month
+
+**Setup Complexity**: Low
+
+---
+
+## Tracing Platforms
+
+### Jaeger
+
+**Type**: Open-source distributed tracing
+
+**Strengths**:
+- ✅ Industry standard
+- ✅ CNCF graduated project
+- ✅ Supports OpenTelemetry
+- ✅ Good UI
+- ✅ Free and open source
+
+**Weaknesses**:
+- ❌ Requires separate storage backend
+- ❌ Limited query capabilities
+- ❌ No built-in analytics
+
+**Best For**:
+- Microservices architectures
+- Kubernetes environments
+- OpenTelemetry users
+
+**Pricing**: Free (open source) + storage costs
+
+**Setup Complexity**: Medium
+
+---
+
+### Grafana Tempo
+
+**Type**: Open-source distributed tracing
+
+**Strengths**:
+- ✅ Cost-effective (object storage)
+- ✅ Easy to operate
+- ✅ Great Grafana integration
+- ✅ TraceQL query language
+- ✅ Supports OpenTelemetry
+
+**Weaknesses**:
+- ❌ Younger than Jaeger
+- ❌ Limited third-party integrations
+- ❌ Requires Grafana for UI
+
+**Best For**:
+- Cost-conscious organizations
+- Teams using Grafana stack
+- High trace volumes
+
+**Pricing**: Free (open source) + storage costs
+
+**Setup Complexity**: Medium
+
+---
+
+### Datadog APM
+
+**Type**: SaaS application performance monitoring
+
+**Strengths**:
+- ✅ Easy to set up
+- ✅ Excellent trace visualization
+- ✅ Integrated with metrics/logs
+- ✅ Automatic service map
+- ✅ Good profiling features
+
+**Weaknesses**:
+- ❌ Expensive ($31/host/month)
+- ❌ Vendor lock-in
+- ❌ Limited sampling control
+
+**Best For**:
+- Teams wanting ease of use
+- Organizations already using Datadog
+- Complex microservices
+
+**Pricing**: $31/host/month + $1.70/million spans
+
+**Setup Complexity**: Low
+
+---
+
+### AWS X-Ray
+
+**Type**: AWS-native distributed tracing
+
+**Strengths**:
+- ✅ Native AWS integration
+- ✅ Automatic instrumentation for AWS services
+- ✅ Low cost
+
+**Weaknesses**:
+- ❌ AWS-only
+- ❌ Basic UI
+- ❌ Limited query capabilities
+
+**Best For**:
+- AWS-centric applications
+- Serverless architectures (Lambda)
+- Cost-sensitive projects
+
+**Pricing**: $5/million traces, first 100k free/month
+
+**Setup Complexity**: Low (AWS), Medium (custom)
+
+---
+
+## Full-Stack Observability
+
+### Datadog (Full Platform)
+
+**Components**: Metrics, logs, traces, RUM, synthetics
+
+**Strengths**:
+- ✅ Everything in one platform
+- ✅ Excellent user experience
+- ✅ Correlation across signals
+- ✅ Great for teams
+
+**Weaknesses**:
+- ❌ Very expensive ($50-100+/host/month)
+- ❌ Vendor lock-in
+- ❌ Unpredictable costs
+
+**Total Cost** (example 100 hosts):
+- Infrastructure: $3,100/month
+- APM: $3,100/month
+- Logs: ~$2,000/month
+- **Total: ~$8,000/month**
+
+---
+
+### Grafana Stack (LGTM)
+
+**Components**: Loki (logs), Grafana (viz), Tempo (traces), Mimir/Prometheus (metrics)
+
+**Strengths**:
+- ✅ Open source and cost-effective
+- ✅ Unified visualization
+- ✅ Prometheus-compatible
+- ✅ Great for cloud-native
+
+**Weaknesses**:
+- ❌ Requires self-hosting or Grafana Cloud
+- ❌ More ops burden
+- ❌ Less polished than commercial tools
+
+**Total Cost** (self-hosted, 100 hosts):
+- Infrastructure: ~$1,500/month
+- Ops time: Variable
+- **Total: ~$1,500-3,000/month**
+
+---
+
+### Elastic Observability
+
+**Components**: Elasticsearch (logs), Kibana (viz), APM, metrics
+
+**Strengths**:
+- ✅ Powerful search
+- ✅ Mature platform
+- ✅ Good for log-heavy use cases
+
+**Weaknesses**:
+- ❌ Complex to operate
+- ❌ Expensive infrastructure
+- ❌ Resource intensive
+
+**Total Cost** (self-hosted, 100 hosts):
+- Infrastructure: ~$3,000-5,000/month
+- Ops time: High
+- **Total: ~$4,000-7,000/month**
+
+---
+
+### New Relic One
+
+**Components**: Metrics, logs, traces, synthetics
+
+**Strengths**:
+- ✅ Generous free tier (100GB)
+- ✅ User-friendly
+- ✅ Good for startups
+
+**Weaknesses**:
+- ❌ Costs increase quickly after free tier
+- ❌ Vendor lock-in
+
+**Total Cost**:
+- Free: up to 100GB/month
+- Paid: $0.30/GB beyond 100GB
+
+---
+
+## Cloud Provider Native
+
+### AWS (CloudWatch + X-Ray)
+
+**Use When**:
+- Primarily on AWS
+- Simple monitoring needs
+- Want minimal setup
+
+**Avoid When**:
+- Multi-cloud environment
+- Need advanced features
+- High log volume (expensive)
+
+**Cost** (example):
+- 100 EC2 instances with basic metrics: ~$150/month
+- 1TB logs: ~$500/month ingestion + storage
+- X-Ray: ~$50/month
+
+---
+
+### GCP (Cloud Monitoring + Cloud Trace)
+
+**Use When**:
+- Primarily on GCP
+- Using GKE
+- Want tight GCP integration
+
+**Avoid When**:
+- Multi-cloud environment
+- Need advanced querying
+
+**Cost** (example):
+- First 150MB/month per resource: Free
+- Additional: $0.2508/MB
+
+---
+
+### Azure (Azure Monitor)
+
+**Use When**:
+- Primarily on Azure
+- Using AKS
+- Need Azure integration
+
+**Avoid When**:
+- Multi-cloud
+- Need advanced features
+
+**Cost** (example):
+- First 5GB: Free
+- Additional: $2.76/GB
+
+---
+
+## Decision Matrix
+
+### Choose Prometheus + Grafana If:
+- ✅ Using Kubernetes
+- ✅ Want control and customization
+- ✅ Have ops capacity
+- ✅ Budget-conscious
+- ✅ Need Prometheus ecosystem
+
+### Choose Datadog If:
+- ✅ Want ease of use
+- ✅ Need full observability now
+- ✅ Budget allows ($8k+/month for 100 hosts)
+- ✅ Limited ops team
+- ✅ Need excellent UX
+
+### Choose ELK If:
+- ✅ Heavy log analysis needs
+- ✅ Need powerful search
+- ✅ Have dedicated ops team
+- ✅ Compliance requirements
+- ✅ Willing to invest in infrastructure
+
+### Choose Grafana Stack (LGTM) If:
+- ✅ Want open source full stack
+- ✅ Cost-effective solution
+- ✅ Cloud-native architecture
+- ✅ Already using Prometheus
+- ✅ Have some ops capacity
+
+### Choose New Relic If:
+- ✅ Startup with free tier
+- ✅ APM is priority
+- ✅ Want easy setup
+- ✅ Don't need heavy customization
+
+### Choose Cloud Native (CloudWatch/etc) If:
+- ✅ Single cloud provider
+- ✅ Simple needs
+- ✅ Want minimal setup
+- ✅ Low to medium scale
+
+---
+
+## Cost Comparison
+
+**Example: 100 hosts, 1TB logs/month, 1M spans/day**
+
+| Solution | Monthly Cost | Setup | Ops Burden |
+|----------|-------------|--------|------------|
+| **Prometheus + Loki + Tempo** | $1,500 | Medium | Medium |
+| **Grafana Cloud** | $3,000 | Low | Low |
+| **Datadog** | $8,000 | Low | None |
+| **New Relic** | $3,500 | Low | None |
+| **ELK Stack** | $4,000 | High | High |
+| **CloudWatch** | $2,000 | Low | Low |
+
+---
+
+## Recommendations by Company Size
+
+### Startup (< 10 engineers)
+**Recommendation**: New Relic or Grafana Cloud
+- Minimal ops burden
+- Good free tiers
+- Easy to get started
+
+### Small Company (10-50 engineers)
+**Recommendation**: Prometheus + Grafana + Loki (self-hosted or cloud)
+- Cost-effective
+- Growing ops capacity
+- Flexibility
+
+### Medium Company (50-200 engineers)
+**Recommendation**: Datadog or Grafana Stack
+- Datadog if budget allows
+- Grafana Stack if cost-conscious
+
+### Large Enterprise (200+ engineers)
+**Recommendation**: Build observability platform
+- Mix of tools based on needs
+- Dedicated observability team
+- Custom integrations
--- a/references/tracing_guide.md
+++ b/references/tracing_guide.md
@@ -0,0 +1,663 @@
+# Distributed Tracing Guide
+
+## What is Distributed Tracing?
+
+Distributed tracing tracks a request as it flows through multiple services in a distributed system.
+
+### Key Concepts
+
+**Trace**: End-to-end journey of a request
+**Span**: Single operation within a trace
+**Context**: Metadata propagated between services (trace_id, span_id)
+
+### Example Flow
+```
+User Request → API Gateway → Auth Service → User Service → Database
+                    ↓              ↓             ↓
+              [Trace ID: abc123]
+              Span 1: gateway (50ms)
+              Span 2: auth (20ms)
+              Span 3: user_service (100ms)
+              Span 4: db_query (80ms)
+
+Total: 250ms with waterfall view showing dependencies
+```
+
+---
+
+## OpenTelemetry (OTel)
+
+OpenTelemetry is the industry standard for instrumentation.
+
+### Components
+
+**API**: Instrument code (create spans, add attributes)
+**SDK**: Implement API, configure exporters
+**Collector**: Receive, process, and export telemetry data
+**Exporters**: Send data to backends (Jaeger, Tempo, Zipkin)
+
+### Architecture
+```
+Application → OTel SDK → OTel Collector → Backend (Jaeger/Tempo)
+                                              ↓
+                                          Visualization
+```
+
+---
+
+## Instrumentation Examples
+
+### Python (using OpenTelemetry)
+
+**Setup**:
+```python
+from opentelemetry import trace
+from opentelemetry.sdk.trace import TracerProvider
+from opentelemetry.sdk.trace.export import BatchSpanProcessor
+from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
+
+# Setup tracer
+trace.set_tracer_provider(TracerProvider())
+tracer = trace.get_tracer(__name__)
+
+# Configure exporter
+otlp_exporter = OTLPSpanExporter(endpoint="localhost:4317")
+span_processor = BatchSpanProcessor(otlp_exporter)
+trace.get_tracer_provider().add_span_processor(span_processor)
+```
+
+**Manual instrumentation**:
+```python
+from opentelemetry import trace
+
+tracer = trace.get_tracer(__name__)
+
+@tracer.start_as_current_span("process_order")
+def process_order(order_id):
+    span = trace.get_current_span()
+    span.set_attribute("order.id", order_id)
+    span.set_attribute("order.amount", 99.99)
+
+    try:
+        result = payment_service.charge(order_id)
+        span.set_attribute("payment.status", "success")
+        return result
+    except Exception as e:
+        span.set_status(trace.Status(trace.StatusCode.ERROR))
+        span.record_exception(e)
+        raise
+```
+
+**Auto-instrumentation** (Flask example):
+```python
+from opentelemetry.instrumentation.flask import FlaskInstrumentor
+from opentelemetry.instrumentation.requests import RequestsInstrumentor
+from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
+
+# Auto-instrument Flask
+FlaskInstrumentor().instrument_app(app)
+
+# Auto-instrument requests library
+RequestsInstrumentor().instrument()
+
+# Auto-instrument SQLAlchemy
+SQLAlchemyInstrumentor().instrument(engine=db.engine)
+```
+
+### Node.js (using OpenTelemetry)
+
+**Setup**:
+```javascript
+const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
+const { BatchSpanProcessor } = require('@opentelemetry/sdk-trace-base');
+const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
+
+// Setup provider
+const provider = new NodeTracerProvider();
+const exporter = new OTLPTraceExporter({ url: 'localhost:4317' });
+provider.addSpanProcessor(new BatchSpanProcessor(exporter));
+provider.register();
+```
+
+**Manual instrumentation**:
+```javascript
+const tracer = provider.getTracer('my-service');
+
+async function processOrder(orderId) {
+  const span = tracer.startSpan('process_order');
+  span.setAttribute('order.id', orderId);
+
+  try {
+    const result = await paymentService.charge(orderId);
+    span.setAttribute('payment.status', 'success');
+    return result;
+  } catch (error) {
+    span.setStatus({ code: SpanStatusCode.ERROR });
+    span.recordException(error);
+    throw error;
+  } finally {
+    span.end();
+  }
+}
+```
+
+**Auto-instrumentation**:
+```javascript
+const { registerInstrumentations } = require('@opentelemetry/instrumentation');
+const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');
+const { ExpressInstrumentation } = require('@opentelemetry/instrumentation-express');
+const { MongoDBInstrumentation } = require('@opentelemetry/instrumentation-mongodb');
+
+registerInstrumentations({
+  instrumentations: [
+    new HttpInstrumentation(),
+    new ExpressInstrumentation(),
+    new MongoDBInstrumentation()
+  ]
+});
+```
+
+### Go (using OpenTelemetry)
+
+**Setup**:
+```go
+import (
+    "go.opentelemetry.io/otel"
+    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
+    "go.opentelemetry.io/otel/sdk/trace"
+)
+
+func initTracer() {
+    exporter, _ := otlptracegrpc.New(context.Background())
+    tp := trace.NewTracerProvider(
+        trace.WithBatcher(exporter),
+    )
+    otel.SetTracerProvider(tp)
+}
+```
+
+**Manual instrumentation**:
+```go
+import (
+    "go.opentelemetry.io/otel"
+    "go.opentelemetry.io/otel/attribute"
+)
+
+func processOrder(ctx context.Context, orderID string) error {
+    tracer := otel.Tracer("my-service")
+    ctx, span := tracer.Start(ctx, "process_order")
+    defer span.End()
+
+    span.SetAttributes(
+        attribute.String("order.id", orderID),
+        attribute.Float64("order.amount", 99.99),
+    )
+
+    err := paymentService.Charge(ctx, orderID)
+    if err != nil {
+        span.RecordError(err)
+        return err
+    }
+
+    span.SetAttributes(attribute.String("payment.status", "success"))
+    return nil
+}
+```
+
+---
+
+## Span Attributes
+
+### Semantic Conventions
+
+Follow OpenTelemetry semantic conventions for consistency:
+
+**HTTP**:
+```python
+span.set_attribute("http.method", "GET")
+span.set_attribute("http.url", "https://api.example.com/users")
+span.set_attribute("http.status_code", 200)
+span.set_attribute("http.user_agent", "Mozilla/5.0...")
+```
+
+**Database**:
+```python
+span.set_attribute("db.system", "postgresql")
+span.set_attribute("db.name", "users_db")
+span.set_attribute("db.statement", "SELECT * FROM users WHERE id = ?")
+span.set_attribute("db.operation", "SELECT")
+```
+
+**RPC/gRPC**:
+```python
+span.set_attribute("rpc.system", "grpc")
+span.set_attribute("rpc.service", "UserService")
+span.set_attribute("rpc.method", "GetUser")
+span.set_attribute("rpc.grpc.status_code", 0)
+```
+
+**Messaging**:
+```python
+span.set_attribute("messaging.system", "kafka")
+span.set_attribute("messaging.destination", "user-events")
+span.set_attribute("messaging.operation", "publish")
+span.set_attribute("messaging.message_id", "msg123")
+```
+
+### Custom Attributes
+
+Add business context:
+```python
+span.set_attribute("user.id", "user123")
+span.set_attribute("order.id", "ORD-456")
+span.set_attribute("feature.flag.checkout_v2", True)
+span.set_attribute("cache.hit", False)
+```
+
+---
+
+## Context Propagation
+
+### W3C Trace Context (Standard)
+
+Headers propagated between services:
+```
+traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
+tracestate: vendor1=value1,vendor2=value2
+```
+
+**Format**: `version-trace_id-parent_span_id-trace_flags`
+
+### Implementation
+
+**Python**:
+```python
+from opentelemetry.propagate import inject, extract
+import requests
+
+# Inject context into outgoing request
+headers = {}
+inject(headers)
+requests.get("https://api.example.com", headers=headers)
+
+# Extract context from incoming request
+from flask import request
+ctx = extract(request.headers)
+```
+
+**Node.js**:
+```javascript
+const { propagation } = require('@opentelemetry/api');
+
+// Inject
+const headers = {};
+propagation.inject(context.active(), headers);
+axios.get('https://api.example.com', { headers });
+
+// Extract
+const ctx = propagation.extract(context.active(), req.headers);
+```
+
+**HTTP Example**:
+```bash
+curl -H "traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01" \
+     https://api.example.com/users
+```
+
+---
+
+## Sampling Strategies
+
+### 1. Always On/Off
+```python
+from opentelemetry.sdk.trace import TracerProvider
+from opentelemetry.sdk.trace.sampling import ALWAYS_ON, ALWAYS_OFF
+
+# Development: trace everything
+provider = TracerProvider(sampler=ALWAYS_ON)
+
+# Production: trace nothing (usually not desired)
+provider = TracerProvider(sampler=ALWAYS_OFF)
+```
+
+### 2. Probability-Based
+```python
+from opentelemetry.sdk.trace.sampling import TraceIdRatioBased
+
+# Sample 10% of traces
+provider = TracerProvider(sampler=TraceIdRatioBased(0.1))
+```
+
+### 3. Rate Limiting
+```python
+from opentelemetry.sdk.trace.sampling import ParentBased, RateLimitingSampler
+
+# Sample max 100 traces per second
+sampler = ParentBased(root=RateLimitingSampler(100))
+provider = TracerProvider(sampler=sampler)
+```
+
+### 4. Parent-Based (Default)
+```python
+from opentelemetry.sdk.trace.sampling import ParentBased, TraceIdRatioBased
+
+# If parent span is sampled, sample child spans
+sampler = ParentBased(root=TraceIdRatioBased(0.1))
+provider = TracerProvider(sampler=sampler)
+```
+
+### 5. Custom Sampling
+```python
+from opentelemetry.sdk.trace.sampling import Sampler, Decision
+
+class ErrorSampler(Sampler):
+    """Always sample errors, sample 1% of successes"""
+
+    def should_sample(self, parent_context, trace_id, name, **kwargs):
+        attributes = kwargs.get('attributes', {})
+
+        # Always sample if error
+        if attributes.get('error', False):
+            return Decision.RECORD_AND_SAMPLE
+
+        # Sample 1% of successes
+        if trace_id & 0xFF < 3:  # ~1%
+            return Decision.RECORD_AND_SAMPLE
+
+        return Decision.DROP
+
+provider = TracerProvider(sampler=ErrorSampler())
+```
+
+---
+
+## Backends
+
+### Jaeger
+
+**Docker Compose**:
+```yaml
+version: '3'
+services:
+  jaeger:
+    image: jaegertracing/all-in-one:latest
+    ports:
+      - "16686:16686"  # UI
+      - "4317:4317"    # OTLP gRPC
+      - "4318:4318"    # OTLP HTTP
+    environment:
+      - COLLECTOR_OTLP_ENABLED=true
+```
+
+**Query traces**:
+```bash
+# UI: http://localhost:16686
+
+# API: Get trace by ID
+curl http://localhost:16686/api/traces/abc123
+
+# Search traces
+curl "http://localhost:16686/api/traces?service=my-service&limit=20"
+```
+
+### Grafana Tempo
+
+**Docker Compose**:
+```yaml
+version: '3'
+services:
+  tempo:
+    image: grafana/tempo:latest
+    ports:
+      - "3200:3200"   # Tempo
+      - "4317:4317"   # OTLP gRPC
+    volumes:
+      - ./tempo.yaml:/etc/tempo.yaml
+    command: ["-config.file=/etc/tempo.yaml"]
+```
+
+**tempo.yaml**:
+```yaml
+server:
+  http_listen_port: 3200
+
+distributor:
+  receivers:
+    otlp:
+      protocols:
+        grpc:
+          endpoint: 0.0.0.0:4317
+
+storage:
+  trace:
+    backend: local
+    local:
+      path: /tmp/tempo/traces
+```
+
+**Query in Grafana**:
+- Install Tempo data source
+- Use TraceQL: `{ span.http.status_code = 500 }`
+
+### AWS X-Ray
+
+**Configuration**:
+```python
+from aws_xray_sdk.core import xray_recorder
+from aws_xray_sdk.ext.flask.middleware import XRayMiddleware
+
+xray_recorder.configure(service='my-service')
+XRayMiddleware(app, xray_recorder)
+```
+
+**Query**:
+```bash
+aws xray get-trace-summaries \
+  --start-time 2024-10-28T00:00:00 \
+  --end-time 2024-10-28T23:59:59 \
+  --filter-expression 'error = true'
+```
+
+---
+
+## Analysis Patterns
+
+### Find Slow Traces
+```
+# Jaeger UI
+- Filter by service
+- Set min duration: 1000ms
+- Sort by duration
+
+# TraceQL (Tempo)
+{ duration > 1s }
+```
+
+### Find Error Traces
+```
+# Jaeger UI
+- Filter by tag: error=true
+- Or by HTTP status: http.status_code=500
+
+# TraceQL (Tempo)
+{ span.http.status_code >= 500 }
+```
+
+### Find Traces by User
+```
+# Jaeger UI
+- Filter by tag: user.id=user123
+
+# TraceQL (Tempo)
+{ span.user.id = "user123" }
+```
+
+### Find N+1 Query Problems
+Look for:
+- Many sequential database spans
+- Same query repeated multiple times
+- Pattern: API call → DB query → DB query → DB query...
+
+### Find Service Bottlenecks
+- Identify spans with longest duration
+- Check if time is spent in service logic or waiting for dependencies
+- Look at span relationships (parallel vs sequential)
+
+---
+
+## Integration with Logs
+
+### Trace ID in Logs
+
+**Python**:
+```python
+from opentelemetry import trace
+
+def add_trace_context():
+    span = trace.get_current_span()
+    trace_id = span.get_span_context().trace_id
+    span_id = span.get_span_context().span_id
+
+    return {
+        "trace_id": format(trace_id, '032x'),
+        "span_id": format(span_id, '016x')
+    }
+
+logger.info("Processing order", **add_trace_context(), order_id=order_id)
+```
+
+**Query logs for trace**:
+```
+# Elasticsearch
+GET /logs/_search
+{
+  "query": {
+    "match": { "trace_id": "0af7651916cd43dd8448eb211c80319c" }
+  }
+}
+
+# Loki (LogQL)
+{job="app"} |= "0af7651916cd43dd8448eb211c80319c"
+```
+
+### Trace from Log (Grafana)
+
+Configure derived fields in Grafana:
+```yaml
+datasources:
+  - name: Loki
+    type: loki
+    jsonData:
+      derivedFields:
+        - name: TraceID
+          matcherRegex: "trace_id=([\\w]+)"
+          url: "http://tempo:3200/trace/$${__value.raw}"
+          datasourceUid: tempo_uid
+```
+
+---
+
+## Best Practices
+
+### 1. Span Naming
+✅ Use operation names, not IDs
+- Good: `GET /api/users`, `UserService.GetUser`, `db.query.users`
+- Bad: `/api/users/123`, `span_abc`, `query_1`
+
+### 2. Span Granularity
+✅ One span per logical operation
+- Too coarse: One span for entire request
+- Too fine: Span for every variable assignment
+- Just right: Span per service call, database query, external API
+
+### 3. Add Context
+Always include:
+- Operation name
+- Service name
+- Error status
+- Business identifiers (user_id, order_id)
+
+### 4. Handle Errors
+```python
+try:
+    result = operation()
+except Exception as e:
+    span.set_status(trace.Status(trace.StatusCode.ERROR))
+    span.record_exception(e)
+    raise
+```
+
+### 5. Sampling Strategy
+- Development: 100%
+- Staging: 50-100%
+- Production: 1-10% (or error-based)
+
+### 6. Performance Impact
+- Overhead: ~1-5% CPU
+- Use async exporters
+- Batch span exports
+- Sample appropriately
+
+### 7. Cardinality
+Avoid high-cardinality attributes:
+- ❌ Email addresses
+- ❌ Full URLs with unique IDs
+- ❌ Timestamps
+- ✅ User ID
+- ✅ Endpoint pattern
+- ✅ Status code
+
+---
+
+## Common Issues
+
+### Missing Traces
+**Cause**: Context not propagated
+**Solution**: Verify headers are injected/extracted
+
+### Incomplete Traces
+**Cause**: Spans not closed properly
+**Solution**: Always use `defer span.End()` or context managers
+
+### High Overhead
+**Cause**: Too many spans or synchronous export
+**Solution**: Reduce span count, use batch processor
+
+### No Error Traces
+**Cause**: Errors not recorded on spans
+**Solution**: Call `span.record_exception()` and set error status
+
+---
+
+## Metrics from Traces
+
+Generate RED metrics from trace data:
+
+**Rate**: Traces per second
+**Errors**: Traces with error status
+**Duration**: Span duration percentiles
+
+**Example** (using Tempo + Prometheus):
+```yaml
+# Generate metrics from spans
+metrics_generator:
+  processor:
+    span_metrics:
+      dimensions:
+        - http.method
+        - http.status_code
+```
+
+**Query**:
+```promql
+# Request rate
+rate(traces_spanmetrics_calls_total[5m])
+
+# Error rate
+rate(traces_spanmetrics_calls_total{status_code="STATUS_CODE_ERROR"}[5m])
+  /
+rate(traces_spanmetrics_calls_total[5m])
+
+# P95 latency
+histogram_quantile(0.95, traces_spanmetrics_latency_bucket)
+```