# Migrating from Datadog to Open-Source Stack ## Overview This guide helps you migrate from Datadog to a cost-effective open-source observability stack: - **Metrics**: Datadog → Prometheus + Grafana - **Logs**: Datadog → Loki + Grafana - **Traces**: Datadog APM → Tempo/Jaeger + Grafana - **Dashboards**: Datadog → Grafana - **Alerts**: Datadog Monitors → Prometheus Alertmanager **Estimated Cost Savings**: 60-80% for similar functionality --- ## Cost Comparison ### Example: 100-host infrastructure **Datadog**: - Infrastructure Pro: $1,500/month (100 hosts × $15) - Custom Metrics: $50/month (5,000 extra metrics beyond included 10,000) - Logs: $2,000/month (20GB/day × $0.10/GB × 30 days) - APM: $3,100/month (100 hosts × $31) - **Total**: ~$6,650/month ($79,800/year) **Open-Source Stack** (self-hosted): - Infrastructure: $1,200/month (EC2/GKE for Prometheus, Grafana, Loki, Tempo) - Storage: $300/month (S3/GCS for long-term metrics and traces) - Operations time: Variable - **Total**: ~$1,500-2,500/month ($18,000-30,000/year) **Savings**: $49,800-61,800/year --- ## Migration Strategy ### Phase 1: Run Parallel (Month 1-2) - Deploy open-source stack alongside Datadog - Migrate metrics first (lowest risk) - Validate data accuracy - Build confidence ### Phase 2: Migrate Dashboards & Alerts (Month 2-3) - Convert Datadog dashboards to Grafana - Translate alert rules - Train team on new tools ### Phase 3: Migrate Logs & Traces (Month 3-4) - Set up Loki for log aggregation - Deploy Tempo/Jaeger for tracing - Update application instrumentation ### Phase 4: Decommission Datadog (Month 4-5) - Confirm all functionality migrated - Cancel Datadog subscription - Archive Datadog dashboards/alerts for reference --- ## 1. Metrics Migration (Datadog → Prometheus) ### Step 1: Deploy Prometheus **Kubernetes** (recommended): ```yaml # prometheus-values.yaml prometheus: prometheusSpec: retention: 30d storageSpec: volumeClaimTemplate: spec: resources: requests: storage: 100Gi # Scrape configs additionalScrapeConfigs: - job_name: 'kubernetes-pods' kubernetes_sd_configs: - role: pod ``` **Install**: ```bash helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm install prometheus prometheus-community/kube-prometheus-stack -f prometheus-values.yaml ``` **Docker Compose**: ```yaml version: '3' services: prometheus: image: prom/prometheus:latest ports: - "9090:9090" volumes: - ./prometheus.yml:/etc/prometheus/prometheus.yml - prometheus-data:/prometheus command: - '--config.file=/etc/prometheus/prometheus.yml' - '--storage.tsdb.retention.time=30d' volumes: prometheus-data: ``` ### Step 2: Replace DogStatsD with Prometheus Exporters **Before (DogStatsD)**: ```python from datadog import statsd statsd.increment('page.views') statsd.histogram('request.duration', 0.5) statsd.gauge('active_users', 100) ``` **After (Prometheus Python client)**: ```python from prometheus_client import Counter, Histogram, Gauge page_views = Counter('page_views_total', 'Page views') request_duration = Histogram('request_duration_seconds', 'Request duration') active_users = Gauge('active_users', 'Active users') # Usage page_views.inc() request_duration.observe(0.5) active_users.set(100) ``` ### Step 3: Metric Name Translation | Datadog Metric | Prometheus Equivalent | |----------------|----------------------| | `system.cpu.idle` | `node_cpu_seconds_total{mode="idle"}` | | `system.mem.free` | `node_memory_MemFree_bytes` | | `system.disk.used` | `node_filesystem_size_bytes - node_filesystem_free_bytes` | | `docker.cpu.usage` | `container_cpu_usage_seconds_total` | | `kubernetes.pods.running` | `kube_pod_status_phase{phase="Running"}` | ### Step 4: Export Existing Datadog Metrics (Optional) Use Datadog API to export historical data: ```python from datadog import api, initialize options = { 'api_key': 'YOUR_API_KEY', 'app_key': 'YOUR_APP_KEY' } initialize(**options) # Query metric result = api.Metric.query( start=int(time.time() - 86400), # Last 24h end=int(time.time()), query='avg:system.cpu.user{*}' ) # Convert to Prometheus format and import ``` --- ## 2. Dashboard Migration (Datadog → Grafana) ### Step 1: Export Datadog Dashboards ```python import requests import json api_key = "YOUR_API_KEY" app_key = "YOUR_APP_KEY" headers = { 'DD-API-KEY': api_key, 'DD-APPLICATION-KEY': app_key } # Get all dashboards response = requests.get( 'https://api.datadoghq.com/api/v1/dashboard', headers=headers ) dashboards = response.json() # Export each dashboard for dashboard in dashboards['dashboards']: dash_id = dashboard['id'] detail = requests.get( f'https://api.datadoghq.com/api/v1/dashboard/{dash_id}', headers=headers ).json() with open(f'datadog_{dash_id}.json', 'w') as f: json.dump(detail, f, indent=2) ``` ### Step 2: Convert to Grafana Format **Manual Conversion Template**: | Datadog Widget | Grafana Panel Type | |----------------|-------------------| | Timeseries | Graph / Time series | | Query Value | Stat | | Toplist | Table / Bar gauge | | Heatmap | Heatmap | | Distribution | Histogram | **Automated Conversion** (basic example): ```python def convert_datadog_to_grafana(datadog_dashboard): grafana_dashboard = { "title": datadog_dashboard['title'], "panels": [] } for widget in datadog_dashboard['widgets']: panel = { "title": widget['definition'].get('title', ''), "type": map_widget_type(widget['definition']['type']), "targets": convert_queries(widget['definition']['requests']) } grafana_dashboard['panels'].append(panel) return grafana_dashboard ``` ### Step 3: Common Query Translations See `dql_promql_translation.md` for comprehensive query mappings. **Example conversions**: ``` Datadog: avg:system.cpu.user{*} Prometheus: avg(rate(node_cpu_seconds_total{mode="user"}[5m])) * 100 Datadog: sum:requests.count{status:200}.as_rate() Prometheus: sum(rate(http_requests_total{status="200"}[5m])) Datadog: p95:request.duration{*} Prometheus: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) ``` --- ## 3. Alert Migration (Datadog Monitors → Prometheus Alerts) ### Step 1: Export Datadog Monitors ```python import requests api_key = "YOUR_API_KEY" app_key = "YOUR_APP_KEY" headers = { 'DD-API-KEY': api_key, 'DD-APPLICATION-KEY': app_key } response = requests.get( 'https://api.datadoghq.com/api/v1/monitor', headers=headers ) monitors = response.json() # Save each monitor for monitor in monitors: with open(f'monitor_{monitor["id"]}.json', 'w') as f: json.dump(monitor, f, indent=2) ``` ### Step 2: Convert to Prometheus Alert Rules **Datadog Monitor**: ```json { "name": "High CPU Usage", "type": "metric alert", "query": "avg(last_5m):avg:system.cpu.user{*} > 80", "message": "CPU usage is high on {{host.name}}" } ``` **Prometheus Alert**: ```yaml groups: - name: infrastructure rules: - alert: HighCPUUsage expr: | 100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80 for: 5m labels: severity: warning annotations: summary: "High CPU usage on {{ $labels.instance }}" description: "CPU usage is {{ $value }}%" ``` ### Step 3: Alert Routing (Datadog → Alertmanager) **Datadog notification channels** → **Alertmanager receivers** ```yaml # alertmanager.yml route: group_by: ['alertname', 'severity'] receiver: 'slack-notifications' receivers: - name: 'slack-notifications' slack_configs: - api_url: 'YOUR_SLACK_WEBHOOK' channel: '#alerts' - name: 'pagerduty-critical' pagerduty_configs: - service_key: 'YOUR_PAGERDUTY_KEY' ``` --- ## 4. Log Migration (Datadog → Loki) ### Step 1: Deploy Loki **Kubernetes**: ```bash helm repo add grafana https://grafana.github.io/helm-charts helm install loki grafana/loki-stack \ --set loki.persistence.enabled=true \ --set loki.persistence.size=100Gi \ --set promtail.enabled=true ``` **Docker Compose**: ```yaml version: '3' services: loki: image: grafana/loki:latest ports: - "3100:3100" volumes: - ./loki-config.yaml:/etc/loki/local-config.yaml - loki-data:/loki promtail: image: grafana/promtail:latest volumes: - /var/log:/var/log - ./promtail-config.yaml:/etc/promtail/config.yml volumes: loki-data: ``` ### Step 2: Replace Datadog Log Forwarder **Before (Datadog Agent)**: ```yaml # datadog.yaml logs_enabled: true logs_config: container_collect_all: true ``` **After (Promtail)**: ```yaml # promtail-config.yaml server: http_listen_port: 9080 positions: filename: /tmp/positions.yaml clients: - url: http://loki:3100/loki/api/v1/push scrape_configs: - job_name: system static_configs: - targets: - localhost labels: job: varlogs __path__: /var/log/*.log ``` ### Step 3: Query Translation **Datadog Logs Query**: ``` service:my-app status:error ``` **Loki LogQL**: ```logql {job="my-app", level="error"} ``` **More examples**: ``` Datadog: service:api-gateway status:error @http.status_code:>=500 Loki: {service="api-gateway", level="error"} | json | http_status_code >= 500 Datadog: source:nginx "404" Loki: {source="nginx"} |= "404" ``` --- ## 5. APM Migration (Datadog APM → Tempo/Jaeger) ### Step 1: Choose Tracing Backend - **Tempo**: Better for high volume, cheaper storage (object storage) - **Jaeger**: More mature, better UI, requires separate storage ### Step 2: Replace Datadog Tracer with OpenTelemetry **Before (Datadog Python)**: ```python from ddtrace import tracer @tracer.wrap() def my_function(): pass ``` **After (OpenTelemetry)**: ```python from opentelemetry import trace from opentelemetry.sdk.trace import TracerProvider from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter # Setup trace.set_tracer_provider(TracerProvider()) tracer = trace.get_tracer(__name__) exporter = OTLPSpanExporter(endpoint="tempo:4317") @tracer.start_as_current_span("my_function") def my_function(): pass ``` ### Step 3: Deploy Tempo ```yaml # tempo.yaml server: http_listen_port: 3200 distributor: receivers: otlp: protocols: grpc: endpoint: 0.0.0.0:4317 storage: trace: backend: s3 s3: bucket: tempo-traces endpoint: s3.amazonaws.com ``` --- ## 6. Infrastructure Migration ### Recommended Architecture ``` ┌─────────────────────────────────────────┐ │ Grafana (Visualization) │ │ - Dashboards │ │ - Unified view │ └─────────────────────────────────────────┘ ↓ ↓ ↓ ┌──────────────┐ ┌──────────┐ ┌──────────┐ │ Prometheus │ │ Loki │ │ Tempo │ │ (Metrics) │ │ (Logs) │ │ (Traces) │ └──────────────┘ └──────────┘ └──────────┘ ↓ ↓ ↓ ┌─────────────────────────────────────────┐ │ Applications (OpenTelemetry) │ └─────────────────────────────────────────┘ ``` ### Sizing Recommendations **100-host environment**: - **Prometheus**: 2-4 CPU, 8-16GB RAM, 100GB SSD - **Grafana**: 1 CPU, 2GB RAM - **Loki**: 2-4 CPU, 8GB RAM, 100GB SSD - **Tempo**: 2-4 CPU, 8GB RAM, S3 for storage - **Alertmanager**: 1 CPU, 1GB RAM **Total**: ~8-16 CPU, 32-64GB RAM, 200GB SSD + object storage --- ## 7. Migration Checklist ### Pre-Migration - [ ] Calculate current Datadog costs - [ ] Identify all Datadog integrations - [ ] Export all dashboards - [ ] Export all monitors - [ ] Document custom metrics - [ ] Get stakeholder approval ### During Migration - [ ] Deploy Prometheus + Grafana - [ ] Deploy Loki + Promtail - [ ] Deploy Tempo/Jaeger (if using APM) - [ ] Migrate metrics instrumentation - [ ] Convert dashboards (top 10 critical first) - [ ] Convert alerts (critical alerts first) - [ ] Update application logging - [ ] Replace APM instrumentation - [ ] Run parallel for 2-4 weeks - [ ] Validate data accuracy - [ ] Train team on new tools ### Post-Migration - [ ] Decommission Datadog agent from all hosts - [ ] Cancel Datadog subscription - [ ] Archive Datadog configs - [ ] Document new workflows - [ ] Create runbooks for common tasks --- ## 8. Common Challenges & Solutions ### Challenge: Missing Datadog Features **Datadog Synthetic Monitoring**: - Solution: Use **Blackbox Exporter** (Prometheus) or **Grafana Synthetic Monitoring** **Datadog Network Performance Monitoring**: - Solution: Use **Cilium Hubble** (Kubernetes) or **eBPF-based tools** **Datadog RUM (Real User Monitoring)**: - Solution: Use **Grafana Faro** or **OpenTelemetry Browser SDK** ### Challenge: Team Learning Curve **Solution**: - Provide training sessions (2-3 hours per tool) - Create internal documentation with examples - Set up sandbox environment for practice - Assign champions for each tool ### Challenge: Query Performance **Prometheus too slow**: - Use **Thanos** or **Cortex** for scaling - Implement recording rules for expensive queries - Increase retention only where needed **Loki too slow**: - Add more labels for better filtering - Use chunk caching - Consider **parallel query execution** --- ## 9. Maintenance Comparison ### Datadog (Managed) - **Ops burden**: Low (fully managed) - **Upgrades**: Automatic - **Scaling**: Automatic - **Cost**: High ($6k-10k+/month) ### Open-Source Stack (Self-hosted) - **Ops burden**: Medium (requires ops team) - **Upgrades**: Manual (quarterly) - **Scaling**: Manual planning required - **Cost**: Low ($1.5k-3k/month infrastructure) **Hybrid Option**: Use **Grafana Cloud** (managed Prometheus/Loki/Tempo) - Cost: ~$3k/month for 100 hosts - Ops burden: Low - Savings: ~50% vs Datadog --- ## 10. ROI Calculation ### Example Scenario **Before (Datadog)**: - Monthly cost: $7,000 - Annual cost: $84,000 **After (Self-hosted OSS)**: - Infrastructure: $1,800/month - Operations (0.5 FTE): $4,000/month - Annual cost: $69,600 **Savings**: $14,400/year **After (Grafana Cloud)**: - Monthly cost: $3,500 - Annual cost: $42,000 **Savings**: $42,000/year (50%) **Break-even**: Immediate (no migration costs beyond engineering time) --- ## Resources - **Prometheus**: https://prometheus.io/docs/ - **Grafana**: https://grafana.com/docs/ - **Loki**: https://grafana.com/docs/loki/ - **Tempo**: https://grafana.com/docs/tempo/ - **OpenTelemetry**: https://opentelemetry.io/ - **Migration Tools**: https://github.com/grafana/dashboard-linter --- ## Support If you need help with migration: - Grafana Labs offers migration consulting - Many SRE consulting firms specialize in this - Community support via Slack/Discord channels