15 KiB
Migrating from Datadog to Open-Source Stack
Overview
This guide helps you migrate from Datadog to a cost-effective open-source observability stack:
- Metrics: Datadog → Prometheus + Grafana
- Logs: Datadog → Loki + Grafana
- Traces: Datadog APM → Tempo/Jaeger + Grafana
- Dashboards: Datadog → Grafana
- Alerts: Datadog Monitors → Prometheus Alertmanager
Estimated Cost Savings: 60-80% for similar functionality
Cost Comparison
Example: 100-host infrastructure
Datadog:
- Infrastructure Pro: $1,500/month (100 hosts × $15)
- Custom Metrics: $50/month (5,000 extra metrics beyond included 10,000)
- Logs: $2,000/month (20GB/day × $0.10/GB × 30 days)
- APM: $3,100/month (100 hosts × $31)
- Total: ~$6,650/month ($79,800/year)
Open-Source Stack (self-hosted):
- Infrastructure: $1,200/month (EC2/GKE for Prometheus, Grafana, Loki, Tempo)
- Storage: $300/month (S3/GCS for long-term metrics and traces)
- Operations time: Variable
- Total: ~$1,500-2,500/month ($18,000-30,000/year)
Savings: $49,800-61,800/year
Migration Strategy
Phase 1: Run Parallel (Month 1-2)
- Deploy open-source stack alongside Datadog
- Migrate metrics first (lowest risk)
- Validate data accuracy
- Build confidence
Phase 2: Migrate Dashboards & Alerts (Month 2-3)
- Convert Datadog dashboards to Grafana
- Translate alert rules
- Train team on new tools
Phase 3: Migrate Logs & Traces (Month 3-4)
- Set up Loki for log aggregation
- Deploy Tempo/Jaeger for tracing
- Update application instrumentation
Phase 4: Decommission Datadog (Month 4-5)
- Confirm all functionality migrated
- Cancel Datadog subscription
- Archive Datadog dashboards/alerts for reference
1. Metrics Migration (Datadog → Prometheus)
Step 1: Deploy Prometheus
Kubernetes (recommended):
# prometheus-values.yaml
prometheus:
prometheusSpec:
retention: 30d
storageSpec:
volumeClaimTemplate:
spec:
resources:
requests:
storage: 100Gi
# Scrape configs
additionalScrapeConfigs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
Install:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack -f prometheus-values.yaml
Docker Compose:
version: '3'
services:
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=30d'
volumes:
prometheus-data:
Step 2: Replace DogStatsD with Prometheus Exporters
Before (DogStatsD):
from datadog import statsd
statsd.increment('page.views')
statsd.histogram('request.duration', 0.5)
statsd.gauge('active_users', 100)
After (Prometheus Python client):
from prometheus_client import Counter, Histogram, Gauge
page_views = Counter('page_views_total', 'Page views')
request_duration = Histogram('request_duration_seconds', 'Request duration')
active_users = Gauge('active_users', 'Active users')
# Usage
page_views.inc()
request_duration.observe(0.5)
active_users.set(100)
Step 3: Metric Name Translation
| Datadog Metric | Prometheus Equivalent |
|---|---|
system.cpu.idle |
node_cpu_seconds_total{mode="idle"} |
system.mem.free |
node_memory_MemFree_bytes |
system.disk.used |
node_filesystem_size_bytes - node_filesystem_free_bytes |
docker.cpu.usage |
container_cpu_usage_seconds_total |
kubernetes.pods.running |
kube_pod_status_phase{phase="Running"} |
Step 4: Export Existing Datadog Metrics (Optional)
Use Datadog API to export historical data:
from datadog import api, initialize
options = {
'api_key': 'YOUR_API_KEY',
'app_key': 'YOUR_APP_KEY'
}
initialize(**options)
# Query metric
result = api.Metric.query(
start=int(time.time() - 86400), # Last 24h
end=int(time.time()),
query='avg:system.cpu.user{*}'
)
# Convert to Prometheus format and import
2. Dashboard Migration (Datadog → Grafana)
Step 1: Export Datadog Dashboards
import requests
import json
api_key = "YOUR_API_KEY"
app_key = "YOUR_APP_KEY"
headers = {
'DD-API-KEY': api_key,
'DD-APPLICATION-KEY': app_key
}
# Get all dashboards
response = requests.get(
'https://api.datadoghq.com/api/v1/dashboard',
headers=headers
)
dashboards = response.json()
# Export each dashboard
for dashboard in dashboards['dashboards']:
dash_id = dashboard['id']
detail = requests.get(
f'https://api.datadoghq.com/api/v1/dashboard/{dash_id}',
headers=headers
).json()
with open(f'datadog_{dash_id}.json', 'w') as f:
json.dump(detail, f, indent=2)
Step 2: Convert to Grafana Format
Manual Conversion Template:
| Datadog Widget | Grafana Panel Type |
|---|---|
| Timeseries | Graph / Time series |
| Query Value | Stat |
| Toplist | Table / Bar gauge |
| Heatmap | Heatmap |
| Distribution | Histogram |
Automated Conversion (basic example):
def convert_datadog_to_grafana(datadog_dashboard):
grafana_dashboard = {
"title": datadog_dashboard['title'],
"panels": []
}
for widget in datadog_dashboard['widgets']:
panel = {
"title": widget['definition'].get('title', ''),
"type": map_widget_type(widget['definition']['type']),
"targets": convert_queries(widget['definition']['requests'])
}
grafana_dashboard['panels'].append(panel)
return grafana_dashboard
Step 3: Common Query Translations
See dql_promql_translation.md for comprehensive query mappings.
Example conversions:
Datadog: avg:system.cpu.user{*}
Prometheus: avg(rate(node_cpu_seconds_total{mode="user"}[5m])) * 100
Datadog: sum:requests.count{status:200}.as_rate()
Prometheus: sum(rate(http_requests_total{status="200"}[5m]))
Datadog: p95:request.duration{*}
Prometheus: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
3. Alert Migration (Datadog Monitors → Prometheus Alerts)
Step 1: Export Datadog Monitors
import requests
api_key = "YOUR_API_KEY"
app_key = "YOUR_APP_KEY"
headers = {
'DD-API-KEY': api_key,
'DD-APPLICATION-KEY': app_key
}
response = requests.get(
'https://api.datadoghq.com/api/v1/monitor',
headers=headers
)
monitors = response.json()
# Save each monitor
for monitor in monitors:
with open(f'monitor_{monitor["id"]}.json', 'w') as f:
json.dump(monitor, f, indent=2)
Step 2: Convert to Prometheus Alert Rules
Datadog Monitor:
{
"name": "High CPU Usage",
"type": "metric alert",
"query": "avg(last_5m):avg:system.cpu.user{*} > 80",
"message": "CPU usage is high on {{host.name}}"
}
Prometheus Alert:
groups:
- name: infrastructure
rules:
- alert: HighCPUUsage
expr: |
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is {{ $value }}%"
Step 3: Alert Routing (Datadog → Alertmanager)
Datadog notification channels → Alertmanager receivers
# alertmanager.yml
route:
group_by: ['alertname', 'severity']
receiver: 'slack-notifications'
receivers:
- name: 'slack-notifications'
slack_configs:
- api_url: 'YOUR_SLACK_WEBHOOK'
channel: '#alerts'
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: 'YOUR_PAGERDUTY_KEY'
4. Log Migration (Datadog → Loki)
Step 1: Deploy Loki
Kubernetes:
helm repo add grafana https://grafana.github.io/helm-charts
helm install loki grafana/loki-stack \
--set loki.persistence.enabled=true \
--set loki.persistence.size=100Gi \
--set promtail.enabled=true
Docker Compose:
version: '3'
services:
loki:
image: grafana/loki:latest
ports:
- "3100:3100"
volumes:
- ./loki-config.yaml:/etc/loki/local-config.yaml
- loki-data:/loki
promtail:
image: grafana/promtail:latest
volumes:
- /var/log:/var/log
- ./promtail-config.yaml:/etc/promtail/config.yml
volumes:
loki-data:
Step 2: Replace Datadog Log Forwarder
Before (Datadog Agent):
# datadog.yaml
logs_enabled: true
logs_config:
container_collect_all: true
After (Promtail):
# promtail-config.yaml
server:
http_listen_port: 9080
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: system
static_configs:
- targets:
- localhost
labels:
job: varlogs
__path__: /var/log/*.log
Step 3: Query Translation
Datadog Logs Query:
service:my-app status:error
Loki LogQL:
{job="my-app", level="error"}
More examples:
Datadog: service:api-gateway status:error @http.status_code:>=500
Loki: {service="api-gateway", level="error"} | json | http_status_code >= 500
Datadog: source:nginx "404"
Loki: {source="nginx"} |= "404"
5. APM Migration (Datadog APM → Tempo/Jaeger)
Step 1: Choose Tracing Backend
- Tempo: Better for high volume, cheaper storage (object storage)
- Jaeger: More mature, better UI, requires separate storage
Step 2: Replace Datadog Tracer with OpenTelemetry
Before (Datadog Python):
from ddtrace import tracer
@tracer.wrap()
def my_function():
pass
After (OpenTelemetry):
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
# Setup
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)
exporter = OTLPSpanExporter(endpoint="tempo:4317")
@tracer.start_as_current_span("my_function")
def my_function():
pass
Step 3: Deploy Tempo
# tempo.yaml
server:
http_listen_port: 3200
distributor:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
storage:
trace:
backend: s3
s3:
bucket: tempo-traces
endpoint: s3.amazonaws.com
6. Infrastructure Migration
Recommended Architecture
┌─────────────────────────────────────────┐
│ Grafana (Visualization) │
│ - Dashboards │
│ - Unified view │
└─────────────────────────────────────────┘
↓ ↓ ↓
┌──────────────┐ ┌──────────┐ ┌──────────┐
│ Prometheus │ │ Loki │ │ Tempo │
│ (Metrics) │ │ (Logs) │ │ (Traces) │
└──────────────┘ └──────────┘ └──────────┘
↓ ↓ ↓
┌─────────────────────────────────────────┐
│ Applications (OpenTelemetry) │
└─────────────────────────────────────────┘
Sizing Recommendations
100-host environment:
- Prometheus: 2-4 CPU, 8-16GB RAM, 100GB SSD
- Grafana: 1 CPU, 2GB RAM
- Loki: 2-4 CPU, 8GB RAM, 100GB SSD
- Tempo: 2-4 CPU, 8GB RAM, S3 for storage
- Alertmanager: 1 CPU, 1GB RAM
Total: ~8-16 CPU, 32-64GB RAM, 200GB SSD + object storage
7. Migration Checklist
Pre-Migration
- Calculate current Datadog costs
- Identify all Datadog integrations
- Export all dashboards
- Export all monitors
- Document custom metrics
- Get stakeholder approval
During Migration
- Deploy Prometheus + Grafana
- Deploy Loki + Promtail
- Deploy Tempo/Jaeger (if using APM)
- Migrate metrics instrumentation
- Convert dashboards (top 10 critical first)
- Convert alerts (critical alerts first)
- Update application logging
- Replace APM instrumentation
- Run parallel for 2-4 weeks
- Validate data accuracy
- Train team on new tools
Post-Migration
- Decommission Datadog agent from all hosts
- Cancel Datadog subscription
- Archive Datadog configs
- Document new workflows
- Create runbooks for common tasks
8. Common Challenges & Solutions
Challenge: Missing Datadog Features
Datadog Synthetic Monitoring:
- Solution: Use Blackbox Exporter (Prometheus) or Grafana Synthetic Monitoring
Datadog Network Performance Monitoring:
- Solution: Use Cilium Hubble (Kubernetes) or eBPF-based tools
Datadog RUM (Real User Monitoring):
- Solution: Use Grafana Faro or OpenTelemetry Browser SDK
Challenge: Team Learning Curve
Solution:
- Provide training sessions (2-3 hours per tool)
- Create internal documentation with examples
- Set up sandbox environment for practice
- Assign champions for each tool
Challenge: Query Performance
Prometheus too slow:
- Use Thanos or Cortex for scaling
- Implement recording rules for expensive queries
- Increase retention only where needed
Loki too slow:
- Add more labels for better filtering
- Use chunk caching
- Consider parallel query execution
9. Maintenance Comparison
Datadog (Managed)
- Ops burden: Low (fully managed)
- Upgrades: Automatic
- Scaling: Automatic
- Cost: High ($6k-10k+/month)
Open-Source Stack (Self-hosted)
- Ops burden: Medium (requires ops team)
- Upgrades: Manual (quarterly)
- Scaling: Manual planning required
- Cost: Low ($1.5k-3k/month infrastructure)
Hybrid Option: Use Grafana Cloud (managed Prometheus/Loki/Tempo)
- Cost: ~$3k/month for 100 hosts
- Ops burden: Low
- Savings: ~50% vs Datadog
10. ROI Calculation
Example Scenario
Before (Datadog):
- Monthly cost: $7,000
- Annual cost: $84,000
After (Self-hosted OSS):
- Infrastructure: $1,800/month
- Operations (0.5 FTE): $4,000/month
- Annual cost: $69,600
Savings: $14,400/year
After (Grafana Cloud):
- Monthly cost: $3,500
- Annual cost: $42,000
Savings: $42,000/year (50%)
Break-even: Immediate (no migration costs beyond engineering time)
Resources
- Prometheus: https://prometheus.io/docs/
- Grafana: https://grafana.com/docs/
- Loki: https://grafana.com/docs/loki/
- Tempo: https://grafana.com/docs/tempo/
- OpenTelemetry: https://opentelemetry.io/
- Migration Tools: https://github.com/grafana/dashboard-linter
Support
If you need help with migration:
- Grafana Labs offers migration consulting
- Many SRE consulting firms specialize in this
- Community support via Slack/Discord channels