zhongwei/gh-ahmedasmar-devops-claude-skills-monitoring-observability

Fork 0

Files

Zhongwei Li 23753b435e Initial commit

2025-11-29 17:51:22 +08:00

15 KiB

Raw Blame History

Migrating from Datadog to Open-Source Stack

Overview

This guide helps you migrate from Datadog to a cost-effective open-source observability stack:

Metrics: Datadog → Prometheus + Grafana
Logs: Datadog → Loki + Grafana
Traces: Datadog APM → Tempo/Jaeger + Grafana
Dashboards: Datadog → Grafana
Alerts: Datadog Monitors → Prometheus Alertmanager

Estimated Cost Savings: 60-80% for similar functionality

Cost Comparison

Example: 100-host infrastructure

Datadog:

Infrastructure Pro: $1,500/month (100 hosts × $15)
Custom Metrics: $50/month (5,000 extra metrics beyond included 10,000)
Logs: $2,000/month (20GB/day × $0.10/GB × 30 days)
APM: $3,100/month (100 hosts × $31)
Total: ~$6,650/month ($79,800/year)

Open-Source Stack (self-hosted):

Infrastructure: $1,200/month (EC2/GKE for Prometheus, Grafana, Loki, Tempo)
Storage: $300/month (S3/GCS for long-term metrics and traces)
Operations time: Variable
Total: ~$1,500-2,500/month ($18,000-30,000/year)

Savings: $49,800-61,800/year

Migration Strategy

Phase 1: Run Parallel (Month 1-2)

Deploy open-source stack alongside Datadog
Migrate metrics first (lowest risk)
Validate data accuracy
Build confidence

Phase 2: Migrate Dashboards & Alerts (Month 2-3)

Convert Datadog dashboards to Grafana
Translate alert rules
Train team on new tools

Phase 3: Migrate Logs & Traces (Month 3-4)

Set up Loki for log aggregation
Deploy Tempo/Jaeger for tracing
Update application instrumentation

Phase 4: Decommission Datadog (Month 4-5)

Confirm all functionality migrated
Cancel Datadog subscription
Archive Datadog dashboards/alerts for reference

1. Metrics Migration (Datadog → Prometheus)

Step 1: Deploy Prometheus

Kubernetes (recommended):

# prometheus-values.yaml
prometheus:
  prometheusSpec:
    retention: 30d
    storageSpec:
      volumeClaimTemplate:
        spec:
          resources:
            requests:
              storage: 100Gi

    # Scrape configs
    additionalScrapeConfigs:
      - job_name: 'kubernetes-pods'
        kubernetes_sd_configs:
          - role: pod

Install:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack -f prometheus-values.yaml

Docker Compose:

version: '3'
services:
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=30d'

volumes:
  prometheus-data:

Step 2: Replace DogStatsD with Prometheus Exporters

Before (DogStatsD):

from datadog import statsd

statsd.increment('page.views')
statsd.histogram('request.duration', 0.5)
statsd.gauge('active_users', 100)

After (Prometheus Python client):

from prometheus_client import Counter, Histogram, Gauge

page_views = Counter('page_views_total', 'Page views')
request_duration = Histogram('request_duration_seconds', 'Request duration')
active_users = Gauge('active_users', 'Active users')

# Usage
page_views.inc()
request_duration.observe(0.5)
active_users.set(100)

Step 3: Metric Name Translation

Datadog Metric	Prometheus Equivalent
`system.cpu.idle`	`node_cpu_seconds_total{mode="idle"}`
`system.mem.free`	`node_memory_MemFree_bytes`
`system.disk.used`	`node_filesystem_size_bytes - node_filesystem_free_bytes`
`docker.cpu.usage`	`container_cpu_usage_seconds_total`
`kubernetes.pods.running`	`kube_pod_status_phase{phase="Running"}`

Step 4: Export Existing Datadog Metrics (Optional)

Use Datadog API to export historical data:

from datadog import api, initialize

options = {
    'api_key': 'YOUR_API_KEY',
    'app_key': 'YOUR_APP_KEY'
}
initialize(**options)

# Query metric
result = api.Metric.query(
    start=int(time.time() - 86400),  # Last 24h
    end=int(time.time()),
    query='avg:system.cpu.user{*}'
)

# Convert to Prometheus format and import

2. Dashboard Migration (Datadog → Grafana)

Step 1: Export Datadog Dashboards

import requests
import json

api_key = "YOUR_API_KEY"
app_key = "YOUR_APP_KEY"

headers = {
    'DD-API-KEY': api_key,
    'DD-APPLICATION-KEY': app_key
}

# Get all dashboards
response = requests.get(
    'https://api.datadoghq.com/api/v1/dashboard',
    headers=headers
)

dashboards = response.json()

# Export each dashboard
for dashboard in dashboards['dashboards']:
    dash_id = dashboard['id']
    detail = requests.get(
        f'https://api.datadoghq.com/api/v1/dashboard/{dash_id}',
        headers=headers
    ).json()

    with open(f'datadog_{dash_id}.json', 'w') as f:
        json.dump(detail, f, indent=2)

Step 2: Convert to Grafana Format

Manual Conversion Template:

Datadog Widget	Grafana Panel Type
Timeseries	Graph / Time series
Query Value	Stat
Toplist	Table / Bar gauge
Heatmap	Heatmap
Distribution	Histogram

Automated Conversion (basic example):

def convert_datadog_to_grafana(datadog_dashboard):
    grafana_dashboard = {
        "title": datadog_dashboard['title'],
        "panels": []
    }

    for widget in datadog_dashboard['widgets']:
        panel = {
            "title": widget['definition'].get('title', ''),
            "type": map_widget_type(widget['definition']['type']),
            "targets": convert_queries(widget['definition']['requests'])
        }
        grafana_dashboard['panels'].append(panel)

    return grafana_dashboard

Step 3: Common Query Translations

See dql_promql_translation.md for comprehensive query mappings.

Example conversions:

Datadog: avg:system.cpu.user{*}
Prometheus: avg(rate(node_cpu_seconds_total{mode="user"}[5m])) * 100

Datadog: sum:requests.count{status:200}.as_rate()
Prometheus: sum(rate(http_requests_total{status="200"}[5m]))

Datadog: p95:request.duration{*}
Prometheus: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

3. Alert Migration (Datadog Monitors → Prometheus Alerts)

Step 1: Export Datadog Monitors

import requests

api_key = "YOUR_API_KEY"
app_key = "YOUR_APP_KEY"

headers = {
    'DD-API-KEY': api_key,
    'DD-APPLICATION-KEY': app_key
}

response = requests.get(
    'https://api.datadoghq.com/api/v1/monitor',
    headers=headers
)

monitors = response.json()

# Save each monitor
for monitor in monitors:
    with open(f'monitor_{monitor["id"]}.json', 'w') as f:
        json.dump(monitor, f, indent=2)

Step 2: Convert to Prometheus Alert Rules

Datadog Monitor:

{
  "name": "High CPU Usage",
  "type": "metric alert",
  "query": "avg(last_5m):avg:system.cpu.user{*} > 80",
  "message": "CPU usage is high on {{host.name}}"
}

Prometheus Alert:

groups:
  - name: infrastructure
    rules:
      - alert: HighCPUUsage
        expr: |
          100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is {{ $value }}%"

Step 3: Alert Routing (Datadog → Alertmanager)

Datadog notification channels → Alertmanager receivers

# alertmanager.yml
route:
  group_by: ['alertname', 'severity']
  receiver: 'slack-notifications'

receivers:
  - name: 'slack-notifications'
    slack_configs:
      - api_url: 'YOUR_SLACK_WEBHOOK'
        channel: '#alerts'

  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: 'YOUR_PAGERDUTY_KEY'

4. Log Migration (Datadog → Loki)

Step 1: Deploy Loki

Kubernetes:

helm repo add grafana https://grafana.github.io/helm-charts
helm install loki grafana/loki-stack \
  --set loki.persistence.enabled=true \
  --set loki.persistence.size=100Gi \
  --set promtail.enabled=true

Docker Compose:

version: '3'
services:
  loki:
    image: grafana/loki:latest
    ports:
      - "3100:3100"
    volumes:
      - ./loki-config.yaml:/etc/loki/local-config.yaml
      - loki-data:/loki

  promtail:
    image: grafana/promtail:latest
    volumes:
      - /var/log:/var/log
      - ./promtail-config.yaml:/etc/promtail/config.yml

volumes:
  loki-data:

Step 2: Replace Datadog Log Forwarder

Before (Datadog Agent):

# datadog.yaml
logs_enabled: true

logs_config:
  container_collect_all: true

After (Promtail):

# promtail-config.yaml
server:
  http_listen_port: 9080

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: system
    static_configs:
      - targets:
          - localhost
        labels:
          job: varlogs
          __path__: /var/log/*.log

Step 3: Query Translation

Datadog Logs Query:

service:my-app status:error

Loki LogQL:

{job="my-app", level="error"}

More examples:

Datadog: service:api-gateway status:error @http.status_code:>=500
Loki: {service="api-gateway", level="error"} | json | http_status_code >= 500

Datadog: source:nginx "404"
Loki: {source="nginx"} |= "404"

5. APM Migration (Datadog APM → Tempo/Jaeger)

Step 1: Choose Tracing Backend

Tempo: Better for high volume, cheaper storage (object storage)
Jaeger: More mature, better UI, requires separate storage

Step 2: Replace Datadog Tracer with OpenTelemetry

Before (Datadog Python):

from ddtrace import tracer

@tracer.wrap()
def my_function():
    pass

After (OpenTelemetry):

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

# Setup
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)
exporter = OTLPSpanExporter(endpoint="tempo:4317")

@tracer.start_as_current_span("my_function")
def my_function():
    pass

Step 3: Deploy Tempo

# tempo.yaml
server:
  http_listen_port: 3200

distributor:
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: 0.0.0.0:4317

storage:
  trace:
    backend: s3
    s3:
      bucket: tempo-traces
      endpoint: s3.amazonaws.com

6. Infrastructure Migration

Recommended Architecture

┌─────────────────────────────────────────┐
│  Grafana (Visualization)                │
│  - Dashboards                           │
│  - Unified view                         │
└─────────────────────────────────────────┘
         ↓           ↓           ↓
┌──────────────┐ ┌──────────┐ ┌──────────┐
│  Prometheus  │ │   Loki   │ │  Tempo   │
│  (Metrics)   │ │  (Logs)  │ │ (Traces) │
└──────────────┘ └──────────┘ └──────────┘
         ↓           ↓           ↓
┌─────────────────────────────────────────┐
│  Applications (OpenTelemetry)           │
└─────────────────────────────────────────┘

Sizing Recommendations

100-host environment:

Prometheus: 2-4 CPU, 8-16GB RAM, 100GB SSD
Grafana: 1 CPU, 2GB RAM
Loki: 2-4 CPU, 8GB RAM, 100GB SSD
Tempo: 2-4 CPU, 8GB RAM, S3 for storage
Alertmanager: 1 CPU, 1GB RAM

Total: ~8-16 CPU, 32-64GB RAM, 200GB SSD + object storage

7. Migration Checklist

Pre-Migration

Calculate current Datadog costs
Identify all Datadog integrations
Export all dashboards
Export all monitors
Document custom metrics
Get stakeholder approval

During Migration

Deploy Prometheus + Grafana
Deploy Loki + Promtail
Deploy Tempo/Jaeger (if using APM)
Migrate metrics instrumentation
Convert dashboards (top 10 critical first)
Convert alerts (critical alerts first)
Update application logging
Replace APM instrumentation
Run parallel for 2-4 weeks
Validate data accuracy
Train team on new tools

Post-Migration

Decommission Datadog agent from all hosts
Cancel Datadog subscription
Archive Datadog configs
Document new workflows
Create runbooks for common tasks

8. Common Challenges & Solutions

Challenge: Missing Datadog Features

Datadog Synthetic Monitoring:

Solution: Use Blackbox Exporter (Prometheus) or Grafana Synthetic Monitoring

Datadog Network Performance Monitoring:

Solution: Use Cilium Hubble (Kubernetes) or eBPF-based tools

Datadog RUM (Real User Monitoring):

Solution: Use Grafana Faro or OpenTelemetry Browser SDK

Challenge: Team Learning Curve

Solution:

Provide training sessions (2-3 hours per tool)
Create internal documentation with examples
Set up sandbox environment for practice
Assign champions for each tool

Challenge: Query Performance

Prometheus too slow:

Use Thanos or Cortex for scaling
Implement recording rules for expensive queries
Increase retention only where needed

Loki too slow:

Add more labels for better filtering
Use chunk caching
Consider parallel query execution

9. Maintenance Comparison

Datadog (Managed)

Ops burden: Low (fully managed)
Upgrades: Automatic
Scaling: Automatic
Cost: High ($6k-10k+/month)

Open-Source Stack (Self-hosted)

Ops burden: Medium (requires ops team)
Upgrades: Manual (quarterly)
Scaling: Manual planning required
Cost: Low ($1.5k-3k/month infrastructure)

Hybrid Option: Use Grafana Cloud (managed Prometheus/Loki/Tempo)

Cost: ~$3k/month for 100 hosts
Ops burden: Low
Savings: ~50% vs Datadog

10. ROI Calculation

Example Scenario

Before (Datadog):

Monthly cost: $7,000
Annual cost: $84,000

After (Self-hosted OSS):

Infrastructure: $1,800/month
Operations (0.5 FTE): $4,000/month
Annual cost: $69,600

Savings: $14,400/year

After (Grafana Cloud):

Monthly cost: $3,500
Annual cost: $42,000

Savings: $42,000/year (50%)

Break-even: Immediate (no migration costs beyond engineering time)

Resources

Prometheus: https://prometheus.io/docs/
Grafana: https://grafana.com/docs/
Loki: https://grafana.com/docs/loki/
Tempo: https://grafana.com/docs/tempo/
OpenTelemetry: https://opentelemetry.io/
Migration Tools: https://github.com/grafana/dashboard-linter

Support

If you need help with migration:

Grafana Labs offers migration consulting
Many SRE consulting firms specialize in this
Community support via Slack/Discord channels

15 KiB Raw Blame History Unescape Escape