Files
gh-ahmedasmar-devops-claude…/references/dql_promql_translation.md
2025-11-29 17:51:22 +08:00

757 lines
11 KiB
Markdown

# DQL (Datadog Query Language) ↔ PromQL Translation Guide
## Quick Reference
| Concept | Datadog (DQL) | Prometheus (PromQL) |
|---------|---------------|---------------------|
| Aggregation | `avg:`, `sum:`, `min:`, `max:` | `avg()`, `sum()`, `min()`, `max()` |
| Rate | `.as_rate()`, `.as_count()` | `rate()`, `increase()` |
| Percentile | `p50:`, `p95:`, `p99:` | `histogram_quantile()` |
| Filtering | `{tag:value}` | `{label="value"}` |
| Time window | `last_5m`, `last_1h` | `[5m]`, `[1h]` |
---
## Basic Queries
### Simple Metric Query
**Datadog**:
```
system.cpu.user
```
**Prometheus**:
```promql
node_cpu_seconds_total{mode="user"}
```
---
### Metric with Filter
**Datadog**:
```
system.cpu.user{host:web-01}
```
**Prometheus**:
```promql
node_cpu_seconds_total{mode="user", instance="web-01"}
```
---
### Multiple Filters (AND)
**Datadog**:
```
system.cpu.user{host:web-01,env:production}
```
**Prometheus**:
```promql
node_cpu_seconds_total{mode="user", instance="web-01", env="production"}
```
---
### Wildcard Filters
**Datadog**:
```
system.cpu.user{host:web-*}
```
**Prometheus**:
```promql
node_cpu_seconds_total{mode="user", instance=~"web-.*"}
```
---
### OR Filters
**Datadog**:
```
system.cpu.user{host:web-01 OR host:web-02}
```
**Prometheus**:
```promql
node_cpu_seconds_total{mode="user", instance=~"web-01|web-02"}
```
---
## Aggregations
### Average
**Datadog**:
```
avg:system.cpu.user{*}
```
**Prometheus**:
```promql
avg(node_cpu_seconds_total{mode="user"})
```
---
### Sum
**Datadog**:
```
sum:requests.count{*}
```
**Prometheus**:
```promql
sum(http_requests_total)
```
---
### Min/Max
**Datadog**:
```
min:system.mem.free{*}
max:system.mem.free{*}
```
**Prometheus**:
```promql
min(node_memory_MemFree_bytes)
max(node_memory_MemFree_bytes)
```
---
### Aggregation by Tag/Label
**Datadog**:
```
avg:system.cpu.user{*} by {host}
```
**Prometheus**:
```promql
avg by (instance) (node_cpu_seconds_total{mode="user"})
```
---
## Rates and Counts
### Rate (per second)
**Datadog**:
```
sum:requests.count{*}.as_rate()
```
**Prometheus**:
```promql
sum(rate(http_requests_total[5m]))
```
Note: Prometheus requires explicit time window `[5m]`
---
### Count (total over time)
**Datadog**:
```
sum:requests.count{*}.as_count()
```
**Prometheus**:
```promql
sum(increase(http_requests_total[1h]))
```
---
### Derivative (change over time)
**Datadog**:
```
derivative(avg:system.disk.used{*})
```
**Prometheus**:
```promql
deriv(node_filesystem_size_bytes[5m])
```
---
## Percentiles
### P50 (Median)
**Datadog**:
```
p50:request.duration{*}
```
**Prometheus** (requires histogram):
```promql
histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
```
---
### P95
**Datadog**:
```
p95:request.duration{*}
```
**Prometheus**:
```promql
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
```
---
### P99
**Datadog**:
```
p99:request.duration{*}
```
**Prometheus**:
```promql
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
```
---
## Time Windows
### Last 5 minutes
**Datadog**:
```
avg(last_5m):system.cpu.user{*}
```
**Prometheus**:
```promql
avg(node_cpu_seconds_total{mode="user"}[5m])
```
---
### Last 1 hour
**Datadog**:
```
avg(last_1h):system.cpu.user{*}
```
**Prometheus**:
```promql
avg_over_time(node_cpu_seconds_total{mode="user"}[1h])
```
---
## Math Operations
### Division
**Datadog**:
```
avg:system.mem.used{*} / avg:system.mem.total{*}
```
**Prometheus**:
```promql
node_memory_MemUsed_bytes / node_memory_MemTotal_bytes
```
---
### Multiplication
**Datadog**:
```
avg:system.cpu.user{*} * 100
```
**Prometheus**:
```promql
avg(node_cpu_seconds_total{mode="user"}) * 100
```
---
### Percentage Calculation
**Datadog**:
```
(sum:requests.errors{*} / sum:requests.count{*}) * 100
```
**Prometheus**:
```promql
(sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) * 100
```
---
## Common Use Cases
### CPU Usage Percentage
**Datadog**:
```
100 - avg:system.cpu.idle{*}
```
**Prometheus**:
```promql
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
```
---
### Memory Usage Percentage
**Datadog**:
```
(avg:system.mem.used{*} / avg:system.mem.total{*}) * 100
```
**Prometheus**:
```promql
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
```
---
### Disk Usage Percentage
**Datadog**:
```
(avg:system.disk.used{*} / avg:system.disk.total{*}) * 100
```
**Prometheus**:
```promql
(node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes * 100
```
---
### Request Rate (requests/sec)
**Datadog**:
```
sum:requests.count{*}.as_rate()
```
**Prometheus**:
```promql
sum(rate(http_requests_total[5m]))
```
---
### Error Rate Percentage
**Datadog**:
```
(sum:requests.errors{*}.as_rate() / sum:requests.count{*}.as_rate()) * 100
```
**Prometheus**:
```promql
(sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) * 100
```
---
### Request Latency (P95)
**Datadog**:
```
p95:request.duration{*}
```
**Prometheus**:
```promql
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
```
---
### Top 5 Hosts by CPU
**Datadog**:
```
top(avg:system.cpu.user{*} by {host}, 5, 'mean', 'desc')
```
**Prometheus**:
```promql
topk(5, avg by (instance) (rate(node_cpu_seconds_total{mode="user"}[5m])))
```
---
## Functions
### Absolute Value
**Datadog**:
```
abs(diff(avg:system.cpu.user{*}))
```
**Prometheus**:
```promql
abs(delta(node_cpu_seconds_total{mode="user"}[5m]))
```
---
### Ceiling/Floor
**Datadog**:
```
ceil(avg:system.cpu.user{*})
floor(avg:system.cpu.user{*})
```
**Prometheus**:
```promql
ceil(avg(node_cpu_seconds_total{mode="user"}))
floor(avg(node_cpu_seconds_total{mode="user"}))
```
---
### Clamp (Limit Range)
**Datadog**:
```
clamp_min(avg:system.cpu.user{*}, 0)
clamp_max(avg:system.cpu.user{*}, 100)
```
**Prometheus**:
```promql
clamp_min(avg(node_cpu_seconds_total{mode="user"}), 0)
clamp_max(avg(node_cpu_seconds_total{mode="user"}), 100)
```
---
### Moving Average
**Datadog**:
```
moving_rollup(avg:system.cpu.user{*}, 60, 'avg')
```
**Prometheus**:
```promql
avg_over_time(node_cpu_seconds_total{mode="user"}[1h])
```
---
## Advanced Patterns
### Compare to Previous Period
**Datadog**:
```
sum:requests.count{*}.as_rate() / timeshift(sum:requests.count{*}.as_rate(), 3600)
```
**Prometheus**:
```promql
sum(rate(http_requests_total[5m])) / sum(rate(http_requests_total[5m] offset 1h))
```
---
### Forecast
**Datadog**:
```
forecast(avg:system.disk.used{*}, 'linear', 1)
```
**Prometheus**:
```promql
predict_linear(node_filesystem_size_bytes[1h], 3600)
```
Note: Predicts value 1 hour in future based on last 1 hour trend
---
### Anomaly Detection
**Datadog**:
```
anomalies(avg:system.cpu.user{*}, 'basic', 2)
```
**Prometheus**: No built-in function
- Use recording rules with stddev
- External tools like **Robust Perception's anomaly detector**
- Or use **Grafana ML** plugin
---
### Outlier Detection
**Datadog**:
```
outliers(avg:system.cpu.user{*} by {host}, 'mad')
```
**Prometheus**: No built-in function
- Calculate manually with stddev:
```promql
abs(metric - avg(metric)) > 2 * stddev(metric)
```
---
## Container & Kubernetes
### Container CPU Usage
**Datadog**:
```
avg:docker.cpu.usage{*} by {container_name}
```
**Prometheus**:
```promql
avg by (container) (rate(container_cpu_usage_seconds_total[5m]))
```
---
### Container Memory Usage
**Datadog**:
```
avg:docker.mem.rss{*} by {container_name}
```
**Prometheus**:
```promql
avg by (container) (container_memory_rss)
```
---
### Pod Count by Status
**Datadog**:
```
sum:kubernetes.pods.running{*} by {kube_namespace}
```
**Prometheus**:
```promql
sum by (namespace) (kube_pod_status_phase{phase="Running"})
```
---
## Database Queries
### MySQL Queries Per Second
**Datadog**:
```
sum:mysql.performance.queries{*}.as_rate()
```
**Prometheus**:
```promql
sum(rate(mysql_global_status_queries[5m]))
```
---
### PostgreSQL Active Connections
**Datadog**:
```
avg:postgresql.connections{*}
```
**Prometheus**:
```promql
avg(pg_stat_database_numbackends)
```
---
### Redis Memory Usage
**Datadog**:
```
avg:redis.mem.used{*}
```
**Prometheus**:
```promql
avg(redis_memory_used_bytes)
```
---
## Network Metrics
### Network Bytes Sent
**Datadog**:
```
sum:system.net.bytes_sent{*}.as_rate()
```
**Prometheus**:
```promql
sum(rate(node_network_transmit_bytes_total[5m]))
```
---
### Network Bytes Received
**Datadog**:
```
sum:system.net.bytes_rcvd{*}.as_rate()
```
**Prometheus**:
```promql
sum(rate(node_network_receive_bytes_total[5m]))
```
---
## Key Differences
### 1. Time Windows
- **Datadog**: Optional, defaults to query time range
- **Prometheus**: Always required for rate/increase functions
### 2. Histograms
- **Datadog**: Percentiles available directly
- **Prometheus**: Requires histogram buckets + `histogram_quantile()`
### 3. Default Aggregation
- **Datadog**: No default, must specify
- **Prometheus**: Returns all time series unless aggregated
### 4. Metric Types
- **Datadog**: All metrics treated similarly
- **Prometheus**: Explicit types (counter, gauge, histogram, summary)
### 5. Tag vs Label
- **Datadog**: Uses "tags" (key:value)
- **Prometheus**: Uses "labels" (key="value")
---
## Migration Tips
1. **Start with dashboards**: Convert most-used dashboards first
2. **Use recording rules**: Pre-calculate expensive PromQL queries
3. **Test in parallel**: Run both systems during migration
4. **Document mappings**: Create team-specific translation guide
5. **Train team**: PromQL has learning curve, invest in training
---
## Tools
- **Datadog Dashboard Exporter**: Export JSON dashboards
- **Grafana Dashboard Linter**: Validate converted dashboards
- **PromQL Learning Resources**: https://prometheus.io/docs/prometheus/latest/querying/basics/
---
## Common Gotchas
### Rate without Time Window
**Wrong**:
```promql
rate(http_requests_total)
```
**Correct**:
```promql
rate(http_requests_total[5m])
```
---
### Aggregating Before Rate
**Wrong**:
```promql
rate(sum(http_requests_total)[5m])
```
**Correct**:
```promql
sum(rate(http_requests_total[5m]))
```
---
### Histogram Quantile Without by (le)
**Wrong**:
```promql
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
```
**Correct**:
```promql
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
```
---
## Quick Conversion Checklist
When converting a Datadog query to PromQL:
- [ ] Replace metric name (e.g., `system.cpu.user``node_cpu_seconds_total`)
- [ ] Convert tags to labels (`{tag:value}``{label="value"}`)
- [ ] Add time window for rate/increase (`[5m]`)
- [ ] Change aggregation syntax (`avg:``avg()`)
- [ ] Convert percentiles to histogram_quantile if needed
- [ ] Test query in Prometheus before adding to dashboard
- [ ] Add `by (label)` for grouped aggregations
---
## Need More Help?
- See `datadog_migration.md` for full migration guide
- PromQL documentation: https://prometheus.io/docs/prometheus/latest/querying/
- Practice at: https://demo.promlens.com/