9.9 KiB
Metrics Design Guide
The Four Golden Signals
The Four Golden Signals from Google's SRE book provide a comprehensive view of system health:
1. Latency
What: Time to service a request
Why Monitor: Directly impacts user experience
Key Metrics:
- Request duration (p50, p95, p99, p99.9)
- Time to first byte (TTFB)
- Backend processing time
- Database query latency
PromQL Examples:
# P95 latency
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
# Average latency by endpoint
avg(rate(http_request_duration_seconds_sum[5m])) by (endpoint)
/
avg(rate(http_request_duration_seconds_count[5m])) by (endpoint)
Alert Thresholds:
- Warning: p95 > 500ms
- Critical: p99 > 2s
2. Traffic
What: Demand on your system
Why Monitor: Understand load patterns, capacity planning
Key Metrics:
- Requests per second (RPS)
- Transactions per second (TPS)
- Concurrent connections
- Network throughput
PromQL Examples:
# Requests per second
sum(rate(http_requests_total[5m]))
# Requests per second by status code
sum(rate(http_requests_total[5m])) by (status)
# Traffic growth rate (week over week)
sum(rate(http_requests_total[5m]))
/
sum(rate(http_requests_total[5m] offset 7d))
Alert Thresholds:
- Warning: RPS > 80% of capacity
- Critical: RPS > 95% of capacity
3. Errors
What: Rate of requests that fail
Why Monitor: Direct indicator of user-facing problems
Key Metrics:
- Error rate (%)
- 5xx response codes
- Failed transactions
- Exception counts
PromQL Examples:
# Error rate percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) * 100
# Error count by type
sum(rate(http_requests_total{status=~"5.."}[5m])) by (status)
# Application errors
rate(application_errors_total[5m])
Alert Thresholds:
- Warning: Error rate > 1%
- Critical: Error rate > 5%
4. Saturation
What: How "full" your service is
Why Monitor: Predict capacity issues before they impact users
Key Metrics:
- CPU utilization
- Memory utilization
- Disk I/O
- Network bandwidth
- Queue depth
- Thread pool usage
PromQL Examples:
# CPU saturation
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory saturation
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
# Disk saturation
rate(node_disk_io_time_seconds_total[5m]) * 100
# Queue depth
queue_depth_current / queue_depth_max * 100
Alert Thresholds:
- Warning: > 70% utilization
- Critical: > 90% utilization
RED Method (for Services)
Rate, Errors, Duration - a simplified approach for request-driven services
Rate
Number of requests per second:
sum(rate(http_requests_total[5m]))
Errors
Number of failed requests per second:
sum(rate(http_requests_total{status=~"5.."}[5m]))
Duration
Time taken to process requests:
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
When to Use: Microservices, APIs, web applications
USE Method (for Resources)
Utilization, Saturation, Errors - for infrastructure resources
Utilization
Percentage of time resource is busy:
# CPU utilization
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Disk utilization
(node_filesystem_size_bytes - node_filesystem_avail_bytes)
/ node_filesystem_size_bytes * 100
Saturation
Amount of work the resource cannot service (queued):
# Load average (saturation indicator)
node_load15
# Disk I/O wait time
rate(node_disk_io_time_weighted_seconds_total[5m])
Errors
Count of error events:
# Network errors
rate(node_network_receive_errs_total[5m])
rate(node_network_transmit_errs_total[5m])
# Disk errors
rate(node_disk_io_errors_total[5m])
When to Use: Servers, databases, network devices
Metric Types
Counter
Monotonically increasing value (never decreases)
Examples: Request count, error count, bytes sent
Usage:
# Always use rate() or increase() with counters
rate(http_requests_total[5m]) # Requests per second
increase(http_requests_total[1h]) # Total requests in 1 hour
Gauge
Value that can go up and down
Examples: Memory usage, queue depth, concurrent connections
Usage:
# Use directly or with aggregations
avg(memory_usage_bytes)
max(queue_depth)
Histogram
Samples observations and counts them in configurable buckets
Examples: Request duration, response size
Usage:
# Calculate percentiles
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
# Average from histogram
rate(http_request_duration_seconds_sum[5m])
/
rate(http_request_duration_seconds_count[5m])
Summary
Similar to histogram but calculates quantiles on client side
Usage: Less flexible than histograms, avoid for new metrics
Cardinality Best Practices
Cardinality: Number of unique time series
High Cardinality Labels (AVOID)
❌ User ID ❌ Email address ❌ IP address ❌ Timestamp ❌ Random IDs
Low Cardinality Labels (GOOD)
✅ Environment (prod, staging) ✅ Region (us-east-1, eu-west-1) ✅ Service name ✅ HTTP status code category (2xx, 4xx, 5xx) ✅ Endpoint/route
Calculating Cardinality Impact
Time series = unique combinations of labels
Example:
service (5) × environment (3) × region (4) × status (5) = 300 time series ✅
service (5) × environment (3) × region (4) × user_id (1M) = 60M time series ❌
Naming Conventions
Prometheus Naming
<namespace>_<name>_<unit>_total
Examples:
http_requests_total
http_request_duration_seconds
process_cpu_seconds_total
node_memory_MemAvailable_bytes
Rules:
- Use snake_case
- Include unit in name (seconds, bytes, ratio)
- Use
_totalsuffix for counters - Namespace by application/component
CloudWatch Naming
<Namespace>/<MetricName>
Examples:
AWS/EC2/CPUUtilization
MyApp/RequestCount
Rules:
- Use PascalCase
- Group by namespace
- No unit in name (specified separately)
Dashboard Design
Key Principles
- Top-Down Layout: Most important metrics first
- Color Coding: Red (critical), yellow (warning), green (healthy)
- Consistent Time Windows: All panels use same time range
- Limit Panels: 8-12 panels per dashboard maximum
- Include Context: Show related metrics together
Dashboard Structure
┌─────────────────────────────────────────────┐
│ Overall Health (Single Stats) │
│ [Requests/s] [Error%] [P95 Latency] │
└─────────────────────────────────────────────┘
┌─────────────────────────────────────────────┐
│ Request Rate & Errors (Graphs) │
└─────────────────────────────────────────────┘
┌─────────────────────────────────────────────┐
│ Latency Distribution (Graphs) │
└─────────────────────────────────────────────┘
┌─────────────────────────────────────────────┐
│ Resource Usage (Graphs) │
└─────────────────────────────────────────────┘
┌─────────────────────────────────────────────┐
│ Dependencies (Graphs) │
└─────────────────────────────────────────────┘
Template Variables
Use variables for filtering:
- Environment:
$environment - Service:
$service - Region:
$region - Pod:
$pod
Common Pitfalls
1. Monitoring What You Build, Not What Users Experience
❌ backend_processing_complete
✅ user_request_completed
2. Too Many Metrics
- Start with Four Golden Signals
- Add metrics only when needed for specific issues
- Remove unused metrics
3. Incorrect Aggregations
❌ avg(rate(...)) - averages rates incorrectly
✅ sum(rate(...)) / count(...) - correct average
4. Wrong Time Windows
- Too short (< 1m): Noisy data
- Too long (> 15m): Miss short-lived issues
- Sweet spot: 5m for most alerts
5. Missing Labels
❌ http_requests_total
✅ http_requests_total{method="GET", status="200", endpoint="/api/users"}
Metric Collection Best Practices
Application Instrumentation
from prometheus_client import Counter, Histogram, Gauge
# Counter for requests
requests_total = Counter('http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status'])
# Histogram for latency
request_duration = Histogram('http_request_duration_seconds',
'HTTP request duration',
['method', 'endpoint'])
# Gauge for in-progress requests
requests_in_progress = Gauge('http_requests_in_progress',
'HTTP requests currently being processed')
Collection Intervals
- Application metrics: 15-30s
- Infrastructure metrics: 30-60s
- Billing/cost metrics: 5-15m
- External API checks: 1-5m
Retention
- Raw metrics: 15-30 days
- 5m aggregates: 90 days
- 1h aggregates: 1 year
- Daily aggregates: 2+ years