Initial commit
This commit is contained in:
406
references/metrics_design.md
Normal file
406
references/metrics_design.md
Normal file
@@ -0,0 +1,406 @@
|
||||
# Metrics Design Guide
|
||||
|
||||
## The Four Golden Signals
|
||||
|
||||
The Four Golden Signals from Google's SRE book provide a comprehensive view of system health:
|
||||
|
||||
### 1. Latency
|
||||
**What**: Time to service a request
|
||||
|
||||
**Why Monitor**: Directly impacts user experience
|
||||
|
||||
**Key Metrics**:
|
||||
- Request duration (p50, p95, p99, p99.9)
|
||||
- Time to first byte (TTFB)
|
||||
- Backend processing time
|
||||
- Database query latency
|
||||
|
||||
**PromQL Examples**:
|
||||
```promql
|
||||
# P95 latency
|
||||
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
|
||||
|
||||
# Average latency by endpoint
|
||||
avg(rate(http_request_duration_seconds_sum[5m])) by (endpoint)
|
||||
/
|
||||
avg(rate(http_request_duration_seconds_count[5m])) by (endpoint)
|
||||
```
|
||||
|
||||
**Alert Thresholds**:
|
||||
- Warning: p95 > 500ms
|
||||
- Critical: p99 > 2s
|
||||
|
||||
### 2. Traffic
|
||||
**What**: Demand on your system
|
||||
|
||||
**Why Monitor**: Understand load patterns, capacity planning
|
||||
|
||||
**Key Metrics**:
|
||||
- Requests per second (RPS)
|
||||
- Transactions per second (TPS)
|
||||
- Concurrent connections
|
||||
- Network throughput
|
||||
|
||||
**PromQL Examples**:
|
||||
```promql
|
||||
# Requests per second
|
||||
sum(rate(http_requests_total[5m]))
|
||||
|
||||
# Requests per second by status code
|
||||
sum(rate(http_requests_total[5m])) by (status)
|
||||
|
||||
# Traffic growth rate (week over week)
|
||||
sum(rate(http_requests_total[5m]))
|
||||
/
|
||||
sum(rate(http_requests_total[5m] offset 7d))
|
||||
```
|
||||
|
||||
**Alert Thresholds**:
|
||||
- Warning: RPS > 80% of capacity
|
||||
- Critical: RPS > 95% of capacity
|
||||
|
||||
### 3. Errors
|
||||
**What**: Rate of requests that fail
|
||||
|
||||
**Why Monitor**: Direct indicator of user-facing problems
|
||||
|
||||
**Key Metrics**:
|
||||
- Error rate (%)
|
||||
- 5xx response codes
|
||||
- Failed transactions
|
||||
- Exception counts
|
||||
|
||||
**PromQL Examples**:
|
||||
```promql
|
||||
# Error rate percentage
|
||||
sum(rate(http_requests_total{status=~"5.."}[5m]))
|
||||
/
|
||||
sum(rate(http_requests_total[5m])) * 100
|
||||
|
||||
# Error count by type
|
||||
sum(rate(http_requests_total{status=~"5.."}[5m])) by (status)
|
||||
|
||||
# Application errors
|
||||
rate(application_errors_total[5m])
|
||||
```
|
||||
|
||||
**Alert Thresholds**:
|
||||
- Warning: Error rate > 1%
|
||||
- Critical: Error rate > 5%
|
||||
|
||||
### 4. Saturation
|
||||
**What**: How "full" your service is
|
||||
|
||||
**Why Monitor**: Predict capacity issues before they impact users
|
||||
|
||||
**Key Metrics**:
|
||||
- CPU utilization
|
||||
- Memory utilization
|
||||
- Disk I/O
|
||||
- Network bandwidth
|
||||
- Queue depth
|
||||
- Thread pool usage
|
||||
|
||||
**PromQL Examples**:
|
||||
```promql
|
||||
# CPU saturation
|
||||
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
|
||||
|
||||
# Memory saturation
|
||||
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
|
||||
|
||||
# Disk saturation
|
||||
rate(node_disk_io_time_seconds_total[5m]) * 100
|
||||
|
||||
# Queue depth
|
||||
queue_depth_current / queue_depth_max * 100
|
||||
```
|
||||
|
||||
**Alert Thresholds**:
|
||||
- Warning: > 70% utilization
|
||||
- Critical: > 90% utilization
|
||||
|
||||
---
|
||||
|
||||
## RED Method (for Services)
|
||||
|
||||
**R**ate, **E**rrors, **D**uration - a simplified approach for request-driven services
|
||||
|
||||
### Rate
|
||||
Number of requests per second:
|
||||
```promql
|
||||
sum(rate(http_requests_total[5m]))
|
||||
```
|
||||
|
||||
### Errors
|
||||
Number of failed requests per second:
|
||||
```promql
|
||||
sum(rate(http_requests_total{status=~"5.."}[5m]))
|
||||
```
|
||||
|
||||
### Duration
|
||||
Time taken to process requests:
|
||||
```promql
|
||||
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
|
||||
```
|
||||
|
||||
**When to Use**: Microservices, APIs, web applications
|
||||
|
||||
---
|
||||
|
||||
## USE Method (for Resources)
|
||||
|
||||
**U**tilization, **S**aturation, **E**rrors - for infrastructure resources
|
||||
|
||||
### Utilization
|
||||
Percentage of time resource is busy:
|
||||
```promql
|
||||
# CPU utilization
|
||||
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
|
||||
|
||||
# Disk utilization
|
||||
(node_filesystem_size_bytes - node_filesystem_avail_bytes)
|
||||
/ node_filesystem_size_bytes * 100
|
||||
```
|
||||
|
||||
### Saturation
|
||||
Amount of work the resource cannot service (queued):
|
||||
```promql
|
||||
# Load average (saturation indicator)
|
||||
node_load15
|
||||
|
||||
# Disk I/O wait time
|
||||
rate(node_disk_io_time_weighted_seconds_total[5m])
|
||||
```
|
||||
|
||||
### Errors
|
||||
Count of error events:
|
||||
```promql
|
||||
# Network errors
|
||||
rate(node_network_receive_errs_total[5m])
|
||||
rate(node_network_transmit_errs_total[5m])
|
||||
|
||||
# Disk errors
|
||||
rate(node_disk_io_errors_total[5m])
|
||||
```
|
||||
|
||||
**When to Use**: Servers, databases, network devices
|
||||
|
||||
---
|
||||
|
||||
## Metric Types
|
||||
|
||||
### Counter
|
||||
Monotonically increasing value (never decreases)
|
||||
|
||||
**Examples**: Request count, error count, bytes sent
|
||||
|
||||
**Usage**:
|
||||
```promql
|
||||
# Always use rate() or increase() with counters
|
||||
rate(http_requests_total[5m]) # Requests per second
|
||||
increase(http_requests_total[1h]) # Total requests in 1 hour
|
||||
```
|
||||
|
||||
### Gauge
|
||||
Value that can go up and down
|
||||
|
||||
**Examples**: Memory usage, queue depth, concurrent connections
|
||||
|
||||
**Usage**:
|
||||
```promql
|
||||
# Use directly or with aggregations
|
||||
avg(memory_usage_bytes)
|
||||
max(queue_depth)
|
||||
```
|
||||
|
||||
### Histogram
|
||||
Samples observations and counts them in configurable buckets
|
||||
|
||||
**Examples**: Request duration, response size
|
||||
|
||||
**Usage**:
|
||||
```promql
|
||||
# Calculate percentiles
|
||||
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
|
||||
|
||||
# Average from histogram
|
||||
rate(http_request_duration_seconds_sum[5m])
|
||||
/
|
||||
rate(http_request_duration_seconds_count[5m])
|
||||
```
|
||||
|
||||
### Summary
|
||||
Similar to histogram but calculates quantiles on client side
|
||||
|
||||
**Usage**: Less flexible than histograms, avoid for new metrics
|
||||
|
||||
---
|
||||
|
||||
## Cardinality Best Practices
|
||||
|
||||
**Cardinality**: Number of unique time series
|
||||
|
||||
### High Cardinality Labels (AVOID)
|
||||
❌ User ID
|
||||
❌ Email address
|
||||
❌ IP address
|
||||
❌ Timestamp
|
||||
❌ Random IDs
|
||||
|
||||
### Low Cardinality Labels (GOOD)
|
||||
✅ Environment (prod, staging)
|
||||
✅ Region (us-east-1, eu-west-1)
|
||||
✅ Service name
|
||||
✅ HTTP status code category (2xx, 4xx, 5xx)
|
||||
✅ Endpoint/route
|
||||
|
||||
### Calculating Cardinality Impact
|
||||
```
|
||||
Time series = unique combinations of labels
|
||||
|
||||
Example:
|
||||
service (5) × environment (3) × region (4) × status (5) = 300 time series ✅
|
||||
|
||||
service (5) × environment (3) × region (4) × user_id (1M) = 60M time series ❌
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Naming Conventions
|
||||
|
||||
### Prometheus Naming
|
||||
```
|
||||
<namespace>_<name>_<unit>_total
|
||||
|
||||
Examples:
|
||||
http_requests_total
|
||||
http_request_duration_seconds
|
||||
process_cpu_seconds_total
|
||||
node_memory_MemAvailable_bytes
|
||||
```
|
||||
|
||||
**Rules**:
|
||||
- Use snake_case
|
||||
- Include unit in name (seconds, bytes, ratio)
|
||||
- Use `_total` suffix for counters
|
||||
- Namespace by application/component
|
||||
|
||||
### CloudWatch Naming
|
||||
```
|
||||
<Namespace>/<MetricName>
|
||||
|
||||
Examples:
|
||||
AWS/EC2/CPUUtilization
|
||||
MyApp/RequestCount
|
||||
```
|
||||
|
||||
**Rules**:
|
||||
- Use PascalCase
|
||||
- Group by namespace
|
||||
- No unit in name (specified separately)
|
||||
|
||||
---
|
||||
|
||||
## Dashboard Design
|
||||
|
||||
### Key Principles
|
||||
|
||||
1. **Top-Down Layout**: Most important metrics first
|
||||
2. **Color Coding**: Red (critical), yellow (warning), green (healthy)
|
||||
3. **Consistent Time Windows**: All panels use same time range
|
||||
4. **Limit Panels**: 8-12 panels per dashboard maximum
|
||||
5. **Include Context**: Show related metrics together
|
||||
|
||||
### Dashboard Structure
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────┐
|
||||
│ Overall Health (Single Stats) │
|
||||
│ [Requests/s] [Error%] [P95 Latency] │
|
||||
└─────────────────────────────────────────────┘
|
||||
|
||||
┌─────────────────────────────────────────────┐
|
||||
│ Request Rate & Errors (Graphs) │
|
||||
└─────────────────────────────────────────────┘
|
||||
|
||||
┌─────────────────────────────────────────────┐
|
||||
│ Latency Distribution (Graphs) │
|
||||
└─────────────────────────────────────────────┘
|
||||
|
||||
┌─────────────────────────────────────────────┐
|
||||
│ Resource Usage (Graphs) │
|
||||
└─────────────────────────────────────────────┘
|
||||
|
||||
┌─────────────────────────────────────────────┐
|
||||
│ Dependencies (Graphs) │
|
||||
└─────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Template Variables
|
||||
Use variables for filtering:
|
||||
- Environment: `$environment`
|
||||
- Service: `$service`
|
||||
- Region: `$region`
|
||||
- Pod: `$pod`
|
||||
|
||||
---
|
||||
|
||||
## Common Pitfalls
|
||||
|
||||
### 1. Monitoring What You Build, Not What Users Experience
|
||||
❌ `backend_processing_complete`
|
||||
✅ `user_request_completed`
|
||||
|
||||
### 2. Too Many Metrics
|
||||
- Start with Four Golden Signals
|
||||
- Add metrics only when needed for specific issues
|
||||
- Remove unused metrics
|
||||
|
||||
### 3. Incorrect Aggregations
|
||||
❌ `avg(rate(...))` - averages rates incorrectly
|
||||
✅ `sum(rate(...)) / count(...)` - correct average
|
||||
|
||||
### 4. Wrong Time Windows
|
||||
- Too short (< 1m): Noisy data
|
||||
- Too long (> 15m): Miss short-lived issues
|
||||
- Sweet spot: 5m for most alerts
|
||||
|
||||
### 5. Missing Labels
|
||||
❌ `http_requests_total`
|
||||
✅ `http_requests_total{method="GET", status="200", endpoint="/api/users"}`
|
||||
|
||||
---
|
||||
|
||||
## Metric Collection Best Practices
|
||||
|
||||
### Application Instrumentation
|
||||
```python
|
||||
from prometheus_client import Counter, Histogram, Gauge
|
||||
|
||||
# Counter for requests
|
||||
requests_total = Counter('http_requests_total',
|
||||
'Total HTTP requests',
|
||||
['method', 'endpoint', 'status'])
|
||||
|
||||
# Histogram for latency
|
||||
request_duration = Histogram('http_request_duration_seconds',
|
||||
'HTTP request duration',
|
||||
['method', 'endpoint'])
|
||||
|
||||
# Gauge for in-progress requests
|
||||
requests_in_progress = Gauge('http_requests_in_progress',
|
||||
'HTTP requests currently being processed')
|
||||
```
|
||||
|
||||
### Collection Intervals
|
||||
- Application metrics: 15-30s
|
||||
- Infrastructure metrics: 30-60s
|
||||
- Billing/cost metrics: 5-15m
|
||||
- External API checks: 1-5m
|
||||
|
||||
### Retention
|
||||
- Raw metrics: 15-30 days
|
||||
- 5m aggregates: 90 days
|
||||
- 1h aggregates: 1 year
|
||||
- Daily aggregates: 2+ years
|
||||
Reference in New Issue
Block a user