gh-ahmedasmar-devops-claude…/references/metrics_design.md

# Metrics Design Guide

## The Four Golden Signals

The Four Golden Signals from Google's SRE book provide a comprehensive view of system health:

### 1. Latency
**What**: Time to service a request

**Why Monitor**: Directly impacts user experience

**Key Metrics**:
- Request duration (p50, p95, p99, p99.9)
- Time to first byte (TTFB)
- Backend processing time
- Database query latency

**PromQL Examples**:
```promql
# P95 latency
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# Average latency by endpoint
avg(rate(http_request_duration_seconds_sum[5m])) by (endpoint)
  /
avg(rate(http_request_duration_seconds_count[5m])) by (endpoint)
```

**Alert Thresholds**:
- Warning: p95 > 500ms
- Critical: p99 > 2s

### 2. Traffic
**What**: Demand on your system

**Why Monitor**: Understand load patterns, capacity planning

**Key Metrics**:
- Requests per second (RPS)
- Transactions per second (TPS)
- Concurrent connections
- Network throughput

**PromQL Examples**:
```promql
# Requests per second
sum(rate(http_requests_total[5m]))

# Requests per second by status code
sum(rate(http_requests_total[5m])) by (status)

# Traffic growth rate (week over week)
sum(rate(http_requests_total[5m]))
  /
sum(rate(http_requests_total[5m] offset 7d))
```

**Alert Thresholds**:
- Warning: RPS > 80% of capacity
- Critical: RPS > 95% of capacity

### 3. Errors
**What**: Rate of requests that fail

**Why Monitor**: Direct indicator of user-facing problems

**Key Metrics**:
- Error rate (%)
- 5xx response codes
- Failed transactions
- Exception counts

**PromQL Examples**:
```promql
# Error rate percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
  /
sum(rate(http_requests_total[5m])) * 100

# Error count by type
sum(rate(http_requests_total{status=~"5.."}[5m])) by (status)

# Application errors
rate(application_errors_total[5m])
```

**Alert Thresholds**:
- Warning: Error rate > 1%
- Critical: Error rate > 5%

### 4. Saturation
**What**: How "full" your service is

**Why Monitor**: Predict capacity issues before they impact users

**Key Metrics**:
- CPU utilization
- Memory utilization
- Disk I/O
- Network bandwidth
- Queue depth
- Thread pool usage

**PromQL Examples**:
```promql
# CPU saturation
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory saturation
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# Disk saturation
rate(node_disk_io_time_seconds_total[5m]) * 100

# Queue depth
queue_depth_current / queue_depth_max * 100
```

**Alert Thresholds**:
- Warning: > 70% utilization
- Critical: > 90% utilization

---

## RED Method (for Services)

**R**ate, **E**rrors, **D**uration - a simplified approach for request-driven services

### Rate
Number of requests per second:
```promql
sum(rate(http_requests_total[5m]))
```

### Errors
Number of failed requests per second:
```promql
sum(rate(http_requests_total{status=~"5.."}[5m]))
```

### Duration
Time taken to process requests:
```promql
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
```

**When to Use**: Microservices, APIs, web applications

---

## USE Method (for Resources)

**U**tilization, **S**aturation, **E**rrors - for infrastructure resources

### Utilization
Percentage of time resource is busy:
```promql
# CPU utilization
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Disk utilization
(node_filesystem_size_bytes - node_filesystem_avail_bytes)
  / node_filesystem_size_bytes * 100
```

### Saturation
Amount of work the resource cannot service (queued):
```promql
# Load average (saturation indicator)
node_load15

# Disk I/O wait time
rate(node_disk_io_time_weighted_seconds_total[5m])
```

### Errors
Count of error events:
```promql
# Network errors
rate(node_network_receive_errs_total[5m])
rate(node_network_transmit_errs_total[5m])

# Disk errors
rate(node_disk_io_errors_total[5m])
```

**When to Use**: Servers, databases, network devices

---

## Metric Types

### Counter
Monotonically increasing value (never decreases)

**Examples**: Request count, error count, bytes sent

**Usage**:
```promql
# Always use rate() or increase() with counters
rate(http_requests_total[5m])  # Requests per second
increase(http_requests_total[1h])  # Total requests in 1 hour
```

### Gauge
Value that can go up and down

**Examples**: Memory usage, queue depth, concurrent connections

**Usage**:
```promql
# Use directly or with aggregations
avg(memory_usage_bytes)
max(queue_depth)
```

### Histogram
Samples observations and counts them in configurable buckets

**Examples**: Request duration, response size

**Usage**:
```promql
# Calculate percentiles
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# Average from histogram
rate(http_request_duration_seconds_sum[5m])
  /
rate(http_request_duration_seconds_count[5m])
```

### Summary
Similar to histogram but calculates quantiles on client side

**Usage**: Less flexible than histograms, avoid for new metrics

---

## Cardinality Best Practices

**Cardinality**: Number of unique time series

### High Cardinality Labels (AVOID)
❌ User ID
❌ Email address
❌ IP address
❌ Timestamp
❌ Random IDs

### Low Cardinality Labels (GOOD)
✅ Environment (prod, staging)
✅ Region (us-east-1, eu-west-1)
✅ Service name
✅ HTTP status code category (2xx, 4xx, 5xx)
✅ Endpoint/route

### Calculating Cardinality Impact
```
Time series = unique combinations of labels

Example:
service (5) × environment (3) × region (4) × status (5) = 300 time series ✅

service (5) × environment (3) × region (4) × user_id (1M) = 60M time series ❌
```

---

## Naming Conventions

### Prometheus Naming
```
<namespace>_<name>_<unit>_total

Examples:
http_requests_total
http_request_duration_seconds
process_cpu_seconds_total
node_memory_MemAvailable_bytes
```

**Rules**:
- Use snake_case
- Include unit in name (seconds, bytes, ratio)
- Use `_total` suffix for counters
- Namespace by application/component

### CloudWatch Naming
```
<Namespace>/<MetricName>

Examples:
AWS/EC2/CPUUtilization
MyApp/RequestCount
```

**Rules**:
- Use PascalCase
- Group by namespace
- No unit in name (specified separately)

---

## Dashboard Design

### Key Principles

1. **Top-Down Layout**: Most important metrics first
2. **Color Coding**: Red (critical), yellow (warning), green (healthy)
3. **Consistent Time Windows**: All panels use same time range
4. **Limit Panels**: 8-12 panels per dashboard maximum
5. **Include Context**: Show related metrics together

### Dashboard Structure

```
┌─────────────────────────────────────────────┐
│  Overall Health (Single Stats)              │
│  [Requests/s] [Error%] [P95 Latency]        │
└─────────────────────────────────────────────┘

┌─────────────────────────────────────────────┐
│  Request Rate & Errors (Graphs)             │
└─────────────────────────────────────────────┘

┌─────────────────────────────────────────────┐
│  Latency Distribution (Graphs)              │
└─────────────────────────────────────────────┘

┌─────────────────────────────────────────────┐
│  Resource Usage (Graphs)                    │
└─────────────────────────────────────────────┘

┌─────────────────────────────────────────────┐
│  Dependencies (Graphs)                      │
└─────────────────────────────────────────────┘
```

### Template Variables
Use variables for filtering:
- Environment: `$environment`
- Service: `$service`
- Region: `$region`
- Pod: `$pod`

---

## Common Pitfalls

### 1. Monitoring What You Build, Not What Users Experience
❌ `backend_processing_complete`
✅ `user_request_completed`

### 2. Too Many Metrics
- Start with Four Golden Signals
- Add metrics only when needed for specific issues
- Remove unused metrics

### 3. Incorrect Aggregations
❌ `avg(rate(...))` - averages rates incorrectly
✅ `sum(rate(...)) / count(...)` - correct average

### 4. Wrong Time Windows
- Too short (< 1m): Noisy data
- Too long (> 15m): Miss short-lived issues
- Sweet spot: 5m for most alerts

### 5. Missing Labels
❌ `http_requests_total`
✅ `http_requests_total{method="GET", status="200", endpoint="/api/users"}`

---

## Metric Collection Best Practices

### Application Instrumentation
```python
from prometheus_client import Counter, Histogram, Gauge

# Counter for requests
requests_total = Counter('http_requests_total',
                        'Total HTTP requests',
                        ['method', 'endpoint', 'status'])

# Histogram for latency
request_duration = Histogram('http_request_duration_seconds',
                            'HTTP request duration',
                            ['method', 'endpoint'])

# Gauge for in-progress requests
requests_in_progress = Gauge('http_requests_in_progress',
                            'HTTP requests currently being processed')
```

### Collection Intervals
- Application metrics: 15-30s
- Infrastructure metrics: 30-60s
- Billing/cost metrics: 5-15m
- External API checks: 1-5m

### Retention
- Raw metrics: 15-30 days
- 5m aggregates: 90 days
- 1h aggregates: 1 year
- Daily aggregates: 2+ years