Files
2025-11-29 17:51:22 +08:00

9.9 KiB
Raw Permalink Blame History

Metrics Design Guide

The Four Golden Signals

The Four Golden Signals from Google's SRE book provide a comprehensive view of system health:

1. Latency

What: Time to service a request

Why Monitor: Directly impacts user experience

Key Metrics:

  • Request duration (p50, p95, p99, p99.9)
  • Time to first byte (TTFB)
  • Backend processing time
  • Database query latency

PromQL Examples:

# P95 latency
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# Average latency by endpoint
avg(rate(http_request_duration_seconds_sum[5m])) by (endpoint)
  /
avg(rate(http_request_duration_seconds_count[5m])) by (endpoint)

Alert Thresholds:

  • Warning: p95 > 500ms
  • Critical: p99 > 2s

2. Traffic

What: Demand on your system

Why Monitor: Understand load patterns, capacity planning

Key Metrics:

  • Requests per second (RPS)
  • Transactions per second (TPS)
  • Concurrent connections
  • Network throughput

PromQL Examples:

# Requests per second
sum(rate(http_requests_total[5m]))

# Requests per second by status code
sum(rate(http_requests_total[5m])) by (status)

# Traffic growth rate (week over week)
sum(rate(http_requests_total[5m]))
  /
sum(rate(http_requests_total[5m] offset 7d))

Alert Thresholds:

  • Warning: RPS > 80% of capacity
  • Critical: RPS > 95% of capacity

3. Errors

What: Rate of requests that fail

Why Monitor: Direct indicator of user-facing problems

Key Metrics:

  • Error rate (%)
  • 5xx response codes
  • Failed transactions
  • Exception counts

PromQL Examples:

# Error rate percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
  /
sum(rate(http_requests_total[5m])) * 100

# Error count by type
sum(rate(http_requests_total{status=~"5.."}[5m])) by (status)

# Application errors
rate(application_errors_total[5m])

Alert Thresholds:

  • Warning: Error rate > 1%
  • Critical: Error rate > 5%

4. Saturation

What: How "full" your service is

Why Monitor: Predict capacity issues before they impact users

Key Metrics:

  • CPU utilization
  • Memory utilization
  • Disk I/O
  • Network bandwidth
  • Queue depth
  • Thread pool usage

PromQL Examples:

# CPU saturation
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory saturation
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# Disk saturation
rate(node_disk_io_time_seconds_total[5m]) * 100

# Queue depth
queue_depth_current / queue_depth_max * 100

Alert Thresholds:

  • Warning: > 70% utilization
  • Critical: > 90% utilization

RED Method (for Services)

Rate, Errors, Duration - a simplified approach for request-driven services

Rate

Number of requests per second:

sum(rate(http_requests_total[5m]))

Errors

Number of failed requests per second:

sum(rate(http_requests_total{status=~"5.."}[5m]))

Duration

Time taken to process requests:

histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

When to Use: Microservices, APIs, web applications


USE Method (for Resources)

Utilization, Saturation, Errors - for infrastructure resources

Utilization

Percentage of time resource is busy:

# CPU utilization
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Disk utilization
(node_filesystem_size_bytes - node_filesystem_avail_bytes)
  / node_filesystem_size_bytes * 100

Saturation

Amount of work the resource cannot service (queued):

# Load average (saturation indicator)
node_load15

# Disk I/O wait time
rate(node_disk_io_time_weighted_seconds_total[5m])

Errors

Count of error events:

# Network errors
rate(node_network_receive_errs_total[5m])
rate(node_network_transmit_errs_total[5m])

# Disk errors
rate(node_disk_io_errors_total[5m])

When to Use: Servers, databases, network devices


Metric Types

Counter

Monotonically increasing value (never decreases)

Examples: Request count, error count, bytes sent

Usage:

# Always use rate() or increase() with counters
rate(http_requests_total[5m])  # Requests per second
increase(http_requests_total[1h])  # Total requests in 1 hour

Gauge

Value that can go up and down

Examples: Memory usage, queue depth, concurrent connections

Usage:

# Use directly or with aggregations
avg(memory_usage_bytes)
max(queue_depth)

Histogram

Samples observations and counts them in configurable buckets

Examples: Request duration, response size

Usage:

# Calculate percentiles
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# Average from histogram
rate(http_request_duration_seconds_sum[5m])
  /
rate(http_request_duration_seconds_count[5m])

Summary

Similar to histogram but calculates quantiles on client side

Usage: Less flexible than histograms, avoid for new metrics


Cardinality Best Practices

Cardinality: Number of unique time series

High Cardinality Labels (AVOID)

User ID Email address IP address Timestamp Random IDs

Low Cardinality Labels (GOOD)

Environment (prod, staging) Region (us-east-1, eu-west-1) Service name HTTP status code category (2xx, 4xx, 5xx) Endpoint/route

Calculating Cardinality Impact

Time series = unique combinations of labels

Example:
service (5) × environment (3) × region (4) × status (5) = 300 time series ✅

service (5) × environment (3) × region (4) × user_id (1M) = 60M time series ❌

Naming Conventions

Prometheus Naming

<namespace>_<name>_<unit>_total

Examples:
http_requests_total
http_request_duration_seconds
process_cpu_seconds_total
node_memory_MemAvailable_bytes

Rules:

  • Use snake_case
  • Include unit in name (seconds, bytes, ratio)
  • Use _total suffix for counters
  • Namespace by application/component

CloudWatch Naming

<Namespace>/<MetricName>

Examples:
AWS/EC2/CPUUtilization
MyApp/RequestCount

Rules:

  • Use PascalCase
  • Group by namespace
  • No unit in name (specified separately)

Dashboard Design

Key Principles

  1. Top-Down Layout: Most important metrics first
  2. Color Coding: Red (critical), yellow (warning), green (healthy)
  3. Consistent Time Windows: All panels use same time range
  4. Limit Panels: 8-12 panels per dashboard maximum
  5. Include Context: Show related metrics together

Dashboard Structure

┌─────────────────────────────────────────────┐
│  Overall Health (Single Stats)              │
│  [Requests/s] [Error%] [P95 Latency]        │
└─────────────────────────────────────────────┘

┌─────────────────────────────────────────────┐
│  Request Rate & Errors (Graphs)             │
└─────────────────────────────────────────────┘

┌─────────────────────────────────────────────┐
│  Latency Distribution (Graphs)              │
└─────────────────────────────────────────────┘

┌─────────────────────────────────────────────┐
│  Resource Usage (Graphs)                    │
└─────────────────────────────────────────────┘

┌─────────────────────────────────────────────┐
│  Dependencies (Graphs)                      │
└─────────────────────────────────────────────┘

Template Variables

Use variables for filtering:

  • Environment: $environment
  • Service: $service
  • Region: $region
  • Pod: $pod

Common Pitfalls

1. Monitoring What You Build, Not What Users Experience

backend_processing_complete user_request_completed

2. Too Many Metrics

  • Start with Four Golden Signals
  • Add metrics only when needed for specific issues
  • Remove unused metrics

3. Incorrect Aggregations

avg(rate(...)) - averages rates incorrectly sum(rate(...)) / count(...) - correct average

4. Wrong Time Windows

  • Too short (< 1m): Noisy data
  • Too long (> 15m): Miss short-lived issues
  • Sweet spot: 5m for most alerts

5. Missing Labels

http_requests_total http_requests_total{method="GET", status="200", endpoint="/api/users"}


Metric Collection Best Practices

Application Instrumentation

from prometheus_client import Counter, Histogram, Gauge

# Counter for requests
requests_total = Counter('http_requests_total',
                        'Total HTTP requests',
                        ['method', 'endpoint', 'status'])

# Histogram for latency
request_duration = Histogram('http_request_duration_seconds',
                            'HTTP request duration',
                            ['method', 'endpoint'])

# Gauge for in-progress requests
requests_in_progress = Gauge('http_requests_in_progress',
                            'HTTP requests currently being processed')

Collection Intervals

  • Application metrics: 15-30s
  • Infrastructure metrics: 30-60s
  • Billing/cost metrics: 5-15m
  • External API checks: 1-5m

Retention

  • Raw metrics: 15-30 days
  • 5m aggregates: 90 days
  • 1h aggregates: 1 year
  • Daily aggregates: 2+ years