Initial commit

2025-11-29 17:51:22 +08:00
commit 23753b435e
24 changed files with 9837 additions and 0 deletions
--- a/references/metrics_design.md
+++ b/references/metrics_design.md
@@ -0,0 +1,406 @@
+# Metrics Design Guide
+
+## The Four Golden Signals
+
+The Four Golden Signals from Google's SRE book provide a comprehensive view of system health:
+
+### 1. Latency
+**What**: Time to service a request
+
+**Why Monitor**: Directly impacts user experience
+
+**Key Metrics**:
+- Request duration (p50, p95, p99, p99.9)
+- Time to first byte (TTFB)
+- Backend processing time
+- Database query latency
+
+**PromQL Examples**:
+```promql
+# P95 latency
+histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
+
+# Average latency by endpoint
+avg(rate(http_request_duration_seconds_sum[5m])) by (endpoint)
+  /
+avg(rate(http_request_duration_seconds_count[5m])) by (endpoint)
+```
+
+**Alert Thresholds**:
+- Warning: p95 > 500ms
+- Critical: p99 > 2s
+
+### 2. Traffic
+**What**: Demand on your system
+
+**Why Monitor**: Understand load patterns, capacity planning
+
+**Key Metrics**:
+- Requests per second (RPS)
+- Transactions per second (TPS)
+- Concurrent connections
+- Network throughput
+
+**PromQL Examples**:
+```promql
+# Requests per second
+sum(rate(http_requests_total[5m]))
+
+# Requests per second by status code
+sum(rate(http_requests_total[5m])) by (status)
+
+# Traffic growth rate (week over week)
+sum(rate(http_requests_total[5m]))
+  /
+sum(rate(http_requests_total[5m] offset 7d))
+```
+
+**Alert Thresholds**:
+- Warning: RPS > 80% of capacity
+- Critical: RPS > 95% of capacity
+
+### 3. Errors
+**What**: Rate of requests that fail
+
+**Why Monitor**: Direct indicator of user-facing problems
+
+**Key Metrics**:
+- Error rate (%)
+- 5xx response codes
+- Failed transactions
+- Exception counts
+
+**PromQL Examples**:
+```promql
+# Error rate percentage
+sum(rate(http_requests_total{status=~"5.."}[5m]))
+  /
+sum(rate(http_requests_total[5m])) * 100
+
+# Error count by type
+sum(rate(http_requests_total{status=~"5.."}[5m])) by (status)
+
+# Application errors
+rate(application_errors_total[5m])
+```
+
+**Alert Thresholds**:
+- Warning: Error rate > 1%
+- Critical: Error rate > 5%
+
+### 4. Saturation
+**What**: How "full" your service is
+
+**Why Monitor**: Predict capacity issues before they impact users
+
+**Key Metrics**:
+- CPU utilization
+- Memory utilization
+- Disk I/O
+- Network bandwidth
+- Queue depth
+- Thread pool usage
+
+**PromQL Examples**:
+```promql
+# CPU saturation
+100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
+
+# Memory saturation
+(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
+
+# Disk saturation
+rate(node_disk_io_time_seconds_total[5m]) * 100
+
+# Queue depth
+queue_depth_current / queue_depth_max * 100
+```
+
+**Alert Thresholds**:
+- Warning: > 70% utilization
+- Critical: > 90% utilization
+
+---
+
+## RED Method (for Services)
+
+**R**ate, **E**rrors, **D**uration - a simplified approach for request-driven services
+
+### Rate
+Number of requests per second:
+```promql
+sum(rate(http_requests_total[5m]))
+```
+
+### Errors
+Number of failed requests per second:
+```promql
+sum(rate(http_requests_total{status=~"5.."}[5m]))
+```
+
+### Duration
+Time taken to process requests:
+```promql
+histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
+```
+
+**When to Use**: Microservices, APIs, web applications
+
+---
+
+## USE Method (for Resources)
+
+**U**tilization, **S**aturation, **E**rrors - for infrastructure resources
+
+### Utilization
+Percentage of time resource is busy:
+```promql
+# CPU utilization
+100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
+
+# Disk utilization
+(node_filesystem_size_bytes - node_filesystem_avail_bytes)
+  / node_filesystem_size_bytes * 100
+```
+
+### Saturation
+Amount of work the resource cannot service (queued):
+```promql
+# Load average (saturation indicator)
+node_load15
+
+# Disk I/O wait time
+rate(node_disk_io_time_weighted_seconds_total[5m])
+```
+
+### Errors
+Count of error events:
+```promql
+# Network errors
+rate(node_network_receive_errs_total[5m])
+rate(node_network_transmit_errs_total[5m])
+
+# Disk errors
+rate(node_disk_io_errors_total[5m])
+```
+
+**When to Use**: Servers, databases, network devices
+
+---
+
+## Metric Types
+
+### Counter
+Monotonically increasing value (never decreases)
+
+**Examples**: Request count, error count, bytes sent
+
+**Usage**:
+```promql
+# Always use rate() or increase() with counters
+rate(http_requests_total[5m])  # Requests per second
+increase(http_requests_total[1h])  # Total requests in 1 hour
+```
+
+### Gauge
+Value that can go up and down
+
+**Examples**: Memory usage, queue depth, concurrent connections
+
+**Usage**:
+```promql
+# Use directly or with aggregations
+avg(memory_usage_bytes)
+max(queue_depth)
+```
+
+### Histogram
+Samples observations and counts them in configurable buckets
+
+**Examples**: Request duration, response size
+
+**Usage**:
+```promql
+# Calculate percentiles
+histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
+
+# Average from histogram
+rate(http_request_duration_seconds_sum[5m])
+  /
+rate(http_request_duration_seconds_count[5m])
+```
+
+### Summary
+Similar to histogram but calculates quantiles on client side
+
+**Usage**: Less flexible than histograms, avoid for new metrics
+
+---
+
+## Cardinality Best Practices
+
+**Cardinality**: Number of unique time series
+
+### High Cardinality Labels (AVOID)
+❌ User ID
+❌ Email address
+❌ IP address
+❌ Timestamp
+❌ Random IDs
+
+### Low Cardinality Labels (GOOD)
+✅ Environment (prod, staging)
+✅ Region (us-east-1, eu-west-1)
+✅ Service name
+✅ HTTP status code category (2xx, 4xx, 5xx)
+✅ Endpoint/route
+
+### Calculating Cardinality Impact
+```
+Time series = unique combinations of labels
+
+Example:
+service (5) × environment (3) × region (4) × status (5) = 300 time series ✅
+
+service (5) × environment (3) × region (4) × user_id (1M) = 60M time series ❌
+```
+
+---
+
+## Naming Conventions
+
+### Prometheus Naming
+```
+<namespace>_<name>_<unit>_total
+
+Examples:
+http_requests_total
+http_request_duration_seconds
+process_cpu_seconds_total
+node_memory_MemAvailable_bytes
+```
+
+**Rules**:
+- Use snake_case
+- Include unit in name (seconds, bytes, ratio)
+- Use `_total` suffix for counters
+- Namespace by application/component
+
+### CloudWatch Naming
+```
+<Namespace>/<MetricName>
+
+Examples:
+AWS/EC2/CPUUtilization
+MyApp/RequestCount
+```
+
+**Rules**:
+- Use PascalCase
+- Group by namespace
+- No unit in name (specified separately)
+
+---
+
+## Dashboard Design
+
+### Key Principles
+
+1. **Top-Down Layout**: Most important metrics first
+2. **Color Coding**: Red (critical), yellow (warning), green (healthy)
+3. **Consistent Time Windows**: All panels use same time range
+4. **Limit Panels**: 8-12 panels per dashboard maximum
+5. **Include Context**: Show related metrics together
+
+### Dashboard Structure
+
+```
+┌─────────────────────────────────────────────┐
+│  Overall Health (Single Stats)              │
+│  [Requests/s] [Error%] [P95 Latency]        │
+└─────────────────────────────────────────────┘
+
+┌─────────────────────────────────────────────┐
+│  Request Rate & Errors (Graphs)             │
+└─────────────────────────────────────────────┘
+
+┌─────────────────────────────────────────────┐
+│  Latency Distribution (Graphs)              │
+└─────────────────────────────────────────────┘
+
+┌─────────────────────────────────────────────┐
+│  Resource Usage (Graphs)                    │
+└─────────────────────────────────────────────┘
+
+┌─────────────────────────────────────────────┐
+│  Dependencies (Graphs)                      │
+└─────────────────────────────────────────────┘
+```
+
+### Template Variables
+Use variables for filtering:
+- Environment: `$environment`
+- Service: `$service`
+- Region: `$region`
+- Pod: `$pod`
+
+---
+
+## Common Pitfalls
+
+### 1. Monitoring What You Build, Not What Users Experience
+❌ `backend_processing_complete`
+✅ `user_request_completed`
+
+### 2. Too Many Metrics
+- Start with Four Golden Signals
+- Add metrics only when needed for specific issues
+- Remove unused metrics
+
+### 3. Incorrect Aggregations
+❌ `avg(rate(...))` - averages rates incorrectly
+✅ `sum(rate(...)) / count(...)` - correct average
+
+### 4. Wrong Time Windows
+- Too short (< 1m): Noisy data
+- Too long (> 15m): Miss short-lived issues
+- Sweet spot: 5m for most alerts
+
+### 5. Missing Labels
+❌ `http_requests_total`
+✅ `http_requests_total{method="GET", status="200", endpoint="/api/users"}`
+
+---
+
+## Metric Collection Best Practices
+
+### Application Instrumentation
+```python
+from prometheus_client import Counter, Histogram, Gauge
+
+# Counter for requests
+requests_total = Counter('http_requests_total',
+                        'Total HTTP requests',
+                        ['method', 'endpoint', 'status'])
+
+# Histogram for latency
+request_duration = Histogram('http_request_duration_seconds',
+                            'HTTP request duration',
+                            ['method', 'endpoint'])
+
+# Gauge for in-progress requests
+requests_in_progress = Gauge('http_requests_in_progress',
+                            'HTTP requests currently being processed')
+```
+
+### Collection Intervals
+- Application metrics: 15-30s
+- Infrastructure metrics: 30-60s
+- Billing/cost metrics: 5-15m
+- External API checks: 1-5m
+
+### Retention
+- Raw metrics: 15-30 days
+- 5m aggregates: 90 days
+- 1h aggregates: 1 year
+- Daily aggregates: 2+ years