480 lines
14 KiB
Markdown
480 lines
14 KiB
Markdown
---
|
|
name: observability-and-monitoring
|
|
description: Use when implementing metrics/logs/traces, defining SLIs/SLOs, designing alerts, choosing observability tools, debugging alert fatigue, or optimizing observability costs - provides SRE frameworks, anti-patterns, and implementation patterns
|
|
---
|
|
|
|
# Observability and Monitoring
|
|
|
|
## Overview
|
|
|
|
**Core principle:** Measure what users care about, alert on symptoms not causes, make alerts actionable.
|
|
|
|
**Rule:** Observability without actionability is just expensive logging.
|
|
|
|
**Already have observability tools (CloudWatch, Datadog, etc.)?** Optimize what you have first. Most observability problems are usage/process issues, not tooling. Implement SLIs/SLOs, clean up alerts, add runbooks with existing tools. Migrate only if you hit concrete tool limitations (cost, features, multi-cloud). Tool migration is expensive - make sure it solves a real problem.
|
|
|
|
## Getting Started Decision Tree
|
|
|
|
| Team Size | Scale | Starting Point | Tools |
|
|
|-----------|-------|----------------|-------|
|
|
| 1-5 engineers | <10 services | Metrics + logs | Prometheus + Grafana + Loki |
|
|
| 5-20 engineers | 10-50 services | Metrics + logs + basic traces | Add Jaeger, OpenTelemetry |
|
|
| 20+ engineers | 50+ services | Full observability + SLOs | Managed platform (Datadog, Grafana Cloud) |
|
|
|
|
**First step:** Implement metrics with OpenTelemetry + Prometheus
|
|
|
|
**Why this order:** Metrics give you fastest time-to-value (detect issues), logs help debug (understand what happened), traces solve complex distributed problems (debug cross-service issues)
|
|
|
|
## Three Pillars Quick Reference
|
|
|
|
### Metrics (Quantitative, aggregated)
|
|
|
|
**When to use:** Alerting, dashboards, trend analysis
|
|
|
|
**What to collect:**
|
|
- **RED method** (services): Rate, Errors, Duration
|
|
- **USE method** (resources): Utilization, Saturation, Errors
|
|
- **Four Golden Signals**: Latency, traffic, errors, saturation
|
|
|
|
**Implementation:**
|
|
```python
|
|
# OpenTelemetry metrics
|
|
from opentelemetry import metrics
|
|
|
|
meter = metrics.get_meter(__name__)
|
|
request_counter = meter.create_counter(
|
|
"http_requests_total",
|
|
description="Total HTTP requests"
|
|
)
|
|
request_duration = meter.create_histogram(
|
|
"http_request_duration_seconds",
|
|
description="HTTP request duration"
|
|
)
|
|
|
|
# Instrument request
|
|
request_counter.add(1, {"method": "GET", "endpoint": "/api/users"})
|
|
request_duration.record(duration, {"method": "GET", "endpoint": "/api/users"})
|
|
```
|
|
|
|
### Logs (Discrete events)
|
|
|
|
**When to use:** Debugging, audit trails, error investigation
|
|
|
|
**Best practices:**
|
|
- Structured logging (JSON)
|
|
- Include correlation IDs
|
|
- Don't log sensitive data (PII, secrets)
|
|
|
|
**Implementation:**
|
|
```python
|
|
import structlog
|
|
|
|
log = structlog.get_logger()
|
|
log.info(
|
|
"user_login",
|
|
user_id=user_id,
|
|
correlation_id=correlation_id,
|
|
ip_address=ip,
|
|
duration_ms=duration
|
|
)
|
|
```
|
|
|
|
### Traces (Request flows)
|
|
|
|
**When to use:** Debugging distributed systems, latency investigation
|
|
|
|
**Implementation:**
|
|
```python
|
|
from opentelemetry import trace
|
|
|
|
tracer = trace.get_tracer(__name__)
|
|
|
|
with tracer.start_as_current_span("process_order") as span:
|
|
span.set_attribute("order.id", order_id)
|
|
span.set_attribute("user.id", user_id)
|
|
# Process order logic
|
|
```
|
|
|
|
## Anti-Patterns Catalog
|
|
|
|
### ❌ Vanity Metrics
|
|
**Symptom:** Tracking metrics that look impressive but don't inform decisions
|
|
|
|
**Why bad:** Wastes resources, distracts from actionable metrics
|
|
|
|
**Fix:** Only collect metrics that answer "should I page someone?" or inform business decisions
|
|
|
|
```python
|
|
# ❌ Bad - vanity metric
|
|
total_requests_all_time_counter.inc()
|
|
|
|
# ✅ Good - actionable metric
|
|
request_error_rate.labels(service="api", endpoint="/users").observe(error_rate)
|
|
```
|
|
|
|
---
|
|
|
|
### ❌ Alert on Everything
|
|
**Symptom:** Hundreds of alerts per day, team ignores most of them
|
|
|
|
**Why bad:** Alert fatigue, real issues get missed, on-call burnout
|
|
|
|
**Fix:** Alert only on user-impacting symptoms that require immediate action
|
|
|
|
**Test:** "If this alert fires at 2am, should someone wake up to fix it?" If no, it's not an alert.
|
|
|
|
---
|
|
|
|
### ❌ No Runbooks
|
|
**Symptom:** Alerts fire with no guidance on how to respond
|
|
|
|
**Why bad:** Increased MTTR, inconsistent responses, on-call stress
|
|
|
|
**Fix:** Every alert must link to a runbook with investigation steps
|
|
|
|
```yaml
|
|
# ✅ Good alert with runbook
|
|
alert: HighErrorRate
|
|
annotations:
|
|
summary: "Error rate >5% on {{$labels.service}}"
|
|
description: "Current: {{$value}}%"
|
|
runbook: "https://wiki.company.com/runbooks/high-error-rate"
|
|
```
|
|
|
|
---
|
|
|
|
### ❌ Cardinality Explosion
|
|
**Symptom:** Metrics with unbounded labels (user IDs, timestamps, UUIDs) cause storage/performance issues
|
|
|
|
**Why bad:** Expensive storage, slow queries, potential system failure
|
|
|
|
**Fix:** Use fixed cardinality labels, aggregate high-cardinality dimensions
|
|
|
|
```python
|
|
# ❌ Bad - unbounded cardinality
|
|
request_counter.labels(user_id=user_id).inc() # Millions of unique series
|
|
|
|
# ✅ Good - bounded cardinality
|
|
request_counter.labels(user_type="premium", region="us-east").inc()
|
|
```
|
|
|
|
---
|
|
|
|
### ❌ Missing Correlation IDs
|
|
**Symptom:** Can't trace requests across services, debugging takes hours
|
|
|
|
**Why bad:** High MTTR, frustrated engineers, customer impact
|
|
|
|
**Fix:** Generate correlation ID at entry point, propagate through all services
|
|
|
|
```python
|
|
# ✅ Good - correlation ID propagation
|
|
import uuid
|
|
from contextvars import ContextVar
|
|
|
|
correlation_id_var = ContextVar("correlation_id", default=None)
|
|
|
|
def handle_request():
|
|
correlation_id = request.headers.get("X-Correlation-ID") or str(uuid.uuid4())
|
|
correlation_id_var.set(correlation_id)
|
|
|
|
# All logs and traces include it automatically
|
|
log.info("processing_request", extra={"correlation_id": correlation_id})
|
|
```
|
|
|
|
## SLI Selection Framework
|
|
|
|
**Principle:** Measure user experience, not system internals
|
|
|
|
### Four Golden Signals
|
|
|
|
| Signal | Definition | Example SLI |
|
|
|--------|------------|-------------|
|
|
| **Latency** | Request response time | p99 latency < 200ms |
|
|
| **Traffic** | Demand on system | Requests per second |
|
|
| **Errors** | Failed requests | Error rate < 0.1% |
|
|
| **Saturation** | Resource fullness | CPU < 80%, queue depth < 100 |
|
|
|
|
### RED Method (for services)
|
|
|
|
- **Rate**: Requests per second
|
|
- **Errors**: Error rate (%)
|
|
- **Duration**: Response time (p50, p95, p99)
|
|
|
|
### USE Method (for resources)
|
|
|
|
- **Utilization**: % time resource busy (CPU %, disk I/O %)
|
|
- **Saturation**: Queue depth, wait time
|
|
- **Errors**: Error count
|
|
|
|
**Decision framework:**
|
|
|
|
| Service Type | Recommended SLIs |
|
|
|--------------|------------------|
|
|
| **User-facing API** | Availability (%), p95 latency, error rate |
|
|
| **Background jobs** | Freshness (time since last run), success rate, processing time |
|
|
| **Data pipeline** | Data freshness, completeness (%), processing latency |
|
|
| **Storage** | Availability, durability, latency percentiles |
|
|
|
|
## SLO Definition Guide
|
|
|
|
**SLO = Target value for SLI**
|
|
|
|
**Formula:** `SLO = (good events / total events) >= target`
|
|
|
|
**Example:**
|
|
```
|
|
SLI: Request success rate
|
|
SLO: 99.9% of requests succeed (measured over 30 days)
|
|
Error budget: 0.1% = ~43 minutes downtime/month
|
|
```
|
|
|
|
### Error Budget
|
|
|
|
**Definition:** Amount of unreliability you can tolerate
|
|
|
|
**Calculation:**
|
|
```
|
|
Error budget = 1 - SLO target
|
|
If SLO = 99.9%, error budget = 0.1%
|
|
For 1M requests/month: 1,000 requests can fail
|
|
```
|
|
|
|
**Usage:** Balance reliability vs feature velocity
|
|
|
|
### Multi-Window Multi-Burn-Rate Alerting
|
|
|
|
**Problem:** Simple threshold alerts are either too noisy or too slow
|
|
|
|
**Solution:** Alert based on how fast you're burning error budget
|
|
|
|
```yaml
|
|
# Alert if burning budget 14.4x faster than acceptable (5% in 1 hour)
|
|
alert: ErrorBudgetBurnRateHigh
|
|
expr: |
|
|
(
|
|
rate(errors[1h]) / rate(requests[1h])
|
|
) > (14.4 * (1 - 0.999))
|
|
annotations:
|
|
summary: "Burning error budget at 14.4x rate"
|
|
runbook: "https://wiki/runbooks/error-budget-burn"
|
|
```
|
|
|
|
## Alert Design Patterns
|
|
|
|
**Principle:** Alert on symptoms (user impact) not causes (CPU high)
|
|
|
|
### Symptom-Based Alerting
|
|
|
|
```python
|
|
# ❌ Bad - alert on cause
|
|
alert: HighCPU
|
|
expr: cpu_usage > 80%
|
|
|
|
# ✅ Good - alert on symptom
|
|
alert: HighLatency
|
|
expr: http_request_duration_p99 > 1.0
|
|
```
|
|
|
|
### Alert Severity Levels
|
|
|
|
| Level | When | Response Time | Example |
|
|
|-------|------|---------------|---------|
|
|
| **Critical** | User-impacting | Immediate (page) | Error rate >5%, service down |
|
|
| **Warning** | Will become critical | Next business day | Error rate >1%, disk 85% full |
|
|
| **Info** | Informational | No action needed | Deploy completed, scaling event |
|
|
|
|
**Rule:** Only page for critical. Everything else goes to dashboard/Slack.
|
|
|
|
## Cost Optimization Quick Reference
|
|
|
|
**Observability can cost 5-15% of infrastructure spend. Optimize:**
|
|
|
|
### Sampling Strategies
|
|
|
|
```python
|
|
# Trace sampling - collect 10% of traces
|
|
from opentelemetry.sdk.trace.sampling import TraceIdRatioBased
|
|
|
|
sampler = TraceIdRatioBased(0.1) # 10% sampling
|
|
```
|
|
|
|
**When to sample:**
|
|
- Traces: 1-10% for high-traffic services
|
|
- Logs: Sample debug/info logs, keep all errors
|
|
- Metrics: Don't sample (they're already aggregated)
|
|
|
|
### Retention Policies
|
|
|
|
| Data Type | Recommended Retention | Rationale |
|
|
|-----------|----------------------|-----------|
|
|
| **Metrics** | 15 days (raw), 13 months (aggregated) | Trend analysis |
|
|
| **Logs** | 7-30 days | Debugging, compliance |
|
|
| **Traces** | 7 days | Debugging recent issues |
|
|
|
|
### Cardinality Control
|
|
|
|
```python
|
|
# ❌ Bad - high cardinality
|
|
http_requests.labels(
|
|
method=method,
|
|
url=full_url, # Unbounded!
|
|
user_id=user_id # Unbounded!
|
|
)
|
|
|
|
# ✅ Good - controlled cardinality
|
|
http_requests.labels(
|
|
method=method,
|
|
endpoint=route_pattern, # /users/:id not /users/12345
|
|
status_code=status
|
|
)
|
|
```
|
|
|
|
## Tool Ecosystem Quick Reference
|
|
|
|
| Category | Open Source | Managed/Commercial |
|
|
|----------|-------------|-------------------|
|
|
| **Metrics** | Prometheus, VictoriaMetrics | Datadog, New Relic, Grafana Cloud |
|
|
| **Logs** | Loki, ELK Stack | Datadog, Splunk, Sumo Logic |
|
|
| **Traces** | Jaeger, Zipkin | Datadog, Honeycomb, Lightstep |
|
|
| **All-in-One** | Grafana + Loki + Tempo + Mimir | Datadog, New Relic, Dynatrace |
|
|
| **Instrumentation** | OpenTelemetry | (vendor SDKs) |
|
|
|
|
**Recommendation:**
|
|
- **Starting out**: Prometheus + Grafana + OpenTelemetry
|
|
- **Growing (10-50 services)**: Add Loki (logs) + Jaeger (traces)
|
|
- **Scale (50+ services)**: Consider managed (Datadog, Grafana Cloud)
|
|
|
|
**Why OpenTelemetry:** Vendor-neutral, future-proof, single instrumentation for all signals
|
|
|
|
## Your First Observability Setup
|
|
|
|
**Goal:** Metrics + alerting in one week
|
|
|
|
**Day 1-2: Instrument application**
|
|
|
|
```python
|
|
# Add OpenTelemetry
|
|
from opentelemetry import metrics, trace
|
|
from opentelemetry.sdk.metrics import MeterProvider
|
|
from opentelemetry.sdk.trace import TracerProvider
|
|
from opentelemetry.exporter.prometheus import PrometheusMetricReader
|
|
|
|
# Initialize
|
|
meter_provider = MeterProvider(
|
|
metric_readers=[PrometheusMetricReader()]
|
|
)
|
|
metrics.set_meter_provider(meter_provider)
|
|
|
|
# Instrument HTTP framework (auto-instrumentation)
|
|
from opentelemetry.instrumentation.flask import FlaskInstrumentor
|
|
FlaskInstrumentor().instrument_app(app)
|
|
```
|
|
|
|
**Day 3-4: Deploy Prometheus + Grafana**
|
|
|
|
```yaml
|
|
# docker-compose.yml
|
|
version: '3'
|
|
services:
|
|
prometheus:
|
|
image: prom/prometheus
|
|
ports:
|
|
- "9090:9090"
|
|
volumes:
|
|
- ./prometheus.yml:/etc/prometheus/prometheus.yml
|
|
|
|
grafana:
|
|
image: grafana/grafana
|
|
ports:
|
|
- "3000:3000"
|
|
```
|
|
|
|
**Day 5: Define SLIs and SLOs**
|
|
|
|
```
|
|
SLI: HTTP request success rate
|
|
SLO: 99.9% of requests succeed (30-day window)
|
|
Error budget: 0.1% = 43 minutes downtime/month
|
|
```
|
|
|
|
**Day 6: Create alerts**
|
|
|
|
```yaml
|
|
# prometheus-alerts.yml
|
|
groups:
|
|
- name: slo_alerts
|
|
rules:
|
|
- alert: HighErrorRate
|
|
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
|
|
for: 5m
|
|
annotations:
|
|
summary: "Error rate >5% on {{$labels.service}}"
|
|
runbook: "https://wiki/runbooks/high-error-rate"
|
|
```
|
|
|
|
**Day 7: Build dashboard**
|
|
|
|
**Panels to include:**
|
|
- Error rate (%)
|
|
- Request rate (req/s)
|
|
- p50/p95/p99 latency
|
|
- CPU/memory utilization
|
|
|
|
## Common Mistakes
|
|
|
|
### ❌ Logging in Production == Debugging in Production
|
|
**Fix:** Use structured logging with correlation IDs, not print statements
|
|
|
|
---
|
|
|
|
### ❌ Alerting on Predictions, Not Reality
|
|
**Fix:** Alert on actual user impact (errors, latency) not predicted issues (disk 70% full)
|
|
|
|
---
|
|
|
|
### ❌ Dashboard Sprawl
|
|
**Fix:** One main dashboard per service showing SLIs. Delete dashboards unused for 3 months.
|
|
|
|
---
|
|
|
|
### ❌ Ignoring Alert Feedback Loop
|
|
**Fix:** Track alert precision (% that led to action). Delete alerts with <50% precision.
|
|
|
|
## Quick Reference
|
|
|
|
**Getting Started:**
|
|
- Start with metrics (Prometheus + OpenTelemetry)
|
|
- Add logs when debugging is hard (Loki)
|
|
- Add traces when issues span services (Jaeger)
|
|
|
|
**SLI Selection:**
|
|
- User-facing: Availability, latency, error rate
|
|
- Background: Freshness, success rate, processing time
|
|
|
|
**SLO Targets:**
|
|
- Start with 99% (achievable)
|
|
- Increase to 99.9% only if business requires it
|
|
- 99.99% is very expensive (4 nines = 52 min/year downtime)
|
|
|
|
**Alerting:**
|
|
- Critical only = page
|
|
- Warning = next business day
|
|
- Info = dashboard only
|
|
|
|
**Cost Control:**
|
|
- Sample traces (1-10%)
|
|
- Control metric cardinality (no unbounded labels)
|
|
- Set retention policies (7-30 days logs, 15 days metrics)
|
|
|
|
**Tools:**
|
|
- Small: Prometheus + Grafana + Loki
|
|
- Medium: Add Jaeger
|
|
- Large: Consider Datadog, Grafana Cloud
|
|
|
|
## Bottom Line
|
|
|
|
**Start with metrics using OpenTelemetry + Prometheus. Define 3-5 SLIs based on user experience. Alert only on symptoms that require immediate action. Add logs and traces when metrics aren't enough.**
|
|
|
|
Measure what users care about, not what's easy to measure.
|