gh-tachyon-beep-skillpacks-…/skills/observability-and-monitoring/SKILL.md

---
name: observability-and-monitoring
description: Use when implementing metrics/logs/traces, defining SLIs/SLOs, designing alerts, choosing observability tools, debugging alert fatigue, or optimizing observability costs - provides SRE frameworks, anti-patterns, and implementation patterns
---

# Observability and Monitoring

## Overview

**Core principle:** Measure what users care about, alert on symptoms not causes, make alerts actionable.

**Rule:** Observability without actionability is just expensive logging.

**Already have observability tools (CloudWatch, Datadog, etc.)?** Optimize what you have first. Most observability problems are usage/process issues, not tooling. Implement SLIs/SLOs, clean up alerts, add runbooks with existing tools. Migrate only if you hit concrete tool limitations (cost, features, multi-cloud). Tool migration is expensive - make sure it solves a real problem.

## Getting Started Decision Tree

| Team Size | Scale | Starting Point | Tools |
|-----------|-------|----------------|-------|
| 1-5 engineers | <10 services | Metrics + logs | Prometheus + Grafana + Loki |
| 5-20 engineers | 10-50 services | Metrics + logs + basic traces | Add Jaeger, OpenTelemetry |
| 20+ engineers | 50+ services | Full observability + SLOs | Managed platform (Datadog, Grafana Cloud) |

**First step:** Implement metrics with OpenTelemetry + Prometheus

**Why this order:** Metrics give you fastest time-to-value (detect issues), logs help debug (understand what happened), traces solve complex distributed problems (debug cross-service issues)

## Three Pillars Quick Reference

### Metrics (Quantitative, aggregated)

**When to use:** Alerting, dashboards, trend analysis

**What to collect:**
- **RED method** (services): Rate, Errors, Duration
- **USE method** (resources): Utilization, Saturation, Errors
- **Four Golden Signals**: Latency, traffic, errors, saturation

**Implementation:**
```python
# OpenTelemetry metrics
from opentelemetry import metrics

meter = metrics.get_meter(__name__)
request_counter = meter.create_counter(
    "http_requests_total",
    description="Total HTTP requests"
)
request_duration = meter.create_histogram(
    "http_request_duration_seconds",
    description="HTTP request duration"
)

# Instrument request
request_counter.add(1, {"method": "GET", "endpoint": "/api/users"})
request_duration.record(duration, {"method": "GET", "endpoint": "/api/users"})
```

### Logs (Discrete events)

**When to use:** Debugging, audit trails, error investigation

**Best practices:**
- Structured logging (JSON)
- Include correlation IDs
- Don't log sensitive data (PII, secrets)

**Implementation:**
```python
import structlog

log = structlog.get_logger()
log.info(
    "user_login",
    user_id=user_id,
    correlation_id=correlation_id,
    ip_address=ip,
    duration_ms=duration
)
```

### Traces (Request flows)

**When to use:** Debugging distributed systems, latency investigation

**Implementation:**
```python
from opentelemetry import trace

tracer = trace.get_tracer(__name__)

with tracer.start_as_current_span("process_order") as span:
    span.set_attribute("order.id", order_id)
    span.set_attribute("user.id", user_id)
    # Process order logic
```

## Anti-Patterns Catalog

### ❌ Vanity Metrics
**Symptom:** Tracking metrics that look impressive but don't inform decisions

**Why bad:** Wastes resources, distracts from actionable metrics

**Fix:** Only collect metrics that answer "should I page someone?" or inform business decisions

```python
# ❌ Bad - vanity metric
total_requests_all_time_counter.inc()

# ✅ Good - actionable metric
request_error_rate.labels(service="api", endpoint="/users").observe(error_rate)
```

---

### ❌ Alert on Everything
**Symptom:** Hundreds of alerts per day, team ignores most of them

**Why bad:** Alert fatigue, real issues get missed, on-call burnout

**Fix:** Alert only on user-impacting symptoms that require immediate action

**Test:** "If this alert fires at 2am, should someone wake up to fix it?" If no, it's not an alert.

---

### ❌ No Runbooks
**Symptom:** Alerts fire with no guidance on how to respond

**Why bad:** Increased MTTR, inconsistent responses, on-call stress

**Fix:** Every alert must link to a runbook with investigation steps

```yaml
# ✅ Good alert with runbook
alert: HighErrorRate
annotations:
  summary: "Error rate >5% on {{$labels.service}}"
  description: "Current: {{$value}}%"
  runbook: "https://wiki.company.com/runbooks/high-error-rate"
```

---

### ❌ Cardinality Explosion
**Symptom:** Metrics with unbounded labels (user IDs, timestamps, UUIDs) cause storage/performance issues

**Why bad:** Expensive storage, slow queries, potential system failure

**Fix:** Use fixed cardinality labels, aggregate high-cardinality dimensions

```python
# ❌ Bad - unbounded cardinality
request_counter.labels(user_id=user_id).inc()  # Millions of unique series

# ✅ Good - bounded cardinality
request_counter.labels(user_type="premium", region="us-east").inc()
```

---

### ❌ Missing Correlation IDs
**Symptom:** Can't trace requests across services, debugging takes hours

**Why bad:** High MTTR, frustrated engineers, customer impact

**Fix:** Generate correlation ID at entry point, propagate through all services

```python
# ✅ Good - correlation ID propagation
import uuid
from contextvars import ContextVar

correlation_id_var = ContextVar("correlation_id", default=None)

def handle_request():
    correlation_id = request.headers.get("X-Correlation-ID") or str(uuid.uuid4())
    correlation_id_var.set(correlation_id)

    # All logs and traces include it automatically
    log.info("processing_request", extra={"correlation_id": correlation_id})
```

## SLI Selection Framework

**Principle:** Measure user experience, not system internals

### Four Golden Signals

| Signal | Definition | Example SLI |
|--------|------------|-------------|
| **Latency** | Request response time | p99 latency < 200ms |
| **Traffic** | Demand on system | Requests per second |
| **Errors** | Failed requests | Error rate < 0.1% |
| **Saturation** | Resource fullness | CPU < 80%, queue depth < 100 |

### RED Method (for services)

- **Rate**: Requests per second
- **Errors**: Error rate (%)
- **Duration**: Response time (p50, p95, p99)

### USE Method (for resources)

- **Utilization**: % time resource busy (CPU %, disk I/O %)
- **Saturation**: Queue depth, wait time
- **Errors**: Error count

**Decision framework:**

| Service Type | Recommended SLIs |
|--------------|------------------|
| **User-facing API** | Availability (%), p95 latency, error rate |
| **Background jobs** | Freshness (time since last run), success rate, processing time |
| **Data pipeline** | Data freshness, completeness (%), processing latency |
| **Storage** | Availability, durability, latency percentiles |

## SLO Definition Guide

**SLO = Target value for SLI**

**Formula:** `SLO = (good events / total events) >= target`

**Example:**
```
SLI: Request success rate
SLO: 99.9% of requests succeed (measured over 30 days)
Error budget: 0.1% = ~43 minutes downtime/month
```

### Error Budget

**Definition:** Amount of unreliability you can tolerate

**Calculation:**
```
Error budget = 1 - SLO target
If SLO = 99.9%, error budget = 0.1%
For 1M requests/month: 1,000 requests can fail
```

**Usage:** Balance reliability vs feature velocity

### Multi-Window Multi-Burn-Rate Alerting

**Problem:** Simple threshold alerts are either too noisy or too slow

**Solution:** Alert based on how fast you're burning error budget

```yaml
# Alert if burning budget 14.4x faster than acceptable (5% in 1 hour)
alert: ErrorBudgetBurnRateHigh
expr: |
  (
    rate(errors[1h]) / rate(requests[1h])
  ) > (14.4 * (1 - 0.999))
annotations:
  summary: "Burning error budget at 14.4x rate"
  runbook: "https://wiki/runbooks/error-budget-burn"
```

## Alert Design Patterns

**Principle:** Alert on symptoms (user impact) not causes (CPU high)

### Symptom-Based Alerting

```python
# ❌ Bad - alert on cause
alert: HighCPU
expr: cpu_usage > 80%

# ✅ Good - alert on symptom
alert: HighLatency
expr: http_request_duration_p99 > 1.0
```

### Alert Severity Levels

| Level | When | Response Time | Example |
|-------|------|---------------|---------|
| **Critical** | User-impacting | Immediate (page) | Error rate >5%, service down |
| **Warning** | Will become critical | Next business day | Error rate >1%, disk 85% full |
| **Info** | Informational | No action needed | Deploy completed, scaling event |

**Rule:** Only page for critical. Everything else goes to dashboard/Slack.

## Cost Optimization Quick Reference

**Observability can cost 5-15% of infrastructure spend. Optimize:**

### Sampling Strategies

```python
# Trace sampling - collect 10% of traces
from opentelemetry.sdk.trace.sampling import TraceIdRatioBased

sampler = TraceIdRatioBased(0.1)  # 10% sampling
```

**When to sample:**
- Traces: 1-10% for high-traffic services
- Logs: Sample debug/info logs, keep all errors
- Metrics: Don't sample (they're already aggregated)

### Retention Policies

| Data Type | Recommended Retention | Rationale |
|-----------|----------------------|-----------|
| **Metrics** | 15 days (raw), 13 months (aggregated) | Trend analysis |
| **Logs** | 7-30 days | Debugging, compliance |
| **Traces** | 7 days | Debugging recent issues |

### Cardinality Control

```python
# ❌ Bad - high cardinality
http_requests.labels(
    method=method,
    url=full_url,  # Unbounded!
    user_id=user_id  # Unbounded!
)

# ✅ Good - controlled cardinality
http_requests.labels(
    method=method,
    endpoint=route_pattern,  # /users/:id not /users/12345
    status_code=status
)
```

## Tool Ecosystem Quick Reference

| Category | Open Source | Managed/Commercial |
|----------|-------------|-------------------|
| **Metrics** | Prometheus, VictoriaMetrics | Datadog, New Relic, Grafana Cloud |
| **Logs** | Loki, ELK Stack | Datadog, Splunk, Sumo Logic |
| **Traces** | Jaeger, Zipkin | Datadog, Honeycomb, Lightstep |
| **All-in-One** | Grafana + Loki + Tempo + Mimir | Datadog, New Relic, Dynatrace |
| **Instrumentation** | OpenTelemetry | (vendor SDKs) |

**Recommendation:**
- **Starting out**: Prometheus + Grafana + OpenTelemetry
- **Growing (10-50 services)**: Add Loki (logs) + Jaeger (traces)
- **Scale (50+ services)**: Consider managed (Datadog, Grafana Cloud)

**Why OpenTelemetry:** Vendor-neutral, future-proof, single instrumentation for all signals

## Your First Observability Setup

**Goal:** Metrics + alerting in one week

**Day 1-2: Instrument application**

```python
# Add OpenTelemetry
from opentelemetry import metrics, trace
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.prometheus import PrometheusMetricReader

# Initialize
meter_provider = MeterProvider(
    metric_readers=[PrometheusMetricReader()]
)
metrics.set_meter_provider(meter_provider)

# Instrument HTTP framework (auto-instrumentation)
from opentelemetry.instrumentation.flask import FlaskInstrumentor
FlaskInstrumentor().instrument_app(app)
```

**Day 3-4: Deploy Prometheus + Grafana**

```yaml
# docker-compose.yml
version: '3'
services:
  prometheus:
    image: prom/prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml

  grafana:
    image: grafana/grafana
    ports:
      - "3000:3000"
```

**Day 5: Define SLIs and SLOs**

```
SLI: HTTP request success rate
SLO: 99.9% of requests succeed (30-day window)
Error budget: 0.1% = 43 minutes downtime/month
```

**Day 6: Create alerts**

```yaml
# prometheus-alerts.yml
groups:
  - name: slo_alerts
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
        for: 5m
        annotations:
          summary: "Error rate >5% on {{$labels.service}}"
          runbook: "https://wiki/runbooks/high-error-rate"
```

**Day 7: Build dashboard**

**Panels to include:**
- Error rate (%)
- Request rate (req/s)
- p50/p95/p99 latency
- CPU/memory utilization

## Common Mistakes

### ❌ Logging in Production == Debugging in Production
**Fix:** Use structured logging with correlation IDs, not print statements

---

### ❌ Alerting on Predictions, Not Reality
**Fix:** Alert on actual user impact (errors, latency) not predicted issues (disk 70% full)

---

### ❌ Dashboard Sprawl
**Fix:** One main dashboard per service showing SLIs. Delete dashboards unused for 3 months.

---

### ❌ Ignoring Alert Feedback Loop
**Fix:** Track alert precision (% that led to action). Delete alerts with <50% precision.

## Quick Reference

**Getting Started:**
- Start with metrics (Prometheus + OpenTelemetry)
- Add logs when debugging is hard (Loki)
- Add traces when issues span services (Jaeger)

**SLI Selection:**
- User-facing: Availability, latency, error rate
- Background: Freshness, success rate, processing time

**SLO Targets:**
- Start with 99% (achievable)
- Increase to 99.9% only if business requires it
- 99.99% is very expensive (4 nines = 52 min/year downtime)

**Alerting:**
- Critical only = page
- Warning = next business day
- Info = dashboard only

**Cost Control:**
- Sample traces (1-10%)
- Control metric cardinality (no unbounded labels)
- Set retention policies (7-30 days logs, 15 days metrics)

**Tools:**
- Small: Prometheus + Grafana + Loki
- Medium: Add Jaeger
- Large: Consider Datadog, Grafana Cloud

## Bottom Line

**Start with metrics using OpenTelemetry + Prometheus. Define 3-5 SLIs based on user experience. Alert only on symptoms that require immediate action. Add logs and traces when metrics aren't enough.**

Measure what users care about, not what's easy to measure.