# Observability Examples

Production-ready observability implementations for Grey Haven stack (Cloudflare Workers, TanStack Start, FastAPI, PostgreSQL).

## Examples Overview

### Prometheus + Grafana Setup

**File**: [prometheus-grafana-setup.md](prometheus-grafana-setup.md)

Complete monitoring stack for Kubernetes with Golden Signals:
- **Prometheus Deployment** - Helm charts, service discovery, scrape configs
- **Grafana Setup** - Dashboard-as-code, templating, alerting
- **Node Exporter** - System metrics collection (CPU, memory, disk)
- **kube-state-metrics** - Kubernetes resource metrics
- **Golden Signals** - Request rate, error rate, latency (p50/p95/p99), saturation
- **Recording Rules** - Pre-aggregated metrics for fast queries
- **Alert Manager** - PagerDuty integration, escalation policies
- **Before/After Metrics** - Response time improved 40%, MTTR reduced 60%

**Use when**: Setting up production monitoring, implementing SRE practices, cloud-native deployments

---

### OpenTelemetry Distributed Tracing

**File**: [opentelemetry-tracing.md](opentelemetry-tracing.md)

Distributed tracing for microservices with Jaeger:
- **OTel Collector** - Receiver/processor/exporter pipelines
- **Auto-Instrumentation** - Zero-code tracing for Node.js, Python, FastAPI
- **Context Propagation** - W3C Trace Context across services
- **Sampling Strategies** - Head-based (10%), tail-based (errors only)
- **Span Attributes** - HTTP method, status code, user ID, tenant ID
- **Trace Visualization** - Jaeger UI, dependency graphs, critical path
- **Performance Impact** - <5ms overhead, 2% CPU increase
- **Before/After** - MTTR 45min → 8min (82% reduction)

**Use when**: Debugging microservices, understanding latency, optimizing critical paths

---

### SLO & Error Budget Framework

**File**: [slo-error-budgets.md](slo-error-budgets.md)

Complete SLI/SLO/Error Budget implementation:
- **SLI Definition** - Availability (99.9%), latency (p95 < 200ms), error rate (< 0.5%)
- **SLO Targets** - Critical (99.95%), Essential (99.9%), Standard (99.5%)
- **Error Budget Calculation** - Monthly budget, burn rate monitoring (1h/6h/24h windows)
- **Prometheus Recording Rules** - Multi-window SLI calculations
- **Grafana SLO Dashboard** - Real-time status, budget remaining, burn rate graphs
- **Budget Policies** - Feature freeze at 25% remaining, postmortem required at depletion
- **Burn Rate Alerts** - PagerDuty escalation when burning too fast
- **Impact** - 99.95% availability achieved, 3 feature freezes prevented overspend

**Use when**: Implementing SRE practices, balancing reliability with velocity, production deployments

---

### DataDog APM Integration

**File**: [datadog-apm.md](datadog-apm.md)

Application Performance Monitoring for Grey Haven stack:
- **DataDog Agent** - Cloudflare Workers instrumentation, FastAPI tracing
- **Custom Metrics** - Business KPIs (checkout success rate, revenue per minute)
- **Real User Monitoring (RUM)** - Frontend performance, user sessions, error tracking
- **APM Traces** - Distributed tracing with Cloudflare Workers, database queries
- **Log Correlation** - Trace ID in logs, unified troubleshooting
- **Synthetic Monitoring** - API health checks every 1 minute from 10 locations
- **Anomaly Detection** - ML-powered alerts for unusual patterns
- **Cost** - $31/host/month, $40/million spans
- **Before/After** - 99.5% → 99.95% availability (10x fewer incidents)

**Use when**: Commercial APM needed, executive dashboards required, startup budget allows

---

### Centralized Logging with Fluentd + Elasticsearch

**File**: [centralized-logging.md](centralized-logging.md)

Production log aggregation for multi-region deployments:
- **Fluentd DaemonSet** - Kubernetes log collection from all pods
- **Structured Logging** - JSON format with trace ID, user ID, tenant ID
- **Elasticsearch Indexing** - Daily indices with rollover, ILM policies
- **Kibana Dashboards** - Error tracking, request patterns, audit logs
- **Log Parsing** - Grok patterns for FastAPI, TanStack Start, PostgreSQL
- **Retention** - Hot (7 days), Warm (30 days), Cold (90 days), Archive (1 year)
- **PII Redaction** - Automatic SSN, credit card, email masking
- **Volume** - 500GB/day ingested, 90% compression, $800/month cost
- **Before/After** - Log search 5min → 10sec, disk usage 10TB → 1TB

**Use when**: Debugging production issues, compliance requirements (SOC2/PCI), audit trails

---

### Chaos Engineering with Gremlin

**File**: [chaos-engineering.md](chaos-engineering.md)

Reliability testing and circuit breaker validation:
- **Gremlin Setup** - Agent deployment, blast radius configuration
- **Chaos Experiments** - Pod termination, network latency (100ms), CPU stress (80%)
- **Circuit Breaker** - Automatic fallback when error rate > 50%
- **Hypothesis** - "API handles 50% pod failures without user impact"
- **Validation** - Prometheus metrics, distributed traces, user session monitoring
- **Results** - Circuit breaker engaged in 2sec, fallback success rate 99.8%
- **Runbook** - Automatic rollback triggers, escalation procedures
- **Impact** - Found 3 critical bugs before production, confidence in resilience

**Use when**: Pre-production validation, testing disaster recovery, chaos engineering practice

---

## Quick Navigation

| Topic | File | Lines | Focus |
|-------|------|-------|-------|
| **Prometheus + Grafana** | [prometheus-grafana-setup.md](prometheus-grafana-setup.md) | ~480 | Golden Signals monitoring |
| **OpenTelemetry** | [opentelemetry-tracing.md](opentelemetry-tracing.md) | ~450 | Distributed tracing |
| **SLO Framework** | [slo-error-budgets.md](slo-error-budgets.md) | ~420 | Error budget management |
| **DataDog APM** | [datadog-apm.md](datadog-apm.md) | ~400 | Commercial APM |
| **Centralized Logging** | [centralized-logging.md](centralized-logging.md) | ~440 | Log aggregation |
| **Chaos Engineering** | [chaos-engineering.md](chaos-engineering.md) | ~350 | Reliability testing |

## Related Documentation

- **Reference**: [Reference Index](../reference/INDEX.md) - PromQL, Golden Signals, SLO best practices
- **Templates**: [Templates Index](../templates/INDEX.md) - Grafana dashboards, SLO definitions
- **Main Agent**: [observability-engineer.md](../observability-engineer.md) - Observability agent

---

Return to [main agent](../observability-engineer.md)