6.2 KiB
Observability Examples
Production-ready observability implementations for Grey Haven stack (Cloudflare Workers, TanStack Start, FastAPI, PostgreSQL).
Examples Overview
Prometheus + Grafana Setup
File: prometheus-grafana-setup.md
Complete monitoring stack for Kubernetes with Golden Signals:
- Prometheus Deployment - Helm charts, service discovery, scrape configs
- Grafana Setup - Dashboard-as-code, templating, alerting
- Node Exporter - System metrics collection (CPU, memory, disk)
- kube-state-metrics - Kubernetes resource metrics
- Golden Signals - Request rate, error rate, latency (p50/p95/p99), saturation
- Recording Rules - Pre-aggregated metrics for fast queries
- Alert Manager - PagerDuty integration, escalation policies
- Before/After Metrics - Response time improved 40%, MTTR reduced 60%
Use when: Setting up production monitoring, implementing SRE practices, cloud-native deployments
OpenTelemetry Distributed Tracing
File: opentelemetry-tracing.md
Distributed tracing for microservices with Jaeger:
- OTel Collector - Receiver/processor/exporter pipelines
- Auto-Instrumentation - Zero-code tracing for Node.js, Python, FastAPI
- Context Propagation - W3C Trace Context across services
- Sampling Strategies - Head-based (10%), tail-based (errors only)
- Span Attributes - HTTP method, status code, user ID, tenant ID
- Trace Visualization - Jaeger UI, dependency graphs, critical path
- Performance Impact - <5ms overhead, 2% CPU increase
- Before/After - MTTR 45min → 8min (82% reduction)
Use when: Debugging microservices, understanding latency, optimizing critical paths
SLO & Error Budget Framework
File: slo-error-budgets.md
Complete SLI/SLO/Error Budget implementation:
- SLI Definition - Availability (99.9%), latency (p95 < 200ms), error rate (< 0.5%)
- SLO Targets - Critical (99.95%), Essential (99.9%), Standard (99.5%)
- Error Budget Calculation - Monthly budget, burn rate monitoring (1h/6h/24h windows)
- Prometheus Recording Rules - Multi-window SLI calculations
- Grafana SLO Dashboard - Real-time status, budget remaining, burn rate graphs
- Budget Policies - Feature freeze at 25% remaining, postmortem required at depletion
- Burn Rate Alerts - PagerDuty escalation when burning too fast
- Impact - 99.95% availability achieved, 3 feature freezes prevented overspend
Use when: Implementing SRE practices, balancing reliability with velocity, production deployments
DataDog APM Integration
File: datadog-apm.md
Application Performance Monitoring for Grey Haven stack:
- DataDog Agent - Cloudflare Workers instrumentation, FastAPI tracing
- Custom Metrics - Business KPIs (checkout success rate, revenue per minute)
- Real User Monitoring (RUM) - Frontend performance, user sessions, error tracking
- APM Traces - Distributed tracing with Cloudflare Workers, database queries
- Log Correlation - Trace ID in logs, unified troubleshooting
- Synthetic Monitoring - API health checks every 1 minute from 10 locations
- Anomaly Detection - ML-powered alerts for unusual patterns
- Cost - $31/host/month, $40/million spans
- Before/After - 99.5% → 99.95% availability (10x fewer incidents)
Use when: Commercial APM needed, executive dashboards required, startup budget allows
Centralized Logging with Fluentd + Elasticsearch
File: centralized-logging.md
Production log aggregation for multi-region deployments:
- Fluentd DaemonSet - Kubernetes log collection from all pods
- Structured Logging - JSON format with trace ID, user ID, tenant ID
- Elasticsearch Indexing - Daily indices with rollover, ILM policies
- Kibana Dashboards - Error tracking, request patterns, audit logs
- Log Parsing - Grok patterns for FastAPI, TanStack Start, PostgreSQL
- Retention - Hot (7 days), Warm (30 days), Cold (90 days), Archive (1 year)
- PII Redaction - Automatic SSN, credit card, email masking
- Volume - 500GB/day ingested, 90% compression, $800/month cost
- Before/After - Log search 5min → 10sec, disk usage 10TB → 1TB
Use when: Debugging production issues, compliance requirements (SOC2/PCI), audit trails
Chaos Engineering with Gremlin
File: chaos-engineering.md
Reliability testing and circuit breaker validation:
- Gremlin Setup - Agent deployment, blast radius configuration
- Chaos Experiments - Pod termination, network latency (100ms), CPU stress (80%)
- Circuit Breaker - Automatic fallback when error rate > 50%
- Hypothesis - "API handles 50% pod failures without user impact"
- Validation - Prometheus metrics, distributed traces, user session monitoring
- Results - Circuit breaker engaged in 2sec, fallback success rate 99.8%
- Runbook - Automatic rollback triggers, escalation procedures
- Impact - Found 3 critical bugs before production, confidence in resilience
Use when: Pre-production validation, testing disaster recovery, chaos engineering practice
Quick Navigation
| Topic | File | Lines | Focus |
|---|---|---|---|
| Prometheus + Grafana | prometheus-grafana-setup.md | ~480 | Golden Signals monitoring |
| OpenTelemetry | opentelemetry-tracing.md | ~450 | Distributed tracing |
| SLO Framework | slo-error-budgets.md | ~420 | Error budget management |
| DataDog APM | datadog-apm.md | ~400 | Commercial APM |
| Centralized Logging | centralized-logging.md | ~440 | Log aggregation |
| Chaos Engineering | chaos-engineering.md | ~350 | Reliability testing |
Related Documentation
- Reference: Reference Index - PromQL, Golden Signals, SLO best practices
- Templates: Templates Index - Grafana dashboards, SLO definitions
- Main Agent: observability-engineer.md - Observability agent
Return to main agent