137 lines
6.2 KiB
Markdown
137 lines
6.2 KiB
Markdown
# Observability Examples
|
|
|
|
Production-ready observability implementations for Grey Haven stack (Cloudflare Workers, TanStack Start, FastAPI, PostgreSQL).
|
|
|
|
## Examples Overview
|
|
|
|
### Prometheus + Grafana Setup
|
|
|
|
**File**: [prometheus-grafana-setup.md](prometheus-grafana-setup.md)
|
|
|
|
Complete monitoring stack for Kubernetes with Golden Signals:
|
|
- **Prometheus Deployment** - Helm charts, service discovery, scrape configs
|
|
- **Grafana Setup** - Dashboard-as-code, templating, alerting
|
|
- **Node Exporter** - System metrics collection (CPU, memory, disk)
|
|
- **kube-state-metrics** - Kubernetes resource metrics
|
|
- **Golden Signals** - Request rate, error rate, latency (p50/p95/p99), saturation
|
|
- **Recording Rules** - Pre-aggregated metrics for fast queries
|
|
- **Alert Manager** - PagerDuty integration, escalation policies
|
|
- **Before/After Metrics** - Response time improved 40%, MTTR reduced 60%
|
|
|
|
**Use when**: Setting up production monitoring, implementing SRE practices, cloud-native deployments
|
|
|
|
---
|
|
|
|
### OpenTelemetry Distributed Tracing
|
|
|
|
**File**: [opentelemetry-tracing.md](opentelemetry-tracing.md)
|
|
|
|
Distributed tracing for microservices with Jaeger:
|
|
- **OTel Collector** - Receiver/processor/exporter pipelines
|
|
- **Auto-Instrumentation** - Zero-code tracing for Node.js, Python, FastAPI
|
|
- **Context Propagation** - W3C Trace Context across services
|
|
- **Sampling Strategies** - Head-based (10%), tail-based (errors only)
|
|
- **Span Attributes** - HTTP method, status code, user ID, tenant ID
|
|
- **Trace Visualization** - Jaeger UI, dependency graphs, critical path
|
|
- **Performance Impact** - <5ms overhead, 2% CPU increase
|
|
- **Before/After** - MTTR 45min → 8min (82% reduction)
|
|
|
|
**Use when**: Debugging microservices, understanding latency, optimizing critical paths
|
|
|
|
---
|
|
|
|
### SLO & Error Budget Framework
|
|
|
|
**File**: [slo-error-budgets.md](slo-error-budgets.md)
|
|
|
|
Complete SLI/SLO/Error Budget implementation:
|
|
- **SLI Definition** - Availability (99.9%), latency (p95 < 200ms), error rate (< 0.5%)
|
|
- **SLO Targets** - Critical (99.95%), Essential (99.9%), Standard (99.5%)
|
|
- **Error Budget Calculation** - Monthly budget, burn rate monitoring (1h/6h/24h windows)
|
|
- **Prometheus Recording Rules** - Multi-window SLI calculations
|
|
- **Grafana SLO Dashboard** - Real-time status, budget remaining, burn rate graphs
|
|
- **Budget Policies** - Feature freeze at 25% remaining, postmortem required at depletion
|
|
- **Burn Rate Alerts** - PagerDuty escalation when burning too fast
|
|
- **Impact** - 99.95% availability achieved, 3 feature freezes prevented overspend
|
|
|
|
**Use when**: Implementing SRE practices, balancing reliability with velocity, production deployments
|
|
|
|
---
|
|
|
|
### DataDog APM Integration
|
|
|
|
**File**: [datadog-apm.md](datadog-apm.md)
|
|
|
|
Application Performance Monitoring for Grey Haven stack:
|
|
- **DataDog Agent** - Cloudflare Workers instrumentation, FastAPI tracing
|
|
- **Custom Metrics** - Business KPIs (checkout success rate, revenue per minute)
|
|
- **Real User Monitoring (RUM)** - Frontend performance, user sessions, error tracking
|
|
- **APM Traces** - Distributed tracing with Cloudflare Workers, database queries
|
|
- **Log Correlation** - Trace ID in logs, unified troubleshooting
|
|
- **Synthetic Monitoring** - API health checks every 1 minute from 10 locations
|
|
- **Anomaly Detection** - ML-powered alerts for unusual patterns
|
|
- **Cost** - $31/host/month, $40/million spans
|
|
- **Before/After** - 99.5% → 99.95% availability (10x fewer incidents)
|
|
|
|
**Use when**: Commercial APM needed, executive dashboards required, startup budget allows
|
|
|
|
---
|
|
|
|
### Centralized Logging with Fluentd + Elasticsearch
|
|
|
|
**File**: [centralized-logging.md](centralized-logging.md)
|
|
|
|
Production log aggregation for multi-region deployments:
|
|
- **Fluentd DaemonSet** - Kubernetes log collection from all pods
|
|
- **Structured Logging** - JSON format with trace ID, user ID, tenant ID
|
|
- **Elasticsearch Indexing** - Daily indices with rollover, ILM policies
|
|
- **Kibana Dashboards** - Error tracking, request patterns, audit logs
|
|
- **Log Parsing** - Grok patterns for FastAPI, TanStack Start, PostgreSQL
|
|
- **Retention** - Hot (7 days), Warm (30 days), Cold (90 days), Archive (1 year)
|
|
- **PII Redaction** - Automatic SSN, credit card, email masking
|
|
- **Volume** - 500GB/day ingested, 90% compression, $800/month cost
|
|
- **Before/After** - Log search 5min → 10sec, disk usage 10TB → 1TB
|
|
|
|
**Use when**: Debugging production issues, compliance requirements (SOC2/PCI), audit trails
|
|
|
|
---
|
|
|
|
### Chaos Engineering with Gremlin
|
|
|
|
**File**: [chaos-engineering.md](chaos-engineering.md)
|
|
|
|
Reliability testing and circuit breaker validation:
|
|
- **Gremlin Setup** - Agent deployment, blast radius configuration
|
|
- **Chaos Experiments** - Pod termination, network latency (100ms), CPU stress (80%)
|
|
- **Circuit Breaker** - Automatic fallback when error rate > 50%
|
|
- **Hypothesis** - "API handles 50% pod failures without user impact"
|
|
- **Validation** - Prometheus metrics, distributed traces, user session monitoring
|
|
- **Results** - Circuit breaker engaged in 2sec, fallback success rate 99.8%
|
|
- **Runbook** - Automatic rollback triggers, escalation procedures
|
|
- **Impact** - Found 3 critical bugs before production, confidence in resilience
|
|
|
|
**Use when**: Pre-production validation, testing disaster recovery, chaos engineering practice
|
|
|
|
---
|
|
|
|
## Quick Navigation
|
|
|
|
| Topic | File | Lines | Focus |
|
|
|-------|------|-------|-------|
|
|
| **Prometheus + Grafana** | [prometheus-grafana-setup.md](prometheus-grafana-setup.md) | ~480 | Golden Signals monitoring |
|
|
| **OpenTelemetry** | [opentelemetry-tracing.md](opentelemetry-tracing.md) | ~450 | Distributed tracing |
|
|
| **SLO Framework** | [slo-error-budgets.md](slo-error-budgets.md) | ~420 | Error budget management |
|
|
| **DataDog APM** | [datadog-apm.md](datadog-apm.md) | ~400 | Commercial APM |
|
|
| **Centralized Logging** | [centralized-logging.md](centralized-logging.md) | ~440 | Log aggregation |
|
|
| **Chaos Engineering** | [chaos-engineering.md](chaos-engineering.md) | ~350 | Reliability testing |
|
|
|
|
## Related Documentation
|
|
|
|
- **Reference**: [Reference Index](../reference/INDEX.md) - PromQL, Golden Signals, SLO best practices
|
|
- **Templates**: [Templates Index](../templates/INDEX.md) - Grafana dashboards, SLO definitions
|
|
- **Main Agent**: [observability-engineer.md](../observability-engineer.md) - Observability agent
|
|
|
|
---
|
|
|
|
Return to [main agent](../observability-engineer.md)
|