Initial commit

2025-11-29 18:29:23 +08:00
commit ebc71f5387
37 changed files with 9382 additions and 0 deletions
--- a/skills/observability-engineering/SKILL.md
+++ b/skills/observability-engineering/SKILL.md
@@ -0,0 +1,26 @@
+# Observability Engineering Skill
+
+Production-ready monitoring, logging, and tracing using Prometheus, Grafana, OpenTelemetry, DataDog, and Sentry.
+
+## Description
+
+Comprehensive observability setup including SLO implementation, distributed tracing, dashboards, and incident prevention.
+
+## What's Included
+
+- **Examples**: Prometheus configs, Grafana dashboards, SLO definitions
+- **Reference**: Observability best practices, monitoring strategies
+- **Templates**: Dashboard templates, alert rules
+
+## Use When
+
+- Setting up production monitoring
+- Implementing SLOs
+- Distributed tracing
+- Performance tracking
+
+## Related Agents
+
+- `observability-engineer`
+
+**Skill Version**: 1.0
--- a/skills/observability-engineering/checklists/observability-setup-checklist.md
+++ b/skills/observability-engineering/checklists/observability-setup-checklist.md
@@ -0,0 +1,600 @@
+# Observability Engineering Setup Checklist
+
+Comprehensive checklist for implementing production-grade observability with logs, metrics, traces, and alerts.
+
+## Pre-Implementation Planning
+
+- [ ] **Define observability goals** (debug production issues, monitor SLAs, detect anomalies)
+- [ ] **Choose observability stack**:
+  - [ ] Logging: Pino (Node.js), structlog (Python), CloudWatch, Datadog
+  - [ ] Metrics: Prometheus, Datadog, CloudWatch
+  - [ ] Tracing: OpenTelemetry, Datadog APM, Jaeger
+  - [ ] Visualization: Grafana, Datadog, Honeycomb
+
+- [ ] **Set up observability infrastructure** (collectors, storage, dashboards)
+- [ ] **Define data retention** policies (logs: 30 days, metrics: 1 year, traces: 7 days)
+- [ ] **Plan for scale** (log volume, metric cardinality, trace sampling)
+
+## Structured Logging
+
+### Logger Configuration
+
+- [ ] **Structured logger installed**:
+  - Node.js: `pino` with `pino-pretty` for dev
+  - Python: `structlog` with JSON formatter
+  - Browser: Custom JSON logger or service integration
+
+- [ ] **Log levels defined**:
+  - [ ] TRACE: Very detailed debugging
+  - [ ] DEBUG: Detailed debugging info
+  - [ ] INFO: General informational messages
+  - [ ] WARN: Warning messages (recoverable issues)
+  - [ ] ERROR: Error messages (failures)
+  - [ ] FATAL: Critical failures (application crash)
+
+- [ ] **Environment-based configuration**:
+  - [ ] Development: Pretty-printed logs, DEBUG level
+  - [ ] Production: JSON logs, INFO level
+  - [ ] Test: Silent or minimal logs
+
+### Log Structure
+
+- [ ] **Standard log format** across services:
+  ```json
+  {
+    "level": "info",
+    "timestamp": "2025-11-10T10:30:00.000Z",
+    "service": "api-server",
+    "environment": "production",
+    "tenant_id": "uuid",
+    "user_id": "uuid",
+    "request_id": "uuid",
+    "message": "User logged in",
+    "duration_ms": 150,
+    "http": {
+      "method": "POST",
+      "path": "/api/login",
+      "status": 200,
+      "user_agent": "Mozilla/5.0..."
+    }
+  }
+  ```
+
+- [ ] **Correlation IDs** added:
+  - [ ] request_id: Unique per request
+  - [ ] session_id: Unique per session
+  - [ ] trace_id: Unique per distributed trace
+  - [ ] tenant_id: Multi-tenant context
+
+- [ ] **Context propagation** through request lifecycle
+- [ ] **Sensitive data redacted** (passwords, tokens, credit cards)
+
+### What to Log
+
+- [ ] **Request/response logging**:
+  - [ ] HTTP method, path, status code
+  - [ ] Request duration
+  - [ ] User agent, IP address (hashed or anonymized)
+  - [ ] Query parameters (non-sensitive)
+
+- [ ] **Authentication events**:
+  - [ ] Login success/failure
+  - [ ] Logout
+  - [ ] Token refresh
+  - [ ] Permission checks
+
+- [ ] **Business events**:
+  - [ ] User registration
+  - [ ] Payment processing
+  - [ ] Data exports
+  - [ ] Admin actions
+
+- [ ] **Errors and exceptions**:
+  - [ ] Error message
+  - [ ] Stack trace
+  - [ ] Error context (what user was doing)
+  - [ ] Affected resources (user_id, tenant_id, entity_id)
+
+- [ ] **Performance metrics**:
+  - [ ] Database query times
+  - [ ] External API call times
+  - [ ] Cache hit/miss rates
+  - [ ] Background job durations
+
+### Log Aggregation
+
+- [ ] **Logs shipped** to central location:
+  - [ ] CloudWatch Logs
+  - [ ] Datadog Logs
+  - [ ] Elasticsearch (ELK stack)
+  - [ ] Splunk
+
+- [ ] **Log retention** configured (30-90 days typical)
+- [ ] **Log volume** monitored (cost management)
+- [ ] **Log sampling** for high-volume services (if needed)
+
+## Application Metrics
+
+### Metric Types
+
+- [ ] **Counters** for events that only increase:
+  - [ ] Total requests
+  - [ ] Total errors
+  - [ ] Total registrations
+
+- [ ] **Gauges** for values that go up and down:
+  - [ ] Active connections
+  - [ ] Memory usage
+  - [ ] Queue depth
+
+- [ ] **Histograms** for distributions:
+  - [ ] Request duration
+  - [ ] Response size
+  - [ ] Database query time
+
+- [ ] **Summaries** for quantiles (p50, p95, p99)
+
+### Standard Metrics
+
+#### HTTP Metrics
+
+- [ ] **http_requests_total** (counter):
+  - Labels: method, path, status, tenant_id
+  - Track total requests per endpoint
+
+- [ ] **http_request_duration_seconds** (histogram):
+  - Labels: method, path, status
+  - Buckets: 0.1, 0.5, 1, 2, 5, 10 seconds
+
+- [ ] **http_request_size_bytes** (histogram)
+- [ ] **http_response_size_bytes** (histogram)
+
+#### Database Metrics
+
+- [ ] **db_queries_total** (counter):
+  - Labels: operation (SELECT, INSERT, UPDATE, DELETE), table
+
+- [ ] **db_query_duration_seconds** (histogram):
+  - Labels: operation, table
+  - Track slow queries (p95, p99)
+
+- [ ] **db_connection_pool_size** (gauge)
+- [ ] **db_connection_pool_available** (gauge)
+
+#### Application Metrics
+
+- [ ] **active_sessions** (gauge)
+- [ ] **background_jobs_total** (counter):
+  - Labels: job_name, status (success, failure)
+
+- [ ] **background_job_duration_seconds** (histogram):
+  - Labels: job_name
+
+- [ ] **cache_operations_total** (counter):
+  - Labels: operation (hit, miss, set, delete)
+
+- [ ] **external_api_calls_total** (counter):
+  - Labels: service, status
+
+- [ ] **external_api_duration_seconds** (histogram):
+  - Labels: service
+
+#### System Metrics
+
+- [ ] **process_cpu_usage_percent** (gauge)
+- [ ] **process_memory_usage_bytes** (gauge)
+- [ ] **process_heap_usage_bytes** (gauge) - JavaScript specific
+- [ ] **process_open_file_descriptors** (gauge)
+
+### Metric Collection
+
+- [ ] **Prometheus client library** installed:
+  - Node.js: `prom-client`
+  - Python: `prometheus-client`
+  - Custom: OpenTelemetry SDK
+
+- [ ] **Metrics endpoint** exposed (`/metrics`)
+- [ ] **Prometheus scrapes** endpoint (or push to gateway)
+- [ ] **Metric naming** follows conventions:
+  - Lowercase with underscores
+  - Unit suffixes (_seconds, _bytes, _total)
+  - Namespace prefix (myapp_http_requests_total)
+
+### Multi-Tenant Metrics
+
+- [ ] **tenant_id label** on all relevant metrics
+- [ ] **Per-tenant dashboards** (filter by tenant_id)
+- [ ] **Tenant resource usage** tracked:
+  - [ ] API calls per tenant
+  - [ ] Database storage per tenant
+  - [ ] Data transfer per tenant
+
+- [ ] **Tenant quotas** monitored (alert on approaching limit)
+
+## Distributed Tracing
+
+### Tracing Setup
+
+- [ ] **OpenTelemetry SDK** installed:
+  - Node.js: `@opentelemetry/sdk-node`
+  - Python: `opentelemetry-sdk`
+
+- [ ] **Tracing backend** configured:
+  - [ ] Jaeger (self-hosted)
+  - [ ] Datadog APM
+  - [ ] Honeycomb
+  - [ ] AWS X-Ray
+
+- [ ] **Auto-instrumentation** enabled:
+  - [ ] HTTP client/server
+  - [ ] Database queries
+  - [ ] Redis operations
+  - [ ] Message queues
+
+### Span Creation
+
+- [ ] **Custom spans** for business logic:
+  ```typescript
+  const span = tracer.startSpan('process-payment');
+  span.setAttribute('tenant_id', tenantId);
+  span.setAttribute('amount', amount);
+  try {
+    await processPayment();
+    span.setStatus({ code: SpanStatusCode.OK });
+  } catch (error) {
+    span.recordException(error);
+    span.setStatus({ code: SpanStatusCode.ERROR });
+    throw error;
+  } finally {
+    span.end();
+  }
+  ```
+
+- [ ] **Span attributes** include context:
+  - [ ] tenant_id, user_id, request_id
+  - [ ] Input parameters (non-sensitive)
+  - [ ] Result status
+
+- [ ] **Span events** for key moments:
+  - [ ] "Payment started"
+  - [ ] "Database query executed"
+  - [ ] "External API called"
+
+### Trace Context Propagation
+
+- [ ] **W3C Trace Context** headers propagated:
+  - traceparent: trace-id, parent-span-id, flags
+  - tracestate: vendor-specific data
+
+- [ ] **Context propagated** across:
+  - [ ] HTTP requests (frontend ↔ backend)
+  - [ ] Background jobs
+  - [ ] Message queues
+  - [ ] Microservices
+
+- [ ] **Trace ID** included in logs (correlate logs + traces)
+
+### Sampling
+
+- [ ] **Sampling strategy** defined:
+  - [ ] Head-based: Sample at trace start (1%, 10%, 100%)
+  - [ ] Tail-based: Sample after trace completes (error traces, slow traces)
+  - [ ] Adaptive: Sample based on load
+
+- [ ] **Always sample** errors and slow requests
+- [ ] **Sample rate** appropriate for volume (start high, reduce if needed)
+
+## Alerting
+
+### Alert Definitions
+
+- [ ] **Error rate alerts**:
+  - [ ] Condition: Error rate > 5% for 5 minutes
+  - [ ] Severity: Critical
+  - [ ] Action: Page on-call engineer
+
+- [ ] **Latency alerts**:
+  - [ ] Condition: p95 latency > 1s for 10 minutes
+  - [ ] Severity: Warning
+  - [ ] Action: Slack notification
+
+- [ ] **Availability alerts**:
+  - [ ] Condition: Health check fails 3 consecutive times
+  - [ ] Severity: Critical
+  - [ ] Action: Page on-call + auto-restart
+
+- [ ] **Resource alerts**:
+  - [ ] Memory usage > 80%
+  - [ ] CPU usage > 80%
+  - [ ] Disk usage > 85%
+  - [ ] Database connections > 90% of pool
+
+- [ ] **Business metric alerts**:
+  - [ ] Registration rate drops > 50%
+  - [ ] Payment failures increase > 10%
+  - [ ] Active users drop significantly
+
+### Alert Channels
+
+- [ ] **PagerDuty** (or equivalent) for critical alerts
+- [ ] **Slack** for warnings and notifications
+- [ ] **Email** for non-urgent alerts
+- [ ] **SMS** for highest priority (only use sparingly)
+
+### Alert Management
+
+- [ ] **Alert fatigue** prevented:
+  - [ ] Appropriate thresholds (not too sensitive)
+  - [ ] Proper severity levels (not everything is critical)
+  - [ ] Alert aggregation (deduplicate similar alerts)
+
+- [ ] **Runbooks** for each alert:
+  - [ ] What the alert means
+  - [ ] How to investigate
+  - [ ] How to resolve
+  - [ ] Escalation path
+
+- [ ] **Alert suppression** during deployments (planned downtime)
+- [ ] **Alert escalation** if not acknowledged
+
+## Dashboards & Visualization
+
+### Standard Dashboards
+
+- [ ] **Service Overview** dashboard:
+  - [ ] Request rate (requests/sec)
+  - [ ] Error rate (errors/sec, %)
+  - [ ] Latency (p50, p95, p99)
+  - [ ] Availability (uptime %)
+
+- [ ] **Database** dashboard:
+  - [ ] Query rate
+  - [ ] Slow queries (p95, p99)
+  - [ ] Connection pool usage
+  - [ ] Table sizes
+
+- [ ] **System Resources** dashboard:
+  - [ ] CPU usage
+  - [ ] Memory usage
+  - [ ] Disk I/O
+  - [ ] Network I/O
+
+- [ ] **Business Metrics** dashboard:
+  - [ ] Active users
+  - [ ] Registrations
+  - [ ] Revenue
+  - [ ] Feature usage
+
+### Dashboard Best Practices
+
+- [ ] **Auto-refresh** enabled (every 30-60 seconds)
+- [ ] **Time range** configurable (last hour, 24h, 7 days)
+- [ ] **Drill-down** to detailed views
+- [ ] **Annotations** for deployments/incidents
+- [ ] **Shared dashboards** accessible to team
+
+### Per-Tenant Dashboards
+
+- [ ] **Tenant filter** on all relevant dashboards
+- [ ] **Tenant resource usage** visualized
+- [ ] **Tenant-specific alerts** (if large customer)
+- [ ] **Tenant comparison** view (compare usage across tenants)
+
+## Health Checks
+
+### Endpoint Implementation
+
+- [ ] **Health check endpoint** (`/health` or `/healthz`):
+  - [ ] Returns 200 OK when healthy
+  - [ ] Returns 503 Service Unavailable when unhealthy
+  - [ ] Includes subsystem status
+
+```json
+{
+  "status": "healthy",
+  "version": "1.2.3",
+  "uptime_seconds": 86400,
+  "checks": {
+    "database": "healthy",
+    "redis": "healthy",
+    "external_api": "degraded"
+  }
+}
+```
+
+- [ ] **Liveness probe** (`/health/live`):
+  - [ ] Checks if application is running
+  - [ ] Fails → restart container
+
+- [ ] **Readiness probe** (`/health/ready`):
+  - [ ] Checks if application is ready to serve traffic
+  - [ ] Fails → remove from load balancer
+
+### Health Check Coverage
+
+- [ ] **Database connectivity** checked
+- [ ] **Cache connectivity** checked (Redis, Memcached)
+- [ ] **External APIs** checked (optional, can cause false positives)
+- [ ] **Disk space** checked
+- [ ] **Critical dependencies** checked
+
+### Monitoring Health Checks
+
+- [ ] **Uptime monitoring** service (Pingdom, UptimeRobot, Datadog Synthetics)
+- [ ] **Check frequency** appropriate (every 1-5 minutes)
+- [ ] **Alerting** on failed health checks
+- [ ] **Geographic monitoring** (check from multiple regions)
+
+## Error Tracking
+
+### Error Capture
+
+- [ ] **Error tracking service** integrated:
+  - [ ] Sentry
+  - [ ] Datadog Error Tracking
+  - [ ] Rollbar
+  - [ ] Custom solution
+
+- [ ] **Unhandled exceptions** captured automatically
+- [ ] **Handled errors** reported when appropriate
+- [ ] **Error context** included:
+  - [ ] User ID, tenant ID
+  - [ ] Request ID, trace ID
+  - [ ] User actions (breadcrumbs)
+  - [ ] Environment variables (non-sensitive)
+
+### Error Grouping
+
+- [ ] **Errors grouped** by fingerprint (same error, different occurrences)
+- [ ] **Error rate** tracked per group
+- [ ] **Alerting** on new error types or spike in existing
+- [ ] **Error assignment** to team members
+- [ ] **Resolution tracking** (mark errors as resolved)
+
+### Privacy & Security
+
+- [ ] **PII redacted** from error reports:
+  - [ ] Passwords, tokens, API keys
+  - [ ] Credit card numbers
+  - [ ] Email addresses (unless necessary)
+  - [ ] SSNs, tax IDs
+
+- [ ] **Source maps** uploaded for frontend (de-minify stack traces)
+- [ ] **Release tagging** (associate errors with deployments)
+
+## Performance Monitoring
+
+### Real User Monitoring (RUM)
+
+- [ ] **RUM tool integrated** (Datadog RUM, New Relic Browser, Google Analytics):
+  - [ ] Page load times
+  - [ ] Core Web Vitals (LCP, FID, CLS)
+  - [ ] JavaScript errors
+  - [ ] User sessions
+
+- [ ] **Performance budgets** defined:
+  - [ ] First Contentful Paint < 1.8s
+  - [ ] Largest Contentful Paint < 2.5s
+  - [ ] Time to Interactive < 3.8s
+  - [ ] Cumulative Layout Shift < 0.1
+
+- [ ] **Alerting** on performance regressions
+
+### Application Performance Monitoring (APM)
+
+- [ ] **APM tool** integrated (Datadog APM, New Relic APM):
+  - [ ] Trace every request
+  - [ ] Identify slow endpoints
+  - [ ] Database query analysis
+  - [ ] External API profiling
+
+- [ ] **Performance profiling** for critical paths:
+  - [ ] Authentication flow
+  - [ ] Payment processing
+  - [ ] Data exports
+  - [ ] Complex queries
+
+## Cost Management
+
+- [ ] **Observability costs** tracked:
+  - [ ] Log ingestion costs
+  - [ ] Metric cardinality costs
+  - [ ] Trace sampling costs
+  - [ ] Dashboard/seat costs
+
+- [ ] **Cost optimization**:
+  - [ ] Log sampling for high-volume services
+  - [ ] Metric aggregation (reduce cardinality)
+  - [ ] Trace sampling (not 100% in production)
+  - [ ] Data retention policies
+
+- [ ] **Budget alerts** configured
+
+## Security & Compliance
+
+- [ ] **Access control** on observability tools (role-based)
+- [ ] **Audit logging** for observability access
+- [ ] **Data retention** complies with regulations (GDPR, HIPAA)
+- [ ] **Data encryption** in transit and at rest
+- [ ] **PII handling** compliant (redaction, anonymization)
+
+## Testing Observability
+
+- [ ] **Log output** tested in unit tests:
+  ```typescript
+  test('logs user login', () => {
+    const logs = captureLogs();
+    await loginUser();
+    expect(logs).toContainEqual(
+      expect.objectContaining({
+        level: 'info',
+        message: 'User logged in',
+        user_id: expect.any(String)
+      })
+    );
+  });
+  ```
+
+- [ ] **Metrics** incremented in tests
+- [ ] **Traces** created in integration tests
+- [ ] **Health checks** tested
+- [ ] **Alert thresholds** tested (inject failures, verify alert fires)
+
+## Documentation
+
+- [ ] **Observability runbook** created:
+  - [ ] How to access logs, metrics, traces
+  - [ ] How to create dashboards
+  - [ ] How to set up alerts
+  - [ ] Common troubleshooting queries
+
+- [ ] **Alert runbooks** for each alert
+- [ ] **Dashboard documentation** (what each panel shows)
+- [ ] **Metric dictionary** (what each metric means)
+- [ ] **On-call procedures** documented
+
+## Scoring
+
+- **85+ items checked**: Excellent - Production-grade observability ✅
+- **65-84 items**: Good - Most observability covered ⚠️
+- **45-64 items**: Fair - Significant gaps exist 🔴
+- **<45 items**: Poor - Not ready for production ❌
+
+## Priority Items
+
+Address these first:
+1. **Structured logging** - Foundation for debugging
+2. **Error tracking** - Catch and fix bugs quickly
+3. **Health checks** - Know when service is down
+4. **Alerting** - Get notified of issues
+5. **Key metrics** - Request rate, error rate, latency
+
+## Common Pitfalls
+
+❌ **Don't:**
+- Log sensitive data (passwords, tokens, PII)
+- Create high-cardinality metrics (user_id as label)
+- Trace 100% of requests in production (sample instead)
+- Alert on every anomaly (alert fatigue)
+- Ignore observability until there's a problem
+
+✅ **Do:**
+- Log at appropriate levels (use DEBUG for verbose)
+- Use correlation IDs throughout request lifecycle
+- Set up alerts with clear runbooks
+- Review dashboards regularly (detect issues early)
+- Iterate on observability (improve over time)
+
+## Related Resources
+
+- [OpenTelemetry Documentation](https://opentelemetry.io/docs/)
+- [Pino Logger](https://getpino.io)
+- [Prometheus Best Practices](https://prometheus.io/docs/practices/)
+- [observability-engineering skill](../SKILL.md)
+
+---
+
+**Total Items**: 140+ observability checks
+**Critical Items**: Logging, Metrics, Alerting, Health checks
+**Coverage**: Logs, Metrics, Traces, Alerts, Dashboards
+**Last Updated**: 2025-11-10
--- a/skills/observability-engineering/examples/INDEX.md
+++ b/skills/observability-engineering/examples/INDEX.md
@@ -0,0 +1,136 @@
+# Observability Examples
+
+Production-ready observability implementations for Grey Haven stack (Cloudflare Workers, TanStack Start, FastAPI, PostgreSQL).
+
+## Examples Overview
+
+### Prometheus + Grafana Setup
+
+**File**: [prometheus-grafana-setup.md](prometheus-grafana-setup.md)
+
+Complete monitoring stack for Kubernetes with Golden Signals:
+- **Prometheus Deployment** - Helm charts, service discovery, scrape configs
+- **Grafana Setup** - Dashboard-as-code, templating, alerting
+- **Node Exporter** - System metrics collection (CPU, memory, disk)
+- **kube-state-metrics** - Kubernetes resource metrics
+- **Golden Signals** - Request rate, error rate, latency (p50/p95/p99), saturation
+- **Recording Rules** - Pre-aggregated metrics for fast queries
+- **Alert Manager** - PagerDuty integration, escalation policies
+- **Before/After Metrics** - Response time improved 40%, MTTR reduced 60%
+
+**Use when**: Setting up production monitoring, implementing SRE practices, cloud-native deployments
+
+---
+
+### OpenTelemetry Distributed Tracing
+
+**File**: [opentelemetry-tracing.md](opentelemetry-tracing.md)
+
+Distributed tracing for microservices with Jaeger:
+- **OTel Collector** - Receiver/processor/exporter pipelines
+- **Auto-Instrumentation** - Zero-code tracing for Node.js, Python, FastAPI
+- **Context Propagation** - W3C Trace Context across services
+- **Sampling Strategies** - Head-based (10%), tail-based (errors only)
+- **Span Attributes** - HTTP method, status code, user ID, tenant ID
+- **Trace Visualization** - Jaeger UI, dependency graphs, critical path
+- **Performance Impact** - <5ms overhead, 2% CPU increase
+- **Before/After** - MTTR 45min → 8min (82% reduction)
+
+**Use when**: Debugging microservices, understanding latency, optimizing critical paths
+
+---
+
+### SLO & Error Budget Framework
+
+**File**: [slo-error-budgets.md](slo-error-budgets.md)
+
+Complete SLI/SLO/Error Budget implementation:
+- **SLI Definition** - Availability (99.9%), latency (p95 < 200ms), error rate (< 0.5%)
+- **SLO Targets** - Critical (99.95%), Essential (99.9%), Standard (99.5%)
+- **Error Budget Calculation** - Monthly budget, burn rate monitoring (1h/6h/24h windows)
+- **Prometheus Recording Rules** - Multi-window SLI calculations
+- **Grafana SLO Dashboard** - Real-time status, budget remaining, burn rate graphs
+- **Budget Policies** - Feature freeze at 25% remaining, postmortem required at depletion
+- **Burn Rate Alerts** - PagerDuty escalation when burning too fast
+- **Impact** - 99.95% availability achieved, 3 feature freezes prevented overspend
+
+**Use when**: Implementing SRE practices, balancing reliability with velocity, production deployments
+
+---
+
+### DataDog APM Integration
+
+**File**: [datadog-apm.md](datadog-apm.md)
+
+Application Performance Monitoring for Grey Haven stack:
+- **DataDog Agent** - Cloudflare Workers instrumentation, FastAPI tracing
+- **Custom Metrics** - Business KPIs (checkout success rate, revenue per minute)
+- **Real User Monitoring (RUM)** - Frontend performance, user sessions, error tracking
+- **APM Traces** - Distributed tracing with Cloudflare Workers, database queries
+- **Log Correlation** - Trace ID in logs, unified troubleshooting
+- **Synthetic Monitoring** - API health checks every 1 minute from 10 locations
+- **Anomaly Detection** - ML-powered alerts for unusual patterns
+- **Cost** - $31/host/month, $40/million spans
+- **Before/After** - 99.5% → 99.95% availability (10x fewer incidents)
+
+**Use when**: Commercial APM needed, executive dashboards required, startup budget allows
+
+---
+
+### Centralized Logging with Fluentd + Elasticsearch
+
+**File**: [centralized-logging.md](centralized-logging.md)
+
+Production log aggregation for multi-region deployments:
+- **Fluentd DaemonSet** - Kubernetes log collection from all pods
+- **Structured Logging** - JSON format with trace ID, user ID, tenant ID
+- **Elasticsearch Indexing** - Daily indices with rollover, ILM policies
+- **Kibana Dashboards** - Error tracking, request patterns, audit logs
+- **Log Parsing** - Grok patterns for FastAPI, TanStack Start, PostgreSQL
+- **Retention** - Hot (7 days), Warm (30 days), Cold (90 days), Archive (1 year)
+- **PII Redaction** - Automatic SSN, credit card, email masking
+- **Volume** - 500GB/day ingested, 90% compression, $800/month cost
+- **Before/After** - Log search 5min → 10sec, disk usage 10TB → 1TB
+
+**Use when**: Debugging production issues, compliance requirements (SOC2/PCI), audit trails
+
+---
+
+### Chaos Engineering with Gremlin
+
+**File**: [chaos-engineering.md](chaos-engineering.md)
+
+Reliability testing and circuit breaker validation:
+- **Gremlin Setup** - Agent deployment, blast radius configuration
+- **Chaos Experiments** - Pod termination, network latency (100ms), CPU stress (80%)
+- **Circuit Breaker** - Automatic fallback when error rate > 50%
+- **Hypothesis** - "API handles 50% pod failures without user impact"
+- **Validation** - Prometheus metrics, distributed traces, user session monitoring
+- **Results** - Circuit breaker engaged in 2sec, fallback success rate 99.8%
+- **Runbook** - Automatic rollback triggers, escalation procedures
+- **Impact** - Found 3 critical bugs before production, confidence in resilience
+
+**Use when**: Pre-production validation, testing disaster recovery, chaos engineering practice
+
+---
+
+## Quick Navigation
+
+| Topic | File | Lines | Focus |
+|-------|------|-------|-------|
+| **Prometheus + Grafana** | [prometheus-grafana-setup.md](prometheus-grafana-setup.md) | ~480 | Golden Signals monitoring |
+| **OpenTelemetry** | [opentelemetry-tracing.md](opentelemetry-tracing.md) | ~450 | Distributed tracing |
+| **SLO Framework** | [slo-error-budgets.md](slo-error-budgets.md) | ~420 | Error budget management |
+| **DataDog APM** | [datadog-apm.md](datadog-apm.md) | ~400 | Commercial APM |
+| **Centralized Logging** | [centralized-logging.md](centralized-logging.md) | ~440 | Log aggregation |
+| **Chaos Engineering** | [chaos-engineering.md](chaos-engineering.md) | ~350 | Reliability testing |
+
+## Related Documentation
+
+- **Reference**: [Reference Index](../reference/INDEX.md) - PromQL, Golden Signals, SLO best practices
+- **Templates**: [Templates Index](../templates/INDEX.md) - Grafana dashboards, SLO definitions
+- **Main Agent**: [observability-engineer.md](../observability-engineer.md) - Observability agent
+
+---
+
+Return to [main agent](../observability-engineer.md)
--- a/skills/observability-engineering/reference/INDEX.md
+++ b/skills/observability-engineering/reference/INDEX.md
@@ -0,0 +1,67 @@
+# Observability Reference Documentation
+
+Comprehensive reference guides for production observability patterns, PromQL queries, and SRE best practices.
+
+## Reference Overview
+
+### PromQL Query Language Guide
+
+**File**: [promql-guide.md](promql-guide.md)
+
+Complete PromQL reference for Prometheus queries:
+- **Metric types**: Counter, Gauge, Histogram, Summary
+- **PromQL functions**: rate(), irate(), increase(), sum(), avg(), histogram_quantile()
+- **Recording rules**: Pre-aggregated metrics for performance
+- **Alerting queries**: Burn rate calculations, threshold alerts
+- **Performance tips**: Query optimization, avoiding cardinality explosions
+
+**Use when**: Writing Prometheus queries, creating recording rules, debugging slow queries
+
+---
+
+### Golden Signals Reference
+
+**File**: [golden-signals.md](golden-signals.md)
+
+Google SRE Golden Signals implementation guide:
+- **Request Rate (Traffic)**: RPS calculations, per-service breakdowns
+- **Error Rate**: 5xx errors, client vs server errors, error budget impact
+- **Latency (Duration)**: p50/p95/p99 percentiles, latency SLOs
+- **Saturation**: CPU, memory, disk, connection pools
+
+**Use when**: Designing monitoring dashboards, implementing SLIs, understanding system health
+
+---
+
+### SLO Best Practices
+
+**File**: [slo-best-practices.md](slo-best-practices.md)
+
+Google SRE SLO/SLI/Error Budget framework:
+- **SLI selection**: Choosing meaningful indicators (availability, latency, throughput)
+- **SLO targets**: Critical (99.95%), Essential (99.9%), Standard (99.5%)
+- **Error budget policies**: Feature freeze thresholds, postmortem requirements
+- **Multi-window burn rate alerts**: 1h, 6h, 24h windows
+- **SLO review cadence**: Weekly reviews, quarterly adjustments
+
+**Use when**: Implementing SLO framework, setting reliability targets, balancing velocity with reliability
+
+---
+
+## Quick Navigation
+
+| Topic | File | Lines | Focus |
+|-------|------|-------|-------|
+| **PromQL** | [promql-guide.md](promql-guide.md) | ~450 | Query language reference |
+| **Golden Signals** | [golden-signals.md](golden-signals.md) | ~380 | Four signals implementation |
+| **SLO Practices** | [slo-best-practices.md](slo-best-practices.md) | ~420 | Google SRE framework |
+
+## Related Documentation
+
+- **Examples**: [Examples Index](../examples/INDEX.md) - Production implementations
+- **Templates**: [Templates Index](../templates/INDEX.md) - Copy-paste configurations
+- **Main Agent**: [observability-engineer.md](../observability-engineer.md) - Observability agent
+
+---
+
+Return to [main agent](../observability-engineer.md)
--- a/skills/observability-engineering/templates/INDEX.md
+++ b/skills/observability-engineering/templates/INDEX.md
@@ -0,0 +1,72 @@
+# Observability Templates
+
+Copy-paste ready configuration templates for Prometheus, Grafana, and OpenTelemetry.
+
+## Templates Overview
+
+### Grafana Dashboard Template
+
+**File**: [grafana-dashboard.json](grafana-dashboard.json)
+
+Production-ready Golden Signals dashboard:
+- **Request Rate**: Total RPS with 5-minute averages
+- **Error Rate**: Percentage of 5xx errors with alert thresholds
+- **Latency**: p50/p95/p99 percentiles in milliseconds
+- **Saturation**: CPU and memory usage percentages
+
+**Use when**: Creating new service dashboards, standardizing monitoring
+
+---
+
+### SLO Definition Template
+
+**File**: [slo-definition.yaml](slo-definition.yaml)
+
+Service Level Objective configuration:
+- **SLO tiers**: Critical (99.95%), Essential (99.9%), Standard (99.5%)
+- **SLI definitions**: Availability, latency, error rate
+- **Error budget policy**: Feature freeze thresholds
+- **Multi-window burn rate alerts**: 1h, 6h, 24h windows
+
+**Use when**: Implementing SLO framework for new services
+
+---
+
+### Prometheus Recording Rules
+
+**File**: [prometheus-recording-rules.yaml](prometheus-recording-rules.yaml)
+
+Pre-aggregated metrics for fast dashboards:
+- **Request rates**: Per-service, per-endpoint RPS
+- **Error rates**: Percentage calculations (5xx / total)
+- **Latency percentiles**: p50/p95/p99 pre-computed
+- **Error budget**: Remaining budget and burn rate
+
+**Use when**: Optimizing slow dashboard queries, implementing SLOs
+
+---
+
+## Quick Usage
+
+```bash
+# Copy template to your monitoring directory
+cp templates/grafana-dashboard.json ../monitoring/dashboards/
+
+# Edit service name and thresholds
+vim ../monitoring/dashboards/grafana-dashboard.json
+
+# Import to Grafana
+curl -X POST http://admin:password@localhost:3000/api/dashboards/db \
+  -H "Content-Type: application/json" \
+  -d @../monitoring/dashboards/grafana-dashboard.json
+```
+
+## Related Documentation
+
+- **Examples**: [Examples Index](../examples/INDEX.md) - Full implementations
+- **Reference**: [Reference Index](../reference/INDEX.md) - PromQL, SLO guides
+- **Main Agent**: [observability-engineer.md](../observability-engineer.md) - Observability agent
+
+---
+
+Return to [main agent](../observability-engineer.md)
--- a/skills/observability-engineering/templates/grafana-dashboard.json
+++ b/skills/observability-engineering/templates/grafana-dashboard.json
@@ -0,0 +1,210 @@
+{
+  "dashboard": {
+    "title": "Golden Signals - [Service Name]",
+    "tags": ["golden-signals", "production", "slo"],
+    "timezone": "UTC",
+    "refresh": "30s",
+    "time": {
+      "from": "now-6h",
+      "to": "now"
+    },
+    "panels": [
+      {
+        "id": 1,
+        "title": "Request Rate (RPS)",
+        "type": "graph",
+        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
+        "targets": [
+          {
+            "expr": "sum(rate(http_requests_total{service=\"YOUR_SERVICE\"}[5m]))",
+            "legendFormat": "Total RPS",
+            "refId": "A"
+          },
+          {
+            "expr": "sum(rate(http_requests_total{service=\"YOUR_SERVICE\"}[5m])) by (method)",
+            "legendFormat": "{{method}}",
+            "refId": "B"
+          }
+        ],
+        "yaxes": [
+          {"format": "reqps", "label": "Requests/sec"},
+          {"format": "short"}
+        ],
+        "legend": {"show": true, "alignAsTable": true, "avg": true, "max": true, "current": true}
+      },
+      {
+        "id": 2,
+        "title": "Error Rate (%)",
+        "type": "graph",
+        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
+        "targets": [
+          {
+            "expr": "(sum(rate(http_requests_total{service=\"YOUR_SERVICE\",status=~\"5..\"}[5m])) / sum(rate(http_requests_total{service=\"YOUR_SERVICE\"}[5m]))) * 100",
+            "legendFormat": "Error Rate %",
+            "refId": "A"
+          }
+        ],
+        "yaxes": [
+          {"format": "percent", "label": "Error %", "max": 5},
+          {"format": "short"}
+        ],
+        "alert": {
+          "name": "High Error Rate",
+          "conditions": [
+            {
+              "evaluator": {"params": [1], "type": "gt"},
+              "operator": {"type": "and"},
+              "query": {"params": ["A", "5m", "now"]},
+              "type": "query"
+            }
+          ],
+          "frequency": "1m",
+          "for": "5m",
+          "message": "Error rate > 1% for 5 minutes",
+          "noDataState": "no_data",
+          "notifications": []
+        },
+        "thresholds": [
+          {"value": 1, "colorMode": "critical", "op": "gt", "fill": true, "line": true}
+        ]
+      },
+      {
+        "id": 3,
+        "title": "Request Latency (p50/p95/p99)",
+        "type": "graph",
+        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 8},
+        "targets": [
+          {
+            "expr": "histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket{service=\"YOUR_SERVICE\"}[5m])) by (le)) * 1000",
+            "legendFormat": "p50",
+            "refId": "A"
+          },
+          {
+            "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service=\"YOUR_SERVICE\"}[5m])) by (le)) * 1000",
+            "legendFormat": "p95",
+            "refId": "B"
+          },
+          {
+            "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service=\"YOUR_SERVICE\"}[5m])) by (le)) * 1000",
+            "legendFormat": "p99",
+            "refId": "C"
+          }
+        ],
+        "yaxes": [
+          {"format": "ms", "label": "Latency (ms)"},
+          {"format": "short"}
+        ],
+        "thresholds": [
+          {"value": 200, "colorMode": "warning", "op": "gt"},
+          {"value": 500, "colorMode": "critical", "op": "gt"}
+        ]
+      },
+      {
+        "id": 4,
+        "title": "Resource Saturation (CPU/Memory %)",
+        "type": "graph",
+        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 8},
+        "targets": [
+          {
+            "expr": "100 - (avg(irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
+            "legendFormat": "CPU %",
+            "refId": "A"
+          },
+          {
+            "expr": "100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))",
+            "legendFormat": "Memory %",
+            "refId": "B"
+          }
+        ],
+        "yaxes": [
+          {"format": "percent", "label": "Usage %", "max": 100},
+          {"format": "short"}
+        ],
+        "thresholds": [
+          {"value": 80, "colorMode": "warning", "op": "gt"},
+          {"value": 90, "colorMode": "critical", "op": "gt"}
+        ]
+      },
+      {
+        "id": 5,
+        "title": "Top 10 Slowest Endpoints",
+        "type": "table",
+        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 16},
+        "targets": [
+          {
+            "expr": "topk(10, histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service=\"YOUR_SERVICE\"}[5m])) by (le, path))) * 1000",
+            "legendFormat": "",
+            "format": "table",
+            "instant": true,
+            "refId": "A"
+          }
+        ],
+        "transformations": [
+          {"id": "organize", "options": {"excludeByName": {}, "indexByName": {}, "renameByName": {"path": "Endpoint", "Value": "p95 Latency (ms)"}}}
+        ]
+      },
+      {
+        "id": 6,
+        "title": "SLO Status (30-day)",
+        "type": "stat",
+        "gridPos": {"h": 8, "w": 6, "x": 12, "y": 16},
+        "targets": [
+          {
+            "expr": "sum(rate(http_requests_total{service=\"YOUR_SERVICE\",status=~\"2..|3..\"}[30d])) / sum(rate(http_requests_total{service=\"YOUR_SERVICE\"}[30d])) * 100",
+            "refId": "A"
+          }
+        ],
+        "options": {
+          "graphMode": "none",
+          "textMode": "value_and_name",
+          "colorMode": "background"
+        },
+        "fieldConfig": {
+          "defaults": {
+            "unit": "percent",
+            "decimals": 3,
+            "thresholds": {
+              "mode": "absolute",
+              "steps": [
+                {"value": 0, "color": "red"},
+                {"value": 99.5, "color": "yellow"},
+                {"value": 99.9, "color": "green"}
+              ]
+            }
+          }
+        }
+      },
+      {
+        "id": 7,
+        "title": "Error Budget Remaining",
+        "type": "gauge",
+        "gridPos": {"h": 8, "w": 6, "x": 18, "y": 16},
+        "targets": [
+          {
+            "expr": "(1 - ((1 - (sum(rate(http_requests_total{service=\"YOUR_SERVICE\",status=~\"2..|3..\"}[30d])) / sum(rate(http_requests_total{service=\"YOUR_SERVICE\"}[30d])))) / (1 - 0.999))) * 100",
+            "refId": "A"
+          }
+        ],
+        "options": {
+          "showThresholdLabels": false,
+          "showThresholdMarkers": true
+        },
+        "fieldConfig": {
+          "defaults": {
+            "unit": "percent",
+            "min": 0,
+            "max": 100,
+            "thresholds": {
+              "mode": "absolute",
+              "steps": [
+                {"value": 0, "color": "red"},
+                {"value": 25, "color": "yellow"},
+                {"value": 50, "color": "green"}
+              ]
+            }
+          }
+        }
+      }
+    ]
+  }
+}
--- a/skills/observability-engineering/templates/prometheus-recording-rules.yaml
+++ b/skills/observability-engineering/templates/prometheus-recording-rules.yaml
@@ -0,0 +1,188 @@
+# Prometheus Recording Rules Template
+# Pre-aggregated metrics for fast dashboard queries and SLO tracking
+# Replace YOUR_SERVICE with actual service name
+
+groups:
+  # HTTP Request Rates
+  - name: http_request_rates
+    interval: 15s
+    rules:
+      # Total request rate (per-second)
+      - record: greyhaven:http_requests:rate5m
+        expr: sum(rate(http_requests_total{service="YOUR_SERVICE"}[5m]))
+
+      # Request rate by service
+      - record: greyhaven:http_requests:rate5m:by_service
+        expr: sum(rate(http_requests_total[5m])) by (service)
+
+      # Request rate by endpoint
+      - record: greyhaven:http_requests:rate5m:by_endpoint
+        expr: sum(rate(http_requests_total{service="YOUR_SERVICE"}[5m])) by (endpoint)
+
+      # Request rate by method
+      - record: greyhaven:http_requests:rate5m:by_method
+        expr: sum(rate(http_requests_total{service="YOUR_SERVICE"}[5m])) by (method)
+
+      # Request rate by status code
+      - record: greyhaven:http_requests:rate5m:by_status
+        expr: sum(rate(http_requests_total{service="YOUR_SERVICE"}[5m])) by (status)
+
+  # HTTP Error Rates
+  - name: http_error_rates
+    interval: 15s
+    rules:
+      # Error rate (percentage)
+      - record: greyhaven:http_errors:rate5m
+        expr: |
+          sum(rate(http_requests_total{service="YOUR_SERVICE",status=~"5.."}[5m]))
+          /
+          sum(rate(http_requests_total{service="YOUR_SERVICE"}[5m]))
+
+      # Error rate by service
+      - record: greyhaven:http_errors:rate5m:by_service
+        expr: |
+          sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
+          /
+          sum(rate(http_requests_total[5m])) by (service)
+
+      # Error rate by endpoint
+      - record: greyhaven:http_errors:rate5m:by_endpoint
+        expr: |
+          sum(rate(http_requests_total{service="YOUR_SERVICE",status=~"5.."}[5m])) by (endpoint)
+          /
+          sum(rate(http_requests_total{service="YOUR_SERVICE"}[5m])) by (endpoint)
+
+  # HTTP Latency (Duration)
+  - name: http_latency
+    interval: 15s
+    rules:
+      # p50 latency (median)
+      - record: greyhaven:http_latency:p50
+        expr: histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket{service="YOUR_SERVICE"}[5m])) by (le))
+
+      # p95 latency
+      - record: greyhaven:http_latency:p95
+        expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service="YOUR_SERVICE"}[5m])) by (le))
+
+      # p99 latency
+      - record: greyhaven:http_latency:p99
+        expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service="YOUR_SERVICE"}[5m])) by (le))
+
+      # Average latency
+      - record: greyhaven:http_latency:avg
+        expr: |
+          sum(rate(http_request_duration_seconds_sum{service="YOUR_SERVICE"}[5m]))
+          /
+          sum(rate(http_request_duration_seconds_count{service="YOUR_SERVICE"}[5m]))
+
+      # p95 latency by endpoint
+      - record: greyhaven:http_latency:p95:by_endpoint
+        expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service="YOUR_SERVICE"}[5m])) by (le, endpoint))
+
+  # Resource Saturation
+  - name: resource_saturation
+    interval: 15s
+    rules:
+      # CPU usage percentage
+      - record: greyhaven:cpu_usage:percent
+        expr: 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
+
+      # Memory usage percentage
+      - record: greyhaven:memory_usage:percent
+        expr: 100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))
+
+      # Disk usage percentage
+      - record: greyhaven:disk_usage:percent
+        expr: 100 - ((node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100)
+
+      # Database connection pool saturation
+      - record: greyhaven:db_pool:saturation
+        expr: |
+          db_pool_connections_active{service="YOUR_SERVICE"}
+          /
+          db_pool_connections_max{service="YOUR_SERVICE"}
+
+  # SLI Calculations (Multi-Window)
+  - name: sli_calculations
+    interval: 30s
+    rules:
+      # Availability SLI - 1 hour window
+      - record: greyhaven:sli:availability:1h
+        expr: |
+          sum(rate(http_requests_total{service="YOUR_SERVICE",status=~"2..|3.."}[1h]))
+          /
+          sum(rate(http_requests_total{service="YOUR_SERVICE"}[1h]))
+
+      # Availability SLI - 6 hour window
+      - record: greyhaven:sli:availability:6h
+        expr: |
+          sum(rate(http_requests_total{service="YOUR_SERVICE",status=~"2..|3.."}[6h]))
+          /
+          sum(rate(http_requests_total{service="YOUR_SERVICE"}[6h]))
+
+      # Availability SLI - 24 hour window
+      - record: greyhaven:sli:availability:24h
+        expr: |
+          sum(rate(http_requests_total{service="YOUR_SERVICE",status=~"2..|3.."}[24h]))
+          /
+          sum(rate(http_requests_total{service="YOUR_SERVICE"}[24h]))
+
+      # Availability SLI - 30 day window
+      - record: greyhaven:sli:availability:30d
+        expr: |
+          sum(rate(http_requests_total{service="YOUR_SERVICE",status=~"2..|3.."}[30d]))
+          /
+          sum(rate(http_requests_total{service="YOUR_SERVICE"}[30d]))
+
+      # Latency SLI - 1 hour window (% requests < 200ms)
+      - record: greyhaven:sli:latency:1h
+        expr: |
+          sum(rate(http_request_duration_seconds_bucket{service="YOUR_SERVICE",le="0.2"}[1h]))
+          /
+          sum(rate(http_request_duration_seconds_count{service="YOUR_SERVICE"}[1h]))
+
+      # Latency SLI - 30 day window
+      - record: greyhaven:sli:latency:30d
+        expr: |
+          sum(rate(http_request_duration_seconds_bucket{service="YOUR_SERVICE",le="0.2"}[30d]))
+          /
+          sum(rate(http_request_duration_seconds_count{service="YOUR_SERVICE"}[30d]))
+
+  # Error Budget Tracking
+  - name: error_budget
+    interval: 30s
+    rules:
+      # Error budget remaining (for 99.9% SLO)
+      - record: greyhaven:error_budget:remaining:30d
+        expr: |
+          1 - (
+            (1 - greyhaven:sli:availability:30d{service="YOUR_SERVICE"})
+            /
+            (1 - 0.999)
+          )
+
+      # Error budget burn rate - 1 hour window
+      - record: greyhaven:error_budget:burn_rate:1h
+        expr: |
+          (1 - greyhaven:sli:availability:1h{service="YOUR_SERVICE"})
+          /
+          (1 - 0.999)
+
+      # Error budget burn rate - 6 hour window
+      - record: greyhaven:error_budget:burn_rate:6h
+        expr: |
+          (1 - greyhaven:sli:availability:6h{service="YOUR_SERVICE"})
+          /
+          (1 - 0.999)
+
+      # Error budget burn rate - 24 hour window
+      - record: greyhaven:error_budget:burn_rate:24h
+        expr: |
+          (1 - greyhaven:sli:availability:24h{service="YOUR_SERVICE"})
+          /
+          (1 - 0.999)
+
+      # Error budget consumed (minutes of downtime)
+      - record: greyhaven:error_budget:consumed:30d
+        expr: |
+          (1 - greyhaven:sli:availability:30d{service="YOUR_SERVICE"}) * 43200
--- a/skills/observability-engineering/templates/slo-definition.yaml
+++ b/skills/observability-engineering/templates/slo-definition.yaml
@@ -0,0 +1,173 @@
+# SLO Definition Template
+# Replace YOUR_SERVICE with actual service name
+# Replace 99.9 with your target SLO (99.5, 99.9, or 99.95)
+
+apiVersion: monitoring.greyhaven.io/v1
+kind: ServiceLevelObjective
+metadata:
+  name: YOUR_SERVICE-slo
+  namespace: production
+spec:
+  # Service identification
+  service: YOUR_SERVICE
+  environment: production
+
+  # SLO tier (critical, essential, standard)
+  tier: essential
+
+  # Time window (30 days recommended)
+  window: 30d
+
+  # SLO targets
+  objectives:
+    - name: availability
+      target: 99.9  # 99.9% = 43.2 min downtime/month
+      indicator:
+        type: ratio
+        success_query: |
+          sum(rate(http_requests_total{service="YOUR_SERVICE",status=~"2..|3.."}[{{.window}}]))
+        total_query: |
+          sum(rate(http_requests_total{service="YOUR_SERVICE"}[{{.window}}]))
+
+    - name: latency
+      target: 95  # 95% of requests < 200ms
+      indicator:
+        type: ratio
+        success_query: |
+          sum(rate(http_request_duration_seconds_bucket{service="YOUR_SERVICE",le="0.2"}[{{.window}}]))
+        total_query: |
+          sum(rate(http_request_duration_seconds_count{service="YOUR_SERVICE"}[{{.window}}]))
+
+    - name: error_rate
+      target: 99.5  # <0.5% error rate
+      indicator:
+        type: ratio
+        success_query: |
+          sum(rate(http_requests_total{service="YOUR_SERVICE",status!~"5.."}[{{.window}}]))
+        total_query: |
+          sum(rate(http_requests_total{service="YOUR_SERVICE"}[{{.window}}]))
+
+  # Error budget policy
+  errorBudget:
+    policy:
+      - budget_range: [75%, 100%]
+        action: "Normal feature development"
+        approval: "Engineering team"
+
+      - budget_range: [50%, 75%]
+        action: "Monitor closely, increase testing"
+        approval: "Engineering team"
+
+      - budget_range: [25%, 50%]
+        action: "Prioritize reliability work, reduce risky changes"
+        approval: "Engineering manager"
+
+      - budget_range: [0%, 25%]
+        action: "Feature freeze, all hands on reliability"
+        approval: "VP Engineering"
+        requirements:
+          - "Daily reliability standup"
+          - "Postmortem for all incidents"
+          - "No new features until budget >50%"
+
+      - budget_range: [0%, 0%]
+        action: "SLO violation - mandatory postmortem"
+        approval: "VP Engineering + CTO"
+        requirements:
+          - "Complete postmortem within 48 hours"
+          - "Action items with owners and deadlines"
+          - "Present to exec team"
+
+  # Multi-window burn rate alerts
+  alerts:
+    - name: error-budget-burn-rate-critical
+      severity: critical
+      windows:
+        short: 1h
+        long: 6h
+      burn_rate_threshold: 14.4  # Budget exhausted in 2 hours
+      for: 2m
+      annotations:
+        summary: "Critical burn rate - budget exhausted in 2 hours"
+        description: "Service {{ $labels.service }} is burning error budget 14.4x faster than expected"
+        runbook: "https://runbooks.greyhaven.io/slo-burn-rate"
+      notifications:
+        - type: pagerduty
+          severity: critical
+
+    - name: error-budget-burn-rate-high
+      severity: warning
+      windows:
+        short: 6h
+        long: 24h
+      burn_rate_threshold: 6  # Budget exhausted in 5 days
+      for: 15m
+      annotations:
+        summary: "High burn rate - budget exhausted in 5 days"
+        description: "Service {{ $labels.service }} is burning error budget 6x faster than expected"
+      notifications:
+        - type: slack
+          channel: "#alerts-reliability"
+
+    - name: error-budget-burn-rate-medium
+      severity: warning
+      windows:
+        short: 24h
+        long: 24h
+      burn_rate_threshold: 3  # Budget exhausted in 10 days
+      for: 1h
+      annotations:
+        summary: "Medium burn rate - budget exhausted in 10 days"
+      notifications:
+        - type: slack
+          channel: "#alerts-reliability"
+
+    - name: error-budget-low
+      severity: warning
+      threshold: 0.25  # 25% remaining
+      for: 5m
+      annotations:
+        summary: "Error budget low ({{ $value | humanizePercentage }} remaining)"
+        description: "Consider feature freeze per error budget policy"
+      notifications:
+        - type: slack
+          channel: "#engineering-managers"
+
+    - name: error-budget-depleted
+      severity: critical
+      threshold: 0  # 0% remaining
+      for: 5m
+      annotations:
+        summary: "Error budget depleted - feature freeze required"
+        description: "SLO violated. Postmortem required within 48 hours."
+      notifications:
+        - type: pagerduty
+          severity: critical
+        - type: slack
+          channel: "#exec-alerts"
+
+  # Review cadence
+  review:
+    frequency: weekly
+    participants:
+      - team: engineering
+      - team: product
+      - team: sre
+    agenda:
+      - "Current error budget status"
+      - "Burn rate trends"
+      - "Recent incidents and impact"
+      - "Upcoming risky changes"
+
+  # Reporting
+  reporting:
+    dashboard:
+      grafana_uid: YOUR_SERVICE_slo_dashboard
+      panels:
+        - slo_status
+        - error_budget_remaining
+        - burn_rate_multiwindow
+        - incident_timeline
+    export:
+      format: prometheus
+      recording_rules: true