Initial commit
This commit is contained in:
26
skills/observability-engineering/SKILL.md
Normal file
26
skills/observability-engineering/SKILL.md
Normal file
@@ -0,0 +1,26 @@
|
||||
# Observability Engineering Skill
|
||||
|
||||
Production-ready monitoring, logging, and tracing using Prometheus, Grafana, OpenTelemetry, DataDog, and Sentry.
|
||||
|
||||
## Description
|
||||
|
||||
Comprehensive observability setup including SLO implementation, distributed tracing, dashboards, and incident prevention.
|
||||
|
||||
## What's Included
|
||||
|
||||
- **Examples**: Prometheus configs, Grafana dashboards, SLO definitions
|
||||
- **Reference**: Observability best practices, monitoring strategies
|
||||
- **Templates**: Dashboard templates, alert rules
|
||||
|
||||
## Use When
|
||||
|
||||
- Setting up production monitoring
|
||||
- Implementing SLOs
|
||||
- Distributed tracing
|
||||
- Performance tracking
|
||||
|
||||
## Related Agents
|
||||
|
||||
- `observability-engineer`
|
||||
|
||||
**Skill Version**: 1.0
|
||||
@@ -0,0 +1,600 @@
|
||||
# Observability Engineering Setup Checklist
|
||||
|
||||
Comprehensive checklist for implementing production-grade observability with logs, metrics, traces, and alerts.
|
||||
|
||||
## Pre-Implementation Planning
|
||||
|
||||
- [ ] **Define observability goals** (debug production issues, monitor SLAs, detect anomalies)
|
||||
- [ ] **Choose observability stack**:
|
||||
- [ ] Logging: Pino (Node.js), structlog (Python), CloudWatch, Datadog
|
||||
- [ ] Metrics: Prometheus, Datadog, CloudWatch
|
||||
- [ ] Tracing: OpenTelemetry, Datadog APM, Jaeger
|
||||
- [ ] Visualization: Grafana, Datadog, Honeycomb
|
||||
|
||||
- [ ] **Set up observability infrastructure** (collectors, storage, dashboards)
|
||||
- [ ] **Define data retention** policies (logs: 30 days, metrics: 1 year, traces: 7 days)
|
||||
- [ ] **Plan for scale** (log volume, metric cardinality, trace sampling)
|
||||
|
||||
## Structured Logging
|
||||
|
||||
### Logger Configuration
|
||||
|
||||
- [ ] **Structured logger installed**:
|
||||
- Node.js: `pino` with `pino-pretty` for dev
|
||||
- Python: `structlog` with JSON formatter
|
||||
- Browser: Custom JSON logger or service integration
|
||||
|
||||
- [ ] **Log levels defined**:
|
||||
- [ ] TRACE: Very detailed debugging
|
||||
- [ ] DEBUG: Detailed debugging info
|
||||
- [ ] INFO: General informational messages
|
||||
- [ ] WARN: Warning messages (recoverable issues)
|
||||
- [ ] ERROR: Error messages (failures)
|
||||
- [ ] FATAL: Critical failures (application crash)
|
||||
|
||||
- [ ] **Environment-based configuration**:
|
||||
- [ ] Development: Pretty-printed logs, DEBUG level
|
||||
- [ ] Production: JSON logs, INFO level
|
||||
- [ ] Test: Silent or minimal logs
|
||||
|
||||
### Log Structure
|
||||
|
||||
- [ ] **Standard log format** across services:
|
||||
```json
|
||||
{
|
||||
"level": "info",
|
||||
"timestamp": "2025-11-10T10:30:00.000Z",
|
||||
"service": "api-server",
|
||||
"environment": "production",
|
||||
"tenant_id": "uuid",
|
||||
"user_id": "uuid",
|
||||
"request_id": "uuid",
|
||||
"message": "User logged in",
|
||||
"duration_ms": 150,
|
||||
"http": {
|
||||
"method": "POST",
|
||||
"path": "/api/login",
|
||||
"status": 200,
|
||||
"user_agent": "Mozilla/5.0..."
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
- [ ] **Correlation IDs** added:
|
||||
- [ ] request_id: Unique per request
|
||||
- [ ] session_id: Unique per session
|
||||
- [ ] trace_id: Unique per distributed trace
|
||||
- [ ] tenant_id: Multi-tenant context
|
||||
|
||||
- [ ] **Context propagation** through request lifecycle
|
||||
- [ ] **Sensitive data redacted** (passwords, tokens, credit cards)
|
||||
|
||||
### What to Log
|
||||
|
||||
- [ ] **Request/response logging**:
|
||||
- [ ] HTTP method, path, status code
|
||||
- [ ] Request duration
|
||||
- [ ] User agent, IP address (hashed or anonymized)
|
||||
- [ ] Query parameters (non-sensitive)
|
||||
|
||||
- [ ] **Authentication events**:
|
||||
- [ ] Login success/failure
|
||||
- [ ] Logout
|
||||
- [ ] Token refresh
|
||||
- [ ] Permission checks
|
||||
|
||||
- [ ] **Business events**:
|
||||
- [ ] User registration
|
||||
- [ ] Payment processing
|
||||
- [ ] Data exports
|
||||
- [ ] Admin actions
|
||||
|
||||
- [ ] **Errors and exceptions**:
|
||||
- [ ] Error message
|
||||
- [ ] Stack trace
|
||||
- [ ] Error context (what user was doing)
|
||||
- [ ] Affected resources (user_id, tenant_id, entity_id)
|
||||
|
||||
- [ ] **Performance metrics**:
|
||||
- [ ] Database query times
|
||||
- [ ] External API call times
|
||||
- [ ] Cache hit/miss rates
|
||||
- [ ] Background job durations
|
||||
|
||||
### Log Aggregation
|
||||
|
||||
- [ ] **Logs shipped** to central location:
|
||||
- [ ] CloudWatch Logs
|
||||
- [ ] Datadog Logs
|
||||
- [ ] Elasticsearch (ELK stack)
|
||||
- [ ] Splunk
|
||||
|
||||
- [ ] **Log retention** configured (30-90 days typical)
|
||||
- [ ] **Log volume** monitored (cost management)
|
||||
- [ ] **Log sampling** for high-volume services (if needed)
|
||||
|
||||
## Application Metrics
|
||||
|
||||
### Metric Types
|
||||
|
||||
- [ ] **Counters** for events that only increase:
|
||||
- [ ] Total requests
|
||||
- [ ] Total errors
|
||||
- [ ] Total registrations
|
||||
|
||||
- [ ] **Gauges** for values that go up and down:
|
||||
- [ ] Active connections
|
||||
- [ ] Memory usage
|
||||
- [ ] Queue depth
|
||||
|
||||
- [ ] **Histograms** for distributions:
|
||||
- [ ] Request duration
|
||||
- [ ] Response size
|
||||
- [ ] Database query time
|
||||
|
||||
- [ ] **Summaries** for quantiles (p50, p95, p99)
|
||||
|
||||
### Standard Metrics
|
||||
|
||||
#### HTTP Metrics
|
||||
|
||||
- [ ] **http_requests_total** (counter):
|
||||
- Labels: method, path, status, tenant_id
|
||||
- Track total requests per endpoint
|
||||
|
||||
- [ ] **http_request_duration_seconds** (histogram):
|
||||
- Labels: method, path, status
|
||||
- Buckets: 0.1, 0.5, 1, 2, 5, 10 seconds
|
||||
|
||||
- [ ] **http_request_size_bytes** (histogram)
|
||||
- [ ] **http_response_size_bytes** (histogram)
|
||||
|
||||
#### Database Metrics
|
||||
|
||||
- [ ] **db_queries_total** (counter):
|
||||
- Labels: operation (SELECT, INSERT, UPDATE, DELETE), table
|
||||
|
||||
- [ ] **db_query_duration_seconds** (histogram):
|
||||
- Labels: operation, table
|
||||
- Track slow queries (p95, p99)
|
||||
|
||||
- [ ] **db_connection_pool_size** (gauge)
|
||||
- [ ] **db_connection_pool_available** (gauge)
|
||||
|
||||
#### Application Metrics
|
||||
|
||||
- [ ] **active_sessions** (gauge)
|
||||
- [ ] **background_jobs_total** (counter):
|
||||
- Labels: job_name, status (success, failure)
|
||||
|
||||
- [ ] **background_job_duration_seconds** (histogram):
|
||||
- Labels: job_name
|
||||
|
||||
- [ ] **cache_operations_total** (counter):
|
||||
- Labels: operation (hit, miss, set, delete)
|
||||
|
||||
- [ ] **external_api_calls_total** (counter):
|
||||
- Labels: service, status
|
||||
|
||||
- [ ] **external_api_duration_seconds** (histogram):
|
||||
- Labels: service
|
||||
|
||||
#### System Metrics
|
||||
|
||||
- [ ] **process_cpu_usage_percent** (gauge)
|
||||
- [ ] **process_memory_usage_bytes** (gauge)
|
||||
- [ ] **process_heap_usage_bytes** (gauge) - JavaScript specific
|
||||
- [ ] **process_open_file_descriptors** (gauge)
|
||||
|
||||
### Metric Collection
|
||||
|
||||
- [ ] **Prometheus client library** installed:
|
||||
- Node.js: `prom-client`
|
||||
- Python: `prometheus-client`
|
||||
- Custom: OpenTelemetry SDK
|
||||
|
||||
- [ ] **Metrics endpoint** exposed (`/metrics`)
|
||||
- [ ] **Prometheus scrapes** endpoint (or push to gateway)
|
||||
- [ ] **Metric naming** follows conventions:
|
||||
- Lowercase with underscores
|
||||
- Unit suffixes (_seconds, _bytes, _total)
|
||||
- Namespace prefix (myapp_http_requests_total)
|
||||
|
||||
### Multi-Tenant Metrics
|
||||
|
||||
- [ ] **tenant_id label** on all relevant metrics
|
||||
- [ ] **Per-tenant dashboards** (filter by tenant_id)
|
||||
- [ ] **Tenant resource usage** tracked:
|
||||
- [ ] API calls per tenant
|
||||
- [ ] Database storage per tenant
|
||||
- [ ] Data transfer per tenant
|
||||
|
||||
- [ ] **Tenant quotas** monitored (alert on approaching limit)
|
||||
|
||||
## Distributed Tracing
|
||||
|
||||
### Tracing Setup
|
||||
|
||||
- [ ] **OpenTelemetry SDK** installed:
|
||||
- Node.js: `@opentelemetry/sdk-node`
|
||||
- Python: `opentelemetry-sdk`
|
||||
|
||||
- [ ] **Tracing backend** configured:
|
||||
- [ ] Jaeger (self-hosted)
|
||||
- [ ] Datadog APM
|
||||
- [ ] Honeycomb
|
||||
- [ ] AWS X-Ray
|
||||
|
||||
- [ ] **Auto-instrumentation** enabled:
|
||||
- [ ] HTTP client/server
|
||||
- [ ] Database queries
|
||||
- [ ] Redis operations
|
||||
- [ ] Message queues
|
||||
|
||||
### Span Creation
|
||||
|
||||
- [ ] **Custom spans** for business logic:
|
||||
```typescript
|
||||
const span = tracer.startSpan('process-payment');
|
||||
span.setAttribute('tenant_id', tenantId);
|
||||
span.setAttribute('amount', amount);
|
||||
try {
|
||||
await processPayment();
|
||||
span.setStatus({ code: SpanStatusCode.OK });
|
||||
} catch (error) {
|
||||
span.recordException(error);
|
||||
span.setStatus({ code: SpanStatusCode.ERROR });
|
||||
throw error;
|
||||
} finally {
|
||||
span.end();
|
||||
}
|
||||
```
|
||||
|
||||
- [ ] **Span attributes** include context:
|
||||
- [ ] tenant_id, user_id, request_id
|
||||
- [ ] Input parameters (non-sensitive)
|
||||
- [ ] Result status
|
||||
|
||||
- [ ] **Span events** for key moments:
|
||||
- [ ] "Payment started"
|
||||
- [ ] "Database query executed"
|
||||
- [ ] "External API called"
|
||||
|
||||
### Trace Context Propagation
|
||||
|
||||
- [ ] **W3C Trace Context** headers propagated:
|
||||
- traceparent: trace-id, parent-span-id, flags
|
||||
- tracestate: vendor-specific data
|
||||
|
||||
- [ ] **Context propagated** across:
|
||||
- [ ] HTTP requests (frontend ↔ backend)
|
||||
- [ ] Background jobs
|
||||
- [ ] Message queues
|
||||
- [ ] Microservices
|
||||
|
||||
- [ ] **Trace ID** included in logs (correlate logs + traces)
|
||||
|
||||
### Sampling
|
||||
|
||||
- [ ] **Sampling strategy** defined:
|
||||
- [ ] Head-based: Sample at trace start (1%, 10%, 100%)
|
||||
- [ ] Tail-based: Sample after trace completes (error traces, slow traces)
|
||||
- [ ] Adaptive: Sample based on load
|
||||
|
||||
- [ ] **Always sample** errors and slow requests
|
||||
- [ ] **Sample rate** appropriate for volume (start high, reduce if needed)
|
||||
|
||||
## Alerting
|
||||
|
||||
### Alert Definitions
|
||||
|
||||
- [ ] **Error rate alerts**:
|
||||
- [ ] Condition: Error rate > 5% for 5 minutes
|
||||
- [ ] Severity: Critical
|
||||
- [ ] Action: Page on-call engineer
|
||||
|
||||
- [ ] **Latency alerts**:
|
||||
- [ ] Condition: p95 latency > 1s for 10 minutes
|
||||
- [ ] Severity: Warning
|
||||
- [ ] Action: Slack notification
|
||||
|
||||
- [ ] **Availability alerts**:
|
||||
- [ ] Condition: Health check fails 3 consecutive times
|
||||
- [ ] Severity: Critical
|
||||
- [ ] Action: Page on-call + auto-restart
|
||||
|
||||
- [ ] **Resource alerts**:
|
||||
- [ ] Memory usage > 80%
|
||||
- [ ] CPU usage > 80%
|
||||
- [ ] Disk usage > 85%
|
||||
- [ ] Database connections > 90% of pool
|
||||
|
||||
- [ ] **Business metric alerts**:
|
||||
- [ ] Registration rate drops > 50%
|
||||
- [ ] Payment failures increase > 10%
|
||||
- [ ] Active users drop significantly
|
||||
|
||||
### Alert Channels
|
||||
|
||||
- [ ] **PagerDuty** (or equivalent) for critical alerts
|
||||
- [ ] **Slack** for warnings and notifications
|
||||
- [ ] **Email** for non-urgent alerts
|
||||
- [ ] **SMS** for highest priority (only use sparingly)
|
||||
|
||||
### Alert Management
|
||||
|
||||
- [ ] **Alert fatigue** prevented:
|
||||
- [ ] Appropriate thresholds (not too sensitive)
|
||||
- [ ] Proper severity levels (not everything is critical)
|
||||
- [ ] Alert aggregation (deduplicate similar alerts)
|
||||
|
||||
- [ ] **Runbooks** for each alert:
|
||||
- [ ] What the alert means
|
||||
- [ ] How to investigate
|
||||
- [ ] How to resolve
|
||||
- [ ] Escalation path
|
||||
|
||||
- [ ] **Alert suppression** during deployments (planned downtime)
|
||||
- [ ] **Alert escalation** if not acknowledged
|
||||
|
||||
## Dashboards & Visualization
|
||||
|
||||
### Standard Dashboards
|
||||
|
||||
- [ ] **Service Overview** dashboard:
|
||||
- [ ] Request rate (requests/sec)
|
||||
- [ ] Error rate (errors/sec, %)
|
||||
- [ ] Latency (p50, p95, p99)
|
||||
- [ ] Availability (uptime %)
|
||||
|
||||
- [ ] **Database** dashboard:
|
||||
- [ ] Query rate
|
||||
- [ ] Slow queries (p95, p99)
|
||||
- [ ] Connection pool usage
|
||||
- [ ] Table sizes
|
||||
|
||||
- [ ] **System Resources** dashboard:
|
||||
- [ ] CPU usage
|
||||
- [ ] Memory usage
|
||||
- [ ] Disk I/O
|
||||
- [ ] Network I/O
|
||||
|
||||
- [ ] **Business Metrics** dashboard:
|
||||
- [ ] Active users
|
||||
- [ ] Registrations
|
||||
- [ ] Revenue
|
||||
- [ ] Feature usage
|
||||
|
||||
### Dashboard Best Practices
|
||||
|
||||
- [ ] **Auto-refresh** enabled (every 30-60 seconds)
|
||||
- [ ] **Time range** configurable (last hour, 24h, 7 days)
|
||||
- [ ] **Drill-down** to detailed views
|
||||
- [ ] **Annotations** for deployments/incidents
|
||||
- [ ] **Shared dashboards** accessible to team
|
||||
|
||||
### Per-Tenant Dashboards
|
||||
|
||||
- [ ] **Tenant filter** on all relevant dashboards
|
||||
- [ ] **Tenant resource usage** visualized
|
||||
- [ ] **Tenant-specific alerts** (if large customer)
|
||||
- [ ] **Tenant comparison** view (compare usage across tenants)
|
||||
|
||||
## Health Checks
|
||||
|
||||
### Endpoint Implementation
|
||||
|
||||
- [ ] **Health check endpoint** (`/health` or `/healthz`):
|
||||
- [ ] Returns 200 OK when healthy
|
||||
- [ ] Returns 503 Service Unavailable when unhealthy
|
||||
- [ ] Includes subsystem status
|
||||
|
||||
```json
|
||||
{
|
||||
"status": "healthy",
|
||||
"version": "1.2.3",
|
||||
"uptime_seconds": 86400,
|
||||
"checks": {
|
||||
"database": "healthy",
|
||||
"redis": "healthy",
|
||||
"external_api": "degraded"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
- [ ] **Liveness probe** (`/health/live`):
|
||||
- [ ] Checks if application is running
|
||||
- [ ] Fails → restart container
|
||||
|
||||
- [ ] **Readiness probe** (`/health/ready`):
|
||||
- [ ] Checks if application is ready to serve traffic
|
||||
- [ ] Fails → remove from load balancer
|
||||
|
||||
### Health Check Coverage
|
||||
|
||||
- [ ] **Database connectivity** checked
|
||||
- [ ] **Cache connectivity** checked (Redis, Memcached)
|
||||
- [ ] **External APIs** checked (optional, can cause false positives)
|
||||
- [ ] **Disk space** checked
|
||||
- [ ] **Critical dependencies** checked
|
||||
|
||||
### Monitoring Health Checks
|
||||
|
||||
- [ ] **Uptime monitoring** service (Pingdom, UptimeRobot, Datadog Synthetics)
|
||||
- [ ] **Check frequency** appropriate (every 1-5 minutes)
|
||||
- [ ] **Alerting** on failed health checks
|
||||
- [ ] **Geographic monitoring** (check from multiple regions)
|
||||
|
||||
## Error Tracking
|
||||
|
||||
### Error Capture
|
||||
|
||||
- [ ] **Error tracking service** integrated:
|
||||
- [ ] Sentry
|
||||
- [ ] Datadog Error Tracking
|
||||
- [ ] Rollbar
|
||||
- [ ] Custom solution
|
||||
|
||||
- [ ] **Unhandled exceptions** captured automatically
|
||||
- [ ] **Handled errors** reported when appropriate
|
||||
- [ ] **Error context** included:
|
||||
- [ ] User ID, tenant ID
|
||||
- [ ] Request ID, trace ID
|
||||
- [ ] User actions (breadcrumbs)
|
||||
- [ ] Environment variables (non-sensitive)
|
||||
|
||||
### Error Grouping
|
||||
|
||||
- [ ] **Errors grouped** by fingerprint (same error, different occurrences)
|
||||
- [ ] **Error rate** tracked per group
|
||||
- [ ] **Alerting** on new error types or spike in existing
|
||||
- [ ] **Error assignment** to team members
|
||||
- [ ] **Resolution tracking** (mark errors as resolved)
|
||||
|
||||
### Privacy & Security
|
||||
|
||||
- [ ] **PII redacted** from error reports:
|
||||
- [ ] Passwords, tokens, API keys
|
||||
- [ ] Credit card numbers
|
||||
- [ ] Email addresses (unless necessary)
|
||||
- [ ] SSNs, tax IDs
|
||||
|
||||
- [ ] **Source maps** uploaded for frontend (de-minify stack traces)
|
||||
- [ ] **Release tagging** (associate errors with deployments)
|
||||
|
||||
## Performance Monitoring
|
||||
|
||||
### Real User Monitoring (RUM)
|
||||
|
||||
- [ ] **RUM tool integrated** (Datadog RUM, New Relic Browser, Google Analytics):
|
||||
- [ ] Page load times
|
||||
- [ ] Core Web Vitals (LCP, FID, CLS)
|
||||
- [ ] JavaScript errors
|
||||
- [ ] User sessions
|
||||
|
||||
- [ ] **Performance budgets** defined:
|
||||
- [ ] First Contentful Paint < 1.8s
|
||||
- [ ] Largest Contentful Paint < 2.5s
|
||||
- [ ] Time to Interactive < 3.8s
|
||||
- [ ] Cumulative Layout Shift < 0.1
|
||||
|
||||
- [ ] **Alerting** on performance regressions
|
||||
|
||||
### Application Performance Monitoring (APM)
|
||||
|
||||
- [ ] **APM tool** integrated (Datadog APM, New Relic APM):
|
||||
- [ ] Trace every request
|
||||
- [ ] Identify slow endpoints
|
||||
- [ ] Database query analysis
|
||||
- [ ] External API profiling
|
||||
|
||||
- [ ] **Performance profiling** for critical paths:
|
||||
- [ ] Authentication flow
|
||||
- [ ] Payment processing
|
||||
- [ ] Data exports
|
||||
- [ ] Complex queries
|
||||
|
||||
## Cost Management
|
||||
|
||||
- [ ] **Observability costs** tracked:
|
||||
- [ ] Log ingestion costs
|
||||
- [ ] Metric cardinality costs
|
||||
- [ ] Trace sampling costs
|
||||
- [ ] Dashboard/seat costs
|
||||
|
||||
- [ ] **Cost optimization**:
|
||||
- [ ] Log sampling for high-volume services
|
||||
- [ ] Metric aggregation (reduce cardinality)
|
||||
- [ ] Trace sampling (not 100% in production)
|
||||
- [ ] Data retention policies
|
||||
|
||||
- [ ] **Budget alerts** configured
|
||||
|
||||
## Security & Compliance
|
||||
|
||||
- [ ] **Access control** on observability tools (role-based)
|
||||
- [ ] **Audit logging** for observability access
|
||||
- [ ] **Data retention** complies with regulations (GDPR, HIPAA)
|
||||
- [ ] **Data encryption** in transit and at rest
|
||||
- [ ] **PII handling** compliant (redaction, anonymization)
|
||||
|
||||
## Testing Observability
|
||||
|
||||
- [ ] **Log output** tested in unit tests:
|
||||
```typescript
|
||||
test('logs user login', () => {
|
||||
const logs = captureLogs();
|
||||
await loginUser();
|
||||
expect(logs).toContainEqual(
|
||||
expect.objectContaining({
|
||||
level: 'info',
|
||||
message: 'User logged in',
|
||||
user_id: expect.any(String)
|
||||
})
|
||||
);
|
||||
});
|
||||
```
|
||||
|
||||
- [ ] **Metrics** incremented in tests
|
||||
- [ ] **Traces** created in integration tests
|
||||
- [ ] **Health checks** tested
|
||||
- [ ] **Alert thresholds** tested (inject failures, verify alert fires)
|
||||
|
||||
## Documentation
|
||||
|
||||
- [ ] **Observability runbook** created:
|
||||
- [ ] How to access logs, metrics, traces
|
||||
- [ ] How to create dashboards
|
||||
- [ ] How to set up alerts
|
||||
- [ ] Common troubleshooting queries
|
||||
|
||||
- [ ] **Alert runbooks** for each alert
|
||||
- [ ] **Dashboard documentation** (what each panel shows)
|
||||
- [ ] **Metric dictionary** (what each metric means)
|
||||
- [ ] **On-call procedures** documented
|
||||
|
||||
## Scoring
|
||||
|
||||
- **85+ items checked**: Excellent - Production-grade observability ✅
|
||||
- **65-84 items**: Good - Most observability covered ⚠️
|
||||
- **45-64 items**: Fair - Significant gaps exist 🔴
|
||||
- **<45 items**: Poor - Not ready for production ❌
|
||||
|
||||
## Priority Items
|
||||
|
||||
Address these first:
|
||||
1. **Structured logging** - Foundation for debugging
|
||||
2. **Error tracking** - Catch and fix bugs quickly
|
||||
3. **Health checks** - Know when service is down
|
||||
4. **Alerting** - Get notified of issues
|
||||
5. **Key metrics** - Request rate, error rate, latency
|
||||
|
||||
## Common Pitfalls
|
||||
|
||||
❌ **Don't:**
|
||||
- Log sensitive data (passwords, tokens, PII)
|
||||
- Create high-cardinality metrics (user_id as label)
|
||||
- Trace 100% of requests in production (sample instead)
|
||||
- Alert on every anomaly (alert fatigue)
|
||||
- Ignore observability until there's a problem
|
||||
|
||||
✅ **Do:**
|
||||
- Log at appropriate levels (use DEBUG for verbose)
|
||||
- Use correlation IDs throughout request lifecycle
|
||||
- Set up alerts with clear runbooks
|
||||
- Review dashboards regularly (detect issues early)
|
||||
- Iterate on observability (improve over time)
|
||||
|
||||
## Related Resources
|
||||
|
||||
- [OpenTelemetry Documentation](https://opentelemetry.io/docs/)
|
||||
- [Pino Logger](https://getpino.io)
|
||||
- [Prometheus Best Practices](https://prometheus.io/docs/practices/)
|
||||
- [observability-engineering skill](../SKILL.md)
|
||||
|
||||
---
|
||||
|
||||
**Total Items**: 140+ observability checks
|
||||
**Critical Items**: Logging, Metrics, Alerting, Health checks
|
||||
**Coverage**: Logs, Metrics, Traces, Alerts, Dashboards
|
||||
**Last Updated**: 2025-11-10
|
||||
136
skills/observability-engineering/examples/INDEX.md
Normal file
136
skills/observability-engineering/examples/INDEX.md
Normal file
@@ -0,0 +1,136 @@
|
||||
# Observability Examples
|
||||
|
||||
Production-ready observability implementations for Grey Haven stack (Cloudflare Workers, TanStack Start, FastAPI, PostgreSQL).
|
||||
|
||||
## Examples Overview
|
||||
|
||||
### Prometheus + Grafana Setup
|
||||
|
||||
**File**: [prometheus-grafana-setup.md](prometheus-grafana-setup.md)
|
||||
|
||||
Complete monitoring stack for Kubernetes with Golden Signals:
|
||||
- **Prometheus Deployment** - Helm charts, service discovery, scrape configs
|
||||
- **Grafana Setup** - Dashboard-as-code, templating, alerting
|
||||
- **Node Exporter** - System metrics collection (CPU, memory, disk)
|
||||
- **kube-state-metrics** - Kubernetes resource metrics
|
||||
- **Golden Signals** - Request rate, error rate, latency (p50/p95/p99), saturation
|
||||
- **Recording Rules** - Pre-aggregated metrics for fast queries
|
||||
- **Alert Manager** - PagerDuty integration, escalation policies
|
||||
- **Before/After Metrics** - Response time improved 40%, MTTR reduced 60%
|
||||
|
||||
**Use when**: Setting up production monitoring, implementing SRE practices, cloud-native deployments
|
||||
|
||||
---
|
||||
|
||||
### OpenTelemetry Distributed Tracing
|
||||
|
||||
**File**: [opentelemetry-tracing.md](opentelemetry-tracing.md)
|
||||
|
||||
Distributed tracing for microservices with Jaeger:
|
||||
- **OTel Collector** - Receiver/processor/exporter pipelines
|
||||
- **Auto-Instrumentation** - Zero-code tracing for Node.js, Python, FastAPI
|
||||
- **Context Propagation** - W3C Trace Context across services
|
||||
- **Sampling Strategies** - Head-based (10%), tail-based (errors only)
|
||||
- **Span Attributes** - HTTP method, status code, user ID, tenant ID
|
||||
- **Trace Visualization** - Jaeger UI, dependency graphs, critical path
|
||||
- **Performance Impact** - <5ms overhead, 2% CPU increase
|
||||
- **Before/After** - MTTR 45min → 8min (82% reduction)
|
||||
|
||||
**Use when**: Debugging microservices, understanding latency, optimizing critical paths
|
||||
|
||||
---
|
||||
|
||||
### SLO & Error Budget Framework
|
||||
|
||||
**File**: [slo-error-budgets.md](slo-error-budgets.md)
|
||||
|
||||
Complete SLI/SLO/Error Budget implementation:
|
||||
- **SLI Definition** - Availability (99.9%), latency (p95 < 200ms), error rate (< 0.5%)
|
||||
- **SLO Targets** - Critical (99.95%), Essential (99.9%), Standard (99.5%)
|
||||
- **Error Budget Calculation** - Monthly budget, burn rate monitoring (1h/6h/24h windows)
|
||||
- **Prometheus Recording Rules** - Multi-window SLI calculations
|
||||
- **Grafana SLO Dashboard** - Real-time status, budget remaining, burn rate graphs
|
||||
- **Budget Policies** - Feature freeze at 25% remaining, postmortem required at depletion
|
||||
- **Burn Rate Alerts** - PagerDuty escalation when burning too fast
|
||||
- **Impact** - 99.95% availability achieved, 3 feature freezes prevented overspend
|
||||
|
||||
**Use when**: Implementing SRE practices, balancing reliability with velocity, production deployments
|
||||
|
||||
---
|
||||
|
||||
### DataDog APM Integration
|
||||
|
||||
**File**: [datadog-apm.md](datadog-apm.md)
|
||||
|
||||
Application Performance Monitoring for Grey Haven stack:
|
||||
- **DataDog Agent** - Cloudflare Workers instrumentation, FastAPI tracing
|
||||
- **Custom Metrics** - Business KPIs (checkout success rate, revenue per minute)
|
||||
- **Real User Monitoring (RUM)** - Frontend performance, user sessions, error tracking
|
||||
- **APM Traces** - Distributed tracing with Cloudflare Workers, database queries
|
||||
- **Log Correlation** - Trace ID in logs, unified troubleshooting
|
||||
- **Synthetic Monitoring** - API health checks every 1 minute from 10 locations
|
||||
- **Anomaly Detection** - ML-powered alerts for unusual patterns
|
||||
- **Cost** - $31/host/month, $40/million spans
|
||||
- **Before/After** - 99.5% → 99.95% availability (10x fewer incidents)
|
||||
|
||||
**Use when**: Commercial APM needed, executive dashboards required, startup budget allows
|
||||
|
||||
---
|
||||
|
||||
### Centralized Logging with Fluentd + Elasticsearch
|
||||
|
||||
**File**: [centralized-logging.md](centralized-logging.md)
|
||||
|
||||
Production log aggregation for multi-region deployments:
|
||||
- **Fluentd DaemonSet** - Kubernetes log collection from all pods
|
||||
- **Structured Logging** - JSON format with trace ID, user ID, tenant ID
|
||||
- **Elasticsearch Indexing** - Daily indices with rollover, ILM policies
|
||||
- **Kibana Dashboards** - Error tracking, request patterns, audit logs
|
||||
- **Log Parsing** - Grok patterns for FastAPI, TanStack Start, PostgreSQL
|
||||
- **Retention** - Hot (7 days), Warm (30 days), Cold (90 days), Archive (1 year)
|
||||
- **PII Redaction** - Automatic SSN, credit card, email masking
|
||||
- **Volume** - 500GB/day ingested, 90% compression, $800/month cost
|
||||
- **Before/After** - Log search 5min → 10sec, disk usage 10TB → 1TB
|
||||
|
||||
**Use when**: Debugging production issues, compliance requirements (SOC2/PCI), audit trails
|
||||
|
||||
---
|
||||
|
||||
### Chaos Engineering with Gremlin
|
||||
|
||||
**File**: [chaos-engineering.md](chaos-engineering.md)
|
||||
|
||||
Reliability testing and circuit breaker validation:
|
||||
- **Gremlin Setup** - Agent deployment, blast radius configuration
|
||||
- **Chaos Experiments** - Pod termination, network latency (100ms), CPU stress (80%)
|
||||
- **Circuit Breaker** - Automatic fallback when error rate > 50%
|
||||
- **Hypothesis** - "API handles 50% pod failures without user impact"
|
||||
- **Validation** - Prometheus metrics, distributed traces, user session monitoring
|
||||
- **Results** - Circuit breaker engaged in 2sec, fallback success rate 99.8%
|
||||
- **Runbook** - Automatic rollback triggers, escalation procedures
|
||||
- **Impact** - Found 3 critical bugs before production, confidence in resilience
|
||||
|
||||
**Use when**: Pre-production validation, testing disaster recovery, chaos engineering practice
|
||||
|
||||
---
|
||||
|
||||
## Quick Navigation
|
||||
|
||||
| Topic | File | Lines | Focus |
|
||||
|-------|------|-------|-------|
|
||||
| **Prometheus + Grafana** | [prometheus-grafana-setup.md](prometheus-grafana-setup.md) | ~480 | Golden Signals monitoring |
|
||||
| **OpenTelemetry** | [opentelemetry-tracing.md](opentelemetry-tracing.md) | ~450 | Distributed tracing |
|
||||
| **SLO Framework** | [slo-error-budgets.md](slo-error-budgets.md) | ~420 | Error budget management |
|
||||
| **DataDog APM** | [datadog-apm.md](datadog-apm.md) | ~400 | Commercial APM |
|
||||
| **Centralized Logging** | [centralized-logging.md](centralized-logging.md) | ~440 | Log aggregation |
|
||||
| **Chaos Engineering** | [chaos-engineering.md](chaos-engineering.md) | ~350 | Reliability testing |
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- **Reference**: [Reference Index](../reference/INDEX.md) - PromQL, Golden Signals, SLO best practices
|
||||
- **Templates**: [Templates Index](../templates/INDEX.md) - Grafana dashboards, SLO definitions
|
||||
- **Main Agent**: [observability-engineer.md](../observability-engineer.md) - Observability agent
|
||||
|
||||
---
|
||||
|
||||
Return to [main agent](../observability-engineer.md)
|
||||
67
skills/observability-engineering/reference/INDEX.md
Normal file
67
skills/observability-engineering/reference/INDEX.md
Normal file
@@ -0,0 +1,67 @@
|
||||
# Observability Reference Documentation
|
||||
|
||||
Comprehensive reference guides for production observability patterns, PromQL queries, and SRE best practices.
|
||||
|
||||
## Reference Overview
|
||||
|
||||
### PromQL Query Language Guide
|
||||
|
||||
**File**: [promql-guide.md](promql-guide.md)
|
||||
|
||||
Complete PromQL reference for Prometheus queries:
|
||||
- **Metric types**: Counter, Gauge, Histogram, Summary
|
||||
- **PromQL functions**: rate(), irate(), increase(), sum(), avg(), histogram_quantile()
|
||||
- **Recording rules**: Pre-aggregated metrics for performance
|
||||
- **Alerting queries**: Burn rate calculations, threshold alerts
|
||||
- **Performance tips**: Query optimization, avoiding cardinality explosions
|
||||
|
||||
**Use when**: Writing Prometheus queries, creating recording rules, debugging slow queries
|
||||
|
||||
---
|
||||
|
||||
### Golden Signals Reference
|
||||
|
||||
**File**: [golden-signals.md](golden-signals.md)
|
||||
|
||||
Google SRE Golden Signals implementation guide:
|
||||
- **Request Rate (Traffic)**: RPS calculations, per-service breakdowns
|
||||
- **Error Rate**: 5xx errors, client vs server errors, error budget impact
|
||||
- **Latency (Duration)**: p50/p95/p99 percentiles, latency SLOs
|
||||
- **Saturation**: CPU, memory, disk, connection pools
|
||||
|
||||
**Use when**: Designing monitoring dashboards, implementing SLIs, understanding system health
|
||||
|
||||
---
|
||||
|
||||
### SLO Best Practices
|
||||
|
||||
**File**: [slo-best-practices.md](slo-best-practices.md)
|
||||
|
||||
Google SRE SLO/SLI/Error Budget framework:
|
||||
- **SLI selection**: Choosing meaningful indicators (availability, latency, throughput)
|
||||
- **SLO targets**: Critical (99.95%), Essential (99.9%), Standard (99.5%)
|
||||
- **Error budget policies**: Feature freeze thresholds, postmortem requirements
|
||||
- **Multi-window burn rate alerts**: 1h, 6h, 24h windows
|
||||
- **SLO review cadence**: Weekly reviews, quarterly adjustments
|
||||
|
||||
**Use when**: Implementing SLO framework, setting reliability targets, balancing velocity with reliability
|
||||
|
||||
---
|
||||
|
||||
## Quick Navigation
|
||||
|
||||
| Topic | File | Lines | Focus |
|
||||
|-------|------|-------|-------|
|
||||
| **PromQL** | [promql-guide.md](promql-guide.md) | ~450 | Query language reference |
|
||||
| **Golden Signals** | [golden-signals.md](golden-signals.md) | ~380 | Four signals implementation |
|
||||
| **SLO Practices** | [slo-best-practices.md](slo-best-practices.md) | ~420 | Google SRE framework |
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- **Examples**: [Examples Index](../examples/INDEX.md) - Production implementations
|
||||
- **Templates**: [Templates Index](../templates/INDEX.md) - Copy-paste configurations
|
||||
- **Main Agent**: [observability-engineer.md](../observability-engineer.md) - Observability agent
|
||||
|
||||
---
|
||||
|
||||
Return to [main agent](../observability-engineer.md)
|
||||
72
skills/observability-engineering/templates/INDEX.md
Normal file
72
skills/observability-engineering/templates/INDEX.md
Normal file
@@ -0,0 +1,72 @@
|
||||
# Observability Templates
|
||||
|
||||
Copy-paste ready configuration templates for Prometheus, Grafana, and OpenTelemetry.
|
||||
|
||||
## Templates Overview
|
||||
|
||||
### Grafana Dashboard Template
|
||||
|
||||
**File**: [grafana-dashboard.json](grafana-dashboard.json)
|
||||
|
||||
Production-ready Golden Signals dashboard:
|
||||
- **Request Rate**: Total RPS with 5-minute averages
|
||||
- **Error Rate**: Percentage of 5xx errors with alert thresholds
|
||||
- **Latency**: p50/p95/p99 percentiles in milliseconds
|
||||
- **Saturation**: CPU and memory usage percentages
|
||||
|
||||
**Use when**: Creating new service dashboards, standardizing monitoring
|
||||
|
||||
---
|
||||
|
||||
### SLO Definition Template
|
||||
|
||||
**File**: [slo-definition.yaml](slo-definition.yaml)
|
||||
|
||||
Service Level Objective configuration:
|
||||
- **SLO tiers**: Critical (99.95%), Essential (99.9%), Standard (99.5%)
|
||||
- **SLI definitions**: Availability, latency, error rate
|
||||
- **Error budget policy**: Feature freeze thresholds
|
||||
- **Multi-window burn rate alerts**: 1h, 6h, 24h windows
|
||||
|
||||
**Use when**: Implementing SLO framework for new services
|
||||
|
||||
---
|
||||
|
||||
### Prometheus Recording Rules
|
||||
|
||||
**File**: [prometheus-recording-rules.yaml](prometheus-recording-rules.yaml)
|
||||
|
||||
Pre-aggregated metrics for fast dashboards:
|
||||
- **Request rates**: Per-service, per-endpoint RPS
|
||||
- **Error rates**: Percentage calculations (5xx / total)
|
||||
- **Latency percentiles**: p50/p95/p99 pre-computed
|
||||
- **Error budget**: Remaining budget and burn rate
|
||||
|
||||
**Use when**: Optimizing slow dashboard queries, implementing SLOs
|
||||
|
||||
---
|
||||
|
||||
## Quick Usage
|
||||
|
||||
```bash
|
||||
# Copy template to your monitoring directory
|
||||
cp templates/grafana-dashboard.json ../monitoring/dashboards/
|
||||
|
||||
# Edit service name and thresholds
|
||||
vim ../monitoring/dashboards/grafana-dashboard.json
|
||||
|
||||
# Import to Grafana
|
||||
curl -X POST http://admin:password@localhost:3000/api/dashboards/db \
|
||||
-H "Content-Type: application/json" \
|
||||
-d @../monitoring/dashboards/grafana-dashboard.json
|
||||
```
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- **Examples**: [Examples Index](../examples/INDEX.md) - Full implementations
|
||||
- **Reference**: [Reference Index](../reference/INDEX.md) - PromQL, SLO guides
|
||||
- **Main Agent**: [observability-engineer.md](../observability-engineer.md) - Observability agent
|
||||
|
||||
---
|
||||
|
||||
Return to [main agent](../observability-engineer.md)
|
||||
@@ -0,0 +1,210 @@
|
||||
{
|
||||
"dashboard": {
|
||||
"title": "Golden Signals - [Service Name]",
|
||||
"tags": ["golden-signals", "production", "slo"],
|
||||
"timezone": "UTC",
|
||||
"refresh": "30s",
|
||||
"time": {
|
||||
"from": "now-6h",
|
||||
"to": "now"
|
||||
},
|
||||
"panels": [
|
||||
{
|
||||
"id": 1,
|
||||
"title": "Request Rate (RPS)",
|
||||
"type": "graph",
|
||||
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "sum(rate(http_requests_total{service=\"YOUR_SERVICE\"}[5m]))",
|
||||
"legendFormat": "Total RPS",
|
||||
"refId": "A"
|
||||
},
|
||||
{
|
||||
"expr": "sum(rate(http_requests_total{service=\"YOUR_SERVICE\"}[5m])) by (method)",
|
||||
"legendFormat": "{{method}}",
|
||||
"refId": "B"
|
||||
}
|
||||
],
|
||||
"yaxes": [
|
||||
{"format": "reqps", "label": "Requests/sec"},
|
||||
{"format": "short"}
|
||||
],
|
||||
"legend": {"show": true, "alignAsTable": true, "avg": true, "max": true, "current": true}
|
||||
},
|
||||
{
|
||||
"id": 2,
|
||||
"title": "Error Rate (%)",
|
||||
"type": "graph",
|
||||
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "(sum(rate(http_requests_total{service=\"YOUR_SERVICE\",status=~\"5..\"}[5m])) / sum(rate(http_requests_total{service=\"YOUR_SERVICE\"}[5m]))) * 100",
|
||||
"legendFormat": "Error Rate %",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"yaxes": [
|
||||
{"format": "percent", "label": "Error %", "max": 5},
|
||||
{"format": "short"}
|
||||
],
|
||||
"alert": {
|
||||
"name": "High Error Rate",
|
||||
"conditions": [
|
||||
{
|
||||
"evaluator": {"params": [1], "type": "gt"},
|
||||
"operator": {"type": "and"},
|
||||
"query": {"params": ["A", "5m", "now"]},
|
||||
"type": "query"
|
||||
}
|
||||
],
|
||||
"frequency": "1m",
|
||||
"for": "5m",
|
||||
"message": "Error rate > 1% for 5 minutes",
|
||||
"noDataState": "no_data",
|
||||
"notifications": []
|
||||
},
|
||||
"thresholds": [
|
||||
{"value": 1, "colorMode": "critical", "op": "gt", "fill": true, "line": true}
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": 3,
|
||||
"title": "Request Latency (p50/p95/p99)",
|
||||
"type": "graph",
|
||||
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 8},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket{service=\"YOUR_SERVICE\"}[5m])) by (le)) * 1000",
|
||||
"legendFormat": "p50",
|
||||
"refId": "A"
|
||||
},
|
||||
{
|
||||
"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service=\"YOUR_SERVICE\"}[5m])) by (le)) * 1000",
|
||||
"legendFormat": "p95",
|
||||
"refId": "B"
|
||||
},
|
||||
{
|
||||
"expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service=\"YOUR_SERVICE\"}[5m])) by (le)) * 1000",
|
||||
"legendFormat": "p99",
|
||||
"refId": "C"
|
||||
}
|
||||
],
|
||||
"yaxes": [
|
||||
{"format": "ms", "label": "Latency (ms)"},
|
||||
{"format": "short"}
|
||||
],
|
||||
"thresholds": [
|
||||
{"value": 200, "colorMode": "warning", "op": "gt"},
|
||||
{"value": 500, "colorMode": "critical", "op": "gt"}
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": 4,
|
||||
"title": "Resource Saturation (CPU/Memory %)",
|
||||
"type": "graph",
|
||||
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 8},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "100 - (avg(irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
|
||||
"legendFormat": "CPU %",
|
||||
"refId": "A"
|
||||
},
|
||||
{
|
||||
"expr": "100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))",
|
||||
"legendFormat": "Memory %",
|
||||
"refId": "B"
|
||||
}
|
||||
],
|
||||
"yaxes": [
|
||||
{"format": "percent", "label": "Usage %", "max": 100},
|
||||
{"format": "short"}
|
||||
],
|
||||
"thresholds": [
|
||||
{"value": 80, "colorMode": "warning", "op": "gt"},
|
||||
{"value": 90, "colorMode": "critical", "op": "gt"}
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": 5,
|
||||
"title": "Top 10 Slowest Endpoints",
|
||||
"type": "table",
|
||||
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 16},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "topk(10, histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service=\"YOUR_SERVICE\"}[5m])) by (le, path))) * 1000",
|
||||
"legendFormat": "",
|
||||
"format": "table",
|
||||
"instant": true,
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"transformations": [
|
||||
{"id": "organize", "options": {"excludeByName": {}, "indexByName": {}, "renameByName": {"path": "Endpoint", "Value": "p95 Latency (ms)"}}}
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": 6,
|
||||
"title": "SLO Status (30-day)",
|
||||
"type": "stat",
|
||||
"gridPos": {"h": 8, "w": 6, "x": 12, "y": 16},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "sum(rate(http_requests_total{service=\"YOUR_SERVICE\",status=~\"2..|3..\"}[30d])) / sum(rate(http_requests_total{service=\"YOUR_SERVICE\"}[30d])) * 100",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"options": {
|
||||
"graphMode": "none",
|
||||
"textMode": "value_and_name",
|
||||
"colorMode": "background"
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "percent",
|
||||
"decimals": 3,
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{"value": 0, "color": "red"},
|
||||
{"value": 99.5, "color": "yellow"},
|
||||
{"value": 99.9, "color": "green"}
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": 7,
|
||||
"title": "Error Budget Remaining",
|
||||
"type": "gauge",
|
||||
"gridPos": {"h": 8, "w": 6, "x": 18, "y": 16},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "(1 - ((1 - (sum(rate(http_requests_total{service=\"YOUR_SERVICE\",status=~\"2..|3..\"}[30d])) / sum(rate(http_requests_total{service=\"YOUR_SERVICE\"}[30d])))) / (1 - 0.999))) * 100",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"options": {
|
||||
"showThresholdLabels": false,
|
||||
"showThresholdMarkers": true
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "percent",
|
||||
"min": 0,
|
||||
"max": 100,
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{"value": 0, "color": "red"},
|
||||
{"value": 25, "color": "yellow"},
|
||||
{"value": 50, "color": "green"}
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,188 @@
|
||||
# Prometheus Recording Rules Template
|
||||
# Pre-aggregated metrics for fast dashboard queries and SLO tracking
|
||||
# Replace YOUR_SERVICE with actual service name
|
||||
|
||||
groups:
|
||||
# HTTP Request Rates
|
||||
- name: http_request_rates
|
||||
interval: 15s
|
||||
rules:
|
||||
# Total request rate (per-second)
|
||||
- record: greyhaven:http_requests:rate5m
|
||||
expr: sum(rate(http_requests_total{service="YOUR_SERVICE"}[5m]))
|
||||
|
||||
# Request rate by service
|
||||
- record: greyhaven:http_requests:rate5m:by_service
|
||||
expr: sum(rate(http_requests_total[5m])) by (service)
|
||||
|
||||
# Request rate by endpoint
|
||||
- record: greyhaven:http_requests:rate5m:by_endpoint
|
||||
expr: sum(rate(http_requests_total{service="YOUR_SERVICE"}[5m])) by (endpoint)
|
||||
|
||||
# Request rate by method
|
||||
- record: greyhaven:http_requests:rate5m:by_method
|
||||
expr: sum(rate(http_requests_total{service="YOUR_SERVICE"}[5m])) by (method)
|
||||
|
||||
# Request rate by status code
|
||||
- record: greyhaven:http_requests:rate5m:by_status
|
||||
expr: sum(rate(http_requests_total{service="YOUR_SERVICE"}[5m])) by (status)
|
||||
|
||||
# HTTP Error Rates
|
||||
- name: http_error_rates
|
||||
interval: 15s
|
||||
rules:
|
||||
# Error rate (percentage)
|
||||
- record: greyhaven:http_errors:rate5m
|
||||
expr: |
|
||||
sum(rate(http_requests_total{service="YOUR_SERVICE",status=~"5.."}[5m]))
|
||||
/
|
||||
sum(rate(http_requests_total{service="YOUR_SERVICE"}[5m]))
|
||||
|
||||
# Error rate by service
|
||||
- record: greyhaven:http_errors:rate5m:by_service
|
||||
expr: |
|
||||
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
|
||||
/
|
||||
sum(rate(http_requests_total[5m])) by (service)
|
||||
|
||||
# Error rate by endpoint
|
||||
- record: greyhaven:http_errors:rate5m:by_endpoint
|
||||
expr: |
|
||||
sum(rate(http_requests_total{service="YOUR_SERVICE",status=~"5.."}[5m])) by (endpoint)
|
||||
/
|
||||
sum(rate(http_requests_total{service="YOUR_SERVICE"}[5m])) by (endpoint)
|
||||
|
||||
# HTTP Latency (Duration)
|
||||
- name: http_latency
|
||||
interval: 15s
|
||||
rules:
|
||||
# p50 latency (median)
|
||||
- record: greyhaven:http_latency:p50
|
||||
expr: histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket{service="YOUR_SERVICE"}[5m])) by (le))
|
||||
|
||||
# p95 latency
|
||||
- record: greyhaven:http_latency:p95
|
||||
expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service="YOUR_SERVICE"}[5m])) by (le))
|
||||
|
||||
# p99 latency
|
||||
- record: greyhaven:http_latency:p99
|
||||
expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service="YOUR_SERVICE"}[5m])) by (le))
|
||||
|
||||
# Average latency
|
||||
- record: greyhaven:http_latency:avg
|
||||
expr: |
|
||||
sum(rate(http_request_duration_seconds_sum{service="YOUR_SERVICE"}[5m]))
|
||||
/
|
||||
sum(rate(http_request_duration_seconds_count{service="YOUR_SERVICE"}[5m]))
|
||||
|
||||
# p95 latency by endpoint
|
||||
- record: greyhaven:http_latency:p95:by_endpoint
|
||||
expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service="YOUR_SERVICE"}[5m])) by (le, endpoint))
|
||||
|
||||
# Resource Saturation
|
||||
- name: resource_saturation
|
||||
interval: 15s
|
||||
rules:
|
||||
# CPU usage percentage
|
||||
- record: greyhaven:cpu_usage:percent
|
||||
expr: 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
|
||||
|
||||
# Memory usage percentage
|
||||
- record: greyhaven:memory_usage:percent
|
||||
expr: 100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))
|
||||
|
||||
# Disk usage percentage
|
||||
- record: greyhaven:disk_usage:percent
|
||||
expr: 100 - ((node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100)
|
||||
|
||||
# Database connection pool saturation
|
||||
- record: greyhaven:db_pool:saturation
|
||||
expr: |
|
||||
db_pool_connections_active{service="YOUR_SERVICE"}
|
||||
/
|
||||
db_pool_connections_max{service="YOUR_SERVICE"}
|
||||
|
||||
# SLI Calculations (Multi-Window)
|
||||
- name: sli_calculations
|
||||
interval: 30s
|
||||
rules:
|
||||
# Availability SLI - 1 hour window
|
||||
- record: greyhaven:sli:availability:1h
|
||||
expr: |
|
||||
sum(rate(http_requests_total{service="YOUR_SERVICE",status=~"2..|3.."}[1h]))
|
||||
/
|
||||
sum(rate(http_requests_total{service="YOUR_SERVICE"}[1h]))
|
||||
|
||||
# Availability SLI - 6 hour window
|
||||
- record: greyhaven:sli:availability:6h
|
||||
expr: |
|
||||
sum(rate(http_requests_total{service="YOUR_SERVICE",status=~"2..|3.."}[6h]))
|
||||
/
|
||||
sum(rate(http_requests_total{service="YOUR_SERVICE"}[6h]))
|
||||
|
||||
# Availability SLI - 24 hour window
|
||||
- record: greyhaven:sli:availability:24h
|
||||
expr: |
|
||||
sum(rate(http_requests_total{service="YOUR_SERVICE",status=~"2..|3.."}[24h]))
|
||||
/
|
||||
sum(rate(http_requests_total{service="YOUR_SERVICE"}[24h]))
|
||||
|
||||
# Availability SLI - 30 day window
|
||||
- record: greyhaven:sli:availability:30d
|
||||
expr: |
|
||||
sum(rate(http_requests_total{service="YOUR_SERVICE",status=~"2..|3.."}[30d]))
|
||||
/
|
||||
sum(rate(http_requests_total{service="YOUR_SERVICE"}[30d]))
|
||||
|
||||
# Latency SLI - 1 hour window (% requests < 200ms)
|
||||
- record: greyhaven:sli:latency:1h
|
||||
expr: |
|
||||
sum(rate(http_request_duration_seconds_bucket{service="YOUR_SERVICE",le="0.2"}[1h]))
|
||||
/
|
||||
sum(rate(http_request_duration_seconds_count{service="YOUR_SERVICE"}[1h]))
|
||||
|
||||
# Latency SLI - 30 day window
|
||||
- record: greyhaven:sli:latency:30d
|
||||
expr: |
|
||||
sum(rate(http_request_duration_seconds_bucket{service="YOUR_SERVICE",le="0.2"}[30d]))
|
||||
/
|
||||
sum(rate(http_request_duration_seconds_count{service="YOUR_SERVICE"}[30d]))
|
||||
|
||||
# Error Budget Tracking
|
||||
- name: error_budget
|
||||
interval: 30s
|
||||
rules:
|
||||
# Error budget remaining (for 99.9% SLO)
|
||||
- record: greyhaven:error_budget:remaining:30d
|
||||
expr: |
|
||||
1 - (
|
||||
(1 - greyhaven:sli:availability:30d{service="YOUR_SERVICE"})
|
||||
/
|
||||
(1 - 0.999)
|
||||
)
|
||||
|
||||
# Error budget burn rate - 1 hour window
|
||||
- record: greyhaven:error_budget:burn_rate:1h
|
||||
expr: |
|
||||
(1 - greyhaven:sli:availability:1h{service="YOUR_SERVICE"})
|
||||
/
|
||||
(1 - 0.999)
|
||||
|
||||
# Error budget burn rate - 6 hour window
|
||||
- record: greyhaven:error_budget:burn_rate:6h
|
||||
expr: |
|
||||
(1 - greyhaven:sli:availability:6h{service="YOUR_SERVICE"})
|
||||
/
|
||||
(1 - 0.999)
|
||||
|
||||
# Error budget burn rate - 24 hour window
|
||||
- record: greyhaven:error_budget:burn_rate:24h
|
||||
expr: |
|
||||
(1 - greyhaven:sli:availability:24h{service="YOUR_SERVICE"})
|
||||
/
|
||||
(1 - 0.999)
|
||||
|
||||
# Error budget consumed (minutes of downtime)
|
||||
- record: greyhaven:error_budget:consumed:30d
|
||||
expr: |
|
||||
(1 - greyhaven:sli:availability:30d{service="YOUR_SERVICE"}) * 43200
|
||||
173
skills/observability-engineering/templates/slo-definition.yaml
Normal file
173
skills/observability-engineering/templates/slo-definition.yaml
Normal file
@@ -0,0 +1,173 @@
|
||||
# SLO Definition Template
|
||||
# Replace YOUR_SERVICE with actual service name
|
||||
# Replace 99.9 with your target SLO (99.5, 99.9, or 99.95)
|
||||
|
||||
apiVersion: monitoring.greyhaven.io/v1
|
||||
kind: ServiceLevelObjective
|
||||
metadata:
|
||||
name: YOUR_SERVICE-slo
|
||||
namespace: production
|
||||
spec:
|
||||
# Service identification
|
||||
service: YOUR_SERVICE
|
||||
environment: production
|
||||
|
||||
# SLO tier (critical, essential, standard)
|
||||
tier: essential
|
||||
|
||||
# Time window (30 days recommended)
|
||||
window: 30d
|
||||
|
||||
# SLO targets
|
||||
objectives:
|
||||
- name: availability
|
||||
target: 99.9 # 99.9% = 43.2 min downtime/month
|
||||
indicator:
|
||||
type: ratio
|
||||
success_query: |
|
||||
sum(rate(http_requests_total{service="YOUR_SERVICE",status=~"2..|3.."}[{{.window}}]))
|
||||
total_query: |
|
||||
sum(rate(http_requests_total{service="YOUR_SERVICE"}[{{.window}}]))
|
||||
|
||||
- name: latency
|
||||
target: 95 # 95% of requests < 200ms
|
||||
indicator:
|
||||
type: ratio
|
||||
success_query: |
|
||||
sum(rate(http_request_duration_seconds_bucket{service="YOUR_SERVICE",le="0.2"}[{{.window}}]))
|
||||
total_query: |
|
||||
sum(rate(http_request_duration_seconds_count{service="YOUR_SERVICE"}[{{.window}}]))
|
||||
|
||||
- name: error_rate
|
||||
target: 99.5 # <0.5% error rate
|
||||
indicator:
|
||||
type: ratio
|
||||
success_query: |
|
||||
sum(rate(http_requests_total{service="YOUR_SERVICE",status!~"5.."}[{{.window}}]))
|
||||
total_query: |
|
||||
sum(rate(http_requests_total{service="YOUR_SERVICE"}[{{.window}}]))
|
||||
|
||||
# Error budget policy
|
||||
errorBudget:
|
||||
policy:
|
||||
- budget_range: [75%, 100%]
|
||||
action: "Normal feature development"
|
||||
approval: "Engineering team"
|
||||
|
||||
- budget_range: [50%, 75%]
|
||||
action: "Monitor closely, increase testing"
|
||||
approval: "Engineering team"
|
||||
|
||||
- budget_range: [25%, 50%]
|
||||
action: "Prioritize reliability work, reduce risky changes"
|
||||
approval: "Engineering manager"
|
||||
|
||||
- budget_range: [0%, 25%]
|
||||
action: "Feature freeze, all hands on reliability"
|
||||
approval: "VP Engineering"
|
||||
requirements:
|
||||
- "Daily reliability standup"
|
||||
- "Postmortem for all incidents"
|
||||
- "No new features until budget >50%"
|
||||
|
||||
- budget_range: [0%, 0%]
|
||||
action: "SLO violation - mandatory postmortem"
|
||||
approval: "VP Engineering + CTO"
|
||||
requirements:
|
||||
- "Complete postmortem within 48 hours"
|
||||
- "Action items with owners and deadlines"
|
||||
- "Present to exec team"
|
||||
|
||||
# Multi-window burn rate alerts
|
||||
alerts:
|
||||
- name: error-budget-burn-rate-critical
|
||||
severity: critical
|
||||
windows:
|
||||
short: 1h
|
||||
long: 6h
|
||||
burn_rate_threshold: 14.4 # Budget exhausted in 2 hours
|
||||
for: 2m
|
||||
annotations:
|
||||
summary: "Critical burn rate - budget exhausted in 2 hours"
|
||||
description: "Service {{ $labels.service }} is burning error budget 14.4x faster than expected"
|
||||
runbook: "https://runbooks.greyhaven.io/slo-burn-rate"
|
||||
notifications:
|
||||
- type: pagerduty
|
||||
severity: critical
|
||||
|
||||
- name: error-budget-burn-rate-high
|
||||
severity: warning
|
||||
windows:
|
||||
short: 6h
|
||||
long: 24h
|
||||
burn_rate_threshold: 6 # Budget exhausted in 5 days
|
||||
for: 15m
|
||||
annotations:
|
||||
summary: "High burn rate - budget exhausted in 5 days"
|
||||
description: "Service {{ $labels.service }} is burning error budget 6x faster than expected"
|
||||
notifications:
|
||||
- type: slack
|
||||
channel: "#alerts-reliability"
|
||||
|
||||
- name: error-budget-burn-rate-medium
|
||||
severity: warning
|
||||
windows:
|
||||
short: 24h
|
||||
long: 24h
|
||||
burn_rate_threshold: 3 # Budget exhausted in 10 days
|
||||
for: 1h
|
||||
annotations:
|
||||
summary: "Medium burn rate - budget exhausted in 10 days"
|
||||
notifications:
|
||||
- type: slack
|
||||
channel: "#alerts-reliability"
|
||||
|
||||
- name: error-budget-low
|
||||
severity: warning
|
||||
threshold: 0.25 # 25% remaining
|
||||
for: 5m
|
||||
annotations:
|
||||
summary: "Error budget low ({{ $value | humanizePercentage }} remaining)"
|
||||
description: "Consider feature freeze per error budget policy"
|
||||
notifications:
|
||||
- type: slack
|
||||
channel: "#engineering-managers"
|
||||
|
||||
- name: error-budget-depleted
|
||||
severity: critical
|
||||
threshold: 0 # 0% remaining
|
||||
for: 5m
|
||||
annotations:
|
||||
summary: "Error budget depleted - feature freeze required"
|
||||
description: "SLO violated. Postmortem required within 48 hours."
|
||||
notifications:
|
||||
- type: pagerduty
|
||||
severity: critical
|
||||
- type: slack
|
||||
channel: "#exec-alerts"
|
||||
|
||||
# Review cadence
|
||||
review:
|
||||
frequency: weekly
|
||||
participants:
|
||||
- team: engineering
|
||||
- team: product
|
||||
- team: sre
|
||||
agenda:
|
||||
- "Current error budget status"
|
||||
- "Burn rate trends"
|
||||
- "Recent incidents and impact"
|
||||
- "Upcoming risky changes"
|
||||
|
||||
# Reporting
|
||||
reporting:
|
||||
dashboard:
|
||||
grafana_uid: YOUR_SERVICE_slo_dashboard
|
||||
panels:
|
||||
- slo_status
|
||||
- error_budget_remaining
|
||||
- burn_rate_multiwindow
|
||||
- incident_timeline
|
||||
export:
|
||||
format: prometheus
|
||||
recording_rules: true
|
||||
Reference in New Issue
Block a user