--- name: monitoring-observability description: Monitoring and observability strategy, implementation, and troubleshooting. Use for designing metrics/logs/traces systems, setting up Prometheus/Grafana/Loki, creating alerts and dashboards, calculating SLOs and error budgets, analyzing performance issues, and comparing monitoring tools (Datadog, ELK, CloudWatch). Covers the Four Golden Signals, RED/USE methods, OpenTelemetry instrumentation, log aggregation patterns, and distributed tracing. --- # Monitoring & Observability ## Overview This skill provides comprehensive guidance for monitoring and observability workflows including metrics design, log aggregation, distributed tracing, alerting strategies, SLO/SLA management, and tool selection. **When to use this skill**: - Setting up monitoring for new services - Designing alerts and dashboards - Troubleshooting performance issues - Implementing SLO tracking and error budgets - Choosing between monitoring tools - Integrating OpenTelemetry instrumentation - Analyzing metrics, logs, and traces - Optimizing Datadog costs and finding waste - Migrating from Datadog to open-source stack --- ## Core Workflow: Observability Implementation Use this decision tree to determine your starting point: ``` Are you setting up monitoring from scratch? ├─ YES → Start with "1. Design Metrics Strategy" └─ NO → Do you have an existing issue? ├─ YES → Go to "9. Troubleshooting & Analysis" └─ NO → Are you improving existing monitoring? ├─ Alerts → Go to "3. Alert Design" ├─ Dashboards → Go to "4. Dashboard & Visualization" ├─ SLOs → Go to "5. SLO & Error Budgets" ├─ Tool selection → Read references/tool_comparison.md └─ Using Datadog? High costs? → Go to "7. Datadog Cost Optimization & Migration" ``` --- ## 1. Design Metrics Strategy ### Start with The Four Golden Signals Every service should monitor: 1. **Latency**: Response time (p50, p95, p99) 2. **Traffic**: Requests per second 3. **Errors**: Failure rate 4. **Saturation**: Resource utilization **For request-driven services**, use the **RED Method**: - **R**ate: Requests/sec - **E**rrors: Error rate - **D**uration: Response time **For infrastructure resources**, use the **USE Method**: - **U**tilization: % time busy - **S**aturation**: Queue depth - **E**rrors**: Error count **Quick Start - Web Application Example**: ```promql # Rate (requests/sec) sum(rate(http_requests_total[5m])) # Errors (error rate %) sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100 # Duration (p95 latency) histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le) ) ``` ### Deep Dive: Metric Design For comprehensive metric design guidance including: - Metric types (counter, gauge, histogram, summary) - Cardinality best practices - Naming conventions - Dashboard design principles **→ Read**: [references/metrics_design.md](references/metrics_design.md) ### Automated Metric Analysis Detect anomalies and trends in your metrics: ```bash # Analyze Prometheus metrics for anomalies python3 scripts/analyze_metrics.py prometheus \ --endpoint http://localhost:9090 \ --query 'rate(http_requests_total[5m])' \ --hours 24 # Analyze CloudWatch metrics python3 scripts/analyze_metrics.py cloudwatch \ --namespace AWS/EC2 \ --metric CPUUtilization \ --dimensions InstanceId=i-1234567890abcdef0 \ --hours 48 ``` **→ Script**: [scripts/analyze_metrics.py](scripts/analyze_metrics.py) --- ## 2. Log Aggregation & Analysis ### Structured Logging Checklist Every log entry should include: - ✅ Timestamp (ISO 8601 format) - ✅ Log level (DEBUG, INFO, WARN, ERROR, FATAL) - ✅ Message (human-readable) - ✅ Service name - ✅ Request ID (for tracing) **Example structured log (JSON)**: ```json { "timestamp": "2024-10-28T14:32:15Z", "level": "error", "message": "Payment processing failed", "service": "payment-service", "request_id": "550e8400-e29b-41d4-a716-446655440000", "user_id": "user123", "order_id": "ORD-456", "error_type": "GatewayTimeout", "duration_ms": 5000 } ``` ### Log Aggregation Patterns **ELK Stack** (Elasticsearch, Logstash, Kibana): - Best for: Deep log analysis, complex queries - Cost: High (infrastructure + operations) - Complexity: High **Grafana Loki**: - Best for: Cost-effective logging, Kubernetes - Cost: Low - Complexity: Medium **CloudWatch Logs**: - Best for: AWS-centric applications - Cost: Medium - Complexity: Low ### Log Analysis Analyze logs for errors, patterns, and anomalies: ```bash # Analyze log file for patterns python3 scripts/log_analyzer.py application.log # Show error lines with context python3 scripts/log_analyzer.py application.log --show-errors # Extract stack traces python3 scripts/log_analyzer.py application.log --show-traces ``` **→ Script**: [scripts/log_analyzer.py](scripts/log_analyzer.py) ### Deep Dive: Logging For comprehensive logging guidance including: - Structured logging implementation examples (Python, Node.js, Go, Java) - Log aggregation patterns (ELK, Loki, CloudWatch, Fluentd) - Query patterns and best practices - PII redaction and security - Sampling and rate limiting **→ Read**: [references/logging_guide.md](references/logging_guide.md) --- ## 3. Alert Design ### Alert Design Principles 1. **Every alert must be actionable** - If you can't do something, don't alert 2. **Alert on symptoms, not causes** - Alert on user experience, not components 3. **Tie alerts to SLOs** - Connect to business impact 4. **Reduce noise** - Only page for critical issues ### Alert Severity Levels | Severity | Response Time | Example | |----------|--------------|---------| | **Critical** | Page immediately | Service down, SLO violation | | **Warning** | Ticket, review in hours | Elevated error rate, resource warning | | **Info** | Log for awareness | Deployment completed, scaling event | ### Multi-Window Burn Rate Alerting Alert when error budget is consumed too quickly: ```yaml # Fast burn (1h window) - Critical - alert: ErrorBudgetFastBurn expr: | (error_rate / 0.001) > 14.4 # 99.9% SLO for: 2m labels: severity: critical # Slow burn (6h window) - Warning - alert: ErrorBudgetSlowBurn expr: | (error_rate / 0.001) > 6 # 99.9% SLO for: 30m labels: severity: warning ``` ### Alert Quality Checker Audit your alert rules against best practices: ```bash # Check single file python3 scripts/alert_quality_checker.py alerts.yml # Check all rules in directory python3 scripts/alert_quality_checker.py /path/to/prometheus/rules/ ``` **Checks for**: - Alert naming conventions - Required labels (severity, team) - Required annotations (summary, description, runbook_url) - PromQL expression quality - 'for' clause to prevent flapping **→ Script**: [scripts/alert_quality_checker.py](scripts/alert_quality_checker.py) ### Alert Templates Production-ready alert rule templates: **→ Templates**: - [assets/templates/prometheus-alerts/webapp-alerts.yml](assets/templates/prometheus-alerts/webapp-alerts.yml) - Web application alerts - [assets/templates/prometheus-alerts/kubernetes-alerts.yml](assets/templates/prometheus-alerts/kubernetes-alerts.yml) - Kubernetes alerts ### Deep Dive: Alerting For comprehensive alerting guidance including: - Alert design patterns (multi-window, rate of change, threshold with hysteresis) - Alert annotation best practices - Alert routing (severity-based, team-based, time-based) - Inhibition rules - Runbook structure - On-call best practices **→ Read**: [references/alerting_best_practices.md](references/alerting_best_practices.md) ### Runbook Template Create comprehensive runbooks for your alerts: **→ Template**: [assets/templates/runbooks/incident-runbook-template.md](assets/templates/runbooks/incident-runbook-template.md) --- ## 4. Dashboard & Visualization ### Dashboard Design Principles 1. **Top-down layout**: Most important metrics first 2. **Color coding**: Red (critical), yellow (warning), green (healthy) 3. **Consistent time windows**: All panels use same time range 4. **Limit panels**: 8-12 panels per dashboard maximum 5. **Include context**: Show related metrics together ### Recommended Dashboard Structure ``` ┌─────────────────────────────────────┐ │ Overall Health (Single Stats) │ │ [Requests/s] [Error%] [P95 Latency]│ └─────────────────────────────────────┘ ┌─────────────────────────────────────┐ │ Request Rate & Errors (Graphs) │ └─────────────────────────────────────┘ ┌─────────────────────────────────────┐ │ Latency Distribution (Graphs) │ └─────────────────────────────────────┘ ┌─────────────────────────────────────┐ │ Resource Usage (Graphs) │ └─────────────────────────────────────┘ ``` ### Generate Grafana Dashboards Automatically generate dashboards from templates: ```bash # Web application dashboard python3 scripts/dashboard_generator.py webapp \ --title "My API Dashboard" \ --service my_api \ --output dashboard.json # Kubernetes dashboard python3 scripts/dashboard_generator.py kubernetes \ --title "K8s Production" \ --namespace production \ --output k8s-dashboard.json # Database dashboard python3 scripts/dashboard_generator.py database \ --title "PostgreSQL" \ --db-type postgres \ --instance db.example.com:5432 \ --output db-dashboard.json ``` **Supports**: - Web applications (requests, errors, latency, resources) - Kubernetes (pods, nodes, resources, network) - Databases (PostgreSQL, MySQL) **→ Script**: [scripts/dashboard_generator.py](scripts/dashboard_generator.py) --- ## 5. SLO & Error Budgets ### SLO Fundamentals **SLI** (Service Level Indicator): Measurement of service quality - Example: Request latency, error rate, availability **SLO** (Service Level Objective): Target value for an SLI - Example: "99.9% of requests return in < 500ms" **Error Budget**: Allowed failure amount = (100% - SLO) - Example: 99.9% SLO = 0.1% error budget = 43.2 minutes/month ### Common SLO Targets | Availability | Downtime/Month | Use Case | |--------------|----------------|----------| | **99%** | 7.2 hours | Internal tools | | **99.9%** | 43.2 minutes | Standard production | | **99.95%** | 21.6 minutes | Critical services | | **99.99%** | 4.3 minutes | High availability | ### SLO Calculator Calculate compliance, error budgets, and burn rates: ```bash # Show SLO reference table python3 scripts/slo_calculator.py --table # Calculate availability SLO python3 scripts/slo_calculator.py availability \ --slo 99.9 \ --total-requests 1000000 \ --failed-requests 1500 \ --period-days 30 # Calculate burn rate python3 scripts/slo_calculator.py burn-rate \ --slo 99.9 \ --errors 50 \ --requests 10000 \ --window-hours 1 ``` **→ Script**: [scripts/slo_calculator.py](scripts/slo_calculator.py) ### Deep Dive: SLO/SLA For comprehensive SLO/SLA guidance including: - Choosing appropriate SLIs - Setting realistic SLO targets - Error budget policies - Burn rate alerting - SLA structure and contracts - Monthly reporting templates **→ Read**: [references/slo_sla_guide.md](references/slo_sla_guide.md) --- ## 6. Distributed Tracing ### When to Use Tracing Use distributed tracing when you need to: - Debug performance issues across services - Understand request flow through microservices - Identify bottlenecks in distributed systems - Find N+1 query problems ### OpenTelemetry Implementation **Python example**: ```python from opentelemetry import trace tracer = trace.get_tracer(__name__) @tracer.start_as_current_span("process_order") def process_order(order_id): span = trace.get_current_span() span.set_attribute("order.id", order_id) try: result = payment_service.charge(order_id) span.set_attribute("payment.status", "success") return result except Exception as e: span.set_status(trace.Status(trace.StatusCode.ERROR)) span.record_exception(e) raise ``` ### Sampling Strategies - **Development**: 100% (ALWAYS_ON) - **Staging**: 50-100% - **Production**: 1-10% (or error-based sampling) **Error-based sampling** (always sample errors, 1% of successes): ```python class ErrorSampler(Sampler): def should_sample(self, parent_context, trace_id, name, **kwargs): attributes = kwargs.get('attributes', {}) if attributes.get('error', False): return Decision.RECORD_AND_SAMPLE if trace_id & 0xFF < 3: # ~1% return Decision.RECORD_AND_SAMPLE return Decision.DROP ``` ### OTel Collector Configuration Production-ready OpenTelemetry Collector configuration: **→ Template**: [assets/templates/otel-config/collector-config.yaml](assets/templates/otel-config/collector-config.yaml) **Features**: - Receives OTLP, Prometheus, and host metrics - Batching and memory limiting - Tail sampling (error-based, latency-based, probabilistic) - Multiple exporters (Tempo, Jaeger, Loki, Prometheus, CloudWatch, Datadog) ### Deep Dive: Tracing For comprehensive tracing guidance including: - OpenTelemetry instrumentation (Python, Node.js, Go, Java) - Span attributes and semantic conventions - Context propagation (W3C Trace Context) - Backend comparison (Jaeger, Tempo, X-Ray, Datadog APM) - Analysis patterns (finding slow traces, N+1 queries) - Integration with logs **→ Read**: [references/tracing_guide.md](references/tracing_guide.md) --- ## 7. Datadog Cost Optimization & Migration ### Scenario 1: I'm Using Datadog and Costs Are Too High If your Datadog bill is growing out of control, start by identifying waste: #### Cost Analysis Script Automatically analyze your Datadog usage and find cost optimization opportunities: ```bash # Analyze Datadog usage (requires API key and APP key) python3 scripts/datadog_cost_analyzer.py \ --api-key $DD_API_KEY \ --app-key $DD_APP_KEY # Show detailed breakdown by category python3 scripts/datadog_cost_analyzer.py \ --api-key $DD_API_KEY \ --app-key $DD_APP_KEY \ --show-details ``` **What it checks**: - Infrastructure host count and cost - Custom metrics usage and high-cardinality metrics - Log ingestion volume and trends - APM host usage - Unused or noisy monitors - Container vs VM optimization opportunities **→ Script**: [scripts/datadog_cost_analyzer.py](scripts/datadog_cost_analyzer.py) #### Common Cost Optimization Strategies **1. Custom Metrics Optimization** (typical savings: 20-40%): - Remove high-cardinality tags (user IDs, request IDs) - Delete unused custom metrics - Aggregate metrics before sending - Use metric prefixes to identify teams/services **2. Log Management** (typical savings: 30-50%): - Implement log sampling for high-volume services - Use exclusion filters for debug/trace logs in production - Archive cold logs to S3/GCS after 7 days - Set log retention policies (15 days instead of 30) **3. APM Optimization** (typical savings: 15-25%): - Reduce trace sampling rates (10% → 5% in prod) - Use head-based sampling instead of complete sampling - Remove APM from non-critical services - Use trace search with lower retention **4. Infrastructure Monitoring** (typical savings: 10-20%): - Switch from VM-based to container-based pricing where possible - Remove agents from ephemeral instances - Use Datadog's host reduction strategies - Consolidate staging environments ### Scenario 2: Migrating Away from Datadog If you're considering migrating to a more cost-effective open-source stack: #### Migration Overview **From Datadog** → **To Open Source Stack**: - Metrics: Datadog → **Prometheus + Grafana** - Logs: Datadog Logs → **Grafana Loki** - Traces: Datadog APM → **Tempo or Jaeger** - Dashboards: Datadog → **Grafana** - Alerts: Datadog Monitors → **Prometheus Alertmanager** **Estimated Cost Savings**: 60-77% ($49.8k-61.8k/year for 100-host environment) #### Migration Strategy **Phase 1: Run Parallel** (Month 1-2): - Deploy open-source stack alongside Datadog - Migrate metrics first (lowest risk) - Validate data accuracy **Phase 2: Migrate Dashboards & Alerts** (Month 2-3): - Convert Datadog dashboards to Grafana - Translate alert rules (use DQL → PromQL guide below) - Train team on new tools **Phase 3: Migrate Logs & Traces** (Month 3-4): - Set up Loki for log aggregation - Deploy Tempo/Jaeger for tracing - Update application instrumentation **Phase 4: Decommission Datadog** (Month 4-5): - Confirm all functionality migrated - Cancel Datadog subscription #### Query Translation: DQL → PromQL When migrating dashboards and alerts, you'll need to translate Datadog queries to PromQL: **Quick examples**: ``` # Average CPU Datadog: avg:system.cpu.user{*} Prometheus: avg(node_cpu_seconds_total{mode="user"}) # Request rate Datadog: sum:requests.count{*}.as_rate() Prometheus: sum(rate(http_requests_total[5m])) # P95 latency Datadog: p95:request.duration{*} Prometheus: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) # Error rate percentage Datadog: (sum:requests.errors{*}.as_rate() / sum:requests.count{*}.as_rate()) * 100 Prometheus: (sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) * 100 ``` **→ Full Translation Guide**: [references/dql_promql_translation.md](references/dql_promql_translation.md) #### Cost Comparison **Example: 100-host infrastructure** | Component | Datadog (Annual) | Open Source (Annual) | Savings | |-----------|-----------------|---------------------|---------| | Infrastructure | $18,000 | $10,000 (self-hosted infra) | $8,000 | | Custom Metrics | $600 | Included | $600 | | Logs | $24,000 | $3,000 (storage) | $21,000 | | APM/Traces | $37,200 | $5,000 (storage) | $32,200 | | **Total** | **$79,800** | **$18,000** | **$61,800 (77%)** | ### Deep Dive: Datadog Migration For comprehensive migration guidance including: - Detailed cost comparison and ROI calculations - Step-by-step migration instructions - Infrastructure sizing recommendations (CPU, RAM, storage) - Dashboard conversion tools and examples - Alert rule translation patterns - Application instrumentation changes (DogStatsD → Prometheus client) - Python scripts for exporting Datadog dashboards and monitors - Common challenges and solutions **→ Read**: [references/datadog_migration.md](references/datadog_migration.md) --- ## 8. Tool Selection & Comparison ### Decision Matrix **Choose Prometheus + Grafana if**: - ✅ Using Kubernetes - ✅ Want control and customization - ✅ Have ops capacity - ✅ Budget-conscious **Choose Datadog if**: - ✅ Want ease of use - ✅ Need full observability now - ✅ Budget allows ($8k+/month for 100 hosts) **Choose Grafana Stack (LGTM) if**: - ✅ Want open source full stack - ✅ Cost-effective solution - ✅ Cloud-native architecture **Choose ELK Stack if**: - ✅ Heavy log analysis needs - ✅ Need powerful search - ✅ Have dedicated ops team **Choose Cloud Native (CloudWatch/etc) if**: - ✅ Single cloud provider - ✅ Simple needs - ✅ Want minimal setup ### Cost Comparison (100 hosts, 1TB logs/month) | Solution | Monthly Cost | Setup | Ops Burden | |----------|-------------|--------|------------| | Prometheus + Loki + Tempo | $1,500 | Medium | Medium | | Grafana Cloud | $3,000 | Low | Low | | Datadog | $8,000 | Low | None | | ELK Stack | $4,000 | High | High | | CloudWatch | $2,000 | Low | Low | ### Deep Dive: Tool Comparison For comprehensive tool comparison including: - Metrics platforms (Prometheus, Datadog, New Relic, CloudWatch, Grafana Cloud) - Logging platforms (ELK, Loki, Splunk, CloudWatch Logs, Sumo Logic) - Tracing platforms (Jaeger, Tempo, Datadog APM, X-Ray) - Full-stack observability comparison - Recommendations by company size **→ Read**: [references/tool_comparison.md](references/tool_comparison.md) --- ## 9. Troubleshooting & Analysis ### Health Check Validation Validate health check endpoints against best practices: ```bash # Check single endpoint python3 scripts/health_check_validator.py https://api.example.com/health # Check multiple endpoints python3 scripts/health_check_validator.py \ https://api.example.com/health \ https://api.example.com/readiness \ --verbose ``` **Checks for**: - ✓ Returns 200 status code - ✓ Response time < 1 second - ✓ Returns JSON format - ✓ Contains 'status' field - ✓ Includes version/build info - ✓ Checks dependencies - ✓ Disables caching **→ Script**: [scripts/health_check_validator.py](scripts/health_check_validator.py) ### Common Troubleshooting Workflows **High Latency Investigation**: 1. Check dashboards for latency spike 2. Query traces for slow operations 3. Check database slow query log 4. Check external API response times 5. Review recent deployments 6. Check resource utilization (CPU, memory) **High Error Rate Investigation**: 1. Check error logs for patterns 2. Identify affected endpoints 3. Check dependency health 4. Review recent deployments 5. Check resource limits 6. Verify configuration **Service Down Investigation**: 1. Check if pods/instances are running 2. Check health check endpoint 3. Review recent deployments 4. Check resource availability 5. Check network connectivity 6. Review logs for startup errors --- ## Quick Reference Commands ### Prometheus Queries ```promql # Request rate sum(rate(http_requests_total[5m])) # Error rate sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100 # P95 latency histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le) ) # CPU usage 100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) # Memory usage (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 ``` ### Kubernetes Commands ```bash # Check pod status kubectl get pods -n # View pod logs kubectl logs -f -n # Check pod resources kubectl top pods -n # Describe pod for events kubectl describe pod -n # Check recent deployments kubectl rollout history deployment/ -n ``` ### Log Queries **Elasticsearch**: ```json GET /logs-*/_search { "query": { "bool": { "must": [ { "match": { "level": "error" } }, { "range": { "@timestamp": { "gte": "now-1h" } } } ] } } } ``` **Loki (LogQL)**: ```logql {job="app", level="error"} |= "error" | json ``` **CloudWatch Insights**: ``` fields @timestamp, level, message | filter level = "error" | filter @timestamp > ago(1h) ``` --- ## Resources Summary ### Scripts (automation and analysis) - `analyze_metrics.py` - Detect anomalies in Prometheus/CloudWatch metrics - `alert_quality_checker.py` - Audit alert rules against best practices - `slo_calculator.py` - Calculate SLO compliance and error budgets - `log_analyzer.py` - Parse logs for errors and patterns - `dashboard_generator.py` - Generate Grafana dashboards from templates - `health_check_validator.py` - Validate health check endpoints - `datadog_cost_analyzer.py` - Analyze Datadog usage and find cost waste ### References (deep-dive documentation) - `metrics_design.md` - Four Golden Signals, RED/USE methods, metric types - `alerting_best_practices.md` - Alert design, runbooks, on-call practices - `logging_guide.md` - Structured logging, aggregation patterns - `tracing_guide.md` - OpenTelemetry, distributed tracing - `slo_sla_guide.md` - SLI/SLO/SLA definitions, error budgets - `tool_comparison.md` - Comprehensive comparison of monitoring tools - `datadog_migration.md` - Complete guide for migrating from Datadog to OSS stack - `dql_promql_translation.md` - Datadog Query Language to PromQL translation reference ### Templates (ready-to-use configurations) - `prometheus-alerts/webapp-alerts.yml` - Production-ready web app alerts - `prometheus-alerts/kubernetes-alerts.yml` - Kubernetes monitoring alerts - `otel-config/collector-config.yaml` - OpenTelemetry Collector configuration - `runbooks/incident-runbook-template.md` - Incident response template --- ## Best Practices ### Metrics - Start with Four Golden Signals - Use appropriate metric types (counter, gauge, histogram) - Keep cardinality low (avoid high-cardinality labels) - Follow naming conventions ### Logging - Use structured logging (JSON) - Include request IDs for tracing - Set appropriate log levels - Redact PII before logging ### Alerting - Make every alert actionable - Alert on symptoms, not causes - Use multi-window burn rate alerts - Include runbook links ### Tracing - Sample appropriately (1-10% in production) - Always record errors - Use semantic conventions - Propagate context between services ### SLOs - Start with current performance - Set realistic targets - Define error budget policies - Review and adjust quarterly