Initial commit

This commit is contained in:
Zhongwei Li
2025-11-29 17:51:22 +08:00
commit 23753b435e
24 changed files with 9837 additions and 0 deletions

View File

@@ -0,0 +1,12 @@
{
"name": "monitoring-observability",
"description": "Monitoring and observability strategy, metrics/logs/traces systems, SLOs/error budgets, Prometheus/Grafana/Loki, OpenTelemetry, and tool comparison",
"version": "0.0.0-2025.11.28",
"author": {
"name": "Ahmad Asmar",
"email": "zhongweili@tubi.tv"
},
"skills": [
"./"
]
}

3
README.md Normal file
View File

@@ -0,0 +1,3 @@
# monitoring-observability
Monitoring and observability strategy, metrics/logs/traces systems, SLOs/error budgets, Prometheus/Grafana/Loki, OpenTelemetry, and tool comparison

869
SKILL.md Normal file
View File

@@ -0,0 +1,869 @@
---
name: monitoring-observability
description: Monitoring and observability strategy, implementation, and troubleshooting. Use for designing metrics/logs/traces systems, setting up Prometheus/Grafana/Loki, creating alerts and dashboards, calculating SLOs and error budgets, analyzing performance issues, and comparing monitoring tools (Datadog, ELK, CloudWatch). Covers the Four Golden Signals, RED/USE methods, OpenTelemetry instrumentation, log aggregation patterns, and distributed tracing.
---
# Monitoring & Observability
## Overview
This skill provides comprehensive guidance for monitoring and observability workflows including metrics design, log aggregation, distributed tracing, alerting strategies, SLO/SLA management, and tool selection.
**When to use this skill**:
- Setting up monitoring for new services
- Designing alerts and dashboards
- Troubleshooting performance issues
- Implementing SLO tracking and error budgets
- Choosing between monitoring tools
- Integrating OpenTelemetry instrumentation
- Analyzing metrics, logs, and traces
- Optimizing Datadog costs and finding waste
- Migrating from Datadog to open-source stack
---
## Core Workflow: Observability Implementation
Use this decision tree to determine your starting point:
```
Are you setting up monitoring from scratch?
├─ YES → Start with "1. Design Metrics Strategy"
└─ NO → Do you have an existing issue?
├─ YES → Go to "9. Troubleshooting & Analysis"
└─ NO → Are you improving existing monitoring?
├─ Alerts → Go to "3. Alert Design"
├─ Dashboards → Go to "4. Dashboard & Visualization"
├─ SLOs → Go to "5. SLO & Error Budgets"
├─ Tool selection → Read references/tool_comparison.md
└─ Using Datadog? High costs? → Go to "7. Datadog Cost Optimization & Migration"
```
---
## 1. Design Metrics Strategy
### Start with The Four Golden Signals
Every service should monitor:
1. **Latency**: Response time (p50, p95, p99)
2. **Traffic**: Requests per second
3. **Errors**: Failure rate
4. **Saturation**: Resource utilization
**For request-driven services**, use the **RED Method**:
- **R**ate: Requests/sec
- **E**rrors: Error rate
- **D**uration: Response time
**For infrastructure resources**, use the **USE Method**:
- **U**tilization: % time busy
- **S**aturation**: Queue depth
- **E**rrors**: Error count
**Quick Start - Web Application Example**:
```promql
# Rate (requests/sec)
sum(rate(http_requests_total[5m]))
# Errors (error rate %)
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) * 100
# Duration (p95 latency)
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)
```
### Deep Dive: Metric Design
For comprehensive metric design guidance including:
- Metric types (counter, gauge, histogram, summary)
- Cardinality best practices
- Naming conventions
- Dashboard design principles
**→ Read**: [references/metrics_design.md](references/metrics_design.md)
### Automated Metric Analysis
Detect anomalies and trends in your metrics:
```bash
# Analyze Prometheus metrics for anomalies
python3 scripts/analyze_metrics.py prometheus \
--endpoint http://localhost:9090 \
--query 'rate(http_requests_total[5m])' \
--hours 24
# Analyze CloudWatch metrics
python3 scripts/analyze_metrics.py cloudwatch \
--namespace AWS/EC2 \
--metric CPUUtilization \
--dimensions InstanceId=i-1234567890abcdef0 \
--hours 48
```
**→ Script**: [scripts/analyze_metrics.py](scripts/analyze_metrics.py)
---
## 2. Log Aggregation & Analysis
### Structured Logging Checklist
Every log entry should include:
- ✅ Timestamp (ISO 8601 format)
- ✅ Log level (DEBUG, INFO, WARN, ERROR, FATAL)
- ✅ Message (human-readable)
- ✅ Service name
- ✅ Request ID (for tracing)
**Example structured log (JSON)**:
```json
{
"timestamp": "2024-10-28T14:32:15Z",
"level": "error",
"message": "Payment processing failed",
"service": "payment-service",
"request_id": "550e8400-e29b-41d4-a716-446655440000",
"user_id": "user123",
"order_id": "ORD-456",
"error_type": "GatewayTimeout",
"duration_ms": 5000
}
```
### Log Aggregation Patterns
**ELK Stack** (Elasticsearch, Logstash, Kibana):
- Best for: Deep log analysis, complex queries
- Cost: High (infrastructure + operations)
- Complexity: High
**Grafana Loki**:
- Best for: Cost-effective logging, Kubernetes
- Cost: Low
- Complexity: Medium
**CloudWatch Logs**:
- Best for: AWS-centric applications
- Cost: Medium
- Complexity: Low
### Log Analysis
Analyze logs for errors, patterns, and anomalies:
```bash
# Analyze log file for patterns
python3 scripts/log_analyzer.py application.log
# Show error lines with context
python3 scripts/log_analyzer.py application.log --show-errors
# Extract stack traces
python3 scripts/log_analyzer.py application.log --show-traces
```
**→ Script**: [scripts/log_analyzer.py](scripts/log_analyzer.py)
### Deep Dive: Logging
For comprehensive logging guidance including:
- Structured logging implementation examples (Python, Node.js, Go, Java)
- Log aggregation patterns (ELK, Loki, CloudWatch, Fluentd)
- Query patterns and best practices
- PII redaction and security
- Sampling and rate limiting
**→ Read**: [references/logging_guide.md](references/logging_guide.md)
---
## 3. Alert Design
### Alert Design Principles
1. **Every alert must be actionable** - If you can't do something, don't alert
2. **Alert on symptoms, not causes** - Alert on user experience, not components
3. **Tie alerts to SLOs** - Connect to business impact
4. **Reduce noise** - Only page for critical issues
### Alert Severity Levels
| Severity | Response Time | Example |
|----------|--------------|---------|
| **Critical** | Page immediately | Service down, SLO violation |
| **Warning** | Ticket, review in hours | Elevated error rate, resource warning |
| **Info** | Log for awareness | Deployment completed, scaling event |
### Multi-Window Burn Rate Alerting
Alert when error budget is consumed too quickly:
```yaml
# Fast burn (1h window) - Critical
- alert: ErrorBudgetFastBurn
expr: |
(error_rate / 0.001) > 14.4 # 99.9% SLO
for: 2m
labels:
severity: critical
# Slow burn (6h window) - Warning
- alert: ErrorBudgetSlowBurn
expr: |
(error_rate / 0.001) > 6 # 99.9% SLO
for: 30m
labels:
severity: warning
```
### Alert Quality Checker
Audit your alert rules against best practices:
```bash
# Check single file
python3 scripts/alert_quality_checker.py alerts.yml
# Check all rules in directory
python3 scripts/alert_quality_checker.py /path/to/prometheus/rules/
```
**Checks for**:
- Alert naming conventions
- Required labels (severity, team)
- Required annotations (summary, description, runbook_url)
- PromQL expression quality
- 'for' clause to prevent flapping
**→ Script**: [scripts/alert_quality_checker.py](scripts/alert_quality_checker.py)
### Alert Templates
Production-ready alert rule templates:
**→ Templates**:
- [assets/templates/prometheus-alerts/webapp-alerts.yml](assets/templates/prometheus-alerts/webapp-alerts.yml) - Web application alerts
- [assets/templates/prometheus-alerts/kubernetes-alerts.yml](assets/templates/prometheus-alerts/kubernetes-alerts.yml) - Kubernetes alerts
### Deep Dive: Alerting
For comprehensive alerting guidance including:
- Alert design patterns (multi-window, rate of change, threshold with hysteresis)
- Alert annotation best practices
- Alert routing (severity-based, team-based, time-based)
- Inhibition rules
- Runbook structure
- On-call best practices
**→ Read**: [references/alerting_best_practices.md](references/alerting_best_practices.md)
### Runbook Template
Create comprehensive runbooks for your alerts:
**→ Template**: [assets/templates/runbooks/incident-runbook-template.md](assets/templates/runbooks/incident-runbook-template.md)
---
## 4. Dashboard & Visualization
### Dashboard Design Principles
1. **Top-down layout**: Most important metrics first
2. **Color coding**: Red (critical), yellow (warning), green (healthy)
3. **Consistent time windows**: All panels use same time range
4. **Limit panels**: 8-12 panels per dashboard maximum
5. **Include context**: Show related metrics together
### Recommended Dashboard Structure
```
┌─────────────────────────────────────┐
│ Overall Health (Single Stats) │
│ [Requests/s] [Error%] [P95 Latency]│
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ Request Rate & Errors (Graphs) │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ Latency Distribution (Graphs) │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ Resource Usage (Graphs) │
└─────────────────────────────────────┘
```
### Generate Grafana Dashboards
Automatically generate dashboards from templates:
```bash
# Web application dashboard
python3 scripts/dashboard_generator.py webapp \
--title "My API Dashboard" \
--service my_api \
--output dashboard.json
# Kubernetes dashboard
python3 scripts/dashboard_generator.py kubernetes \
--title "K8s Production" \
--namespace production \
--output k8s-dashboard.json
# Database dashboard
python3 scripts/dashboard_generator.py database \
--title "PostgreSQL" \
--db-type postgres \
--instance db.example.com:5432 \
--output db-dashboard.json
```
**Supports**:
- Web applications (requests, errors, latency, resources)
- Kubernetes (pods, nodes, resources, network)
- Databases (PostgreSQL, MySQL)
**→ Script**: [scripts/dashboard_generator.py](scripts/dashboard_generator.py)
---
## 5. SLO & Error Budgets
### SLO Fundamentals
**SLI** (Service Level Indicator): Measurement of service quality
- Example: Request latency, error rate, availability
**SLO** (Service Level Objective): Target value for an SLI
- Example: "99.9% of requests return in < 500ms"
**Error Budget**: Allowed failure amount = (100% - SLO)
- Example: 99.9% SLO = 0.1% error budget = 43.2 minutes/month
### Common SLO Targets
| Availability | Downtime/Month | Use Case |
|--------------|----------------|----------|
| **99%** | 7.2 hours | Internal tools |
| **99.9%** | 43.2 minutes | Standard production |
| **99.95%** | 21.6 minutes | Critical services |
| **99.99%** | 4.3 minutes | High availability |
### SLO Calculator
Calculate compliance, error budgets, and burn rates:
```bash
# Show SLO reference table
python3 scripts/slo_calculator.py --table
# Calculate availability SLO
python3 scripts/slo_calculator.py availability \
--slo 99.9 \
--total-requests 1000000 \
--failed-requests 1500 \
--period-days 30
# Calculate burn rate
python3 scripts/slo_calculator.py burn-rate \
--slo 99.9 \
--errors 50 \
--requests 10000 \
--window-hours 1
```
**→ Script**: [scripts/slo_calculator.py](scripts/slo_calculator.py)
### Deep Dive: SLO/SLA
For comprehensive SLO/SLA guidance including:
- Choosing appropriate SLIs
- Setting realistic SLO targets
- Error budget policies
- Burn rate alerting
- SLA structure and contracts
- Monthly reporting templates
**→ Read**: [references/slo_sla_guide.md](references/slo_sla_guide.md)
---
## 6. Distributed Tracing
### When to Use Tracing
Use distributed tracing when you need to:
- Debug performance issues across services
- Understand request flow through microservices
- Identify bottlenecks in distributed systems
- Find N+1 query problems
### OpenTelemetry Implementation
**Python example**:
```python
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
@tracer.start_as_current_span("process_order")
def process_order(order_id):
span = trace.get_current_span()
span.set_attribute("order.id", order_id)
try:
result = payment_service.charge(order_id)
span.set_attribute("payment.status", "success")
return result
except Exception as e:
span.set_status(trace.Status(trace.StatusCode.ERROR))
span.record_exception(e)
raise
```
### Sampling Strategies
- **Development**: 100% (ALWAYS_ON)
- **Staging**: 50-100%
- **Production**: 1-10% (or error-based sampling)
**Error-based sampling** (always sample errors, 1% of successes):
```python
class ErrorSampler(Sampler):
def should_sample(self, parent_context, trace_id, name, **kwargs):
attributes = kwargs.get('attributes', {})
if attributes.get('error', False):
return Decision.RECORD_AND_SAMPLE
if trace_id & 0xFF < 3: # ~1%
return Decision.RECORD_AND_SAMPLE
return Decision.DROP
```
### OTel Collector Configuration
Production-ready OpenTelemetry Collector configuration:
**→ Template**: [assets/templates/otel-config/collector-config.yaml](assets/templates/otel-config/collector-config.yaml)
**Features**:
- Receives OTLP, Prometheus, and host metrics
- Batching and memory limiting
- Tail sampling (error-based, latency-based, probabilistic)
- Multiple exporters (Tempo, Jaeger, Loki, Prometheus, CloudWatch, Datadog)
### Deep Dive: Tracing
For comprehensive tracing guidance including:
- OpenTelemetry instrumentation (Python, Node.js, Go, Java)
- Span attributes and semantic conventions
- Context propagation (W3C Trace Context)
- Backend comparison (Jaeger, Tempo, X-Ray, Datadog APM)
- Analysis patterns (finding slow traces, N+1 queries)
- Integration with logs
**→ Read**: [references/tracing_guide.md](references/tracing_guide.md)
---
## 7. Datadog Cost Optimization & Migration
### Scenario 1: I'm Using Datadog and Costs Are Too High
If your Datadog bill is growing out of control, start by identifying waste:
#### Cost Analysis Script
Automatically analyze your Datadog usage and find cost optimization opportunities:
```bash
# Analyze Datadog usage (requires API key and APP key)
python3 scripts/datadog_cost_analyzer.py \
--api-key $DD_API_KEY \
--app-key $DD_APP_KEY
# Show detailed breakdown by category
python3 scripts/datadog_cost_analyzer.py \
--api-key $DD_API_KEY \
--app-key $DD_APP_KEY \
--show-details
```
**What it checks**:
- Infrastructure host count and cost
- Custom metrics usage and high-cardinality metrics
- Log ingestion volume and trends
- APM host usage
- Unused or noisy monitors
- Container vs VM optimization opportunities
**→ Script**: [scripts/datadog_cost_analyzer.py](scripts/datadog_cost_analyzer.py)
#### Common Cost Optimization Strategies
**1. Custom Metrics Optimization** (typical savings: 20-40%):
- Remove high-cardinality tags (user IDs, request IDs)
- Delete unused custom metrics
- Aggregate metrics before sending
- Use metric prefixes to identify teams/services
**2. Log Management** (typical savings: 30-50%):
- Implement log sampling for high-volume services
- Use exclusion filters for debug/trace logs in production
- Archive cold logs to S3/GCS after 7 days
- Set log retention policies (15 days instead of 30)
**3. APM Optimization** (typical savings: 15-25%):
- Reduce trace sampling rates (10% → 5% in prod)
- Use head-based sampling instead of complete sampling
- Remove APM from non-critical services
- Use trace search with lower retention
**4. Infrastructure Monitoring** (typical savings: 10-20%):
- Switch from VM-based to container-based pricing where possible
- Remove agents from ephemeral instances
- Use Datadog's host reduction strategies
- Consolidate staging environments
### Scenario 2: Migrating Away from Datadog
If you're considering migrating to a more cost-effective open-source stack:
#### Migration Overview
**From Datadog****To Open Source Stack**:
- Metrics: Datadog → **Prometheus + Grafana**
- Logs: Datadog Logs → **Grafana Loki**
- Traces: Datadog APM → **Tempo or Jaeger**
- Dashboards: Datadog → **Grafana**
- Alerts: Datadog Monitors → **Prometheus Alertmanager**
**Estimated Cost Savings**: 60-77% ($49.8k-61.8k/year for 100-host environment)
#### Migration Strategy
**Phase 1: Run Parallel** (Month 1-2):
- Deploy open-source stack alongside Datadog
- Migrate metrics first (lowest risk)
- Validate data accuracy
**Phase 2: Migrate Dashboards & Alerts** (Month 2-3):
- Convert Datadog dashboards to Grafana
- Translate alert rules (use DQL → PromQL guide below)
- Train team on new tools
**Phase 3: Migrate Logs & Traces** (Month 3-4):
- Set up Loki for log aggregation
- Deploy Tempo/Jaeger for tracing
- Update application instrumentation
**Phase 4: Decommission Datadog** (Month 4-5):
- Confirm all functionality migrated
- Cancel Datadog subscription
#### Query Translation: DQL → PromQL
When migrating dashboards and alerts, you'll need to translate Datadog queries to PromQL:
**Quick examples**:
```
# Average CPU
Datadog: avg:system.cpu.user{*}
Prometheus: avg(node_cpu_seconds_total{mode="user"})
# Request rate
Datadog: sum:requests.count{*}.as_rate()
Prometheus: sum(rate(http_requests_total[5m]))
# P95 latency
Datadog: p95:request.duration{*}
Prometheus: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
# Error rate percentage
Datadog: (sum:requests.errors{*}.as_rate() / sum:requests.count{*}.as_rate()) * 100
Prometheus: (sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) * 100
```
**→ Full Translation Guide**: [references/dql_promql_translation.md](references/dql_promql_translation.md)
#### Cost Comparison
**Example: 100-host infrastructure**
| Component | Datadog (Annual) | Open Source (Annual) | Savings |
|-----------|-----------------|---------------------|---------|
| Infrastructure | $18,000 | $10,000 (self-hosted infra) | $8,000 |
| Custom Metrics | $600 | Included | $600 |
| Logs | $24,000 | $3,000 (storage) | $21,000 |
| APM/Traces | $37,200 | $5,000 (storage) | $32,200 |
| **Total** | **$79,800** | **$18,000** | **$61,800 (77%)** |
### Deep Dive: Datadog Migration
For comprehensive migration guidance including:
- Detailed cost comparison and ROI calculations
- Step-by-step migration instructions
- Infrastructure sizing recommendations (CPU, RAM, storage)
- Dashboard conversion tools and examples
- Alert rule translation patterns
- Application instrumentation changes (DogStatsD → Prometheus client)
- Python scripts for exporting Datadog dashboards and monitors
- Common challenges and solutions
**→ Read**: [references/datadog_migration.md](references/datadog_migration.md)
---
## 8. Tool Selection & Comparison
### Decision Matrix
**Choose Prometheus + Grafana if**:
- ✅ Using Kubernetes
- ✅ Want control and customization
- ✅ Have ops capacity
- ✅ Budget-conscious
**Choose Datadog if**:
- ✅ Want ease of use
- ✅ Need full observability now
- ✅ Budget allows ($8k+/month for 100 hosts)
**Choose Grafana Stack (LGTM) if**:
- ✅ Want open source full stack
- ✅ Cost-effective solution
- ✅ Cloud-native architecture
**Choose ELK Stack if**:
- ✅ Heavy log analysis needs
- ✅ Need powerful search
- ✅ Have dedicated ops team
**Choose Cloud Native (CloudWatch/etc) if**:
- ✅ Single cloud provider
- ✅ Simple needs
- ✅ Want minimal setup
### Cost Comparison (100 hosts, 1TB logs/month)
| Solution | Monthly Cost | Setup | Ops Burden |
|----------|-------------|--------|------------|
| Prometheus + Loki + Tempo | $1,500 | Medium | Medium |
| Grafana Cloud | $3,000 | Low | Low |
| Datadog | $8,000 | Low | None |
| ELK Stack | $4,000 | High | High |
| CloudWatch | $2,000 | Low | Low |
### Deep Dive: Tool Comparison
For comprehensive tool comparison including:
- Metrics platforms (Prometheus, Datadog, New Relic, CloudWatch, Grafana Cloud)
- Logging platforms (ELK, Loki, Splunk, CloudWatch Logs, Sumo Logic)
- Tracing platforms (Jaeger, Tempo, Datadog APM, X-Ray)
- Full-stack observability comparison
- Recommendations by company size
**→ Read**: [references/tool_comparison.md](references/tool_comparison.md)
---
## 9. Troubleshooting & Analysis
### Health Check Validation
Validate health check endpoints against best practices:
```bash
# Check single endpoint
python3 scripts/health_check_validator.py https://api.example.com/health
# Check multiple endpoints
python3 scripts/health_check_validator.py \
https://api.example.com/health \
https://api.example.com/readiness \
--verbose
```
**Checks for**:
- ✓ Returns 200 status code
- ✓ Response time < 1 second
- ✓ Returns JSON format
- ✓ Contains 'status' field
- ✓ Includes version/build info
- ✓ Checks dependencies
- ✓ Disables caching
**→ Script**: [scripts/health_check_validator.py](scripts/health_check_validator.py)
### Common Troubleshooting Workflows
**High Latency Investigation**:
1. Check dashboards for latency spike
2. Query traces for slow operations
3. Check database slow query log
4. Check external API response times
5. Review recent deployments
6. Check resource utilization (CPU, memory)
**High Error Rate Investigation**:
1. Check error logs for patterns
2. Identify affected endpoints
3. Check dependency health
4. Review recent deployments
5. Check resource limits
6. Verify configuration
**Service Down Investigation**:
1. Check if pods/instances are running
2. Check health check endpoint
3. Review recent deployments
4. Check resource availability
5. Check network connectivity
6. Review logs for startup errors
---
## Quick Reference Commands
### Prometheus Queries
```promql
# Request rate
sum(rate(http_requests_total[5m]))
# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) * 100
# P95 latency
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)
# CPU usage
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory usage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
```
### Kubernetes Commands
```bash
# Check pod status
kubectl get pods -n <namespace>
# View pod logs
kubectl logs -f <pod-name> -n <namespace>
# Check pod resources
kubectl top pods -n <namespace>
# Describe pod for events
kubectl describe pod <pod-name> -n <namespace>
# Check recent deployments
kubectl rollout history deployment/<name> -n <namespace>
```
### Log Queries
**Elasticsearch**:
```json
GET /logs-*/_search
{
"query": {
"bool": {
"must": [
{ "match": { "level": "error" } },
{ "range": { "@timestamp": { "gte": "now-1h" } } }
]
}
}
}
```
**Loki (LogQL)**:
```logql
{job="app", level="error"} |= "error" | json
```
**CloudWatch Insights**:
```
fields @timestamp, level, message
| filter level = "error"
| filter @timestamp > ago(1h)
```
---
## Resources Summary
### Scripts (automation and analysis)
- `analyze_metrics.py` - Detect anomalies in Prometheus/CloudWatch metrics
- `alert_quality_checker.py` - Audit alert rules against best practices
- `slo_calculator.py` - Calculate SLO compliance and error budgets
- `log_analyzer.py` - Parse logs for errors and patterns
- `dashboard_generator.py` - Generate Grafana dashboards from templates
- `health_check_validator.py` - Validate health check endpoints
- `datadog_cost_analyzer.py` - Analyze Datadog usage and find cost waste
### References (deep-dive documentation)
- `metrics_design.md` - Four Golden Signals, RED/USE methods, metric types
- `alerting_best_practices.md` - Alert design, runbooks, on-call practices
- `logging_guide.md` - Structured logging, aggregation patterns
- `tracing_guide.md` - OpenTelemetry, distributed tracing
- `slo_sla_guide.md` - SLI/SLO/SLA definitions, error budgets
- `tool_comparison.md` - Comprehensive comparison of monitoring tools
- `datadog_migration.md` - Complete guide for migrating from Datadog to OSS stack
- `dql_promql_translation.md` - Datadog Query Language to PromQL translation reference
### Templates (ready-to-use configurations)
- `prometheus-alerts/webapp-alerts.yml` - Production-ready web app alerts
- `prometheus-alerts/kubernetes-alerts.yml` - Kubernetes monitoring alerts
- `otel-config/collector-config.yaml` - OpenTelemetry Collector configuration
- `runbooks/incident-runbook-template.md` - Incident response template
---
## Best Practices
### Metrics
- Start with Four Golden Signals
- Use appropriate metric types (counter, gauge, histogram)
- Keep cardinality low (avoid high-cardinality labels)
- Follow naming conventions
### Logging
- Use structured logging (JSON)
- Include request IDs for tracing
- Set appropriate log levels
- Redact PII before logging
### Alerting
- Make every alert actionable
- Alert on symptoms, not causes
- Use multi-window burn rate alerts
- Include runbook links
### Tracing
- Sample appropriately (1-10% in production)
- Always record errors
- Use semantic conventions
- Propagate context between services
### SLOs
- Start with current performance
- Set realistic targets
- Define error budget policies
- Review and adjust quarterly

View File

@@ -0,0 +1,227 @@
# OpenTelemetry Collector Configuration
# Receives metrics, logs, and traces and exports to various backends
receivers:
# OTLP receiver (standard OpenTelemetry protocol)
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
# Prometheus receiver (scrape Prometheus endpoints)
prometheus:
config:
scrape_configs:
- job_name: 'otel-collector'
scrape_interval: 30s
static_configs:
- targets: ['localhost:8888']
# Host metrics (CPU, memory, disk, network)
hostmetrics:
collection_interval: 30s
scrapers:
cpu:
memory:
disk:
network:
filesystem:
load:
# Kubernetes receiver (cluster metrics)
k8s_cluster:
auth_type: serviceAccount
node_conditions_to_report: [Ready, MemoryPressure, DiskPressure]
distribution: kubernetes
# Zipkin receiver (legacy tracing)
zipkin:
endpoint: 0.0.0.0:9411
processors:
# Batch processor (improves performance)
batch:
timeout: 10s
send_batch_size: 1024
send_batch_max_size: 2048
# Memory limiter (prevent OOM)
memory_limiter:
check_interval: 1s
limit_mib: 512
spike_limit_mib: 128
# Resource processor (add resource attributes)
resource:
attributes:
- key: environment
value: production
action: insert
- key: cluster.name
value: prod-cluster
action: insert
# Attributes processor (modify span/metric attributes)
attributes:
actions:
- key: http.url
action: delete # Remove potentially sensitive URLs
- key: db.statement
action: hash # Hash SQL queries for privacy
# Filter processor (drop unwanted data)
filter:
metrics:
# Drop metrics matching criteria
exclude:
match_type: regexp
metric_names:
- ^go_.* # Drop Go runtime metrics
- ^process_.* # Drop process metrics
# Tail sampling (intelligent trace sampling)
tail_sampling:
decision_wait: 10s
num_traces: 100
policies:
# Always sample errors
- name: error-policy
type: status_code
status_code:
status_codes: [ERROR]
# Sample slow traces
- name: latency-policy
type: latency
latency:
threshold_ms: 1000
# Sample 10% of others
- name: probabilistic-policy
type: probabilistic
probabilistic:
sampling_percentage: 10
# Span processor (modify spans)
span:
name:
to_attributes:
rules:
- ^\/api\/v1\/users\/(?P<user_id>.*)$
from_attributes:
- db.name
- http.method
exporters:
# Prometheus exporter (expose metrics endpoint)
prometheus:
endpoint: 0.0.0.0:8889
namespace: otel
# OTLP exporters (send to backends)
otlp/tempo:
endpoint: tempo:4317
tls:
insecure: true
otlp/mimir:
endpoint: mimir:4317
tls:
insecure: true
# Loki exporter (for logs)
loki:
endpoint: http://loki:3100/loki/api/v1/push
labels:
resource:
service.name: "service_name"
service.namespace: "service_namespace"
attributes:
level: "level"
# Jaeger exporter (alternative tracing backend)
jaeger:
endpoint: jaeger:14250
tls:
insecure: true
# Elasticsearch exporter (for logs)
elasticsearch:
endpoints:
- http://elasticsearch:9200
logs_index: otel-logs
traces_index: otel-traces
# CloudWatch exporter (AWS)
awscloudwatch:
region: us-east-1
namespace: MyApp
log_group_name: /aws/otel/logs
log_stream_name: otel-collector
# Datadog exporter
datadog:
api:
key: ${DD_API_KEY}
site: datadoghq.com
# File exporter (debugging)
file:
path: /tmp/otel-output.json
# Logging exporter (console output for debugging)
logging:
verbosity: detailed
sampling_initial: 5
sampling_thereafter: 200
extensions:
# Health check endpoint
health_check:
endpoint: 0.0.0.0:13133
# Pprof endpoint (for profiling)
pprof:
endpoint: 0.0.0.0:1777
# ZPages (internal diagnostics)
zpages:
endpoint: 0.0.0.0:55679
service:
extensions: [health_check, pprof, zpages]
pipelines:
# Traces pipeline
traces:
receivers: [otlp, zipkin]
processors: [memory_limiter, batch, tail_sampling, resource, span]
exporters: [otlp/tempo, jaeger, logging]
# Metrics pipeline
metrics:
receivers: [otlp, prometheus, hostmetrics, k8s_cluster]
processors: [memory_limiter, batch, filter, resource]
exporters: [otlp/mimir, prometheus, awscloudwatch]
# Logs pipeline
logs:
receivers: [otlp]
processors: [memory_limiter, batch, resource, attributes]
exporters: [loki, elasticsearch, awscloudwatch]
# Telemetry (collector's own metrics)
telemetry:
logs:
level: info
metrics:
address: 0.0.0.0:8888
# Notes:
# 1. Replace ${DD_API_KEY} with actual API key or use environment variable
# 2. Adjust endpoints to match your infrastructure
# 3. Comment out exporters you don't use
# 4. Adjust sampling rates based on your volume and needs
# 5. Add TLS configuration for production deployments

View File

@@ -0,0 +1,293 @@
---
# Prometheus Alert Rules for Kubernetes
# Covers pods, nodes, deployments, and resource usage
groups:
- name: kubernetes_pods
interval: 30s
rules:
# Pod crash looping
- alert: PodCrashLooping
expr: |
rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: warning
team: platform
component: kubernetes
annotations:
summary: "Pod is crash looping - {{ $labels.namespace }}/{{ $labels.pod }}"
description: |
Pod {{ $labels.namespace }}/{{ $labels.pod }} has restarted {{ $value }} times in the last 15 minutes.
Check pod logs:
kubectl logs -n {{ $labels.namespace }} {{ $labels.pod }} --previous
runbook_url: "https://runbooks.example.com/pod-crash-loop"
# Pod not ready
- alert: PodNotReady
expr: |
sum by (namespace, pod) (kube_pod_status_phase{phase!~"Running|Succeeded"}) > 0
for: 10m
labels:
severity: warning
team: platform
component: kubernetes
annotations:
summary: "Pod not ready - {{ $labels.namespace }}/{{ $labels.pod }}"
description: |
Pod {{ $labels.namespace }}/{{ $labels.pod }} is in {{ $labels.phase }} state for 10 minutes.
Investigate:
kubectl describe pod -n {{ $labels.namespace }} {{ $labels.pod }}
runbook_url: "https://runbooks.example.com/pod-not-ready"
# Pod OOMKilled
- alert: PodOOMKilled
expr: |
sum by (namespace, pod) (kube_pod_container_status_terminated_reason{reason="OOMKilled"}) > 0
for: 1m
labels:
severity: warning
team: platform
component: kubernetes
annotations:
summary: "Pod killed due to OOM - {{ $labels.namespace }}/{{ $labels.pod }}"
description: |
Pod {{ $labels.namespace }}/{{ $labels.pod }} was killed due to out-of-memory.
Increase memory limits or investigate memory leak.
runbook_url: "https://runbooks.example.com/oom-killed"
- name: kubernetes_deployments
interval: 30s
rules:
# Deployment replica mismatch
- alert: DeploymentReplicasMismatch
expr: |
kube_deployment_spec_replicas != kube_deployment_status_replicas_available
for: 15m
labels:
severity: warning
team: platform
component: kubernetes
annotations:
summary: "Deployment replicas mismatch - {{ $labels.namespace }}/{{ $labels.deployment }}"
description: |
Deployment {{ $labels.namespace }}/{{ $labels.deployment }} has been running with
fewer replicas than desired for 15 minutes.
Desired: {{ $value }}
Available: Check deployment status
runbook_url: "https://runbooks.example.com/replica-mismatch"
# Deployment rollout stuck
- alert: DeploymentRolloutStuck
expr: |
kube_deployment_status_condition{condition="Progressing", status="false"} > 0
for: 15m
labels:
severity: warning
team: platform
component: kubernetes
annotations:
summary: "Deployment rollout stuck - {{ $labels.namespace }}/{{ $labels.deployment }}"
description: |
Deployment {{ $labels.namespace }}/{{ $labels.deployment }} rollout is stuck.
Check rollout status:
kubectl rollout status deployment/{{ $labels.deployment }} -n {{ $labels.namespace }}
runbook_url: "https://runbooks.example.com/rollout-stuck"
- name: kubernetes_nodes
interval: 30s
rules:
# Node not ready
- alert: NodeNotReady
expr: |
kube_node_status_condition{condition="Ready",status="true"} == 0
for: 5m
labels:
severity: critical
team: platform
component: kubernetes
annotations:
summary: "Node not ready - {{ $labels.node }}"
description: |
Node {{ $labels.node }} has been NotReady for 5 minutes.
This will affect pod scheduling and availability.
Check node status:
kubectl describe node {{ $labels.node }}
runbook_url: "https://runbooks.example.com/node-not-ready"
# Node memory pressure
- alert: NodeMemoryPressure
expr: |
kube_node_status_condition{condition="MemoryPressure",status="true"} == 1
for: 5m
labels:
severity: warning
team: platform
component: kubernetes
annotations:
summary: "Node under memory pressure - {{ $labels.node }}"
description: |
Node {{ $labels.node }} is experiencing memory pressure.
Pods may be evicted. Consider scaling up or evicting low-priority pods.
runbook_url: "https://runbooks.example.com/memory-pressure"
# Node disk pressure
- alert: NodeDiskPressure
expr: |
kube_node_status_condition{condition="DiskPressure",status="true"} == 1
for: 5m
labels:
severity: warning
team: platform
component: kubernetes
annotations:
summary: "Node under disk pressure - {{ $labels.node }}"
description: |
Node {{ $labels.node }} is experiencing disk pressure.
Clean up disk space or add capacity.
runbook_url: "https://runbooks.example.com/disk-pressure"
# Node high CPU
- alert: NodeHighCPU
expr: |
(1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)) * 100 > 80
for: 15m
labels:
severity: warning
team: platform
component: kubernetes
annotations:
summary: "Node high CPU usage - {{ $labels.instance }}"
description: |
Node {{ $labels.instance }} CPU usage is {{ $value | humanize }}%.
Check for resource-intensive pods or scale cluster.
runbook_url: "https://runbooks.example.com/node-high-cpu"
- name: kubernetes_resources
interval: 30s
rules:
# Container CPU throttling
- alert: ContainerCPUThrottling
expr: |
rate(container_cpu_cfs_throttled_seconds_total[5m]) > 0.5
for: 10m
labels:
severity: warning
team: platform
component: kubernetes
annotations:
summary: "Container CPU throttling - {{ $labels.namespace }}/{{ $labels.pod }}"
description: |
Container {{ $labels.container }} in pod {{ $labels.namespace }}/{{ $labels.pod }}
is being CPU throttled.
CPU throttling rate: {{ $value | humanize }}
Consider increasing CPU limits.
runbook_url: "https://runbooks.example.com/cpu-throttling"
# Container memory usage high
- alert: ContainerMemoryUsageHigh
expr: |
(container_memory_usage_bytes / container_spec_memory_limit_bytes) > 0.9
for: 10m
labels:
severity: warning
team: platform
component: kubernetes
annotations:
summary: "Container memory usage high - {{ $labels.namespace }}/{{ $labels.pod }}"
description: |
Container {{ $labels.container }} in pod {{ $labels.namespace }}/{{ $labels.pod }}
is using {{ $value | humanizePercentage }} of its memory limit.
Risk of OOMKill. Consider increasing memory limits.
runbook_url: "https://runbooks.example.com/high-memory"
- name: kubernetes_pv
interval: 30s
rules:
# PersistentVolume nearing full
- alert: PersistentVolumeFillingUp
expr: |
(kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes) < 0.15
for: 10m
labels:
severity: warning
team: platform
component: kubernetes
annotations:
summary: "PersistentVolume filling up - {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }}"
description: |
PersistentVolume {{ $labels.persistentvolumeclaim }} in namespace {{ $labels.namespace }}
is {{ $value | humanizePercentage }} full.
Available space is running low. Consider expanding volume.
runbook_url: "https://runbooks.example.com/pv-filling-up"
# PersistentVolume critically full
- alert: PersistentVolumeCriticallyFull
expr: |
(kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes) < 0.05
for: 5m
labels:
severity: critical
team: platform
component: kubernetes
annotations:
summary: "PersistentVolume critically full - {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }}"
description: |
PersistentVolume {{ $labels.persistentvolumeclaim }} in namespace {{ $labels.namespace }}
is {{ $value | humanizePercentage }} full.
Immediate action required to prevent application failures.
runbook_url: "https://runbooks.example.com/pv-critically-full"
- name: kubernetes_jobs
interval: 30s
rules:
# Job failed
- alert: JobFailed
expr: |
kube_job_status_failed > 0
for: 5m
labels:
severity: warning
team: platform
component: kubernetes
annotations:
summary: "Job failed - {{ $labels.namespace }}/{{ $labels.job_name }}"
description: |
Job {{ $labels.namespace }}/{{ $labels.job_name }} has failed.
Check job logs:
kubectl logs job/{{ $labels.job_name }} -n {{ $labels.namespace }}
runbook_url: "https://runbooks.example.com/job-failed"
# CronJob not running
- alert: CronJobNotRunning
expr: |
time() - kube_cronjob_status_last_schedule_time > 3600
for: 10m
labels:
severity: warning
team: platform
component: kubernetes
annotations:
summary: "CronJob not running - {{ $labels.namespace }}/{{ $labels.cronjob }}"
description: |
CronJob {{ $labels.namespace}}/{{ $labels.cronjob }} hasn't run in over an hour.
Check CronJob status:
kubectl describe cronjob {{ $labels.cronjob }} -n {{ $labels.namespace }}
runbook_url: "https://runbooks.example.com/cronjob-not-running"

View File

@@ -0,0 +1,243 @@
---
# Prometheus Alert Rules for Web Applications
# Based on SLO best practices and multi-window burn rate alerting
groups:
- name: webapp_availability
interval: 30s
rules:
# Fast burn rate alert (1h window) - SLO: 99.9%
- alert: ErrorBudgetFastBurn
expr: |
(
sum(rate(http_requests_total{job="webapp",status=~"5.."}[1h]))
/
sum(rate(http_requests_total{job="webapp"}[1h]))
) > (14.4 * 0.001)
for: 2m
labels:
severity: critical
team: backend
component: webapp
annotations:
summary: "Fast error budget burn - {{ $labels.job }}"
description: |
Error rate is {{ $value | humanizePercentage }} over the last hour,
burning through error budget at 14.4x rate.
At this rate, the monthly error budget will be exhausted in 2 days.
Immediate investigation required.
runbook_url: "https://runbooks.example.com/error-budget-burn"
dashboard: "https://grafana.example.com/d/webapp"
# Slow burn rate alert (6h window)
- alert: ErrorBudgetSlowBurn
expr: |
(
sum(rate(http_requests_total{job="webapp",status=~"5.."}[6h]))
/
sum(rate(http_requests_total{job="webapp"}[6h]))
) > (6 * 0.001)
for: 30m
labels:
severity: warning
team: backend
component: webapp
annotations:
summary: "Elevated error budget burn - {{ $labels.job }}"
description: |
Error rate is {{ $value | humanizePercentage }} over the last 6 hours,
burning through error budget at 6x rate.
Monitor closely and investigate if trend continues.
runbook_url: "https://runbooks.example.com/error-budget-burn"
# Service down alert
- alert: WebAppDown
expr: up{job="webapp"} == 0
for: 2m
labels:
severity: critical
team: backend
component: webapp
annotations:
summary: "Web application is down - {{ $labels.instance }}"
description: |
Web application instance {{ $labels.instance }} has been down for 2 minutes.
Check service health and logs immediately.
runbook_url: "https://runbooks.example.com/service-down"
- name: webapp_latency
interval: 30s
rules:
# High latency (p95)
- alert: HighLatencyP95
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket{job="webapp"}[5m])) by (le)
) > 0.5
for: 10m
labels:
severity: warning
team: backend
component: webapp
annotations:
summary: "High p95 latency - {{ $labels.job }}"
description: |
P95 request latency is {{ $value }}s, exceeding 500ms threshold.
This may impact user experience. Check for:
- Slow database queries
- External API issues
- Resource saturation
runbook_url: "https://runbooks.example.com/high-latency"
dashboard: "https://grafana.example.com/d/webapp-latency"
# Very high latency (p99)
- alert: HighLatencyP99
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{job="webapp"}[5m])) by (le)
) > 2
for: 5m
labels:
severity: critical
team: backend
component: webapp
annotations:
summary: "Critical latency degradation - {{ $labels.job }}"
description: |
P99 request latency is {{ $value }}s, exceeding 2s threshold.
Severe performance degradation detected.
runbook_url: "https://runbooks.example.com/high-latency"
- name: webapp_resources
interval: 30s
rules:
# High CPU
- alert: HighCPU
expr: |
rate(process_cpu_seconds_total{job="webapp"}[5m]) * 100 > 80
for: 15m
labels:
severity: warning
team: backend
component: webapp
annotations:
summary: "High CPU usage - {{ $labels.instance }}"
description: |
CPU usage is {{ $value | humanize }}% on {{ $labels.instance }}.
Consider scaling up or investigating CPU-intensive operations.
runbook_url: "https://runbooks.example.com/high-cpu"
# High memory
- alert: HighMemory
expr: |
(process_resident_memory_bytes{job="webapp"} / node_memory_MemTotal_bytes) * 100 > 80
for: 15m
labels:
severity: warning
team: backend
component: webapp
annotations:
summary: "High memory usage - {{ $labels.instance }}"
description: |
Memory usage is {{ $value | humanize }}% on {{ $labels.instance }}.
Check for memory leaks or consider scaling up.
runbook_url: "https://runbooks.example.com/high-memory"
- name: webapp_traffic
interval: 30s
rules:
# Traffic spike
- alert: TrafficSpike
expr: |
sum(rate(http_requests_total{job="webapp"}[5m]))
>
1.5 * sum(rate(http_requests_total{job="webapp"}[5m] offset 1h))
for: 10m
labels:
severity: warning
team: backend
component: webapp
annotations:
summary: "Traffic spike detected - {{ $labels.job }}"
description: |
Request rate increased by 50% compared to 1 hour ago.
Current: {{ $value | humanize }} req/s
This could be:
- Legitimate traffic increase
- DDoS attack
- Retry storm
Monitor closely and be ready to scale.
runbook_url: "https://runbooks.example.com/traffic-spike"
# Traffic drop (potential issue)
- alert: TrafficDrop
expr: |
sum(rate(http_requests_total{job="webapp"}[5m]))
<
0.5 * sum(rate(http_requests_total{job="webapp"}[5m] offset 1h))
for: 10m
labels:
severity: warning
team: backend
component: webapp
annotations:
summary: "Traffic drop detected - {{ $labels.job }}"
description: |
Request rate dropped by 50% compared to 1 hour ago.
This could indicate:
- Upstream service issue
- DNS problems
- Load balancer misconfiguration
runbook_url: "https://runbooks.example.com/traffic-drop"
- name: webapp_dependencies
interval: 30s
rules:
# Database connection pool exhaustion
- alert: DatabasePoolExhausted
expr: |
(db_connection_pool_active / db_connection_pool_max) > 0.9
for: 5m
labels:
severity: critical
team: backend
component: database
annotations:
summary: "Database connection pool near exhaustion"
description: |
Connection pool is {{ $value | humanizePercentage }} full.
This will cause request failures. Immediate action required.
runbook_url: "https://runbooks.example.com/db-pool-exhausted"
# External API errors
- alert: ExternalAPIErrors
expr: |
sum(rate(external_api_requests_total{status=~"5.."}[5m])) by (api)
/
sum(rate(external_api_requests_total[5m])) by (api)
> 0.1
for: 5m
labels:
severity: warning
team: backend
component: integration
annotations:
summary: "High error rate from external API - {{ $labels.api }}"
description: |
{{ $labels.api }} is returning errors at {{ $value | humanizePercentage }} rate.
Check API status page and consider enabling circuit breaker.
runbook_url: "https://runbooks.example.com/external-api-errors"

View File

@@ -0,0 +1,409 @@
# Runbook: [Alert Name]
## Overview
**Alert Name**: [e.g., HighLatency, ServiceDown, ErrorBudgetBurn]
**Severity**: [Critical | Warning | Info]
**Team**: [e.g., Backend, Platform, Database]
**Component**: [e.g., API Gateway, User Service, PostgreSQL]
**What it means**: [One-line description of what this alert indicates]
**User impact**: [How does this affect users? High/Medium/Low]
**Urgency**: [How quickly must this be addressed? Immediate/Hours/Days]
---
## Alert Details
### When This Alert Fires
This alert fires when:
- [Specific condition, e.g., "P95 latency exceeds 500ms for 10 minutes"]
- [Any additional conditions]
### Symptoms
Users will experience:
- [ ] Slow response times
- [ ] Errors or failures
- [ ] Service unavailable
- [ ] [Other symptoms]
### Probable Causes
Common causes include:
1. **[Cause 1]**: [Description]
- Example: Database overload due to slow queries
2. **[Cause 2]**: [Description]
- Example: Memory leak causing OOM errors
3. **[Cause 3]**: [Description]
- Example: Upstream service degradation
---
## Investigation Steps
### 1. Check Service Health
**Dashboard**: [Link to primary dashboard]
**Key metrics to check**:
```bash
# Request rate
sum(rate(http_requests_total[5m]))
# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
# Latency (p95, p99)
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
```
**What to look for**:
- [ ] Has traffic spiked recently?
- [ ] Is error rate elevated?
- [ ] Are any endpoints particularly slow?
### 2. Check Recent Changes
**Deployments**:
```bash
# Kubernetes
kubectl rollout history deployment/[service-name] -n [namespace]
# Check when last deployed
kubectl get pods -n [namespace] -o wide | grep [service-name]
```
**What to look for**:
- [ ] Was there a recent deployment?
- [ ] Did alert start after deployment?
- [ ] Any configuration changes?
### 3. Check Logs
**Log query** (adjust for your log system):
```bash
# Kubernetes
kubectl logs deployment/[service-name] -n [namespace] --tail=100 | grep ERROR
# Elasticsearch/Kibana
GET /logs-*/_search
{
"query": {
"bool": {
"must": [
{ "match": { "service": "[service-name]" } },
{ "match": { "level": "error" } },
{ "range": { "@timestamp": { "gte": "now-30m" } } }
]
}
}
}
# Loki/LogQL
{job="[service-name]"} |= "error" | json | level="error"
```
**What to look for**:
- [ ] Repeated error messages
- [ ] Stack traces
- [ ] Connection errors
- [ ] Timeout errors
### 4. Check Dependencies
**Database**:
```bash
# Check active connections
SELECT count(*) FROM pg_stat_activity WHERE state = 'active';
# Check slow queries
SELECT pid, now() - pg_stat_activity.query_start AS duration, query
FROM pg_stat_activity
WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '5 seconds';
```
**External APIs**:
- [ ] Check status pages: [Link to status pages]
- [ ] Check API error rates in dashboard
- [ ] Test API endpoints manually
**Cache** (Redis/Memcached):
```bash
# Redis info
redis-cli -h [host] INFO stats
# Check memory usage
redis-cli -h [host] INFO memory
```
### 5. Check Resource Usage
**CPU and Memory**:
```bash
# Kubernetes
kubectl top pods -n [namespace] | grep [service-name]
# Node metrics
kubectl top nodes
```
**Prometheus queries**:
```promql
# CPU usage by pod
sum(rate(container_cpu_usage_seconds_total{pod=~"[service-name].*"}[5m])) by (pod)
# Memory usage by pod
sum(container_memory_usage_bytes{pod=~"[service-name].*"}) by (pod)
```
**What to look for**:
- [ ] CPU throttling
- [ ] Memory approaching limits
- [ ] Disk space issues
### 6. Check Traces (if available)
**Trace query**:
```bash
# Jaeger
# Search for slow traces (> 1s) in last 30 minutes
# Tempo/TraceQL
{ duration > 1s && resource.service.name = "[service-name]" }
```
**What to look for**:
- [ ] Which operation is slow?
- [ ] Where is time spent? (DB, external API, service logic)
- [ ] Any N+1 query patterns?
---
## Common Scenarios and Solutions
### Scenario 1: Recent Deployment Caused Issue
**Symptoms**:
- Alert started immediately after deployment
- Error logs correlate with new code
**Solution**:
```bash
# Rollback deployment
kubectl rollout undo deployment/[service-name] -n [namespace]
# Verify rollback succeeded
kubectl rollout status deployment/[service-name] -n [namespace]
# Monitor for alert resolution
```
**Follow-up**:
- [ ] Create incident report
- [ ] Review deployment process
- [ ] Add pre-deployment checks
### Scenario 2: Database Performance Issue
**Symptoms**:
- Slow query logs show problematic queries
- Database CPU or connection pool exhausted
**Solution**:
```bash
# Identify slow query
# Kill long-running query (use with caution)
SELECT pg_cancel_backend([pid]);
# Or terminate if cancel doesn't work
SELECT pg_terminate_backend([pid]);
# Add index if missing (in maintenance window)
CREATE INDEX CONCURRENTLY idx_name ON table_name (column_name);
```
**Follow-up**:
- [ ] Add query performance test
- [ ] Review and optimize query
- [ ] Consider read replicas
### Scenario 3: Memory Leak
**Symptoms**:
- Memory usage gradually increasing
- Eventually OOMKilled
- Restarts temporarily fix issue
**Solution**:
```bash
# Immediate: Restart pods
kubectl rollout restart deployment/[service-name] -n [namespace]
# Increase memory limits (temporary)
kubectl set resources deployment/[service-name] -n [namespace] \
--limits=memory=2Gi
```
**Follow-up**:
- [ ] Profile application for memory leaks
- [ ] Add memory usage alerts
- [ ] Fix root cause
### Scenario 4: Traffic Spike / DDoS
**Symptoms**:
- Sudden traffic increase
- Traffic from unusual sources
- High CPU/memory across all instances
**Solution**:
```bash
# Scale up immediately
kubectl scale deployment/[service-name] -n [namespace] --replicas=10
# Enable rate limiting at load balancer level
# (Specific steps depend on LB)
# Block suspicious IPs if confirmed DDoS
# (Use WAF or network policies)
```
**Follow-up**:
- [ ] Implement rate limiting
- [ ] Add DDoS protection (CloudFlare, WAF)
- [ ] Set up auto-scaling
### Scenario 5: Upstream Service Degradation
**Symptoms**:
- Errors calling external API
- Timeouts to upstream service
- Upstream status page shows issues
**Solution**:
```bash
# Enable circuit breaker (if available)
# Adjust timeout configuration
# Switch to backup service/cached data
# Monitor external service
# Check status page: [Link]
```
**Follow-up**:
- [ ] Implement circuit breaker pattern
- [ ] Add fallback mechanisms
- [ ] Set up external service monitoring
---
## Immediate Actions (< 5 minutes)
These should be done first to mitigate impact:
1. **[Action 1]**: [e.g., "Scale up service"]
```bash
kubectl scale deployment/[service] --replicas=10
```
2. **[Action 2]**: [e.g., "Rollback deployment"]
```bash
kubectl rollout undo deployment/[service]
```
3. **[Action 3]**: [e.g., "Enable circuit breaker"]
---
## Short-term Actions (< 30 minutes)
After immediate mitigation:
1. **[Action 1]**: [e.g., "Investigate root cause"]
2. **[Action 2]**: [e.g., "Optimize slow query"]
3. **[Action 3]**: [e.g., "Clear cache if stale"]
---
## Long-term Actions (Post-Incident)
Preventive measures:
1. **[Action 1]**: [e.g., "Add circuit breaker"]
2. **[Action 2]**: [e.g., "Implement auto-scaling"]
3. **[Action 3]**: [e.g., "Add query performance tests"]
4. **[Action 4]**: [e.g., "Update alert thresholds"]
---
## Escalation
If issue persists after 30 minutes:
**Escalation Path**:
1. **Primary oncall**: @[username] ([slack/email])
2. **Team lead**: @[username] ([slack/email])
3. **Engineering manager**: @[username] ([slack/email])
4. **Incident commander**: @[username] ([slack/email])
**Communication**:
- **Slack channel**: #[incidents-channel]
- **Status page**: [Link]
- **Incident tracking**: [Link to incident management tool]
---
## Related Runbooks
- [Related Runbook 1]
- [Related Runbook 2]
- [Related Runbook 3]
## Related Dashboards
- [Main Service Dashboard]
- [Resource Usage Dashboard]
- [Dependency Dashboard]
## Related Documentation
- [Architecture Diagram]
- [Service Documentation]
- [API Documentation]
---
## Recent Incidents
| Date | Duration | Root Cause | Resolution | Ticket |
|------|----------|------------|------------|--------|
| 2024-10-15 | 23 min | Database pool exhausted | Increased pool size | INC-123 |
| 2024-09-30 | 45 min | Memory leak | Fixed code, restarted | INC-120 |
---
## Runbook Metadata
**Last Updated**: [Date]
**Owner**: [Team name]
**Reviewers**: [Names]
**Next Review**: [Date]
---
## Notes
- This runbook should be reviewed quarterly
- Update after each incident to capture new learnings
- Keep investigation steps concise and actionable
- Include actual commands that can be copy-pasted

Binary file not shown.

125
plugin.lock.json Normal file
View File

@@ -0,0 +1,125 @@
{
"$schema": "internal://schemas/plugin.lock.v1.json",
"pluginId": "gh:ahmedasmar/devops-claude-skills:monitoring-observability",
"normalized": {
"repo": null,
"ref": "refs/tags/v20251128.0",
"commit": "9bb89b1ce889c2df6d7c3c2eedbd6d1301297561",
"treeHash": "9fd50a78a79b6d45553e3372bc2d5142f4c48ba4a945ca724356f89f9ce08825",
"generatedAt": "2025-11-28T10:13:03.403599Z",
"toolVersion": "publish_plugins.py@0.2.0"
},
"origin": {
"remote": "git@github.com:zhongweili/42plugin-data.git",
"branch": "master",
"commit": "aa1497ed0949fd50e99e70d6324a29c5b34f9390",
"repoRoot": "/Users/zhongweili/projects/openmind/42plugin-data"
},
"manifest": {
"name": "monitoring-observability",
"description": "Monitoring and observability strategy, metrics/logs/traces systems, SLOs/error budgets, Prometheus/Grafana/Loki, OpenTelemetry, and tool comparison",
"version": null
},
"content": {
"files": [
{
"path": "README.md",
"sha256": "b18b6358cf31ab285b751916a5b2c670b5bc2c8748ef17216f2c9106e4997f8e"
},
{
"path": "SKILL.md",
"sha256": "c02fcac42ed2d4d6fcda67a9f835000b1a1198734e4d8d18000546dda81402e4"
},
{
"path": "monitoring-observability.skill",
"sha256": "c2c368577bb73885c887cc824b695fb3d36f4a77e74b2e25dcd7815c331a71c1"
},
{
"path": "references/alerting_best_practices.md",
"sha256": "99cea7a40310b77a4fdff5543a0b1ee44189497508757bee0dc9ebbe11794a53"
},
{
"path": "references/metrics_design.md",
"sha256": "6edc73473e9d3c2ac7e46a4d97576d356d177ed701a2468c5e21d528ff9c29d7"
},
{
"path": "references/tracing_guide.md",
"sha256": "5e419d77a31d8b3ee5c16fb57e1fc6e3e16d31efb8f4a86dd756c7327a482fa0"
},
{
"path": "references/dql_promql_translation.md",
"sha256": "47113e77b03d9ac70fc35121efd93cf5e17e031b878d27791403493b71058c5c"
},
{
"path": "references/tool_comparison.md",
"sha256": "fd0fc7e4fc3641ca0ddc469a14fa1373457f5a4586fe4bc7ec23afe3de9f6171"
},
{
"path": "references/datadog_migration.md",
"sha256": "9ed5e276eb2ea67f72c91e1bb53374b293e164fa28c4c44f31ee9f8660dfaf02"
},
{
"path": "references/logging_guide.md",
"sha256": "2c94b61d6db2c0f6b8927c8092010f3a2f1ea20d2eefd330d8073e7b4bcf4c9d"
},
{
"path": "references/slo_sla_guide.md",
"sha256": "2a0cb69dd120897183f7bcab002a368dbe11bd5038817906da3391ca168e0052"
},
{
"path": "scripts/log_analyzer.py",
"sha256": "c7fb7e13c2d6507c81ee9575fc8514408d36b2f2e786caeb536ba927d517046e"
},
{
"path": "scripts/analyze_metrics.py",
"sha256": "50ad856cb043dfd70b60c6ca685b526d34b8bc5e5454dd0b530033da3da22545"
},
{
"path": "scripts/health_check_validator.py",
"sha256": "cef8c447fabf83dfd9bd28a8d22127b87b66aafa4d151cbccd9fe1f1db0bbcf2"
},
{
"path": "scripts/alert_quality_checker.py",
"sha256": "b561cf9c41e2de8d5f09557c018110553047d0ad54629bdc7a07a654d76263d1"
},
{
"path": "scripts/datadog_cost_analyzer.py",
"sha256": "05a1c6c0033b04f2f5206af015907f2df4c9cf57f4c2b8f10ba2565236a5c97f"
},
{
"path": "scripts/slo_calculator.py",
"sha256": "c26ab0f0a31e5efa830a9f24938ec356bfaef927438bd47b95f4ad0015cff662"
},
{
"path": "scripts/dashboard_generator.py",
"sha256": "6fe98a49ae431d67bc44eb631c542ba29199da72cc348e90ec99d73a05783ee5"
},
{
"path": ".claude-plugin/plugin.json",
"sha256": "7b6a16e6bce66bf87929c2f3c4ea32f4bfadd8d9606edd195f144c82ec85f151"
},
{
"path": "assets/templates/prometheus-alerts/webapp-alerts.yml",
"sha256": "d881081e53650c335ec5cc7d5d96bade03e607e55bff3bcbafe6811377055154"
},
{
"path": "assets/templates/prometheus-alerts/kubernetes-alerts.yml",
"sha256": "cb8c247b245ea1fb2a904f525fce8f74f9237d79eda04c2c60938135a7271415"
},
{
"path": "assets/templates/runbooks/incident-runbook-template.md",
"sha256": "1a5ba8951cf5b1408ea2101232ffe8d88fab75ed4ae63b0c9f1902059373112d"
},
{
"path": "assets/templates/otel-config/collector-config.yaml",
"sha256": "2696548b1c7f4034283cc2387f9730efa4811881d1c9c9219002e7affc8c29f2"
}
],
"dirSha256": "9fd50a78a79b6d45553e3372bc2d5142f4c48ba4a945ca724356f89f9ce08825"
},
"security": {
"scannedAt": null,
"scannerVersion": null,
"flags": []
}
}

View File

@@ -0,0 +1,609 @@
# Alerting Best Practices
## Core Principles
### 1. Every Alert Should Be Actionable
If you can't do something about it, don't alert on it.
❌ Bad: `Alert: CPU > 50%` (What action should be taken?)
✅ Good: `Alert: API latency p95 > 2s for 10m` (Investigate/scale up)
### 2. Alert on Symptoms, Not Causes
Alert on what users experience, not underlying components.
❌ Bad: `Database connection pool 80% full`
✅ Good: `Request latency p95 > 1s` (which might be caused by DB pool)
### 3. Alert on SLO Violations
Tie alerts to Service Level Objectives.
`Error rate exceeds 0.1% (SLO: 99.9% availability)`
### 4. Reduce Noise
Alert fatigue is real. Only page for critical issues.
**Alert Severity Levels**:
- **Critical**: Page on-call immediately (user-facing issue)
- **Warning**: Create ticket, review during business hours
- **Info**: Log for awareness, no action needed
---
## Alert Design Patterns
### Pattern 1: Multi-Window Multi-Burn-Rate
Google's recommended SLO alerting approach.
**Concept**: Alert when error budget burn rate is high enough to exhaust the budget too quickly.
```yaml
# Fast burn (6% of budget in 1 hour)
- alert: FastBurnRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
> (14.4 * 0.001) # 14.4x burn rate for 99.9% SLO
for: 2m
labels:
severity: critical
# Slow burn (6% of budget in 6 hours)
- alert: SlowBurnRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[6h]))
/
sum(rate(http_requests_total[6h]))
> (6 * 0.001) # 6x burn rate for 99.9% SLO
for: 30m
labels:
severity: warning
```
**Burn Rate Multipliers for 99.9% SLO (0.1% error budget)**:
- 1 hour window, 2m grace: 14.4x burn rate
- 6 hour window, 30m grace: 6x burn rate
- 3 day window, 6h grace: 1x burn rate
### Pattern 2: Rate of Change
Alert when metrics change rapidly.
```yaml
- alert: TrafficSpike
expr: |
sum(rate(http_requests_total[5m]))
>
1.5 * sum(rate(http_requests_total[5m] offset 1h))
for: 10m
annotations:
summary: "Traffic increased by 50% compared to 1 hour ago"
```
### Pattern 3: Threshold with Hysteresis
Prevent flapping with different thresholds for firing and resolving.
```yaml
# Fire at 90%, resolve at 70%
- alert: HighCPU
expr: cpu_usage > 90
for: 5m
- alert: HighCPU_Resolved
expr: cpu_usage < 70
for: 5m
```
### Pattern 4: Absent Metrics
Alert when expected metrics stop being reported (service down).
```yaml
- alert: ServiceDown
expr: absent(up{job="my-service"})
for: 5m
labels:
severity: critical
annotations:
summary: "Service {{ $labels.job }} is not reporting metrics"
```
### Pattern 5: Aggregate Alerts
Alert on aggregate performance across multiple instances.
```yaml
- alert: HighOverallErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
> 0.05
for: 10m
annotations:
summary: "Overall error rate is {{ $value | humanizePercentage }}"
```
---
## Alert Annotation Best Practices
### Required Fields
**summary**: One-line description of the issue
```yaml
summary: "High error rate on {{ $labels.service }}: {{ $value | humanizePercentage }}"
```
**description**: Detailed explanation with context
```yaml
description: |
Error rate on {{ $labels.service }} is {{ $value | humanizePercentage }},
which exceeds the threshold of 1% for more than 10 minutes.
Current value: {{ $value }}
Runbook: https://runbooks.example.com/high-error-rate
```
**runbook_url**: Link to investigation steps
```yaml
runbook_url: "https://runbooks.example.com/alerts/{{ $labels.alertname }}"
```
### Optional but Recommended
**dashboard**: Link to relevant dashboard
```yaml
dashboard: "https://grafana.example.com/d/service-dashboard?var-service={{ $labels.service }}"
```
**logs**: Link to logs
```yaml
logs: "https://kibana.example.com/app/discover#/?_a=(query:(query_string:(query:'service:{{ $labels.service }}')))"
```
---
## Alert Label Best Practices
### Required Labels
**severity**: Critical, warning, or info
```yaml
labels:
severity: critical
```
**team**: Who should handle this alert
```yaml
labels:
team: platform
severity: critical
```
**component**: What part of the system
```yaml
labels:
component: api-gateway
severity: warning
```
### Example Complete Alert
```yaml
- alert: HighLatency
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
) > 1
for: 10m
labels:
severity: warning
team: backend
component: api
environment: "{{ $labels.environment }}"
annotations:
summary: "High latency on {{ $labels.service }}"
description: |
P95 latency on {{ $labels.service }} is {{ $value }}s, exceeding 1s threshold.
This may impact user experience. Check recent deployments and database performance.
Current p95: {{ $value }}s
Threshold: 1s
Duration: 10m+
runbook_url: "https://runbooks.example.com/high-latency"
dashboard: "https://grafana.example.com/d/api-dashboard"
logs: "https://kibana.example.com/app/discover#/?_a=(query:(query_string:(query:'service:{{ $labels.service }} AND level:error')))"
```
---
## Alert Thresholds
### General Guidelines
**Response Time / Latency**:
- Warning: p95 > 500ms or p99 > 1s
- Critical: p95 > 2s or p99 > 5s
**Error Rate**:
- Warning: > 1%
- Critical: > 5%
**Availability**:
- Warning: < 99.9%
- Critical: < 99.5%
**CPU Utilization**:
- Warning: > 70% for 15m
- Critical: > 90% for 5m
**Memory Utilization**:
- Warning: > 80% for 15m
- Critical: > 95% for 5m
**Disk Space**:
- Warning: > 80% full
- Critical: > 90% full
**Queue Depth**:
- Warning: > 70% of max capacity
- Critical: > 90% of max capacity
### Application-Specific Thresholds
Set thresholds based on:
1. **Historical performance**: Use p95 of last 30 days + 20%
2. **SLO requirements**: If SLO is 99.9%, alert at 99.5%
3. **Business impact**: What error rate causes user complaints?
---
## The "for" Clause
Prevent alert flapping by requiring the condition to be true for a duration.
### Guidelines
**Critical alerts**: Short duration (2-5m)
```yaml
- alert: ServiceDown
expr: up == 0
for: 2m # Quick detection for critical issues
```
**Warning alerts**: Longer duration (10-30m)
```yaml
- alert: HighMemoryUsage
expr: memory_usage > 80
for: 15m # Avoid noise from temporary spikes
```
**Resource saturation**: Medium duration (5-10m)
```yaml
- alert: HighCPU
expr: cpu_usage > 90
for: 5m
```
---
## Alert Routing
### Severity-Based Routing
```yaml
# alertmanager.yml
route:
group_by: ['alertname', 'cluster']
group_wait: 10s
group_interval: 5m
repeat_interval: 4h
receiver: 'default'
routes:
# Critical alerts → PagerDuty
- match:
severity: critical
receiver: pagerduty
group_wait: 10s
repeat_interval: 5m
# Warning alerts → Slack
- match:
severity: warning
receiver: slack
group_wait: 30s
repeat_interval: 12h
# Info alerts → Email
- match:
severity: info
receiver: email
repeat_interval: 24h
```
### Team-Based Routing
```yaml
routes:
# Platform team
- match:
team: platform
receiver: platform-pagerduty
# Backend team
- match:
team: backend
receiver: backend-slack
# Database team
- match:
component: database
receiver: dba-pagerduty
```
### Time-Based Routing
```yaml
# Only page during business hours for non-critical
routes:
- match:
severity: warning
receiver: slack
active_time_intervals:
- business_hours
time_intervals:
- name: business_hours
time_intervals:
- weekdays: ['monday:friday']
times:
- start_time: '09:00'
end_time: '17:00'
location: 'America/New_York'
```
---
## Alert Grouping
### Intelligent Grouping
**Group by service and environment**:
```yaml
route:
group_by: ['alertname', 'service', 'environment']
group_wait: 30s
group_interval: 5m
```
This prevents:
- 50 alerts for "HighCPU" on different pods → 1 grouped alert
- Mixing production and staging alerts
### Inhibition Rules
Suppress related alerts when a parent alert fires.
```yaml
inhibit_rules:
# If service is down, suppress latency alerts
- source_match:
alertname: ServiceDown
target_match:
alertname: HighLatency
equal: ['service']
# If node is down, suppress all pod alerts on that node
- source_match:
alertname: NodeDown
target_match_re:
alertname: '(PodCrashLoop|HighCPU|HighMemory)'
equal: ['node']
```
---
## Runbook Structure
Every alert should link to a runbook with:
### 1. Context
- What does this alert mean?
- What is the user impact?
- What is the urgency?
### 2. Investigation Steps
```markdown
## Investigation
1. Check service health dashboard
https://grafana.example.com/d/service-dashboard
2. Check recent deployments
kubectl rollout history deployment/myapp -n production
3. Check error logs
kubectl logs deployment/myapp -n production --tail=100 | grep ERROR
4. Check dependencies
- Database: Check slow query log
- Redis: Check memory usage
- External APIs: Check status pages
```
### 3. Common Causes
```markdown
## Common Causes
- **Recent deployment**: Check if alert started after deployment
- **Traffic spike**: Check request rate, might need to scale
- **Database issues**: Check query performance and connection pool
- **External API degradation**: Check third-party status pages
```
### 4. Resolution Steps
```markdown
## Resolution
### Immediate Actions (< 5 minutes)
1. Scale up if traffic spike: `kubectl scale deployment myapp --replicas=10`
2. Rollback if recent deployment: `kubectl rollout undo deployment/myapp`
### Short-term Actions (< 30 minutes)
1. Restart pods if memory leak: `kubectl rollout restart deployment/myapp`
2. Clear cache if stale data: `redis-cli -h cache.example.com FLUSHDB`
### Long-term Actions (post-incident)
1. Review and optimize slow queries
2. Implement circuit breakers
3. Add more capacity
4. Update alert thresholds if false positive
```
### 5. Escalation
```markdown
## Escalation
If issue persists after 30 minutes:
- Slack: #backend-oncall
- PagerDuty: Escalate to senior engineer
- Incident Commander: Jane Doe (jane@example.com)
```
---
## Anti-Patterns to Avoid
### 1. Alert on Everything
❌ Don't: Alert on every warning log
✅ Do: Alert on error rate exceeding threshold
### 2. Alert Without Context
❌ Don't: "Error rate high"
✅ Do: "Error rate 5.2% exceeds 1% threshold for 10m, impacting checkout flow"
### 3. Static Thresholds for Dynamic Systems
❌ Don't: `cpu_usage > 70` (fails during scale-up)
✅ Do: Alert on SLO violations or rate of change
### 4. No "for" Clause
❌ Don't: Alert immediately on threshold breach
✅ Do: Use `for: 5m` to avoid flapping
### 5. Too Many Recipients
❌ Don't: Page 10 people for every alert
✅ Do: Route to specific on-call rotation
### 6. Duplicate Alerts
❌ Don't: Alert on both cause and symptom
✅ Do: Alert on symptom, use inhibition for causes
### 7. No Runbook
❌ Don't: Alert without guidance
✅ Do: Include runbook_url in every alert
---
## Alert Testing
### Test Alert Firing
```bash
# Trigger test alert in Prometheus
amtool alert add alertname="TestAlert" \
severity="warning" \
summary="Test alert"
# Or use Alertmanager API
curl -X POST http://alertmanager:9093/api/v1/alerts \
-d '[{
"labels": {"alertname": "TestAlert", "severity": "critical"},
"annotations": {"summary": "Test critical alert"}
}]'
```
### Verify Alert Rules
```bash
# Check syntax
promtool check rules alerts.yml
# Test expression
promtool query instant http://prometheus:9090 \
'sum(rate(http_requests_total{status=~"5.."}[5m]))'
# Unit test alerts
promtool test rules test.yml
```
### Test Alertmanager Routing
```bash
# Test which receiver an alert would go to
amtool config routes test \
--config.file=alertmanager.yml \
alertname="HighLatency" \
severity="critical" \
team="backend"
```
---
## On-Call Best Practices
### Rotation Schedule
- **Primary on-call**: First responder
- **Secondary on-call**: Escalation backup
- **Rotation length**: 1 week (balance load vs context)
- **Handoff**: Monday morning (not Friday evening)
### On-Call Checklist
```markdown
## Pre-shift
- [ ] Test pager/phone
- [ ] Review recent incidents
- [ ] Check upcoming deployments
- [ ] Update contact info
## During shift
- [ ] Respond to pages within 5 minutes
- [ ] Document all incidents
- [ ] Update runbooks if gaps found
- [ ] Communicate in #incidents channel
## Post-shift
- [ ] Hand off open incidents
- [ ] Complete incident reports
- [ ] Suggest improvements
- [ ] Update team documentation
```
### Escalation Policy
1. **Primary**: Responds within 5 minutes
2. **Secondary**: Auto-escalate after 15 minutes
3. **Manager**: Auto-escalate after 30 minutes
4. **Incident Commander**: Critical incidents only
---
## Metrics About Alerts
Monitor your monitoring system!
### Key Metrics
```promql
# Alert firing frequency
sum(ALERTS{alertstate="firing"}) by (alertname)
# Alert duration
ALERTS_FOR_STATE{alertstate="firing"}
# Alerts per severity
sum(ALERTS{alertstate="firing"}) by (severity)
# Time to acknowledge (from PagerDuty/etc)
pagerduty_incident_ack_duration_seconds
```
### Alert Quality Metrics
- **Mean Time to Acknowledge (MTTA)**: < 5 minutes
- **Mean Time to Resolve (MTTR)**: < 30 minutes
- **False Positive Rate**: < 10%
- **Alert Coverage**: % of incidents with preceding alert > 80%

View File

@@ -0,0 +1,649 @@
# Migrating from Datadog to Open-Source Stack
## Overview
This guide helps you migrate from Datadog to a cost-effective open-source observability stack:
- **Metrics**: Datadog → Prometheus + Grafana
- **Logs**: Datadog → Loki + Grafana
- **Traces**: Datadog APM → Tempo/Jaeger + Grafana
- **Dashboards**: Datadog → Grafana
- **Alerts**: Datadog Monitors → Prometheus Alertmanager
**Estimated Cost Savings**: 60-80% for similar functionality
---
## Cost Comparison
### Example: 100-host infrastructure
**Datadog**:
- Infrastructure Pro: $1,500/month (100 hosts × $15)
- Custom Metrics: $50/month (5,000 extra metrics beyond included 10,000)
- Logs: $2,000/month (20GB/day × $0.10/GB × 30 days)
- APM: $3,100/month (100 hosts × $31)
- **Total**: ~$6,650/month ($79,800/year)
**Open-Source Stack** (self-hosted):
- Infrastructure: $1,200/month (EC2/GKE for Prometheus, Grafana, Loki, Tempo)
- Storage: $300/month (S3/GCS for long-term metrics and traces)
- Operations time: Variable
- **Total**: ~$1,500-2,500/month ($18,000-30,000/year)
**Savings**: $49,800-61,800/year
---
## Migration Strategy
### Phase 1: Run Parallel (Month 1-2)
- Deploy open-source stack alongside Datadog
- Migrate metrics first (lowest risk)
- Validate data accuracy
- Build confidence
### Phase 2: Migrate Dashboards & Alerts (Month 2-3)
- Convert Datadog dashboards to Grafana
- Translate alert rules
- Train team on new tools
### Phase 3: Migrate Logs & Traces (Month 3-4)
- Set up Loki for log aggregation
- Deploy Tempo/Jaeger for tracing
- Update application instrumentation
### Phase 4: Decommission Datadog (Month 4-5)
- Confirm all functionality migrated
- Cancel Datadog subscription
- Archive Datadog dashboards/alerts for reference
---
## 1. Metrics Migration (Datadog → Prometheus)
### Step 1: Deploy Prometheus
**Kubernetes** (recommended):
```yaml
# prometheus-values.yaml
prometheus:
prometheusSpec:
retention: 30d
storageSpec:
volumeClaimTemplate:
spec:
resources:
requests:
storage: 100Gi
# Scrape configs
additionalScrapeConfigs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
```
**Install**:
```bash
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack -f prometheus-values.yaml
```
**Docker Compose**:
```yaml
version: '3'
services:
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=30d'
volumes:
prometheus-data:
```
### Step 2: Replace DogStatsD with Prometheus Exporters
**Before (DogStatsD)**:
```python
from datadog import statsd
statsd.increment('page.views')
statsd.histogram('request.duration', 0.5)
statsd.gauge('active_users', 100)
```
**After (Prometheus Python client)**:
```python
from prometheus_client import Counter, Histogram, Gauge
page_views = Counter('page_views_total', 'Page views')
request_duration = Histogram('request_duration_seconds', 'Request duration')
active_users = Gauge('active_users', 'Active users')
# Usage
page_views.inc()
request_duration.observe(0.5)
active_users.set(100)
```
### Step 3: Metric Name Translation
| Datadog Metric | Prometheus Equivalent |
|----------------|----------------------|
| `system.cpu.idle` | `node_cpu_seconds_total{mode="idle"}` |
| `system.mem.free` | `node_memory_MemFree_bytes` |
| `system.disk.used` | `node_filesystem_size_bytes - node_filesystem_free_bytes` |
| `docker.cpu.usage` | `container_cpu_usage_seconds_total` |
| `kubernetes.pods.running` | `kube_pod_status_phase{phase="Running"}` |
### Step 4: Export Existing Datadog Metrics (Optional)
Use Datadog API to export historical data:
```python
from datadog import api, initialize
options = {
'api_key': 'YOUR_API_KEY',
'app_key': 'YOUR_APP_KEY'
}
initialize(**options)
# Query metric
result = api.Metric.query(
start=int(time.time() - 86400), # Last 24h
end=int(time.time()),
query='avg:system.cpu.user{*}'
)
# Convert to Prometheus format and import
```
---
## 2. Dashboard Migration (Datadog → Grafana)
### Step 1: Export Datadog Dashboards
```python
import requests
import json
api_key = "YOUR_API_KEY"
app_key = "YOUR_APP_KEY"
headers = {
'DD-API-KEY': api_key,
'DD-APPLICATION-KEY': app_key
}
# Get all dashboards
response = requests.get(
'https://api.datadoghq.com/api/v1/dashboard',
headers=headers
)
dashboards = response.json()
# Export each dashboard
for dashboard in dashboards['dashboards']:
dash_id = dashboard['id']
detail = requests.get(
f'https://api.datadoghq.com/api/v1/dashboard/{dash_id}',
headers=headers
).json()
with open(f'datadog_{dash_id}.json', 'w') as f:
json.dump(detail, f, indent=2)
```
### Step 2: Convert to Grafana Format
**Manual Conversion Template**:
| Datadog Widget | Grafana Panel Type |
|----------------|-------------------|
| Timeseries | Graph / Time series |
| Query Value | Stat |
| Toplist | Table / Bar gauge |
| Heatmap | Heatmap |
| Distribution | Histogram |
**Automated Conversion** (basic example):
```python
def convert_datadog_to_grafana(datadog_dashboard):
grafana_dashboard = {
"title": datadog_dashboard['title'],
"panels": []
}
for widget in datadog_dashboard['widgets']:
panel = {
"title": widget['definition'].get('title', ''),
"type": map_widget_type(widget['definition']['type']),
"targets": convert_queries(widget['definition']['requests'])
}
grafana_dashboard['panels'].append(panel)
return grafana_dashboard
```
### Step 3: Common Query Translations
See `dql_promql_translation.md` for comprehensive query mappings.
**Example conversions**:
```
Datadog: avg:system.cpu.user{*}
Prometheus: avg(rate(node_cpu_seconds_total{mode="user"}[5m])) * 100
Datadog: sum:requests.count{status:200}.as_rate()
Prometheus: sum(rate(http_requests_total{status="200"}[5m]))
Datadog: p95:request.duration{*}
Prometheus: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
```
---
## 3. Alert Migration (Datadog Monitors → Prometheus Alerts)
### Step 1: Export Datadog Monitors
```python
import requests
api_key = "YOUR_API_KEY"
app_key = "YOUR_APP_KEY"
headers = {
'DD-API-KEY': api_key,
'DD-APPLICATION-KEY': app_key
}
response = requests.get(
'https://api.datadoghq.com/api/v1/monitor',
headers=headers
)
monitors = response.json()
# Save each monitor
for monitor in monitors:
with open(f'monitor_{monitor["id"]}.json', 'w') as f:
json.dump(monitor, f, indent=2)
```
### Step 2: Convert to Prometheus Alert Rules
**Datadog Monitor**:
```json
{
"name": "High CPU Usage",
"type": "metric alert",
"query": "avg(last_5m):avg:system.cpu.user{*} > 80",
"message": "CPU usage is high on {{host.name}}"
}
```
**Prometheus Alert**:
```yaml
groups:
- name: infrastructure
rules:
- alert: HighCPUUsage
expr: |
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is {{ $value }}%"
```
### Step 3: Alert Routing (Datadog → Alertmanager)
**Datadog notification channels****Alertmanager receivers**
```yaml
# alertmanager.yml
route:
group_by: ['alertname', 'severity']
receiver: 'slack-notifications'
receivers:
- name: 'slack-notifications'
slack_configs:
- api_url: 'YOUR_SLACK_WEBHOOK'
channel: '#alerts'
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: 'YOUR_PAGERDUTY_KEY'
```
---
## 4. Log Migration (Datadog → Loki)
### Step 1: Deploy Loki
**Kubernetes**:
```bash
helm repo add grafana https://grafana.github.io/helm-charts
helm install loki grafana/loki-stack \
--set loki.persistence.enabled=true \
--set loki.persistence.size=100Gi \
--set promtail.enabled=true
```
**Docker Compose**:
```yaml
version: '3'
services:
loki:
image: grafana/loki:latest
ports:
- "3100:3100"
volumes:
- ./loki-config.yaml:/etc/loki/local-config.yaml
- loki-data:/loki
promtail:
image: grafana/promtail:latest
volumes:
- /var/log:/var/log
- ./promtail-config.yaml:/etc/promtail/config.yml
volumes:
loki-data:
```
### Step 2: Replace Datadog Log Forwarder
**Before (Datadog Agent)**:
```yaml
# datadog.yaml
logs_enabled: true
logs_config:
container_collect_all: true
```
**After (Promtail)**:
```yaml
# promtail-config.yaml
server:
http_listen_port: 9080
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: system
static_configs:
- targets:
- localhost
labels:
job: varlogs
__path__: /var/log/*.log
```
### Step 3: Query Translation
**Datadog Logs Query**:
```
service:my-app status:error
```
**Loki LogQL**:
```logql
{job="my-app", level="error"}
```
**More examples**:
```
Datadog: service:api-gateway status:error @http.status_code:>=500
Loki: {service="api-gateway", level="error"} | json | http_status_code >= 500
Datadog: source:nginx "404"
Loki: {source="nginx"} |= "404"
```
---
## 5. APM Migration (Datadog APM → Tempo/Jaeger)
### Step 1: Choose Tracing Backend
- **Tempo**: Better for high volume, cheaper storage (object storage)
- **Jaeger**: More mature, better UI, requires separate storage
### Step 2: Replace Datadog Tracer with OpenTelemetry
**Before (Datadog Python)**:
```python
from ddtrace import tracer
@tracer.wrap()
def my_function():
pass
```
**After (OpenTelemetry)**:
```python
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
# Setup
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)
exporter = OTLPSpanExporter(endpoint="tempo:4317")
@tracer.start_as_current_span("my_function")
def my_function():
pass
```
### Step 3: Deploy Tempo
```yaml
# tempo.yaml
server:
http_listen_port: 3200
distributor:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
storage:
trace:
backend: s3
s3:
bucket: tempo-traces
endpoint: s3.amazonaws.com
```
---
## 6. Infrastructure Migration
### Recommended Architecture
```
┌─────────────────────────────────────────┐
│ Grafana (Visualization) │
│ - Dashboards │
│ - Unified view │
└─────────────────────────────────────────┘
↓ ↓ ↓
┌──────────────┐ ┌──────────┐ ┌──────────┐
│ Prometheus │ │ Loki │ │ Tempo │
│ (Metrics) │ │ (Logs) │ │ (Traces) │
└──────────────┘ └──────────┘ └──────────┘
↓ ↓ ↓
┌─────────────────────────────────────────┐
│ Applications (OpenTelemetry) │
└─────────────────────────────────────────┘
```
### Sizing Recommendations
**100-host environment**:
- **Prometheus**: 2-4 CPU, 8-16GB RAM, 100GB SSD
- **Grafana**: 1 CPU, 2GB RAM
- **Loki**: 2-4 CPU, 8GB RAM, 100GB SSD
- **Tempo**: 2-4 CPU, 8GB RAM, S3 for storage
- **Alertmanager**: 1 CPU, 1GB RAM
**Total**: ~8-16 CPU, 32-64GB RAM, 200GB SSD + object storage
---
## 7. Migration Checklist
### Pre-Migration
- [ ] Calculate current Datadog costs
- [ ] Identify all Datadog integrations
- [ ] Export all dashboards
- [ ] Export all monitors
- [ ] Document custom metrics
- [ ] Get stakeholder approval
### During Migration
- [ ] Deploy Prometheus + Grafana
- [ ] Deploy Loki + Promtail
- [ ] Deploy Tempo/Jaeger (if using APM)
- [ ] Migrate metrics instrumentation
- [ ] Convert dashboards (top 10 critical first)
- [ ] Convert alerts (critical alerts first)
- [ ] Update application logging
- [ ] Replace APM instrumentation
- [ ] Run parallel for 2-4 weeks
- [ ] Validate data accuracy
- [ ] Train team on new tools
### Post-Migration
- [ ] Decommission Datadog agent from all hosts
- [ ] Cancel Datadog subscription
- [ ] Archive Datadog configs
- [ ] Document new workflows
- [ ] Create runbooks for common tasks
---
## 8. Common Challenges & Solutions
### Challenge: Missing Datadog Features
**Datadog Synthetic Monitoring**:
- Solution: Use **Blackbox Exporter** (Prometheus) or **Grafana Synthetic Monitoring**
**Datadog Network Performance Monitoring**:
- Solution: Use **Cilium Hubble** (Kubernetes) or **eBPF-based tools**
**Datadog RUM (Real User Monitoring)**:
- Solution: Use **Grafana Faro** or **OpenTelemetry Browser SDK**
### Challenge: Team Learning Curve
**Solution**:
- Provide training sessions (2-3 hours per tool)
- Create internal documentation with examples
- Set up sandbox environment for practice
- Assign champions for each tool
### Challenge: Query Performance
**Prometheus too slow**:
- Use **Thanos** or **Cortex** for scaling
- Implement recording rules for expensive queries
- Increase retention only where needed
**Loki too slow**:
- Add more labels for better filtering
- Use chunk caching
- Consider **parallel query execution**
---
## 9. Maintenance Comparison
### Datadog (Managed)
- **Ops burden**: Low (fully managed)
- **Upgrades**: Automatic
- **Scaling**: Automatic
- **Cost**: High ($6k-10k+/month)
### Open-Source Stack (Self-hosted)
- **Ops burden**: Medium (requires ops team)
- **Upgrades**: Manual (quarterly)
- **Scaling**: Manual planning required
- **Cost**: Low ($1.5k-3k/month infrastructure)
**Hybrid Option**: Use **Grafana Cloud** (managed Prometheus/Loki/Tempo)
- Cost: ~$3k/month for 100 hosts
- Ops burden: Low
- Savings: ~50% vs Datadog
---
## 10. ROI Calculation
### Example Scenario
**Before (Datadog)**:
- Monthly cost: $7,000
- Annual cost: $84,000
**After (Self-hosted OSS)**:
- Infrastructure: $1,800/month
- Operations (0.5 FTE): $4,000/month
- Annual cost: $69,600
**Savings**: $14,400/year
**After (Grafana Cloud)**:
- Monthly cost: $3,500
- Annual cost: $42,000
**Savings**: $42,000/year (50%)
**Break-even**: Immediate (no migration costs beyond engineering time)
---
## Resources
- **Prometheus**: https://prometheus.io/docs/
- **Grafana**: https://grafana.com/docs/
- **Loki**: https://grafana.com/docs/loki/
- **Tempo**: https://grafana.com/docs/tempo/
- **OpenTelemetry**: https://opentelemetry.io/
- **Migration Tools**: https://github.com/grafana/dashboard-linter
---
## Support
If you need help with migration:
- Grafana Labs offers migration consulting
- Many SRE consulting firms specialize in this
- Community support via Slack/Discord channels

View File

@@ -0,0 +1,756 @@
# DQL (Datadog Query Language) ↔ PromQL Translation Guide
## Quick Reference
| Concept | Datadog (DQL) | Prometheus (PromQL) |
|---------|---------------|---------------------|
| Aggregation | `avg:`, `sum:`, `min:`, `max:` | `avg()`, `sum()`, `min()`, `max()` |
| Rate | `.as_rate()`, `.as_count()` | `rate()`, `increase()` |
| Percentile | `p50:`, `p95:`, `p99:` | `histogram_quantile()` |
| Filtering | `{tag:value}` | `{label="value"}` |
| Time window | `last_5m`, `last_1h` | `[5m]`, `[1h]` |
---
## Basic Queries
### Simple Metric Query
**Datadog**:
```
system.cpu.user
```
**Prometheus**:
```promql
node_cpu_seconds_total{mode="user"}
```
---
### Metric with Filter
**Datadog**:
```
system.cpu.user{host:web-01}
```
**Prometheus**:
```promql
node_cpu_seconds_total{mode="user", instance="web-01"}
```
---
### Multiple Filters (AND)
**Datadog**:
```
system.cpu.user{host:web-01,env:production}
```
**Prometheus**:
```promql
node_cpu_seconds_total{mode="user", instance="web-01", env="production"}
```
---
### Wildcard Filters
**Datadog**:
```
system.cpu.user{host:web-*}
```
**Prometheus**:
```promql
node_cpu_seconds_total{mode="user", instance=~"web-.*"}
```
---
### OR Filters
**Datadog**:
```
system.cpu.user{host:web-01 OR host:web-02}
```
**Prometheus**:
```promql
node_cpu_seconds_total{mode="user", instance=~"web-01|web-02"}
```
---
## Aggregations
### Average
**Datadog**:
```
avg:system.cpu.user{*}
```
**Prometheus**:
```promql
avg(node_cpu_seconds_total{mode="user"})
```
---
### Sum
**Datadog**:
```
sum:requests.count{*}
```
**Prometheus**:
```promql
sum(http_requests_total)
```
---
### Min/Max
**Datadog**:
```
min:system.mem.free{*}
max:system.mem.free{*}
```
**Prometheus**:
```promql
min(node_memory_MemFree_bytes)
max(node_memory_MemFree_bytes)
```
---
### Aggregation by Tag/Label
**Datadog**:
```
avg:system.cpu.user{*} by {host}
```
**Prometheus**:
```promql
avg by (instance) (node_cpu_seconds_total{mode="user"})
```
---
## Rates and Counts
### Rate (per second)
**Datadog**:
```
sum:requests.count{*}.as_rate()
```
**Prometheus**:
```promql
sum(rate(http_requests_total[5m]))
```
Note: Prometheus requires explicit time window `[5m]`
---
### Count (total over time)
**Datadog**:
```
sum:requests.count{*}.as_count()
```
**Prometheus**:
```promql
sum(increase(http_requests_total[1h]))
```
---
### Derivative (change over time)
**Datadog**:
```
derivative(avg:system.disk.used{*})
```
**Prometheus**:
```promql
deriv(node_filesystem_size_bytes[5m])
```
---
## Percentiles
### P50 (Median)
**Datadog**:
```
p50:request.duration{*}
```
**Prometheus** (requires histogram):
```promql
histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
```
---
### P95
**Datadog**:
```
p95:request.duration{*}
```
**Prometheus**:
```promql
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
```
---
### P99
**Datadog**:
```
p99:request.duration{*}
```
**Prometheus**:
```promql
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
```
---
## Time Windows
### Last 5 minutes
**Datadog**:
```
avg(last_5m):system.cpu.user{*}
```
**Prometheus**:
```promql
avg(node_cpu_seconds_total{mode="user"}[5m])
```
---
### Last 1 hour
**Datadog**:
```
avg(last_1h):system.cpu.user{*}
```
**Prometheus**:
```promql
avg_over_time(node_cpu_seconds_total{mode="user"}[1h])
```
---
## Math Operations
### Division
**Datadog**:
```
avg:system.mem.used{*} / avg:system.mem.total{*}
```
**Prometheus**:
```promql
node_memory_MemUsed_bytes / node_memory_MemTotal_bytes
```
---
### Multiplication
**Datadog**:
```
avg:system.cpu.user{*} * 100
```
**Prometheus**:
```promql
avg(node_cpu_seconds_total{mode="user"}) * 100
```
---
### Percentage Calculation
**Datadog**:
```
(sum:requests.errors{*} / sum:requests.count{*}) * 100
```
**Prometheus**:
```promql
(sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) * 100
```
---
## Common Use Cases
### CPU Usage Percentage
**Datadog**:
```
100 - avg:system.cpu.idle{*}
```
**Prometheus**:
```promql
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
```
---
### Memory Usage Percentage
**Datadog**:
```
(avg:system.mem.used{*} / avg:system.mem.total{*}) * 100
```
**Prometheus**:
```promql
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
```
---
### Disk Usage Percentage
**Datadog**:
```
(avg:system.disk.used{*} / avg:system.disk.total{*}) * 100
```
**Prometheus**:
```promql
(node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes * 100
```
---
### Request Rate (requests/sec)
**Datadog**:
```
sum:requests.count{*}.as_rate()
```
**Prometheus**:
```promql
sum(rate(http_requests_total[5m]))
```
---
### Error Rate Percentage
**Datadog**:
```
(sum:requests.errors{*}.as_rate() / sum:requests.count{*}.as_rate()) * 100
```
**Prometheus**:
```promql
(sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) * 100
```
---
### Request Latency (P95)
**Datadog**:
```
p95:request.duration{*}
```
**Prometheus**:
```promql
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
```
---
### Top 5 Hosts by CPU
**Datadog**:
```
top(avg:system.cpu.user{*} by {host}, 5, 'mean', 'desc')
```
**Prometheus**:
```promql
topk(5, avg by (instance) (rate(node_cpu_seconds_total{mode="user"}[5m])))
```
---
## Functions
### Absolute Value
**Datadog**:
```
abs(diff(avg:system.cpu.user{*}))
```
**Prometheus**:
```promql
abs(delta(node_cpu_seconds_total{mode="user"}[5m]))
```
---
### Ceiling/Floor
**Datadog**:
```
ceil(avg:system.cpu.user{*})
floor(avg:system.cpu.user{*})
```
**Prometheus**:
```promql
ceil(avg(node_cpu_seconds_total{mode="user"}))
floor(avg(node_cpu_seconds_total{mode="user"}))
```
---
### Clamp (Limit Range)
**Datadog**:
```
clamp_min(avg:system.cpu.user{*}, 0)
clamp_max(avg:system.cpu.user{*}, 100)
```
**Prometheus**:
```promql
clamp_min(avg(node_cpu_seconds_total{mode="user"}), 0)
clamp_max(avg(node_cpu_seconds_total{mode="user"}), 100)
```
---
### Moving Average
**Datadog**:
```
moving_rollup(avg:system.cpu.user{*}, 60, 'avg')
```
**Prometheus**:
```promql
avg_over_time(node_cpu_seconds_total{mode="user"}[1h])
```
---
## Advanced Patterns
### Compare to Previous Period
**Datadog**:
```
sum:requests.count{*}.as_rate() / timeshift(sum:requests.count{*}.as_rate(), 3600)
```
**Prometheus**:
```promql
sum(rate(http_requests_total[5m])) / sum(rate(http_requests_total[5m] offset 1h))
```
---
### Forecast
**Datadog**:
```
forecast(avg:system.disk.used{*}, 'linear', 1)
```
**Prometheus**:
```promql
predict_linear(node_filesystem_size_bytes[1h], 3600)
```
Note: Predicts value 1 hour in future based on last 1 hour trend
---
### Anomaly Detection
**Datadog**:
```
anomalies(avg:system.cpu.user{*}, 'basic', 2)
```
**Prometheus**: No built-in function
- Use recording rules with stddev
- External tools like **Robust Perception's anomaly detector**
- Or use **Grafana ML** plugin
---
### Outlier Detection
**Datadog**:
```
outliers(avg:system.cpu.user{*} by {host}, 'mad')
```
**Prometheus**: No built-in function
- Calculate manually with stddev:
```promql
abs(metric - avg(metric)) > 2 * stddev(metric)
```
---
## Container & Kubernetes
### Container CPU Usage
**Datadog**:
```
avg:docker.cpu.usage{*} by {container_name}
```
**Prometheus**:
```promql
avg by (container) (rate(container_cpu_usage_seconds_total[5m]))
```
---
### Container Memory Usage
**Datadog**:
```
avg:docker.mem.rss{*} by {container_name}
```
**Prometheus**:
```promql
avg by (container) (container_memory_rss)
```
---
### Pod Count by Status
**Datadog**:
```
sum:kubernetes.pods.running{*} by {kube_namespace}
```
**Prometheus**:
```promql
sum by (namespace) (kube_pod_status_phase{phase="Running"})
```
---
## Database Queries
### MySQL Queries Per Second
**Datadog**:
```
sum:mysql.performance.queries{*}.as_rate()
```
**Prometheus**:
```promql
sum(rate(mysql_global_status_queries[5m]))
```
---
### PostgreSQL Active Connections
**Datadog**:
```
avg:postgresql.connections{*}
```
**Prometheus**:
```promql
avg(pg_stat_database_numbackends)
```
---
### Redis Memory Usage
**Datadog**:
```
avg:redis.mem.used{*}
```
**Prometheus**:
```promql
avg(redis_memory_used_bytes)
```
---
## Network Metrics
### Network Bytes Sent
**Datadog**:
```
sum:system.net.bytes_sent{*}.as_rate()
```
**Prometheus**:
```promql
sum(rate(node_network_transmit_bytes_total[5m]))
```
---
### Network Bytes Received
**Datadog**:
```
sum:system.net.bytes_rcvd{*}.as_rate()
```
**Prometheus**:
```promql
sum(rate(node_network_receive_bytes_total[5m]))
```
---
## Key Differences
### 1. Time Windows
- **Datadog**: Optional, defaults to query time range
- **Prometheus**: Always required for rate/increase functions
### 2. Histograms
- **Datadog**: Percentiles available directly
- **Prometheus**: Requires histogram buckets + `histogram_quantile()`
### 3. Default Aggregation
- **Datadog**: No default, must specify
- **Prometheus**: Returns all time series unless aggregated
### 4. Metric Types
- **Datadog**: All metrics treated similarly
- **Prometheus**: Explicit types (counter, gauge, histogram, summary)
### 5. Tag vs Label
- **Datadog**: Uses "tags" (key:value)
- **Prometheus**: Uses "labels" (key="value")
---
## Migration Tips
1. **Start with dashboards**: Convert most-used dashboards first
2. **Use recording rules**: Pre-calculate expensive PromQL queries
3. **Test in parallel**: Run both systems during migration
4. **Document mappings**: Create team-specific translation guide
5. **Train team**: PromQL has learning curve, invest in training
---
## Tools
- **Datadog Dashboard Exporter**: Export JSON dashboards
- **Grafana Dashboard Linter**: Validate converted dashboards
- **PromQL Learning Resources**: https://prometheus.io/docs/prometheus/latest/querying/basics/
---
## Common Gotchas
### Rate without Time Window
**Wrong**:
```promql
rate(http_requests_total)
```
**Correct**:
```promql
rate(http_requests_total[5m])
```
---
### Aggregating Before Rate
**Wrong**:
```promql
rate(sum(http_requests_total)[5m])
```
**Correct**:
```promql
sum(rate(http_requests_total[5m]))
```
---
### Histogram Quantile Without by (le)
**Wrong**:
```promql
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
```
**Correct**:
```promql
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
```
---
## Quick Conversion Checklist
When converting a Datadog query to PromQL:
- [ ] Replace metric name (e.g., `system.cpu.user``node_cpu_seconds_total`)
- [ ] Convert tags to labels (`{tag:value}``{label="value"}`)
- [ ] Add time window for rate/increase (`[5m]`)
- [ ] Change aggregation syntax (`avg:``avg()`)
- [ ] Convert percentiles to histogram_quantile if needed
- [ ] Test query in Prometheus before adding to dashboard
- [ ] Add `by (label)` for grouped aggregations
---
## Need More Help?
- See `datadog_migration.md` for full migration guide
- PromQL documentation: https://prometheus.io/docs/prometheus/latest/querying/
- Practice at: https://demo.promlens.com/

775
references/logging_guide.md Normal file
View File

@@ -0,0 +1,775 @@
# Logging Guide
## Structured Logging
### Why Structured Logs?
**Unstructured** (text):
```
2024-10-28 14:32:15 User john@example.com logged in from 192.168.1.1
```
**Structured** (JSON):
```json
{
"timestamp": "2024-10-28T14:32:15Z",
"level": "info",
"message": "User logged in",
"user": "john@example.com",
"ip": "192.168.1.1",
"event_type": "user_login"
}
```
**Benefits**:
- Easy to parse and query
- Consistent format
- Machine-readable
- Efficient storage and indexing
---
## Log Levels
Use appropriate log levels for better filtering and alerting.
### DEBUG
**When**: Development, troubleshooting
**Examples**:
- Function entry/exit
- Variable values
- Internal state changes
```python
logger.debug("Processing request", extra={
"request_id": req_id,
"params": params
})
```
### INFO
**When**: Important business events
**Examples**:
- User actions (login, purchase)
- System state changes (started, stopped)
- Significant milestones
```python
logger.info("Order placed", extra={
"order_id": "12345",
"user_id": "user123",
"amount": 99.99
})
```
### WARN
**When**: Potentially problematic situations
**Examples**:
- Deprecated API usage
- Slow operations (but not failing)
- Retry attempts
- Resource usage approaching limits
```python
logger.warning("API response slow", extra={
"endpoint": "/api/users",
"duration_ms": 2500,
"threshold_ms": 1000
})
```
### ERROR
**When**: Error conditions that need attention
**Examples**:
- Failed requests
- Exceptions caught and handled
- Integration failures
- Data validation errors
```python
logger.error("Payment processing failed", extra={
"order_id": "12345",
"error": str(e),
"payment_gateway": "stripe"
}, exc_info=True)
```
### FATAL/CRITICAL
**When**: Severe errors causing shutdown
**Examples**:
- Database connection lost
- Out of memory
- Configuration errors preventing startup
```python
logger.critical("Database connection lost", extra={
"database": "postgres",
"host": "db.example.com",
"attempt": 3
})
```
---
## Required Fields
Every log entry should include:
### 1. Timestamp
ISO 8601 format with timezone:
```json
{
"timestamp": "2024-10-28T14:32:15.123Z"
}
```
### 2. Level
Standard levels: debug, info, warn, error, critical
```json
{
"level": "error"
}
```
### 3. Message
Human-readable description:
```json
{
"message": "User authentication failed"
}
```
### 4. Service/Application
What component logged this:
```json
{
"service": "api-gateway",
"version": "1.2.3"
}
```
### 5. Environment
```json
{
"environment": "production"
}
```
---
## Recommended Fields
### Request Context
```json
{
"request_id": "550e8400-e29b-41d4-a716-446655440000",
"user_id": "user123",
"session_id": "sess_abc",
"ip_address": "192.168.1.1",
"user_agent": "Mozilla/5.0..."
}
```
### Performance Metrics
```json
{
"duration_ms": 245,
"response_size_bytes": 1024
}
```
### Error Details
```json
{
"error_type": "ValidationError",
"error_message": "Invalid email format",
"stack_trace": "...",
"error_code": "VAL_001"
}
```
### Business Context
```json
{
"order_id": "ORD-12345",
"customer_id": "CUST-789",
"transaction_amount": 99.99,
"payment_method": "credit_card"
}
```
---
## Implementation Examples
### Python (using structlog)
```python
import structlog
logger = structlog.get_logger()
# Configure structured logging
structlog.configure(
processors=[
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.add_log_level,
structlog.processors.JSONRenderer()
]
)
# Usage
logger.info(
"user_logged_in",
user_id="user123",
ip_address="192.168.1.1",
login_method="oauth"
)
```
### Node.js (using Winston)
```javascript
const winston = require('winston');
const logger = winston.createLogger({
format: winston.format.json(),
defaultMeta: { service: 'api-gateway' },
transports: [
new winston.transports.Console()
]
});
logger.info('User logged in', {
userId: 'user123',
ipAddress: '192.168.1.1',
loginMethod: 'oauth'
});
```
### Go (using zap)
```go
import "go.uber.org/zap"
logger, _ := zap.NewProduction()
defer logger.Sync()
logger.Info("User logged in",
zap.String("userId", "user123"),
zap.String("ipAddress", "192.168.1.1"),
zap.String("loginMethod", "oauth"),
)
```
### Java (using Logback with JSON)
```java
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import net.logstash.logback.argument.StructuredArguments;
Logger logger = LoggerFactory.getLogger(MyClass.class);
logger.info("User logged in",
StructuredArguments.kv("userId", "user123"),
StructuredArguments.kv("ipAddress", "192.168.1.1"),
StructuredArguments.kv("loginMethod", "oauth")
);
```
---
## Log Aggregation Patterns
### Pattern 1: ELK Stack (Elasticsearch, Logstash, Kibana)
**Architecture**:
```
Application → Filebeat → Logstash → Elasticsearch → Kibana
```
**filebeat.yml**:
```yaml
filebeat.inputs:
- type: log
enabled: true
paths:
- /var/log/app/*.log
json.keys_under_root: true
json.add_error_key: true
output.logstash:
hosts: ["logstash:5044"]
```
**logstash.conf**:
```
input {
beats {
port => 5044
}
}
filter {
json {
source => "message"
}
date {
match => ["timestamp", "ISO8601"]
}
grok {
match => { "message" => "%{COMBINEDAPACHELOG}" }
}
}
output {
elasticsearch {
hosts => ["elasticsearch:9200"]
index => "app-logs-%{+YYYY.MM.dd}"
}
}
```
### Pattern 2: Loki (Grafana Loki)
**Architecture**:
```
Application → Promtail → Loki → Grafana
```
**promtail-config.yml**:
```yaml
server:
http_listen_port: 9080
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: app
static_configs:
- targets:
- localhost
labels:
job: app
__path__: /var/log/app/*.log
pipeline_stages:
- json:
expressions:
level: level
timestamp: timestamp
- labels:
level:
service:
- timestamp:
source: timestamp
format: RFC3339
```
**Query in Grafana**:
```logql
{job="app"} |= "error" | json | level="error"
```
### Pattern 3: CloudWatch Logs
**Install CloudWatch agent**:
```json
{
"logs": {
"logs_collected": {
"files": {
"collect_list": [
{
"file_path": "/var/log/app/*.log",
"log_group_name": "/aws/app/production",
"log_stream_name": "{instance_id}",
"timezone": "UTC"
}
]
}
}
}
}
```
**Query with CloudWatch Insights**:
```
fields @timestamp, level, message, user_id
| filter level = "error"
| sort @timestamp desc
| limit 100
```
### Pattern 4: Fluentd/Fluent Bit
**fluent-bit.conf**:
```
[INPUT]
Name tail
Path /var/log/app/*.log
Parser json
Tag app.*
[FILTER]
Name record_modifier
Match *
Record hostname ${HOSTNAME}
Record cluster production
[OUTPUT]
Name es
Match *
Host elasticsearch
Port 9200
Index app-logs
Type _doc
```
---
## Query Patterns
### Find Errors in Time Range
**Elasticsearch**:
```json
GET /app-logs-*/_search
{
"query": {
"bool": {
"must": [
{ "match": { "level": "error" } },
{ "range": { "@timestamp": {
"gte": "now-1h",
"lte": "now"
}}}
]
}
}
}
```
**Loki (LogQL)**:
```logql
{job="app", level="error"} |= "error"
```
**CloudWatch Insights**:
```
fields @timestamp, @message
| filter level = "error"
| filter @timestamp > ago(1h)
```
### Count Errors by Type
**Elasticsearch**:
```json
GET /app-logs-*/_search
{
"size": 0,
"query": { "match": { "level": "error" } },
"aggs": {
"error_types": {
"terms": { "field": "error_type.keyword" }
}
}
}
```
**Loki**:
```logql
sum by (error_type) (count_over_time({job="app", level="error"}[1h]))
```
### Find Slow Requests
**Elasticsearch**:
```json
GET /app-logs-*/_search
{
"query": {
"range": { "duration_ms": { "gte": 1000 } }
},
"sort": [ { "duration_ms": "desc" } ]
}
```
### Trace Request Through Services
**Elasticsearch** (using request_id):
```json
GET /_search
{
"query": {
"match": { "request_id": "550e8400-e29b-41d4-a716-446655440000" }
},
"sort": [ { "@timestamp": "asc" } ]
}
```
---
## Sampling and Rate Limiting
### When to Sample
- **High volume services**: > 10,000 logs/second
- **Debug logs in production**: Sample 1-10%
- **Cost optimization**: Reduce storage costs
### Sampling Strategies
**1. Random Sampling**:
```python
import random
if random.random() < 0.1: # Sample 10%
logger.debug("Debug message", ...)
```
**2. Rate Limiting**:
```python
from rate_limiter import RateLimiter
limiter = RateLimiter(max_per_second=100)
if limiter.allow():
logger.info("Rate limited log", ...)
```
**3. Error-Biased Sampling**:
```python
# Always log errors, sample successful requests
if level == "error" or random.random() < 0.01:
logger.log(level, message, ...)
```
**4. Head-Based Sampling** (trace-aware):
```python
# If trace is sampled, log all related logs
if trace_context.is_sampled():
logger.info("Traced log", trace_id=trace_context.trace_id)
```
---
## Log Retention
### Retention Strategy
**Hot tier** (fast SSD): 7-30 days
- Recent logs
- Full query performance
- High cost
**Warm tier** (regular disk): 30-90 days
- Older logs
- Slower queries acceptable
- Medium cost
**Cold tier** (object storage): 90+ days
- Archive logs
- Query via restore
- Low cost
### Example: Elasticsearch ILM Policy
```json
{
"policy": {
"phases": {
"hot": {
"actions": {
"rollover": {
"max_size": "50GB",
"max_age": "1d"
}
}
},
"warm": {
"min_age": "7d",
"actions": {
"allocate": { "number_of_replicas": 1 },
"shrink": { "number_of_shards": 1 }
}
},
"cold": {
"min_age": "30d",
"actions": {
"allocate": { "require": { "box_type": "cold" } }
}
},
"delete": {
"min_age": "90d",
"actions": {
"delete": {}
}
}
}
}
}
```
---
## Security and Compliance
### PII Redaction
**Before logging**:
```python
import re
def redact_pii(data):
# Redact email
data = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
'[EMAIL]', data)
# Redact credit card
data = re.sub(r'\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b',
'[CARD]', data)
# Redact SSN
data = re.sub(r'\b\d{3}-\d{2}-\d{4}\b', '[SSN]', data)
return data
logger.info("User data", user_input=redact_pii(user_input))
```
**In Logstash**:
```
filter {
mutate {
gsub => [
"message", "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b", "[EMAIL]",
"message", "\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b", "[CARD]"
]
}
}
```
### Access Control
**Elasticsearch** (with Security):
```yaml
# Role for developers
dev_logs:
indices:
- names: ['app-logs-*']
privileges: ['read']
query: '{"match": {"environment": "development"}}'
```
**CloudWatch** (IAM Policy):
```json
{
"Effect": "Allow",
"Action": [
"logs:DescribeLogGroups",
"logs:GetLogEvents",
"logs:FilterLogEvents"
],
"Resource": "arn:aws:logs:*:*:log-group:/aws/app/production:*"
}
```
---
## Common Pitfalls
### 1. Logging Sensitive Data
`logger.info("Login", password=password)`
`logger.info("Login", user_id=user_id)`
### 2. Excessive Logging
❌ Logging every iteration of a loop
✅ Log aggregate results or sample
### 3. Not Including Context
`logger.error("Failed")`
`logger.error("Payment failed", order_id=order_id, error=str(e))`
### 4. Inconsistent Formats
❌ Mix of JSON and plain text
✅ Pick one format and stick to it
### 5. No Request IDs
❌ Can't trace request across services
✅ Generate and propagate request_id
### 6. Logging to Multiple Places
❌ Log to file AND stdout AND syslog
✅ Log to stdout, let agent handle routing
### 7. Blocking on Log Writes
❌ Synchronous writes to remote systems
✅ Asynchronous buffered writes
---
## Performance Optimization
### 1. Async Logging
```python
import logging
from logging.handlers import QueueHandler, QueueListener
import queue
# Create queue
log_queue = queue.Queue()
# Configure async handler
queue_handler = QueueHandler(log_queue)
logger.addHandler(queue_handler)
# Process logs in background thread
listener = QueueListener(log_queue, *handlers)
listener.start()
```
### 2. Conditional Logging
```python
# Avoid expensive operations if not logging
if logger.isEnabledFor(logging.DEBUG):
logger.debug("Details", data=expensive_serialization(obj))
```
### 3. Batching
```python
# Batch logs before sending
batch = []
for log in logs:
batch.append(log)
if len(batch) >= 100:
send_to_aggregator(batch)
batch = []
```
### 4. Compression
```yaml
# Filebeat with compression
output.logstash:
hosts: ["logstash:5044"]
compression_level: 3
```
---
## Monitoring Log Pipeline
Track pipeline health with metrics:
```promql
# Log ingestion rate
rate(logs_ingested_total[5m])
# Pipeline lag
log_processing_lag_seconds
# Dropped logs
rate(logs_dropped_total[5m])
# Error parsing rate
rate(logs_parse_errors_total[5m])
```
Alert on:
- Sudden drop in log volume (service down?)
- High parse error rate (format changed?)
- Pipeline lag > 1 minute (capacity issue?)

View File

@@ -0,0 +1,406 @@
# Metrics Design Guide
## The Four Golden Signals
The Four Golden Signals from Google's SRE book provide a comprehensive view of system health:
### 1. Latency
**What**: Time to service a request
**Why Monitor**: Directly impacts user experience
**Key Metrics**:
- Request duration (p50, p95, p99, p99.9)
- Time to first byte (TTFB)
- Backend processing time
- Database query latency
**PromQL Examples**:
```promql
# P95 latency
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
# Average latency by endpoint
avg(rate(http_request_duration_seconds_sum[5m])) by (endpoint)
/
avg(rate(http_request_duration_seconds_count[5m])) by (endpoint)
```
**Alert Thresholds**:
- Warning: p95 > 500ms
- Critical: p99 > 2s
### 2. Traffic
**What**: Demand on your system
**Why Monitor**: Understand load patterns, capacity planning
**Key Metrics**:
- Requests per second (RPS)
- Transactions per second (TPS)
- Concurrent connections
- Network throughput
**PromQL Examples**:
```promql
# Requests per second
sum(rate(http_requests_total[5m]))
# Requests per second by status code
sum(rate(http_requests_total[5m])) by (status)
# Traffic growth rate (week over week)
sum(rate(http_requests_total[5m]))
/
sum(rate(http_requests_total[5m] offset 7d))
```
**Alert Thresholds**:
- Warning: RPS > 80% of capacity
- Critical: RPS > 95% of capacity
### 3. Errors
**What**: Rate of requests that fail
**Why Monitor**: Direct indicator of user-facing problems
**Key Metrics**:
- Error rate (%)
- 5xx response codes
- Failed transactions
- Exception counts
**PromQL Examples**:
```promql
# Error rate percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) * 100
# Error count by type
sum(rate(http_requests_total{status=~"5.."}[5m])) by (status)
# Application errors
rate(application_errors_total[5m])
```
**Alert Thresholds**:
- Warning: Error rate > 1%
- Critical: Error rate > 5%
### 4. Saturation
**What**: How "full" your service is
**Why Monitor**: Predict capacity issues before they impact users
**Key Metrics**:
- CPU utilization
- Memory utilization
- Disk I/O
- Network bandwidth
- Queue depth
- Thread pool usage
**PromQL Examples**:
```promql
# CPU saturation
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory saturation
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
# Disk saturation
rate(node_disk_io_time_seconds_total[5m]) * 100
# Queue depth
queue_depth_current / queue_depth_max * 100
```
**Alert Thresholds**:
- Warning: > 70% utilization
- Critical: > 90% utilization
---
## RED Method (for Services)
**R**ate, **E**rrors, **D**uration - a simplified approach for request-driven services
### Rate
Number of requests per second:
```promql
sum(rate(http_requests_total[5m]))
```
### Errors
Number of failed requests per second:
```promql
sum(rate(http_requests_total{status=~"5.."}[5m]))
```
### Duration
Time taken to process requests:
```promql
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
```
**When to Use**: Microservices, APIs, web applications
---
## USE Method (for Resources)
**U**tilization, **S**aturation, **E**rrors - for infrastructure resources
### Utilization
Percentage of time resource is busy:
```promql
# CPU utilization
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Disk utilization
(node_filesystem_size_bytes - node_filesystem_avail_bytes)
/ node_filesystem_size_bytes * 100
```
### Saturation
Amount of work the resource cannot service (queued):
```promql
# Load average (saturation indicator)
node_load15
# Disk I/O wait time
rate(node_disk_io_time_weighted_seconds_total[5m])
```
### Errors
Count of error events:
```promql
# Network errors
rate(node_network_receive_errs_total[5m])
rate(node_network_transmit_errs_total[5m])
# Disk errors
rate(node_disk_io_errors_total[5m])
```
**When to Use**: Servers, databases, network devices
---
## Metric Types
### Counter
Monotonically increasing value (never decreases)
**Examples**: Request count, error count, bytes sent
**Usage**:
```promql
# Always use rate() or increase() with counters
rate(http_requests_total[5m]) # Requests per second
increase(http_requests_total[1h]) # Total requests in 1 hour
```
### Gauge
Value that can go up and down
**Examples**: Memory usage, queue depth, concurrent connections
**Usage**:
```promql
# Use directly or with aggregations
avg(memory_usage_bytes)
max(queue_depth)
```
### Histogram
Samples observations and counts them in configurable buckets
**Examples**: Request duration, response size
**Usage**:
```promql
# Calculate percentiles
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
# Average from histogram
rate(http_request_duration_seconds_sum[5m])
/
rate(http_request_duration_seconds_count[5m])
```
### Summary
Similar to histogram but calculates quantiles on client side
**Usage**: Less flexible than histograms, avoid for new metrics
---
## Cardinality Best Practices
**Cardinality**: Number of unique time series
### High Cardinality Labels (AVOID)
❌ User ID
❌ Email address
❌ IP address
❌ Timestamp
❌ Random IDs
### Low Cardinality Labels (GOOD)
✅ Environment (prod, staging)
✅ Region (us-east-1, eu-west-1)
✅ Service name
✅ HTTP status code category (2xx, 4xx, 5xx)
✅ Endpoint/route
### Calculating Cardinality Impact
```
Time series = unique combinations of labels
Example:
service (5) × environment (3) × region (4) × status (5) = 300 time series ✅
service (5) × environment (3) × region (4) × user_id (1M) = 60M time series ❌
```
---
## Naming Conventions
### Prometheus Naming
```
<namespace>_<name>_<unit>_total
Examples:
http_requests_total
http_request_duration_seconds
process_cpu_seconds_total
node_memory_MemAvailable_bytes
```
**Rules**:
- Use snake_case
- Include unit in name (seconds, bytes, ratio)
- Use `_total` suffix for counters
- Namespace by application/component
### CloudWatch Naming
```
<Namespace>/<MetricName>
Examples:
AWS/EC2/CPUUtilization
MyApp/RequestCount
```
**Rules**:
- Use PascalCase
- Group by namespace
- No unit in name (specified separately)
---
## Dashboard Design
### Key Principles
1. **Top-Down Layout**: Most important metrics first
2. **Color Coding**: Red (critical), yellow (warning), green (healthy)
3. **Consistent Time Windows**: All panels use same time range
4. **Limit Panels**: 8-12 panels per dashboard maximum
5. **Include Context**: Show related metrics together
### Dashboard Structure
```
┌─────────────────────────────────────────────┐
│ Overall Health (Single Stats) │
│ [Requests/s] [Error%] [P95 Latency] │
└─────────────────────────────────────────────┘
┌─────────────────────────────────────────────┐
│ Request Rate & Errors (Graphs) │
└─────────────────────────────────────────────┘
┌─────────────────────────────────────────────┐
│ Latency Distribution (Graphs) │
└─────────────────────────────────────────────┘
┌─────────────────────────────────────────────┐
│ Resource Usage (Graphs) │
└─────────────────────────────────────────────┘
┌─────────────────────────────────────────────┐
│ Dependencies (Graphs) │
└─────────────────────────────────────────────┘
```
### Template Variables
Use variables for filtering:
- Environment: `$environment`
- Service: `$service`
- Region: `$region`
- Pod: `$pod`
---
## Common Pitfalls
### 1. Monitoring What You Build, Not What Users Experience
`backend_processing_complete`
`user_request_completed`
### 2. Too Many Metrics
- Start with Four Golden Signals
- Add metrics only when needed for specific issues
- Remove unused metrics
### 3. Incorrect Aggregations
`avg(rate(...))` - averages rates incorrectly
`sum(rate(...)) / count(...)` - correct average
### 4. Wrong Time Windows
- Too short (< 1m): Noisy data
- Too long (> 15m): Miss short-lived issues
- Sweet spot: 5m for most alerts
### 5. Missing Labels
`http_requests_total`
`http_requests_total{method="GET", status="200", endpoint="/api/users"}`
---
## Metric Collection Best Practices
### Application Instrumentation
```python
from prometheus_client import Counter, Histogram, Gauge
# Counter for requests
requests_total = Counter('http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status'])
# Histogram for latency
request_duration = Histogram('http_request_duration_seconds',
'HTTP request duration',
['method', 'endpoint'])
# Gauge for in-progress requests
requests_in_progress = Gauge('http_requests_in_progress',
'HTTP requests currently being processed')
```
### Collection Intervals
- Application metrics: 15-30s
- Infrastructure metrics: 30-60s
- Billing/cost metrics: 5-15m
- External API checks: 1-5m
### Retention
- Raw metrics: 15-30 days
- 5m aggregates: 90 days
- 1h aggregates: 1 year
- Daily aggregates: 2+ years

652
references/slo_sla_guide.md Normal file
View File

@@ -0,0 +1,652 @@
# SLI, SLO, and SLA Guide
## Definitions
### SLI (Service Level Indicator)
**What**: A quantitative measure of service quality
**Examples**:
- Request latency (ms)
- Error rate (%)
- Availability (%)
- Throughput (requests/sec)
### SLO (Service Level Objective)
**What**: Target value or range for an SLI
**Examples**:
- "99.9% of requests return in < 500ms"
- "99.95% availability"
- "Error rate < 0.1%"
### SLA (Service Level Agreement)
**What**: Business contract with consequences for SLO violations
**Examples**:
- "99.9% uptime or 10% monthly credit"
- "p95 latency < 1s or refund"
### Relationship
```
SLI = Measurement
SLO = Target (internal goal)
SLA = Promise (customer contract with penalties)
Example:
SLI: Actual availability this month = 99.92%
SLO: Target availability = 99.9%
SLA: Guaranteed availability = 99.5% (with penalties)
```
---
## Choosing SLIs
### The Four Golden Signals as SLIs
1. **Latency SLIs**
- Request duration (p50, p95, p99)
- Time to first byte
- Page load time
2. **Availability/Success SLIs**
- % of successful requests
- % uptime
- % of requests completing
3. **Throughput SLIs** (less common)
- Requests per second
- Transactions per second
4. **Saturation SLIs** (internal only)
- Resource utilization
- Queue depth
### SLI Selection Criteria
**Good SLIs**:
- Measured from user perspective
- Directly impact user experience
- Aggregatable across instances
- Proportional to user happiness
**Bad SLIs**:
- Internal metrics only
- Not user-facing
- Hard to measure consistently
### Examples by Service Type
**Web Application**:
```
SLI 1: Request Success Rate
= successful_requests / total_requests
SLI 2: Request Latency (p95)
= 95th percentile of response times
SLI 3: Availability
= time_service_responding / total_time
```
**API Service**:
```
SLI 1: Error Rate
= (4xx_errors + 5xx_errors) / total_requests
SLI 2: Response Time (p99)
= 99th percentile latency
SLI 3: Throughput
= requests_per_second
```
**Batch Processing**:
```
SLI 1: Job Success Rate
= successful_jobs / total_jobs
SLI 2: Processing Latency
= time_from_submission_to_completion
SLI 3: Freshness
= age_of_oldest_unprocessed_item
```
**Storage Service**:
```
SLI 1: Durability
= data_not_lost / total_data
SLI 2: Read Latency (p99)
= 99th percentile read time
SLI 3: Write Success Rate
= successful_writes / total_writes
```
---
## Setting SLO Targets
### Start with Current Performance
1. **Measure baseline**: Collect 30 days of data
2. **Analyze distribution**: Look at p50, p95, p99, p99.9
3. **Set initial SLO**: Slightly better than worst performer
4. **Iterate**: Tighten or loosen based on feasibility
### Example Process
**Current Performance** (30 days):
```
p50 latency: 120ms
p95 latency: 450ms
p99 latency: 1200ms
p99.9 latency: 3500ms
Error rate: 0.05%
Availability: 99.95%
```
**Initial SLOs**:
```
Latency: p95 < 500ms (slightly worse than current p95)
Error rate: < 0.1% (double current rate)
Availability: 99.9% (slightly worse than current)
```
**Rationale**: Start loose, prevent false alarms, tighten over time
### Common SLO Targets
**Availability**:
- **99%** (3.65 days downtime/year): Internal tools
- **99.5%** (1.83 days/year): Non-critical services
- **99.9%** (8.76 hours/year): Standard production
- **99.95%** (4.38 hours/year): Critical services
- **99.99%** (52 minutes/year): High availability
- **99.999%** (5 minutes/year): Mission critical
**Latency**:
- **p50 < 100ms**: Excellent responsiveness
- **p95 < 500ms**: Standard web applications
- **p99 < 1s**: Acceptable for most users
- **p99.9 < 5s**: Acceptable for rare edge cases
**Error Rate**:
- **< 0.01%** (99.99% success): Critical operations
- **< 0.1%** (99.9% success): Standard production
- **< 1%** (99% success): Non-critical services
---
## Error Budgets
### Concept
Error budget = (100% - SLO target)
If SLO is 99.9%, error budget is 0.1%
**Purpose**: Balance reliability with feature velocity
### Calculation
**For availability**:
```
Monthly error budget = (1 - SLO) × time_period
Example (99.9% SLO, 30 days):
Error budget = 0.001 × 30 days = 0.03 days = 43.2 minutes
```
**For request-based SLIs**:
```
Error budget = (1 - SLO) × total_requests
Example (99.9% SLO, 10M requests/month):
Error budget = 0.001 × 10,000,000 = 10,000 failed requests
```
### Error Budget Consumption
**Formula**:
```
Budget consumed = actual_errors / allowed_errors × 100%
Example:
SLO: 99.9% (0.1% error budget)
Total requests: 1,000,000
Failed requests: 500
Allowed failures: 1,000
Budget consumed = 500 / 1,000 × 100% = 50%
Budget remaining = 50%
```
### Error Budget Policy
**Example policy**:
```markdown
## Error Budget Policy
### If error budget > 50%
- Deploy frequently (multiple times per day)
- Take calculated risks
- Experiment with new features
- Acceptable to have some incidents
### If error budget 20-50%
- Deploy normally (once per day)
- Increase testing
- Review recent changes
- Monitor closely
### If error budget < 20%
- Freeze non-critical deploys
- Focus on reliability improvements
- Postmortem all incidents
- Reduce change velocity
### If error budget exhausted (< 0%)
- Complete deploy freeze except rollbacks
- All hands on reliability
- Mandatory postmortems
- Executive escalation
```
---
## Error Budget Burn Rate
### Concept
Burn rate = rate of error budget consumption
**Example**:
- Monthly budget: 43.2 minutes (99.9% SLO)
- If consuming at 2x rate: Budget exhausted in 15 days
- If consuming at 10x rate: Budget exhausted in 3 days
### Burn Rate Calculation
```
Burn rate = (actual_error_rate / allowed_error_rate)
Example:
SLO: 99.9% (0.1% allowed error rate)
Current error rate: 0.5%
Burn rate = 0.5% / 0.1% = 5x
Time to exhaust = 30 days / 5 = 6 days
```
### Multi-Window Alerting
Alert on burn rate across multiple time windows:
**Fast burn** (1 hour window):
```
Burn rate > 14.4x → Exhausts budget in 2 days
Alert after 2 minutes
Severity: Critical (page immediately)
```
**Moderate burn** (6 hour window):
```
Burn rate > 6x → Exhausts budget in 5 days
Alert after 30 minutes
Severity: Warning (create ticket)
```
**Slow burn** (3 day window):
```
Burn rate > 1x → Exhausts budget by end of month
Alert after 6 hours
Severity: Info (monitor)
```
### Implementation
**Prometheus**:
```yaml
# Fast burn alert (1h window, 2m grace period)
- alert: ErrorBudgetFastBurn
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
) > (14.4 * 0.001) # 14.4x burn rate for 99.9% SLO
for: 2m
labels:
severity: critical
annotations:
summary: "Fast error budget burn detected"
description: "Error budget will be exhausted in 2 days at current rate"
# Slow burn alert (6h window, 30m grace period)
- alert: ErrorBudgetSlowBurn
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[6h]))
/
sum(rate(http_requests_total[6h]))
) > (6 * 0.001) # 6x burn rate for 99.9% SLO
for: 30m
labels:
severity: warning
annotations:
summary: "Elevated error budget burn detected"
```
---
## SLO Reporting
### Dashboard Structure
**Overall Health**:
```
┌─────────────────────────────────────────┐
│ SLO Compliance: 99.92% ✅ │
│ Error Budget Remaining: 73% 🟢 │
│ Burn Rate: 0.8x 🟢 │
└─────────────────────────────────────────┘
```
**SLI Performance**:
```
Latency p95: 420ms (Target: 500ms) ✅
Error Rate: 0.08% (Target: < 0.1%) ✅
Availability: 99.95% (Target: > 99.9%) ✅
```
**Error Budget Trend**:
```
Graph showing:
- Error budget consumption over time
- Burn rate spikes
- Incidents marked
- Deploy events overlaid
```
### Monthly SLO Report
**Template**:
```markdown
# SLO Report: October 2024
## Executive Summary
- ✅ All SLOs met this month
- 🟡 Latency SLO came close to violation (99.1% compliance)
- 3 incidents consumed 47% of error budget
- Error budget remaining: 53%
## SLO Performance
### Availability SLO: 99.9%
- Actual: 99.92%
- Status: ✅ Met
- Error budget consumed: 33%
- Downtime: 23 minutes (allowed: 43.2 minutes)
### Latency SLO: p95 < 500ms
- Actual p95: 445ms
- Status: ✅ Met
- Compliance: 99.1% (target: 99%)
- 0.9% of requests exceeded threshold
### Error Rate SLO: < 0.1%
- Actual: 0.05%
- Status: ✅ Met
- Error budget consumed: 50%
## Incidents
### Incident #1: Database Overload (Oct 5)
- Duration: 15 minutes
- Error budget consumed: 35%
- Root cause: Slow query after schema change
- Prevention: Added query review to deploy checklist
### Incident #2: API Gateway Timeout (Oct 12)
- Duration: 5 minutes
- Error budget consumed: 10%
- Root cause: Configuration error in load balancer
- Prevention: Automated configuration validation
### Incident #3: Upstream Service Degradation (Oct 20)
- Duration: 3 minutes
- Error budget consumed: 2%
- Root cause: Third-party API outage
- Prevention: Implemented circuit breaker
## Recommendations
1. Investigate latency near-miss (Oct 15-17)
2. Add automated rollback for database changes
3. Increase circuit breaker thresholds for third-party APIs
4. Consider tightening availability SLO to 99.95%
## Next Month's Focus
- Reduce p95 latency to 400ms
- Implement automated canary deployments
- Add synthetic monitoring for critical paths
```
---
## SLA Structure
### Components
**Service Description**:
```
The API Service provides RESTful endpoints for user management,
authentication, and data retrieval.
```
**Covered Metrics**:
```
- Availability: Service is reachable and returns valid responses
- Latency: Time from request to response
- Error Rate: Percentage of requests returning errors
```
**SLA Targets**:
```
Service commits to:
1. 99.9% monthly uptime
2. p95 API response time < 1 second
3. Error rate < 0.5%
```
**Measurement**:
```
Metrics calculated from server-side monitoring:
- Uptime: Successful health check probes / total probes
- Latency: Server-side request duration (p95)
- Errors: HTTP 5xx responses / total responses
Calculated monthly (first of month for previous month).
```
**Exclusions**:
```
SLA does not cover:
- Scheduled maintenance (with 7 days notice)
- Client-side network issues
- DDoS attacks or force majeure
- Beta/preview features
- Issues caused by customer misuse
```
**Service Credits**:
```
Monthly Uptime | Service Credit
---------------- | --------------
< 99.9% (SLA) | 10%
< 99.0% | 25%
< 95.0% | 50%
```
**Claiming Credits**:
```
Customer must:
1. Report violation within 30 days
2. Provide ticket numbers for support requests
3. Credits applied to next month's invoice
4. Credits do not exceed monthly fee
```
### Example SLAs by Industry
**E-commerce**:
```
- 99.95% availability
- p95 page load < 2s
- p99 checkout < 5s
- Credits: 5% per 0.1% below target
```
**Financial Services**:
```
- 99.99% availability
- p99 transaction < 500ms
- Zero data loss
- Penalties: $10,000 per hour of downtime
```
**Media/Content**:
```
- 99.9% availability
- p95 video start < 3s
- No credit system (best effort latency)
```
---
## Best Practices
### 1. SLOs Should Be User-Centric
❌ "Database queries < 100ms"
✅ "API response time p95 < 500ms"
### 2. Start Loose, Tighten Over Time
- Begin with achievable targets
- Build reliability culture
- Gradually raise bar
### 3. Fewer, Better SLOs
- 1-3 SLOs per service
- Focus on user impact
- Avoid SLO sprawl
### 4. SLAs More Conservative Than SLOs
```
Internal SLO: 99.95%
Customer SLA: 99.9%
Margin: 0.05% buffer
```
### 5. Make Error Budgets Actionable
- Define policies at different thresholds
- Empower teams to make tradeoffs
- Review in planning meetings
### 6. Document Everything
- How SLIs are measured
- Why targets were chosen
- Who owns each SLO
- How to interpret metrics
### 7. Review Regularly
- Monthly SLO reviews
- Quarterly SLO adjustments
- Annual SLA renegotiation
---
## Common Pitfalls
### 1. Too Many SLOs
❌ 20 different SLOs per service
✅ 2-3 critical SLOs
### 2. Unrealistic Targets
❌ 99.999% for non-critical service
✅ 99.9% with room to improve
### 3. SLOs Without Error Budgets
❌ "Must always be 99.9%"
✅ "Budget for 0.1% errors"
### 4. No Consequences
❌ Missing SLO has no impact
✅ Deploy freeze when budget exhausted
### 5. SLA Equals SLO
❌ Promise exactly what you target
✅ SLA more conservative than SLO
### 6. Ignoring User Experience
❌ "Our servers are up 99.99%"
✅ "Users can complete actions 99.9% of the time"
### 7. Static Targets
❌ Set once, never revisit
✅ Quarterly reviews and adjustments
---
## Tools and Automation
### SLO Tracking Tools
**Prometheus + Grafana**:
- Use recording rules for SLIs
- Alert on burn rates
- Dashboard for compliance
**Google Cloud SLO Monitoring**:
- Built-in SLO tracking
- Automatic error budget calculation
- Integration with alerting
**Datadog SLOs**:
- UI for SLO definition
- Automatic burn rate alerts
- Status pages
**Custom Tools**:
- sloth: Generate Prometheus rules from SLO definitions
- slo-libsonnet: Jsonnet library for SLO monitoring
### Example: Prometheus Recording Rules
```yaml
groups:
- name: sli_recording
interval: 30s
rules:
# SLI: Request success rate
- record: sli:request_success:ratio
expr: |
sum(rate(http_requests_total{status!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
# SLI: Request latency (p95)
- record: sli:request_latency:p95
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)
# Error budget burn rate (1h window)
- record: slo:error_budget_burn_rate:1h
expr: |
(1 - sli:request_success:ratio) / 0.001
```

View File

@@ -0,0 +1,697 @@
# Monitoring Tools Comparison
## Overview Matrix
| Tool | Type | Best For | Complexity | Cost | Cloud/Self-Hosted |
|------|------|----------|------------|------|-------------------|
| **Prometheus** | Metrics | Kubernetes, time-series | Medium | Free | Self-hosted |
| **Grafana** | Visualization | Dashboards, multi-source | Low-Medium | Free | Both |
| **Datadog** | Full-stack | Ease of use, APM | Low | High | Cloud |
| **New Relic** | Full-stack | APM, traces | Low | High | Cloud |
| **Elasticsearch (ELK)** | Logs | Log search, analysis | High | Medium | Both |
| **Grafana Loki** | Logs | Cost-effective logs | Medium | Free | Both |
| **CloudWatch** | AWS-native | AWS infrastructure | Low | Medium | Cloud |
| **Jaeger** | Tracing | Distributed tracing | Medium | Free | Self-hosted |
| **Grafana Tempo** | Tracing | Cost-effective tracing | Medium | Free | Self-hosted |
---
## Metrics Platforms
### Prometheus
**Type**: Open-source time-series database
**Strengths**:
- ✅ Industry standard for Kubernetes
- ✅ Powerful query language (PromQL)
- ✅ Pull-based model (no agent config)
- ✅ Service discovery
- ✅ Free and open source
- ✅ Huge ecosystem (exporters for everything)
**Weaknesses**:
- ❌ No built-in dashboards (need Grafana)
- ❌ Single-node only (no HA without federation)
- ❌ Limited long-term storage (need Thanos/Cortex)
- ❌ Steep learning curve for PromQL
**Best For**:
- Kubernetes monitoring
- Infrastructure metrics
- Custom application metrics
- Organizations that need control
**Pricing**: Free (open source)
**Setup Complexity**: Medium
**Example**:
```yaml
# prometheus.yml
scrape_configs:
- job_name: 'app'
static_configs:
- targets: ['localhost:8080']
```
---
### Datadog
**Type**: SaaS monitoring platform
**Strengths**:
- ✅ Easy to set up (install agent, done)
- ✅ Beautiful pre-built dashboards
- ✅ APM, logs, metrics, traces in one platform
- ✅ Great anomaly detection
- ✅ Excellent integrations (500+)
- ✅ Good mobile app
**Weaknesses**:
- ❌ Very expensive at scale
- ❌ Vendor lock-in
- ❌ Cost can be unpredictable (per-host pricing)
- ❌ Limited PromQL support
**Best For**:
- Teams that want quick setup
- Companies prioritizing ease of use over cost
- Organizations needing full observability
**Pricing**: $15-$31/host/month + custom metrics fees
**Setup Complexity**: Low
**Example**:
```bash
# Install agent
DD_API_KEY=xxx bash -c "$(curl -L https://s3.amazonaws.com/dd-agent/scripts/install_script.sh)"
```
---
### New Relic
**Type**: SaaS application performance monitoring
**Strengths**:
- ✅ Excellent APM capabilities
- ✅ User-friendly interface
- ✅ Good transaction tracing
- ✅ Comprehensive alerting
- ✅ Generous free tier
**Weaknesses**:
- ❌ Can get expensive at scale
- ❌ Vendor lock-in
- ❌ Query language less powerful than PromQL
- ❌ Limited customization
**Best For**:
- Application performance monitoring
- Teams focused on APM over infrastructure
- Startups (free tier is generous)
**Pricing**: Free up to 100GB/month, then $0.30/GB
**Setup Complexity**: Low
**Example**:
```python
import newrelic.agent
newrelic.agent.initialize('newrelic.ini')
```
---
### CloudWatch
**Type**: AWS-native monitoring
**Strengths**:
- ✅ Zero setup for AWS services
- ✅ Native integration with AWS
- ✅ Automatic dashboards for AWS resources
- ✅ Tightly integrated with other AWS services
- ✅ Good for cost if already on AWS
**Weaknesses**:
- ❌ AWS-only (not multi-cloud)
- ❌ Limited query capabilities
- ❌ High costs for custom metrics
- ❌ Basic visualization
- ❌ 1-minute minimum resolution
**Best For**:
- AWS-centric infrastructure
- Quick setup for AWS services
- Organizations already invested in AWS
**Pricing**:
- First 10 custom metrics: Free
- Additional: $0.30/metric/month
- API calls: $0.01/1000 requests
**Setup Complexity**: Low (for AWS), Medium (for custom metrics)
**Example**:
```python
import boto3
cloudwatch = boto3.client('cloudwatch')
cloudwatch.put_metric_data(
Namespace='MyApp',
MetricData=[{'MetricName': 'RequestCount', 'Value': 1}]
)
```
---
### Grafana Cloud / Mimir
**Type**: Managed Prometheus-compatible
**Strengths**:
- ✅ Prometheus-compatible (PromQL)
- ✅ Managed service (no ops burden)
- ✅ Good cost model (pay for what you use)
- ✅ Grafana dashboards included
- ✅ Long-term storage
**Weaknesses**:
- ❌ Relatively new (less mature)
- ❌ Some Prometheus features missing
- ❌ Requires Grafana for visualization
**Best For**:
- Teams wanting Prometheus without ops overhead
- Multi-cloud environments
- Organizations already using Grafana
**Pricing**: $8/month + $0.29/1M samples
**Setup Complexity**: Low-Medium
---
## Logging Platforms
### Elasticsearch (ELK Stack)
**Type**: Open-source log search and analytics
**Full Stack**: Elasticsearch + Logstash + Kibana
**Strengths**:
- ✅ Powerful search capabilities
- ✅ Rich query language
- ✅ Great for log analysis
- ✅ Mature ecosystem
- ✅ Can handle large volumes
- ✅ Flexible data model
**Weaknesses**:
- ❌ Complex to operate
- ❌ Resource intensive (RAM hungry)
- ❌ Expensive at scale
- ❌ Requires dedicated ops team
- ❌ Slow for high-cardinality queries
**Best For**:
- Large organizations with ops teams
- Deep log analysis needs
- Search-heavy use cases
**Pricing**: Free (open source) + infrastructure costs
**Infrastructure**: ~$500-2000/month for medium scale
**Setup Complexity**: High
**Example**:
```json
PUT /logs-2024.10/_doc/1
{
"timestamp": "2024-10-28T14:32:15Z",
"level": "error",
"message": "Payment failed"
}
```
---
### Grafana Loki
**Type**: Log aggregation system
**Strengths**:
- ✅ Cost-effective (labels only, not full-text indexing)
- ✅ Easy to operate
- ✅ Prometheus-like label model
- ✅ Great Grafana integration
- ✅ Low resource usage
- ✅ Fast time-range queries
**Weaknesses**:
- ❌ Limited full-text search
- ❌ Requires careful label design
- ❌ Younger ecosystem than ELK
- ❌ Not ideal for complex queries
**Best For**:
- Cost-conscious organizations
- Kubernetes environments
- Teams already using Prometheus
- Time-series log queries
**Pricing**: Free (open source) + infrastructure costs
**Infrastructure**: ~$100-500/month for medium scale
**Setup Complexity**: Medium
**Example**:
```logql
{job="api", environment="prod"} |= "error" | json | level="error"
```
---
### Splunk
**Type**: Enterprise log management
**Strengths**:
- ✅ Extremely powerful search
- ✅ Great for security/compliance
- ✅ Mature platform
- ✅ Enterprise support
- ✅ Machine learning features
**Weaknesses**:
- ❌ Very expensive
- ❌ Complex pricing (per GB ingested)
- ❌ Steep learning curve
- ❌ Heavy resource usage
**Best For**:
- Large enterprises
- Security operations centers (SOCs)
- Compliance-heavy industries
**Pricing**: $150-$1800/GB/month (depending on tier)
**Setup Complexity**: Medium-High
---
### CloudWatch Logs
**Type**: AWS-native log management
**Strengths**:
- ✅ Zero setup for AWS services
- ✅ Integrated with AWS ecosystem
- ✅ CloudWatch Insights for queries
- ✅ Reasonable cost for low volume
**Weaknesses**:
- ❌ AWS-only
- ❌ Limited query capabilities
- ❌ Expensive at high volume
- ❌ Basic visualization
**Best For**:
- AWS-centric applications
- Low-volume logging
- Simple log aggregation
**Pricing**: Tiered (as of May 2025)
- Vended Logs: $0.50/GB (first 10TB), $0.25/GB (next 20TB), then lower tiers
- Standard logs: $0.50/GB flat
- Storage: $0.03/GB
**Setup Complexity**: Low (AWS), Medium (custom)
---
### Sumo Logic
**Type**: SaaS log management
**Strengths**:
- ✅ Easy to use
- ✅ Good for cloud-native apps
- ✅ Real-time analytics
- ✅ Good compliance features
**Weaknesses**:
- ❌ Expensive at scale
- ❌ Vendor lock-in
- ❌ Limited customization
**Best For**:
- Cloud-native applications
- Teams wanting managed solution
- Security and compliance use cases
**Pricing**: $90-$180/GB/month
**Setup Complexity**: Low
---
## Tracing Platforms
### Jaeger
**Type**: Open-source distributed tracing
**Strengths**:
- ✅ Industry standard
- ✅ CNCF graduated project
- ✅ Supports OpenTelemetry
- ✅ Good UI
- ✅ Free and open source
**Weaknesses**:
- ❌ Requires separate storage backend
- ❌ Limited query capabilities
- ❌ No built-in analytics
**Best For**:
- Microservices architectures
- Kubernetes environments
- OpenTelemetry users
**Pricing**: Free (open source) + storage costs
**Setup Complexity**: Medium
---
### Grafana Tempo
**Type**: Open-source distributed tracing
**Strengths**:
- ✅ Cost-effective (object storage)
- ✅ Easy to operate
- ✅ Great Grafana integration
- ✅ TraceQL query language
- ✅ Supports OpenTelemetry
**Weaknesses**:
- ❌ Younger than Jaeger
- ❌ Limited third-party integrations
- ❌ Requires Grafana for UI
**Best For**:
- Cost-conscious organizations
- Teams using Grafana stack
- High trace volumes
**Pricing**: Free (open source) + storage costs
**Setup Complexity**: Medium
---
### Datadog APM
**Type**: SaaS application performance monitoring
**Strengths**:
- ✅ Easy to set up
- ✅ Excellent trace visualization
- ✅ Integrated with metrics/logs
- ✅ Automatic service map
- ✅ Good profiling features
**Weaknesses**:
- ❌ Expensive ($31/host/month)
- ❌ Vendor lock-in
- ❌ Limited sampling control
**Best For**:
- Teams wanting ease of use
- Organizations already using Datadog
- Complex microservices
**Pricing**: $31/host/month + $1.70/million spans
**Setup Complexity**: Low
---
### AWS X-Ray
**Type**: AWS-native distributed tracing
**Strengths**:
- ✅ Native AWS integration
- ✅ Automatic instrumentation for AWS services
- ✅ Low cost
**Weaknesses**:
- ❌ AWS-only
- ❌ Basic UI
- ❌ Limited query capabilities
**Best For**:
- AWS-centric applications
- Serverless architectures (Lambda)
- Cost-sensitive projects
**Pricing**: $5/million traces, first 100k free/month
**Setup Complexity**: Low (AWS), Medium (custom)
---
## Full-Stack Observability
### Datadog (Full Platform)
**Components**: Metrics, logs, traces, RUM, synthetics
**Strengths**:
- ✅ Everything in one platform
- ✅ Excellent user experience
- ✅ Correlation across signals
- ✅ Great for teams
**Weaknesses**:
- ❌ Very expensive ($50-100+/host/month)
- ❌ Vendor lock-in
- ❌ Unpredictable costs
**Total Cost** (example 100 hosts):
- Infrastructure: $3,100/month
- APM: $3,100/month
- Logs: ~$2,000/month
- **Total: ~$8,000/month**
---
### Grafana Stack (LGTM)
**Components**: Loki (logs), Grafana (viz), Tempo (traces), Mimir/Prometheus (metrics)
**Strengths**:
- ✅ Open source and cost-effective
- ✅ Unified visualization
- ✅ Prometheus-compatible
- ✅ Great for cloud-native
**Weaknesses**:
- ❌ Requires self-hosting or Grafana Cloud
- ❌ More ops burden
- ❌ Less polished than commercial tools
**Total Cost** (self-hosted, 100 hosts):
- Infrastructure: ~$1,500/month
- Ops time: Variable
- **Total: ~$1,500-3,000/month**
---
### Elastic Observability
**Components**: Elasticsearch (logs), Kibana (viz), APM, metrics
**Strengths**:
- ✅ Powerful search
- ✅ Mature platform
- ✅ Good for log-heavy use cases
**Weaknesses**:
- ❌ Complex to operate
- ❌ Expensive infrastructure
- ❌ Resource intensive
**Total Cost** (self-hosted, 100 hosts):
- Infrastructure: ~$3,000-5,000/month
- Ops time: High
- **Total: ~$4,000-7,000/month**
---
### New Relic One
**Components**: Metrics, logs, traces, synthetics
**Strengths**:
- ✅ Generous free tier (100GB)
- ✅ User-friendly
- ✅ Good for startups
**Weaknesses**:
- ❌ Costs increase quickly after free tier
- ❌ Vendor lock-in
**Total Cost**:
- Free: up to 100GB/month
- Paid: $0.30/GB beyond 100GB
---
## Cloud Provider Native
### AWS (CloudWatch + X-Ray)
**Use When**:
- Primarily on AWS
- Simple monitoring needs
- Want minimal setup
**Avoid When**:
- Multi-cloud environment
- Need advanced features
- High log volume (expensive)
**Cost** (example):
- 100 EC2 instances with basic metrics: ~$150/month
- 1TB logs: ~$500/month ingestion + storage
- X-Ray: ~$50/month
---
### GCP (Cloud Monitoring + Cloud Trace)
**Use When**:
- Primarily on GCP
- Using GKE
- Want tight GCP integration
**Avoid When**:
- Multi-cloud environment
- Need advanced querying
**Cost** (example):
- First 150MB/month per resource: Free
- Additional: $0.2508/MB
---
### Azure (Azure Monitor)
**Use When**:
- Primarily on Azure
- Using AKS
- Need Azure integration
**Avoid When**:
- Multi-cloud
- Need advanced features
**Cost** (example):
- First 5GB: Free
- Additional: $2.76/GB
---
## Decision Matrix
### Choose Prometheus + Grafana If:
- ✅ Using Kubernetes
- ✅ Want control and customization
- ✅ Have ops capacity
- ✅ Budget-conscious
- ✅ Need Prometheus ecosystem
### Choose Datadog If:
- ✅ Want ease of use
- ✅ Need full observability now
- ✅ Budget allows ($8k+/month for 100 hosts)
- ✅ Limited ops team
- ✅ Need excellent UX
### Choose ELK If:
- ✅ Heavy log analysis needs
- ✅ Need powerful search
- ✅ Have dedicated ops team
- ✅ Compliance requirements
- ✅ Willing to invest in infrastructure
### Choose Grafana Stack (LGTM) If:
- ✅ Want open source full stack
- ✅ Cost-effective solution
- ✅ Cloud-native architecture
- ✅ Already using Prometheus
- ✅ Have some ops capacity
### Choose New Relic If:
- ✅ Startup with free tier
- ✅ APM is priority
- ✅ Want easy setup
- ✅ Don't need heavy customization
### Choose Cloud Native (CloudWatch/etc) If:
- ✅ Single cloud provider
- ✅ Simple needs
- ✅ Want minimal setup
- ✅ Low to medium scale
---
## Cost Comparison
**Example: 100 hosts, 1TB logs/month, 1M spans/day**
| Solution | Monthly Cost | Setup | Ops Burden |
|----------|-------------|--------|------------|
| **Prometheus + Loki + Tempo** | $1,500 | Medium | Medium |
| **Grafana Cloud** | $3,000 | Low | Low |
| **Datadog** | $8,000 | Low | None |
| **New Relic** | $3,500 | Low | None |
| **ELK Stack** | $4,000 | High | High |
| **CloudWatch** | $2,000 | Low | Low |
---
## Recommendations by Company Size
### Startup (< 10 engineers)
**Recommendation**: New Relic or Grafana Cloud
- Minimal ops burden
- Good free tiers
- Easy to get started
### Small Company (10-50 engineers)
**Recommendation**: Prometheus + Grafana + Loki (self-hosted or cloud)
- Cost-effective
- Growing ops capacity
- Flexibility
### Medium Company (50-200 engineers)
**Recommendation**: Datadog or Grafana Stack
- Datadog if budget allows
- Grafana Stack if cost-conscious
### Large Enterprise (200+ engineers)
**Recommendation**: Build observability platform
- Mix of tools based on needs
- Dedicated observability team
- Custom integrations

663
references/tracing_guide.md Normal file
View File

@@ -0,0 +1,663 @@
# Distributed Tracing Guide
## What is Distributed Tracing?
Distributed tracing tracks a request as it flows through multiple services in a distributed system.
### Key Concepts
**Trace**: End-to-end journey of a request
**Span**: Single operation within a trace
**Context**: Metadata propagated between services (trace_id, span_id)
### Example Flow
```
User Request → API Gateway → Auth Service → User Service → Database
↓ ↓ ↓
[Trace ID: abc123]
Span 1: gateway (50ms)
Span 2: auth (20ms)
Span 3: user_service (100ms)
Span 4: db_query (80ms)
Total: 250ms with waterfall view showing dependencies
```
---
## OpenTelemetry (OTel)
OpenTelemetry is the industry standard for instrumentation.
### Components
**API**: Instrument code (create spans, add attributes)
**SDK**: Implement API, configure exporters
**Collector**: Receive, process, and export telemetry data
**Exporters**: Send data to backends (Jaeger, Tempo, Zipkin)
### Architecture
```
Application → OTel SDK → OTel Collector → Backend (Jaeger/Tempo)
Visualization
```
---
## Instrumentation Examples
### Python (using OpenTelemetry)
**Setup**:
```python
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
# Setup tracer
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)
# Configure exporter
otlp_exporter = OTLPSpanExporter(endpoint="localhost:4317")
span_processor = BatchSpanProcessor(otlp_exporter)
trace.get_tracer_provider().add_span_processor(span_processor)
```
**Manual instrumentation**:
```python
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
@tracer.start_as_current_span("process_order")
def process_order(order_id):
span = trace.get_current_span()
span.set_attribute("order.id", order_id)
span.set_attribute("order.amount", 99.99)
try:
result = payment_service.charge(order_id)
span.set_attribute("payment.status", "success")
return result
except Exception as e:
span.set_status(trace.Status(trace.StatusCode.ERROR))
span.record_exception(e)
raise
```
**Auto-instrumentation** (Flask example):
```python
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
# Auto-instrument Flask
FlaskInstrumentor().instrument_app(app)
# Auto-instrument requests library
RequestsInstrumentor().instrument()
# Auto-instrument SQLAlchemy
SQLAlchemyInstrumentor().instrument(engine=db.engine)
```
### Node.js (using OpenTelemetry)
**Setup**:
```javascript
const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
const { BatchSpanProcessor } = require('@opentelemetry/sdk-trace-base');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
// Setup provider
const provider = new NodeTracerProvider();
const exporter = new OTLPTraceExporter({ url: 'localhost:4317' });
provider.addSpanProcessor(new BatchSpanProcessor(exporter));
provider.register();
```
**Manual instrumentation**:
```javascript
const tracer = provider.getTracer('my-service');
async function processOrder(orderId) {
const span = tracer.startSpan('process_order');
span.setAttribute('order.id', orderId);
try {
const result = await paymentService.charge(orderId);
span.setAttribute('payment.status', 'success');
return result;
} catch (error) {
span.setStatus({ code: SpanStatusCode.ERROR });
span.recordException(error);
throw error;
} finally {
span.end();
}
}
```
**Auto-instrumentation**:
```javascript
const { registerInstrumentations } = require('@opentelemetry/instrumentation');
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');
const { ExpressInstrumentation } = require('@opentelemetry/instrumentation-express');
const { MongoDBInstrumentation } = require('@opentelemetry/instrumentation-mongodb');
registerInstrumentations({
instrumentations: [
new HttpInstrumentation(),
new ExpressInstrumentation(),
new MongoDBInstrumentation()
]
});
```
### Go (using OpenTelemetry)
**Setup**:
```go
import (
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
"go.opentelemetry.io/otel/sdk/trace"
)
func initTracer() {
exporter, _ := otlptracegrpc.New(context.Background())
tp := trace.NewTracerProvider(
trace.WithBatcher(exporter),
)
otel.SetTracerProvider(tp)
}
```
**Manual instrumentation**:
```go
import (
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
)
func processOrder(ctx context.Context, orderID string) error {
tracer := otel.Tracer("my-service")
ctx, span := tracer.Start(ctx, "process_order")
defer span.End()
span.SetAttributes(
attribute.String("order.id", orderID),
attribute.Float64("order.amount", 99.99),
)
err := paymentService.Charge(ctx, orderID)
if err != nil {
span.RecordError(err)
return err
}
span.SetAttributes(attribute.String("payment.status", "success"))
return nil
}
```
---
## Span Attributes
### Semantic Conventions
Follow OpenTelemetry semantic conventions for consistency:
**HTTP**:
```python
span.set_attribute("http.method", "GET")
span.set_attribute("http.url", "https://api.example.com/users")
span.set_attribute("http.status_code", 200)
span.set_attribute("http.user_agent", "Mozilla/5.0...")
```
**Database**:
```python
span.set_attribute("db.system", "postgresql")
span.set_attribute("db.name", "users_db")
span.set_attribute("db.statement", "SELECT * FROM users WHERE id = ?")
span.set_attribute("db.operation", "SELECT")
```
**RPC/gRPC**:
```python
span.set_attribute("rpc.system", "grpc")
span.set_attribute("rpc.service", "UserService")
span.set_attribute("rpc.method", "GetUser")
span.set_attribute("rpc.grpc.status_code", 0)
```
**Messaging**:
```python
span.set_attribute("messaging.system", "kafka")
span.set_attribute("messaging.destination", "user-events")
span.set_attribute("messaging.operation", "publish")
span.set_attribute("messaging.message_id", "msg123")
```
### Custom Attributes
Add business context:
```python
span.set_attribute("user.id", "user123")
span.set_attribute("order.id", "ORD-456")
span.set_attribute("feature.flag.checkout_v2", True)
span.set_attribute("cache.hit", False)
```
---
## Context Propagation
### W3C Trace Context (Standard)
Headers propagated between services:
```
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
tracestate: vendor1=value1,vendor2=value2
```
**Format**: `version-trace_id-parent_span_id-trace_flags`
### Implementation
**Python**:
```python
from opentelemetry.propagate import inject, extract
import requests
# Inject context into outgoing request
headers = {}
inject(headers)
requests.get("https://api.example.com", headers=headers)
# Extract context from incoming request
from flask import request
ctx = extract(request.headers)
```
**Node.js**:
```javascript
const { propagation } = require('@opentelemetry/api');
// Inject
const headers = {};
propagation.inject(context.active(), headers);
axios.get('https://api.example.com', { headers });
// Extract
const ctx = propagation.extract(context.active(), req.headers);
```
**HTTP Example**:
```bash
curl -H "traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01" \
https://api.example.com/users
```
---
## Sampling Strategies
### 1. Always On/Off
```python
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.sampling import ALWAYS_ON, ALWAYS_OFF
# Development: trace everything
provider = TracerProvider(sampler=ALWAYS_ON)
# Production: trace nothing (usually not desired)
provider = TracerProvider(sampler=ALWAYS_OFF)
```
### 2. Probability-Based
```python
from opentelemetry.sdk.trace.sampling import TraceIdRatioBased
# Sample 10% of traces
provider = TracerProvider(sampler=TraceIdRatioBased(0.1))
```
### 3. Rate Limiting
```python
from opentelemetry.sdk.trace.sampling import ParentBased, RateLimitingSampler
# Sample max 100 traces per second
sampler = ParentBased(root=RateLimitingSampler(100))
provider = TracerProvider(sampler=sampler)
```
### 4. Parent-Based (Default)
```python
from opentelemetry.sdk.trace.sampling import ParentBased, TraceIdRatioBased
# If parent span is sampled, sample child spans
sampler = ParentBased(root=TraceIdRatioBased(0.1))
provider = TracerProvider(sampler=sampler)
```
### 5. Custom Sampling
```python
from opentelemetry.sdk.trace.sampling import Sampler, Decision
class ErrorSampler(Sampler):
"""Always sample errors, sample 1% of successes"""
def should_sample(self, parent_context, trace_id, name, **kwargs):
attributes = kwargs.get('attributes', {})
# Always sample if error
if attributes.get('error', False):
return Decision.RECORD_AND_SAMPLE
# Sample 1% of successes
if trace_id & 0xFF < 3: # ~1%
return Decision.RECORD_AND_SAMPLE
return Decision.DROP
provider = TracerProvider(sampler=ErrorSampler())
```
---
## Backends
### Jaeger
**Docker Compose**:
```yaml
version: '3'
services:
jaeger:
image: jaegertracing/all-in-one:latest
ports:
- "16686:16686" # UI
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
environment:
- COLLECTOR_OTLP_ENABLED=true
```
**Query traces**:
```bash
# UI: http://localhost:16686
# API: Get trace by ID
curl http://localhost:16686/api/traces/abc123
# Search traces
curl "http://localhost:16686/api/traces?service=my-service&limit=20"
```
### Grafana Tempo
**Docker Compose**:
```yaml
version: '3'
services:
tempo:
image: grafana/tempo:latest
ports:
- "3200:3200" # Tempo
- "4317:4317" # OTLP gRPC
volumes:
- ./tempo.yaml:/etc/tempo.yaml
command: ["-config.file=/etc/tempo.yaml"]
```
**tempo.yaml**:
```yaml
server:
http_listen_port: 3200
distributor:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
storage:
trace:
backend: local
local:
path: /tmp/tempo/traces
```
**Query in Grafana**:
- Install Tempo data source
- Use TraceQL: `{ span.http.status_code = 500 }`
### AWS X-Ray
**Configuration**:
```python
from aws_xray_sdk.core import xray_recorder
from aws_xray_sdk.ext.flask.middleware import XRayMiddleware
xray_recorder.configure(service='my-service')
XRayMiddleware(app, xray_recorder)
```
**Query**:
```bash
aws xray get-trace-summaries \
--start-time 2024-10-28T00:00:00 \
--end-time 2024-10-28T23:59:59 \
--filter-expression 'error = true'
```
---
## Analysis Patterns
### Find Slow Traces
```
# Jaeger UI
- Filter by service
- Set min duration: 1000ms
- Sort by duration
# TraceQL (Tempo)
{ duration > 1s }
```
### Find Error Traces
```
# Jaeger UI
- Filter by tag: error=true
- Or by HTTP status: http.status_code=500
# TraceQL (Tempo)
{ span.http.status_code >= 500 }
```
### Find Traces by User
```
# Jaeger UI
- Filter by tag: user.id=user123
# TraceQL (Tempo)
{ span.user.id = "user123" }
```
### Find N+1 Query Problems
Look for:
- Many sequential database spans
- Same query repeated multiple times
- Pattern: API call → DB query → DB query → DB query...
### Find Service Bottlenecks
- Identify spans with longest duration
- Check if time is spent in service logic or waiting for dependencies
- Look at span relationships (parallel vs sequential)
---
## Integration with Logs
### Trace ID in Logs
**Python**:
```python
from opentelemetry import trace
def add_trace_context():
span = trace.get_current_span()
trace_id = span.get_span_context().trace_id
span_id = span.get_span_context().span_id
return {
"trace_id": format(trace_id, '032x'),
"span_id": format(span_id, '016x')
}
logger.info("Processing order", **add_trace_context(), order_id=order_id)
```
**Query logs for trace**:
```
# Elasticsearch
GET /logs/_search
{
"query": {
"match": { "trace_id": "0af7651916cd43dd8448eb211c80319c" }
}
}
# Loki (LogQL)
{job="app"} |= "0af7651916cd43dd8448eb211c80319c"
```
### Trace from Log (Grafana)
Configure derived fields in Grafana:
```yaml
datasources:
- name: Loki
type: loki
jsonData:
derivedFields:
- name: TraceID
matcherRegex: "trace_id=([\\w]+)"
url: "http://tempo:3200/trace/$${__value.raw}"
datasourceUid: tempo_uid
```
---
## Best Practices
### 1. Span Naming
✅ Use operation names, not IDs
- Good: `GET /api/users`, `UserService.GetUser`, `db.query.users`
- Bad: `/api/users/123`, `span_abc`, `query_1`
### 2. Span Granularity
✅ One span per logical operation
- Too coarse: One span for entire request
- Too fine: Span for every variable assignment
- Just right: Span per service call, database query, external API
### 3. Add Context
Always include:
- Operation name
- Service name
- Error status
- Business identifiers (user_id, order_id)
### 4. Handle Errors
```python
try:
result = operation()
except Exception as e:
span.set_status(trace.Status(trace.StatusCode.ERROR))
span.record_exception(e)
raise
```
### 5. Sampling Strategy
- Development: 100%
- Staging: 50-100%
- Production: 1-10% (or error-based)
### 6. Performance Impact
- Overhead: ~1-5% CPU
- Use async exporters
- Batch span exports
- Sample appropriately
### 7. Cardinality
Avoid high-cardinality attributes:
- ❌ Email addresses
- ❌ Full URLs with unique IDs
- ❌ Timestamps
- ✅ User ID
- ✅ Endpoint pattern
- ✅ Status code
---
## Common Issues
### Missing Traces
**Cause**: Context not propagated
**Solution**: Verify headers are injected/extracted
### Incomplete Traces
**Cause**: Spans not closed properly
**Solution**: Always use `defer span.End()` or context managers
### High Overhead
**Cause**: Too many spans or synchronous export
**Solution**: Reduce span count, use batch processor
### No Error Traces
**Cause**: Errors not recorded on spans
**Solution**: Call `span.record_exception()` and set error status
---
## Metrics from Traces
Generate RED metrics from trace data:
**Rate**: Traces per second
**Errors**: Traces with error status
**Duration**: Span duration percentiles
**Example** (using Tempo + Prometheus):
```yaml
# Generate metrics from spans
metrics_generator:
processor:
span_metrics:
dimensions:
- http.method
- http.status_code
```
**Query**:
```promql
# Request rate
rate(traces_spanmetrics_calls_total[5m])
# Error rate
rate(traces_spanmetrics_calls_total{status_code="STATUS_CODE_ERROR"}[5m])
/
rate(traces_spanmetrics_calls_total[5m])
# P95 latency
histogram_quantile(0.95, traces_spanmetrics_latency_bucket)
```

View File

@@ -0,0 +1,315 @@
#!/usr/bin/env python3
"""
Audit Prometheus alert rules against best practices.
Checks for: alert naming, severity labels, runbook links, expression quality.
"""
import argparse
import sys
import os
import re
from typing import Dict, List, Any
from pathlib import Path
try:
import yaml
except ImportError:
print("⚠️ Warning: 'PyYAML' library not found. Install with: pip install pyyaml")
sys.exit(1)
class AlertQualityChecker:
def __init__(self):
self.issues = []
self.warnings = []
self.recommendations = []
def check_alert_name(self, alert_name: str) -> List[str]:
"""Check alert naming conventions."""
issues = []
# Should be PascalCase or camelCase
if not re.match(r'^[A-Z][a-zA-Z0-9]*$', alert_name):
issues.append(f"Alert name '{alert_name}' should use PascalCase (e.g., HighCPUUsage)")
# Should be descriptive
if len(alert_name) < 5:
issues.append(f"Alert name '{alert_name}' is too short, use descriptive names")
# Avoid generic names
generic_names = ['Alert', 'Test', 'Warning', 'Error']
if alert_name in generic_names:
issues.append(f"Alert name '{alert_name}' is too generic")
return issues
def check_labels(self, alert: Dict[str, Any]) -> List[str]:
"""Check required and recommended labels."""
issues = []
labels = alert.get('labels', {})
# Required labels
if 'severity' not in labels:
issues.append("Missing required 'severity' label (critical/warning/info)")
elif labels['severity'] not in ['critical', 'warning', 'info']:
issues.append(f"Severity '{labels['severity']}' should be one of: critical, warning, info")
# Recommended labels
if 'team' not in labels:
self.recommendations.append("Consider adding 'team' label for routing")
if 'component' not in labels and 'service' not in labels:
self.recommendations.append("Consider adding 'component' or 'service' label")
return issues
def check_annotations(self, alert: Dict[str, Any]) -> List[str]:
"""Check annotations quality."""
issues = []
annotations = alert.get('annotations', {})
# Required annotations
if 'summary' not in annotations:
issues.append("Missing 'summary' annotation")
elif len(annotations['summary']) < 10:
issues.append("Summary annotation is too short, provide clear description")
if 'description' not in annotations:
issues.append("Missing 'description' annotation")
# Runbook
if 'runbook_url' not in annotations and 'runbook' not in annotations:
self.recommendations.append("Consider adding 'runbook_url' for incident response")
# Check for templating
if 'summary' in annotations:
if '{{ $value }}' not in annotations['summary'] and '{{' not in annotations['summary']:
self.recommendations.append("Consider using template variables in summary (e.g., {{ $value }})")
return issues
def check_expression(self, expr: str, alert_name: str) -> List[str]:
"""Check PromQL expression quality."""
issues = []
# Should have a threshold
if '>' not in expr and '<' not in expr and '==' not in expr and '!=' not in expr:
issues.append("Expression should include a comparison operator")
# Should use rate() for counters
if '_total' in expr and 'rate(' not in expr and 'increase(' not in expr:
self.recommendations.append("Consider using rate() or increase() for counter metrics (*_total)")
# Avoid instant queries without aggregation
if not any(agg in expr for agg in ['sum(', 'avg(', 'min(', 'max(', 'count(']):
if expr.count('{') > 1: # Multiple metrics without aggregation
self.recommendations.append("Consider aggregating metrics with sum(), avg(), etc.")
# Check for proper time windows
if '[' not in expr and 'rate(' in expr:
issues.append("rate() requires a time window (e.g., rate(metric[5m]))")
return issues
def check_for_duration(self, rule: Dict[str, Any]) -> List[str]:
"""Check for 'for' clause to prevent flapping."""
issues = []
severity = rule.get('labels', {}).get('severity', 'unknown')
if 'for' not in rule:
if severity == 'critical':
issues.append("Critical alerts should have 'for' clause to prevent flapping")
else:
self.warnings.append("Consider adding 'for' clause to prevent alert flapping")
else:
# Parse duration
duration = rule['for']
if severity == 'critical' and any(x in duration for x in ['0s', '30s', '1m']):
self.warnings.append(f"'for' duration ({duration}) might be too short for critical alerts")
return issues
def check_alert_rule(self, rule: Dict[str, Any]) -> Dict[str, Any]:
"""Check a single alert rule."""
alert_name = rule.get('alert', 'Unknown')
issues = []
# Check alert name
issues.extend(self.check_alert_name(alert_name))
# Check expression
if 'expr' not in rule:
issues.append("Missing 'expr' field")
else:
issues.extend(self.check_expression(rule['expr'], alert_name))
# Check labels
issues.extend(self.check_labels(rule))
# Check annotations
issues.extend(self.check_annotations(rule))
# Check for duration
issues.extend(self.check_for_duration(rule))
return {
"alert": alert_name,
"issues": issues,
"severity": rule.get('labels', {}).get('severity', 'unknown')
}
def analyze_file(self, filepath: str) -> Dict[str, Any]:
"""Analyze a Prometheus rules file."""
try:
with open(filepath, 'r') as f:
data = yaml.safe_load(f)
if not data:
return {"error": "Empty or invalid YAML file"}
results = []
groups = data.get('groups', [])
for group in groups:
group_name = group.get('name', 'Unknown')
rules = group.get('rules', [])
for rule in rules:
# Only check alerting rules, not recording rules
if 'alert' in rule:
result = self.check_alert_rule(rule)
result['group'] = group_name
results.append(result)
return {
"file": filepath,
"groups": len(groups),
"alerts_checked": len(results),
"results": results
}
except Exception as e:
return {"error": f"Failed to parse file: {e}"}
def print_results(analysis: Dict[str, Any], checker: AlertQualityChecker):
"""Pretty print analysis results."""
print("\n" + "="*60)
print("🚨 ALERT QUALITY CHECK RESULTS")
print("="*60)
if "error" in analysis:
print(f"\n❌ Error: {analysis['error']}")
return
print(f"\n📁 File: {analysis['file']}")
print(f"📊 Groups: {analysis['groups']}")
print(f"🔔 Alerts Checked: {analysis['alerts_checked']}")
# Count issues by severity
critical_count = 0
warning_count = 0
for result in analysis['results']:
if result['issues']:
critical_count += 1
print(f"\n{'='*60}")
print(f"📈 Summary:")
print(f" ❌ Alerts with Issues: {critical_count}")
print(f" ⚠️ Warnings: {len(checker.warnings)}")
print(f" 💡 Recommendations: {len(checker.recommendations)}")
# Print detailed results
if critical_count > 0:
print(f"\n{'='*60}")
print("❌ ALERTS WITH ISSUES:")
print(f"{'='*60}")
for result in analysis['results']:
if result['issues']:
print(f"\n🔔 Alert: {result['alert']} (Group: {result['group']})")
print(f" Severity: {result['severity']}")
print(" Issues:")
for issue in result['issues']:
print(f"{issue}")
# Print warnings
if checker.warnings:
print(f"\n{'='*60}")
print("⚠️ WARNINGS:")
print(f"{'='*60}")
for warning in set(checker.warnings): # Remove duplicates
print(f"{warning}")
# Print recommendations
if checker.recommendations:
print(f"\n{'='*60}")
print("💡 RECOMMENDATIONS:")
print(f"{'='*60}")
for rec in list(set(checker.recommendations))[:10]: # Top 10 unique recommendations
print(f"{rec}")
# Overall score
total_alerts = analysis['alerts_checked']
if total_alerts > 0:
quality_score = ((total_alerts - critical_count) / total_alerts) * 100
print(f"\n{'='*60}")
print(f"📊 Quality Score: {quality_score:.1f}% ({total_alerts - critical_count}/{total_alerts} alerts passing)")
print(f"{'='*60}\n")
def main():
parser = argparse.ArgumentParser(
description="Audit Prometheus alert rules for quality and best practices",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
# Check a single file
python3 alert_quality_checker.py alerts.yml
# Check all YAML files in a directory
python3 alert_quality_checker.py /path/to/prometheus/rules/
Best Practices Checked:
✓ Alert naming conventions (PascalCase, descriptive)
✓ Required labels (severity)
✓ Required annotations (summary, description)
✓ Runbook URL presence
✓ PromQL expression quality
'for' clause to prevent flapping
✓ Template variable usage
"""
)
parser.add_argument('path', help='Path to alert rules file or directory')
parser.add_argument('--verbose', action='store_true', help='Show all recommendations')
args = parser.parse_args()
checker = AlertQualityChecker()
# Check if path is file or directory
path = Path(args.path)
if path.is_file():
files = [str(path)]
elif path.is_dir():
files = [str(f) for f in path.rglob('*.yml')] + [str(f) for f in path.rglob('*.yaml')]
else:
print(f"❌ Path not found: {args.path}")
sys.exit(1)
if not files:
print(f"❌ No YAML files found in: {args.path}")
sys.exit(1)
print(f"🔍 Checking {len(files)} file(s)...")
for filepath in files:
analysis = checker.analyze_file(filepath)
print_results(analysis, checker)
if __name__ == "__main__":
main()

279
scripts/analyze_metrics.py Normal file
View File

@@ -0,0 +1,279 @@
#!/usr/bin/env python3
"""
Analyze metrics from Prometheus or CloudWatch and detect anomalies.
Supports: rate of change analysis, spike detection, trend analysis.
"""
import argparse
import sys
import json
from datetime import datetime, timedelta
from typing import Dict, List, Any, Optional
import statistics
try:
import requests
except ImportError:
print("⚠️ Warning: 'requests' library not found. Install with: pip install requests")
sys.exit(1)
try:
import boto3
except ImportError:
boto3 = None
class MetricAnalyzer:
def __init__(self, source: str, endpoint: Optional[str] = None, region: str = "us-east-1"):
self.source = source
self.endpoint = endpoint
self.region = region
if source == "cloudwatch" and boto3:
self.cloudwatch = boto3.client('cloudwatch', region_name=region)
elif source == "cloudwatch" and not boto3:
print("⚠️ boto3 not installed. Install with: pip install boto3")
sys.exit(1)
def query_prometheus(self, query: str, hours: int = 24) -> List[Dict]:
"""Query Prometheus for metric data."""
if not self.endpoint:
print("❌ Prometheus endpoint required")
sys.exit(1)
try:
# Query range for last N hours
end_time = datetime.now()
start_time = end_time - timedelta(hours=hours)
params = {
'query': query,
'start': start_time.timestamp(),
'end': end_time.timestamp(),
'step': '5m' # 5-minute resolution
}
response = requests.get(f"{self.endpoint}/api/v1/query_range", params=params, timeout=30)
response.raise_for_status()
data = response.json()
if data['status'] != 'success':
print(f"❌ Prometheus query failed: {data}")
return []
return data['data']['result']
except Exception as e:
print(f"❌ Error querying Prometheus: {e}")
return []
def query_cloudwatch(self, namespace: str, metric_name: str, dimensions: Dict[str, str],
hours: int = 24, stat: str = "Average") -> List[Dict]:
"""Query CloudWatch for metric data."""
try:
end_time = datetime.now()
start_time = end_time - timedelta(hours=hours)
dimensions_list = [{'Name': k, 'Value': v} for k, v in dimensions.items()]
response = self.cloudwatch.get_metric_statistics(
Namespace=namespace,
MetricName=metric_name,
Dimensions=dimensions_list,
StartTime=start_time,
EndTime=end_time,
Period=300, # 5-minute intervals
Statistics=[stat]
)
return sorted(response['Datapoints'], key=lambda x: x['Timestamp'])
except Exception as e:
print(f"❌ Error querying CloudWatch: {e}")
return []
def detect_anomalies(self, values: List[float], sensitivity: float = 2.0) -> Dict[str, Any]:
"""Detect anomalies using standard deviation method."""
if len(values) < 10:
return {
"anomalies_detected": False,
"message": "Insufficient data points for anomaly detection"
}
mean = statistics.mean(values)
stdev = statistics.stdev(values)
threshold_upper = mean + (sensitivity * stdev)
threshold_lower = mean - (sensitivity * stdev)
anomalies = []
for i, value in enumerate(values):
if value > threshold_upper or value < threshold_lower:
anomalies.append({
"index": i,
"value": value,
"deviation": abs(value - mean) / stdev if stdev > 0 else 0
})
return {
"anomalies_detected": len(anomalies) > 0,
"count": len(anomalies),
"anomalies": anomalies,
"stats": {
"mean": mean,
"stdev": stdev,
"threshold_upper": threshold_upper,
"threshold_lower": threshold_lower,
"total_points": len(values)
}
}
def analyze_trend(self, values: List[float]) -> Dict[str, Any]:
"""Analyze trend using simple linear regression."""
if len(values) < 2:
return {"trend": "unknown", "message": "Insufficient data"}
n = len(values)
x = list(range(n))
x_mean = sum(x) / n
y_mean = sum(values) / n
numerator = sum((x[i] - x_mean) * (values[i] - y_mean) for i in range(n))
denominator = sum((x[i] - x_mean) ** 2 for i in range(n))
if denominator == 0:
return {"trend": "flat", "slope": 0}
slope = numerator / denominator
# Determine trend direction
if abs(slope) < 0.01 * y_mean: # Less than 1% change per interval
trend = "stable"
elif slope > 0:
trend = "increasing"
else:
trend = "decreasing"
return {
"trend": trend,
"slope": slope,
"rate_of_change": (slope / y_mean * 100) if y_mean != 0 else 0
}
def print_results(results: Dict[str, Any]):
"""Pretty print analysis results."""
print("\n" + "="*60)
print("📊 METRIC ANALYSIS RESULTS")
print("="*60)
if "error" in results:
print(f"\n❌ Error: {results['error']}")
return
print(f"\n📈 Data Points: {results.get('data_points', 0)}")
# Trend analysis
if "trend" in results:
trend_emoji = {"increasing": "📈", "decreasing": "📉", "stable": "➡️"}.get(results["trend"]["trend"], "")
print(f"\n{trend_emoji} Trend: {results['trend']['trend'].upper()}")
if "rate_of_change" in results["trend"]:
print(f" Rate of Change: {results['trend']['rate_of_change']:.2f}% per interval")
# Anomaly detection
if "anomalies" in results:
anomaly_data = results["anomalies"]
if anomaly_data["anomalies_detected"]:
print(f"\n⚠️ ANOMALIES DETECTED: {anomaly_data['count']}")
print(f" Mean: {anomaly_data['stats']['mean']:.2f}")
print(f" Std Dev: {anomaly_data['stats']['stdev']:.2f}")
print(f" Threshold: [{anomaly_data['stats']['threshold_lower']:.2f}, {anomaly_data['stats']['threshold_upper']:.2f}]")
print("\n Top Anomalies:")
for anomaly in sorted(anomaly_data['anomalies'], key=lambda x: x['deviation'], reverse=True)[:5]:
print(f" • Index {anomaly['index']}: {anomaly['value']:.2f} ({anomaly['deviation']:.2f}σ)")
else:
print("\n✅ No anomalies detected")
print("\n" + "="*60)
def main():
parser = argparse.ArgumentParser(
description="Analyze metrics from Prometheus or CloudWatch",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
# Prometheus: Analyze request rate
python3 analyze_metrics.py prometheus \\
--endpoint http://localhost:9090 \\
--query 'rate(http_requests_total[5m])' \\
--hours 24
# CloudWatch: Analyze CPU utilization
python3 analyze_metrics.py cloudwatch \\
--namespace AWS/EC2 \\
--metric CPUUtilization \\
--dimensions InstanceId=i-1234567890abcdef0 \\
--hours 48
"""
)
parser.add_argument('source', choices=['prometheus', 'cloudwatch'],
help='Metric source')
parser.add_argument('--endpoint', help='Prometheus endpoint URL')
parser.add_argument('--query', help='PromQL query')
parser.add_argument('--namespace', help='CloudWatch namespace')
parser.add_argument('--metric', help='CloudWatch metric name')
parser.add_argument('--dimensions', help='CloudWatch dimensions (key=value,key2=value2)')
parser.add_argument('--hours', type=int, default=24, help='Hours of data to analyze (default: 24)')
parser.add_argument('--sensitivity', type=float, default=2.0,
help='Anomaly detection sensitivity (std deviations, default: 2.0)')
parser.add_argument('--region', default='us-east-1', help='AWS region (default: us-east-1)')
args = parser.parse_args()
analyzer = MetricAnalyzer(args.source, args.endpoint, args.region)
# Query metrics
if args.source == 'prometheus':
if not args.query:
print("❌ --query required for Prometheus")
sys.exit(1)
print(f"🔍 Querying Prometheus: {args.query}")
results = analyzer.query_prometheus(args.query, args.hours)
if not results:
print("❌ No data returned")
sys.exit(1)
# Extract values from first result series
values = [float(v[1]) for v in results[0].get('values', [])]
elif args.source == 'cloudwatch':
if not all([args.namespace, args.metric, args.dimensions]):
print("❌ --namespace, --metric, and --dimensions required for CloudWatch")
sys.exit(1)
dims = dict(item.split('=') for item in args.dimensions.split(','))
print(f"🔍 Querying CloudWatch: {args.namespace}/{args.metric}")
results = analyzer.query_cloudwatch(args.namespace, args.metric, dims, args.hours)
if not results:
print("❌ No data returned")
sys.exit(1)
values = [point['Average'] for point in results]
# Analyze metrics
analysis_results = {
"data_points": len(values),
"trend": analyzer.analyze_trend(values),
"anomalies": analyzer.detect_anomalies(values, args.sensitivity)
}
print_results(analysis_results)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,395 @@
#!/usr/bin/env python3
"""
Generate Grafana dashboards from templates.
Supports: web applications, Kubernetes, databases, Redis, and custom metrics.
"""
import argparse
import sys
import json
from typing import Dict, List, Any, Optional
from pathlib import Path
class DashboardGenerator:
def __init__(self, title: str, datasource: str = "Prometheus"):
self.title = title
self.datasource = datasource
self.dashboard = self._create_base_dashboard()
self.panel_id = 1
self.row_y = 0
def _create_base_dashboard(self) -> Dict[str, Any]:
"""Create base dashboard structure."""
return {
"dashboard": {
"title": self.title,
"tags": [],
"timezone": "browser",
"schemaVersion": 16,
"version": 0,
"refresh": "30s",
"panels": [],
"templating": {
"list": []
},
"time": {
"from": "now-6h",
"to": "now"
}
},
"overwrite": True
}
def add_variable(self, name: str, label: str, query: str):
"""Add a template variable."""
variable = {
"name": name,
"label": label,
"type": "query",
"datasource": self.datasource,
"query": query,
"refresh": 1,
"regex": "",
"multi": False,
"includeAll": False
}
self.dashboard["dashboard"]["templating"]["list"].append(variable)
def add_row(self, title: str):
"""Add a row panel."""
panel = {
"id": self.panel_id,
"type": "row",
"title": title,
"collapsed": False,
"gridPos": {"h": 1, "w": 24, "x": 0, "y": self.row_y}
}
self.dashboard["dashboard"]["panels"].append(panel)
self.panel_id += 1
self.row_y += 1
def add_graph(self, title: str, targets: List[Dict[str, str]], unit: str = "short",
width: int = 12, height: int = 8):
"""Add a graph panel."""
panel = {
"id": self.panel_id,
"type": "graph",
"title": title,
"datasource": self.datasource,
"targets": [
{
"expr": target["query"],
"legendFormat": target.get("legend", ""),
"refId": chr(65 + i) # A, B, C, etc.
}
for i, target in enumerate(targets)
],
"gridPos": {"h": height, "w": width, "x": 0, "y": self.row_y},
"yaxes": [
{"format": unit, "label": None, "show": True},
{"format": "short", "label": None, "show": True}
],
"lines": True,
"fill": 1,
"linewidth": 2,
"legend": {
"show": True,
"alignAsTable": True,
"avg": True,
"current": True,
"max": True,
"min": False,
"total": False,
"values": True
}
}
self.dashboard["dashboard"]["panels"].append(panel)
self.panel_id += 1
self.row_y += height
def add_stat(self, title: str, query: str, unit: str = "short",
width: int = 6, height: int = 4):
"""Add a stat panel (single value)."""
panel = {
"id": self.panel_id,
"type": "stat",
"title": title,
"datasource": self.datasource,
"targets": [
{
"expr": query,
"refId": "A"
}
],
"gridPos": {"h": height, "w": width, "x": 0, "y": self.row_y},
"options": {
"graphMode": "area",
"orientation": "auto",
"reduceOptions": {
"values": False,
"calcs": ["lastNotNull"]
}
},
"fieldConfig": {
"defaults": {
"unit": unit,
"thresholds": {
"mode": "absolute",
"steps": [
{"value": None, "color": "green"},
{"value": 80, "color": "red"}
]
}
}
}
}
self.dashboard["dashboard"]["panels"].append(panel)
self.panel_id += 1
def generate_webapp_dashboard(self, service: str):
"""Generate dashboard for web application."""
self.add_variable("service", "Service", f"label_values({service}_http_requests_total, service)")
# Request metrics
self.add_row("Request Metrics")
self.add_graph(
"Request Rate",
[{"query": f'sum(rate({service}_http_requests_total[5m])) by (status)', "legend": "{{status}}"}],
unit="reqps",
width=12
)
self.add_graph(
"Request Latency (p50, p95, p99)",
[
{"query": f'histogram_quantile(0.50, sum(rate({service}_http_request_duration_seconds_bucket[5m])) by (le))', "legend": "p50"},
{"query": f'histogram_quantile(0.95, sum(rate({service}_http_request_duration_seconds_bucket[5m])) by (le))', "legend": "p95"},
{"query": f'histogram_quantile(0.99, sum(rate({service}_http_request_duration_seconds_bucket[5m])) by (le))', "legend": "p99"}
],
unit="s",
width=12
)
# Error rate
self.add_row("Errors")
self.add_graph(
"Error Rate (%)",
[{"query": f'sum(rate({service}_http_requests_total{{status=~"5.."}}[5m])) / sum(rate({service}_http_requests_total[5m])) * 100', "legend": "Error Rate"}],
unit="percent",
width=12
)
# Resource usage
self.add_row("Resource Usage")
self.add_graph(
"CPU Usage",
[{"query": f'sum(rate(process_cpu_seconds_total{{job="{service}"}}[5m])) * 100', "legend": "CPU %"}],
unit="percent",
width=12
)
self.add_graph(
"Memory Usage",
[{"query": f'process_resident_memory_bytes{{job="{service}"}}', "legend": "Memory"}],
unit="bytes",
width=12
)
def generate_kubernetes_dashboard(self, namespace: str):
"""Generate dashboard for Kubernetes cluster."""
self.add_variable("namespace", "Namespace", f"label_values(kube_pod_info, namespace)")
# Cluster overview
self.add_row("Cluster Overview")
self.add_stat("Total Pods", f'count(kube_pod_info{{namespace="{namespace}"}})', width=6)
self.add_stat("Running Pods", f'count(kube_pod_status_phase{{namespace="{namespace}", phase="Running"}})', width=6)
self.add_stat("Pending Pods", f'count(kube_pod_status_phase{{namespace="{namespace}", phase="Pending"}})', width=6)
self.add_stat("Failed Pods", f'count(kube_pod_status_phase{{namespace="{namespace}", phase="Failed"}})', width=6)
# Resource usage
self.add_row("Resource Usage")
self.add_graph(
"CPU Usage by Pod",
[{"query": f'sum(rate(container_cpu_usage_seconds_total{{namespace="{namespace}"}}[5m])) by (pod)', "legend": "{{pod}}"}],
unit="percent",
width=12
)
self.add_graph(
"Memory Usage by Pod",
[{"query": f'sum(container_memory_usage_bytes{{namespace="{namespace}"}}) by (pod)', "legend": "{{pod}}"}],
unit="bytes",
width=12
)
# Network
self.add_row("Network")
self.add_graph(
"Network I/O",
[
{"query": f'sum(rate(container_network_receive_bytes_total{{namespace="{namespace}"}}[5m])) by (pod)', "legend": "Receive - {{pod}}"},
{"query": f'sum(rate(container_network_transmit_bytes_total{{namespace="{namespace}"}}[5m])) by (pod)', "legend": "Transmit - {{pod}}"}
],
unit="Bps",
width=12
)
def generate_database_dashboard(self, db_type: str, instance: str):
"""Generate dashboard for database (postgres/mysql)."""
if db_type == "postgres":
self._generate_postgres_dashboard(instance)
elif db_type == "mysql":
self._generate_mysql_dashboard(instance)
def _generate_postgres_dashboard(self, instance: str):
"""Generate PostgreSQL dashboard."""
self.add_row("PostgreSQL Metrics")
self.add_graph(
"Connections",
[
{"query": f'pg_stat_database_numbackends{{instance="{instance}"}}', "legend": "{{datname}}"}
],
unit="short",
width=12
)
self.add_graph(
"Transactions per Second",
[
{"query": f'rate(pg_stat_database_xact_commit{{instance="{instance}"}}[5m])', "legend": "Commits"},
{"query": f'rate(pg_stat_database_xact_rollback{{instance="{instance}"}}[5m])', "legend": "Rollbacks"}
],
unit="tps",
width=12
)
self.add_graph(
"Query Duration (p95)",
[
{"query": f'histogram_quantile(0.95, rate(pg_stat_statements_total_time_bucket{{instance="{instance}"}}[5m]))', "legend": "p95"}
],
unit="ms",
width=12
)
def _generate_mysql_dashboard(self, instance: str):
"""Generate MySQL dashboard."""
self.add_row("MySQL Metrics")
self.add_graph(
"Connections",
[
{"query": f'mysql_global_status_threads_connected{{instance="{instance}"}}', "legend": "Connected"},
{"query": f'mysql_global_status_threads_running{{instance="{instance}"}}', "legend": "Running"}
],
unit="short",
width=12
)
self.add_graph(
"Queries per Second",
[
{"query": f'rate(mysql_global_status_queries{{instance="{instance}"}}[5m])', "legend": "Queries"}
],
unit="qps",
width=12
)
def save(self, output_file: str):
"""Save dashboard to file."""
try:
with open(output_file, 'w') as f:
json.dump(self.dashboard, f, indent=2)
return True
except Exception as e:
print(f"❌ Error saving dashboard: {e}")
return False
def main():
parser = argparse.ArgumentParser(
description="Generate Grafana dashboards from templates",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
# Web application dashboard
python3 dashboard_generator.py webapp \\
--title "My API Dashboard" \\
--service my_api \\
--output dashboard.json
# Kubernetes dashboard
python3 dashboard_generator.py kubernetes \\
--title "K8s Namespace" \\
--namespace production \\
--output k8s-dashboard.json
# Database dashboard
python3 dashboard_generator.py database \\
--title "PostgreSQL" \\
--db-type postgres \\
--instance db.example.com:5432 \\
--output db-dashboard.json
"""
)
parser.add_argument('type', choices=['webapp', 'kubernetes', 'database'],
help='Dashboard type')
parser.add_argument('--title', required=True, help='Dashboard title')
parser.add_argument('--output', required=True, help='Output file path')
parser.add_argument('--datasource', default='Prometheus', help='Data source name')
# Web app specific
parser.add_argument('--service', help='Service name (for webapp)')
# Kubernetes specific
parser.add_argument('--namespace', help='Kubernetes namespace')
# Database specific
parser.add_argument('--db-type', choices=['postgres', 'mysql'], help='Database type')
parser.add_argument('--instance', help='Database instance')
args = parser.parse_args()
print(f"🎨 Generating {args.type} dashboard: {args.title}")
generator = DashboardGenerator(args.title, args.datasource)
if args.type == 'webapp':
if not args.service:
print("❌ --service required for webapp dashboard")
sys.exit(1)
generator.generate_webapp_dashboard(args.service)
elif args.type == 'kubernetes':
if not args.namespace:
print("❌ --namespace required for kubernetes dashboard")
sys.exit(1)
generator.generate_kubernetes_dashboard(args.namespace)
elif args.type == 'database':
if not args.db_type or not args.instance:
print("❌ --db-type and --instance required for database dashboard")
sys.exit(1)
generator.generate_database_dashboard(args.db_type, args.instance)
if generator.save(args.output):
print(f"✅ Dashboard saved to: {args.output}")
print(f"\n📝 Import to Grafana:")
print(f" 1. Go to Grafana → Dashboards → Import")
print(f" 2. Upload {args.output}")
print(f" 3. Select datasource and save")
else:
sys.exit(1)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,477 @@
#!/usr/bin/env python3
"""
Analyze Datadog usage and identify cost optimization opportunities.
Helps find waste in custom metrics, logs, APM, and infrastructure monitoring.
"""
import argparse
import sys
import os
from datetime import datetime, timedelta
from typing import Dict, List, Any, Optional
from collections import defaultdict
try:
import requests
except ImportError:
print("⚠️ Warning: 'requests' library not found. Install with: pip install requests")
sys.exit(1)
try:
from tabulate import tabulate
except ImportError:
tabulate = None
class DatadogCostAnalyzer:
# Pricing (as of 2024-2025)
PRICING = {
'infrastructure_pro': 15, # per host per month
'infrastructure_enterprise': 23,
'custom_metric': 0.01, # per metric per month (first 100 free per host)
'log_ingestion': 0.10, # per GB ingested per month
'apm_host': 31, # APM Pro per host per month
'apm_span': 1.70, # per million indexed spans
}
def __init__(self, api_key: str, app_key: str, site: str = "datadoghq.com"):
self.api_key = api_key
self.app_key = app_key
self.site = site
self.base_url = f"https://api.{site}"
self.headers = {
'DD-API-KEY': api_key,
'DD-APPLICATION-KEY': app_key,
'Content-Type': 'application/json'
}
def _make_request(self, endpoint: str, params: Optional[Dict] = None) -> Dict:
"""Make API request to Datadog."""
try:
url = f"{self.base_url}{endpoint}"
response = requests.get(url, headers=self.headers, params=params, timeout=30)
response.raise_for_status()
return response.json()
except requests.exceptions.RequestException as e:
print(f"❌ API Error: {e}")
return {}
def get_usage_metrics(self, start_date: str, end_date: str) -> Dict[str, Any]:
"""Get usage metrics for specified date range."""
endpoint = "/api/v1/usage/summary"
params = {
'start_month': start_date,
'end_month': end_date,
'include_org_details': 'true'
}
data = self._make_request(endpoint, params)
return data.get('usage', [])
def get_custom_metrics(self) -> Dict[str, Any]:
"""Get custom metrics usage and identify high-cardinality metrics."""
endpoint = "/api/v1/usage/timeseries"
# Get last 30 days
end_date = datetime.now()
start_date = end_date - timedelta(days=30)
params = {
'start_hr': int(start_date.timestamp()),
'end_hr': int(end_date.timestamp())
}
data = self._make_request(endpoint, params)
if not data:
return {'metrics': [], 'total_count': 0}
# Extract custom metrics info
usage_data = data.get('usage', [])
metrics_summary = {
'total_custom_metrics': 0,
'avg_custom_metrics': 0,
'billable_metrics': 0
}
for day in usage_data:
if 'timeseries' in day:
for ts in day['timeseries']:
if ts.get('metric_category') == 'custom':
metrics_summary['total_custom_metrics'] = max(
metrics_summary['total_custom_metrics'],
ts.get('num_custom_timeseries', 0)
)
# Calculate billable (first 100 free)
metrics_summary['billable_metrics'] = max(0, metrics_summary['total_custom_metrics'] - 100)
return metrics_summary
def get_infrastructure_hosts(self) -> Dict[str, Any]:
"""Get infrastructure host count and breakdown."""
endpoint = "/api/v1/usage/hosts"
end_date = datetime.now()
start_date = end_date - timedelta(days=30)
params = {
'start_hr': int(start_date.timestamp()),
'end_hr': int(end_date.timestamp())
}
data = self._make_request(endpoint, params)
if not data:
return {'total_hosts': 0}
usage = data.get('usage', [])
host_summary = {
'total_hosts': 0,
'agent_hosts': 0,
'aws_hosts': 0,
'azure_hosts': 0,
'gcp_hosts': 0,
'container_count': 0
}
for day in usage:
host_summary['total_hosts'] = max(host_summary['total_hosts'], day.get('host_count', 0))
host_summary['agent_hosts'] = max(host_summary['agent_hosts'], day.get('agent_host_count', 0))
host_summary['aws_hosts'] = max(host_summary['aws_hosts'], day.get('aws_host_count', 0))
host_summary['azure_hosts'] = max(host_summary['azure_hosts'], day.get('azure_host_count', 0))
host_summary['gcp_hosts'] = max(host_summary['gcp_hosts'], day.get('gcp_host_count', 0))
host_summary['container_count'] = max(host_summary['container_count'], day.get('container_count', 0))
return host_summary
def get_log_usage(self) -> Dict[str, Any]:
"""Get log ingestion and retention usage."""
endpoint = "/api/v1/usage/logs"
end_date = datetime.now()
start_date = end_date - timedelta(days=30)
params = {
'start_hr': int(start_date.timestamp()),
'end_hr': int(end_date.timestamp())
}
data = self._make_request(endpoint, params)
if not data:
return {'total_gb': 0, 'daily_avg_gb': 0}
usage = data.get('usage', [])
total_ingested = 0
days_count = len(usage)
for day in usage:
total_ingested += day.get('ingested_events_bytes', 0)
total_gb = total_ingested / (1024**3) # Convert to GB
daily_avg_gb = total_gb / max(days_count, 1)
return {
'total_gb': total_gb,
'daily_avg_gb': daily_avg_gb,
'monthly_projected_gb': daily_avg_gb * 30
}
def get_unused_monitors(self) -> List[Dict[str, Any]]:
"""Find monitors that haven't alerted in 30+ days."""
endpoint = "/api/v1/monitor"
data = self._make_request(endpoint)
if not data:
return []
monitors = data if isinstance(data, list) else []
unused = []
now = datetime.now()
for monitor in monitors:
# Check if monitor has triggered recently
overall_state = monitor.get('overall_state')
modified = monitor.get('modified', '')
# If monitor has been in OK state and not modified in 30+ days
try:
if modified:
mod_date = datetime.fromisoformat(modified.replace('Z', '+00:00'))
days_since_modified = (now - mod_date.replace(tzinfo=None)).days
if days_since_modified > 30 and overall_state in ['OK', 'No Data']:
unused.append({
'name': monitor.get('name', 'Unknown'),
'id': monitor.get('id'),
'days_since_modified': days_since_modified,
'state': overall_state
})
except:
pass
return unused
def calculate_costs(self, usage_data: Dict[str, Any]) -> Dict[str, float]:
"""Calculate estimated monthly costs."""
costs = {
'infrastructure': 0,
'custom_metrics': 0,
'logs': 0,
'apm': 0,
'total': 0
}
# Infrastructure (assuming Pro tier)
if 'hosts' in usage_data:
costs['infrastructure'] = usage_data['hosts'].get('total_hosts', 0) * self.PRICING['infrastructure_pro']
# Custom metrics
if 'custom_metrics' in usage_data:
billable = usage_data['custom_metrics'].get('billable_metrics', 0)
costs['custom_metrics'] = billable * self.PRICING['custom_metric']
# Logs
if 'logs' in usage_data:
monthly_gb = usage_data['logs'].get('monthly_projected_gb', 0)
costs['logs'] = monthly_gb * self.PRICING['log_ingestion']
costs['total'] = sum(costs.values())
return costs
def get_recommendations(self, usage_data: Dict[str, Any]) -> List[str]:
"""Generate cost optimization recommendations."""
recommendations = []
# Custom metrics recommendations
if 'custom_metrics' in usage_data:
billable = usage_data['custom_metrics'].get('billable_metrics', 0)
if billable > 500:
savings = (billable * 0.3) * self.PRICING['custom_metric'] # Assume 30% reduction possible
recommendations.append({
'category': 'Custom Metrics',
'issue': f'High custom metric count: {billable:,} billable metrics',
'action': 'Review metric tags for high cardinality, consider aggregating or dropping unused metrics',
'potential_savings': f'${savings:.2f}/month'
})
# Container vs VM recommendations
if 'hosts' in usage_data:
hosts = usage_data['hosts'].get('total_hosts', 0)
containers = usage_data['hosts'].get('container_count', 0)
if containers > hosts * 10: # Many containers per host
savings = hosts * 0.2 * self.PRICING['infrastructure_pro']
recommendations.append({
'category': 'Infrastructure',
'issue': f'{containers:,} containers running on {hosts} hosts',
'action': 'Consider using container monitoring instead of host-based (can be 50-70% cheaper)',
'potential_savings': f'${savings:.2f}/month'
})
# Unused monitors
if 'unused_monitors' in usage_data:
count = len(usage_data['unused_monitors'])
if count > 10:
recommendations.append({
'category': 'Monitors',
'issue': f'{count} monitors unused for 30+ days',
'action': 'Delete or disable unused monitors to reduce noise and improve performance',
'potential_savings': 'Operational efficiency'
})
# Log volume recommendations
if 'logs' in usage_data:
monthly_gb = usage_data['logs'].get('monthly_projected_gb', 0)
if monthly_gb > 100:
savings = (monthly_gb * 0.4) * self.PRICING['log_ingestion'] # 40% reduction
recommendations.append({
'category': 'Logs',
'issue': f'High log volume: {monthly_gb:.1f} GB/month projected',
'action': 'Review log sources, implement sampling for debug logs, exclude health checks',
'potential_savings': f'${savings:.2f}/month'
})
# Migration recommendation if costs are high
costs = self.calculate_costs(usage_data)
if costs['total'] > 5000:
oss_cost = usage_data['hosts'].get('total_hosts', 0) * 15 # Rough estimate for self-hosted
savings = costs['total'] - oss_cost
recommendations.append({
'category': 'Strategic',
'issue': f'Total monthly cost: ${costs["total"]:.2f}',
'action': 'Consider migrating to open-source stack (Prometheus + Grafana + Loki)',
'potential_savings': f'${savings:.2f}/month (~{(savings/costs["total"]*100):.0f}% reduction)'
})
return recommendations
def print_usage_summary(usage_data: Dict[str, Any]):
"""Print usage summary."""
print("\n" + "="*70)
print("📊 DATADOG USAGE SUMMARY")
print("="*70)
# Infrastructure
if 'hosts' in usage_data:
hosts = usage_data['hosts']
print(f"\n🖥️ Infrastructure:")
print(f" Total Hosts: {hosts.get('total_hosts', 0):,}")
print(f" Agent Hosts: {hosts.get('agent_hosts', 0):,}")
print(f" AWS Hosts: {hosts.get('aws_hosts', 0):,}")
print(f" Azure Hosts: {hosts.get('azure_hosts', 0):,}")
print(f" GCP Hosts: {hosts.get('gcp_hosts', 0):,}")
print(f" Containers: {hosts.get('container_count', 0):,}")
# Custom Metrics
if 'custom_metrics' in usage_data:
metrics = usage_data['custom_metrics']
print(f"\n📈 Custom Metrics:")
print(f" Total: {metrics.get('total_custom_metrics', 0):,}")
print(f" Billable: {metrics.get('billable_metrics', 0):,} (first 100 free)")
# Logs
if 'logs' in usage_data:
logs = usage_data['logs']
print(f"\n📝 Logs:")
print(f" Daily Average: {logs.get('daily_avg_gb', 0):.2f} GB")
print(f" Monthly Projected: {logs.get('monthly_projected_gb', 0):.2f} GB")
# Unused Monitors
if 'unused_monitors' in usage_data:
print(f"\n🔔 Unused Monitors:")
print(f" Count: {len(usage_data['unused_monitors'])}")
def print_cost_breakdown(costs: Dict[str, float]):
"""Print cost breakdown."""
print("\n" + "="*70)
print("💰 ESTIMATED MONTHLY COSTS")
print("="*70)
print(f"\n Infrastructure Monitoring: ${costs['infrastructure']:,.2f}")
print(f" Custom Metrics: ${costs['custom_metrics']:,.2f}")
print(f" Log Management: ${costs['logs']:,.2f}")
print(f" APM: ${costs['apm']:,.2f}")
print(f" " + "-"*40)
print(f" TOTAL: ${costs['total']:,.2f}/month")
print(f" ${costs['total']*12:,.2f}/year")
def print_recommendations(recommendations: List[Dict]):
"""Print recommendations."""
print("\n" + "="*70)
print("💡 COST OPTIMIZATION RECOMMENDATIONS")
print("="*70)
total_savings = 0
for i, rec in enumerate(recommendations, 1):
print(f"\n{i}. {rec['category']}")
print(f" Issue: {rec['issue']}")
print(f" Action: {rec['action']}")
print(f" Potential Savings: {rec['potential_savings']}")
# Extract savings amount if it's a dollar value
if '$' in rec['potential_savings']:
try:
amount = float(rec['potential_savings'].replace('$', '').replace('/month', '').replace(',', ''))
total_savings += amount
except:
pass
if total_savings > 0:
print(f"\n{'='*70}")
print(f"💵 Total Potential Monthly Savings: ${total_savings:,.2f}")
print(f"💵 Total Potential Annual Savings: ${total_savings*12:,.2f}")
print(f"{'='*70}")
def main():
parser = argparse.ArgumentParser(
description="Analyze Datadog usage and identify cost optimization opportunities",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
# Analyze current usage
python3 datadog_cost_analyzer.py \\
--api-key DD_API_KEY \\
--app-key DD_APP_KEY
# Use environment variables
export DD_API_KEY=your_api_key
export DD_APP_KEY=your_app_key
python3 datadog_cost_analyzer.py
# Specify site (for EU)
python3 datadog_cost_analyzer.py --site datadoghq.eu
Required Datadog Permissions:
- usage_read
- monitors_read
"""
)
parser.add_argument('--api-key',
default=os.environ.get('DD_API_KEY'),
help='Datadog API key (or set DD_API_KEY env var)')
parser.add_argument('--app-key',
default=os.environ.get('DD_APP_KEY'),
help='Datadog Application key (or set DD_APP_KEY env var)')
parser.add_argument('--site',
default='datadoghq.com',
help='Datadog site (default: datadoghq.com, EU: datadoghq.eu)')
args = parser.parse_args()
if not args.api_key or not args.app_key:
print("❌ Error: API key and Application key required")
print(" Set via --api-key and --app-key flags or DD_API_KEY and DD_APP_KEY env vars")
sys.exit(1)
print("🔍 Analyzing Datadog usage...")
print(" This may take 30-60 seconds...\n")
analyzer = DatadogCostAnalyzer(args.api_key, args.app_key, args.site)
# Gather usage data
usage_data = {}
print(" ⏳ Fetching infrastructure usage...")
usage_data['hosts'] = analyzer.get_infrastructure_hosts()
print(" ⏳ Fetching custom metrics...")
usage_data['custom_metrics'] = analyzer.get_custom_metrics()
print(" ⏳ Fetching log usage...")
usage_data['logs'] = analyzer.get_log_usage()
print(" ⏳ Finding unused monitors...")
usage_data['unused_monitors'] = analyzer.get_unused_monitors()
# Calculate costs
costs = analyzer.calculate_costs(usage_data)
# Generate recommendations
recommendations = analyzer.get_recommendations(usage_data)
# Print results
print_usage_summary(usage_data)
print_cost_breakdown(costs)
print_recommendations(recommendations)
print("\n" + "="*70)
print("✅ Analysis complete!")
print("="*70)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,297 @@
#!/usr/bin/env python3
"""
Validate health check endpoints and analyze response quality.
Checks: response time, status code, response format, dependencies.
"""
import argparse
import sys
import time
import json
from typing import Dict, List, Any, Optional
from urllib.parse import urlparse
try:
import requests
except ImportError:
print("⚠️ Warning: 'requests' library not found. Install with: pip install requests")
sys.exit(1)
class HealthCheckValidator:
def __init__(self, timeout: int = 5):
self.timeout = timeout
self.results = []
def validate_endpoint(self, url: str) -> Dict[str, Any]:
"""Validate a health check endpoint."""
result = {
"url": url,
"timestamp": time.strftime("%Y-%m-%d %H:%M:%S"),
"checks": [],
"warnings": [],
"errors": []
}
try:
# Make request
start_time = time.time()
response = requests.get(url, timeout=self.timeout, verify=True)
response_time = time.time() - start_time
result["status_code"] = response.status_code
result["response_time"] = response_time
# Check 1: Status code
if response.status_code == 200:
result["checks"].append("✅ Status code is 200")
else:
result["errors"].append(f"❌ Unexpected status code: {response.status_code} (expected 200)")
# Check 2: Response time
if response_time < 1.0:
result["checks"].append(f"✅ Response time: {response_time:.3f}s (< 1s)")
elif response_time < 3.0:
result["warnings"].append(f"⚠️ Slow response time: {response_time:.3f}s (should be < 1s)")
else:
result["errors"].append(f"❌ Very slow response time: {response_time:.3f}s (should be < 1s)")
# Check 3: Content type
content_type = response.headers.get('Content-Type', '')
if 'application/json' in content_type:
result["checks"].append("✅ Content-Type is application/json")
# Try to parse JSON
try:
data = response.json()
result["response_data"] = data
# Check for common health check fields
self._validate_json_structure(data, result)
except json.JSONDecodeError:
result["errors"].append("❌ Invalid JSON response")
elif 'text/plain' in content_type:
result["warnings"].append("⚠️ Content-Type is text/plain (JSON recommended)")
result["response_data"] = response.text
else:
result["warnings"].append(f"⚠️ Unexpected Content-Type: {content_type}")
# Check 4: Response headers
self._validate_headers(response.headers, result)
except requests.exceptions.Timeout:
result["errors"].append(f"❌ Request timeout (> {self.timeout}s)")
result["status_code"] = None
result["response_time"] = None
except requests.exceptions.ConnectionError:
result["errors"].append("❌ Connection error (endpoint unreachable)")
result["status_code"] = None
result["response_time"] = None
except requests.exceptions.SSLError:
result["errors"].append("❌ SSL certificate validation failed")
result["status_code"] = None
result["response_time"] = None
except Exception as e:
result["errors"].append(f"❌ Unexpected error: {str(e)}")
result["status_code"] = None
result["response_time"] = None
# Overall status
if result["errors"]:
result["overall_status"] = "UNHEALTHY"
elif result["warnings"]:
result["overall_status"] = "DEGRADED"
else:
result["overall_status"] = "HEALTHY"
return result
def _validate_json_structure(self, data: Dict[str, Any], result: Dict[str, Any]):
"""Validate JSON health check structure."""
# Check for status field
if "status" in data:
status = data["status"]
if status in ["ok", "healthy", "up", "pass"]:
result["checks"].append(f"✅ Status field present: '{status}'")
else:
result["warnings"].append(f"⚠️ Status field has unexpected value: '{status}'")
else:
result["warnings"].append("⚠️ Missing 'status' field (recommended)")
# Check for version/build info
if any(key in data for key in ["version", "build", "commit", "timestamp"]):
result["checks"].append("✅ Version/build information present")
else:
result["warnings"].append("⚠️ No version/build information (recommended)")
# Check for dependencies
if "dependencies" in data or "checks" in data or "components" in data:
result["checks"].append("✅ Dependency checks present")
# Validate dependency structure
deps = data.get("dependencies") or data.get("checks") or data.get("components")
if isinstance(deps, dict):
unhealthy_deps = []
for name, info in deps.items():
if isinstance(info, dict):
dep_status = info.get("status", "unknown")
if dep_status not in ["ok", "healthy", "up", "pass"]:
unhealthy_deps.append(name)
elif isinstance(info, str):
if info not in ["ok", "healthy", "up", "pass"]:
unhealthy_deps.append(name)
if unhealthy_deps:
result["warnings"].append(f"⚠️ Unhealthy dependencies: {', '.join(unhealthy_deps)}")
else:
result["checks"].append(f"✅ All dependencies healthy ({len(deps)} checked)")
else:
result["warnings"].append("⚠️ No dependency checks (recommended for production services)")
# Check for uptime/metrics
if any(key in data for key in ["uptime", "metrics", "stats"]):
result["checks"].append("✅ Metrics/stats present")
def _validate_headers(self, headers: Dict[str, str], result: Dict[str, Any]):
"""Validate response headers."""
# Check for caching headers
cache_control = headers.get('Cache-Control', '')
if 'no-cache' in cache_control or 'no-store' in cache_control:
result["checks"].append("✅ Caching disabled (Cache-Control: no-cache)")
else:
result["warnings"].append("⚠️ Caching not explicitly disabled (add Cache-Control: no-cache)")
def validate_multiple(self, urls: List[str]) -> List[Dict[str, Any]]:
"""Validate multiple health check endpoints."""
results = []
for url in urls:
print(f"🔍 Checking: {url}")
result = self.validate_endpoint(url)
results.append(result)
return results
def print_result(result: Dict[str, Any], verbose: bool = False):
"""Print validation result."""
status_emoji = {
"HEALTHY": "",
"DEGRADED": "⚠️",
"UNHEALTHY": ""
}
print("\n" + "="*60)
emoji = status_emoji.get(result["overall_status"], "")
print(f"{emoji} {result['overall_status']}: {result['url']}")
print("="*60)
if result.get("status_code"):
print(f"\n📊 Status Code: {result['status_code']}")
print(f"⏱️ Response Time: {result['response_time']:.3f}s")
# Print checks
if result["checks"]:
print(f"\n✅ Passed Checks:")
for check in result["checks"]:
print(f" {check}")
# Print warnings
if result["warnings"]:
print(f"\n⚠️ Warnings:")
for warning in result["warnings"]:
print(f" {warning}")
# Print errors
if result["errors"]:
print(f"\n❌ Errors:")
for error in result["errors"]:
print(f" {error}")
# Print response data if verbose
if verbose and "response_data" in result:
print(f"\n📄 Response Data:")
if isinstance(result["response_data"], dict):
print(json.dumps(result["response_data"], indent=2))
else:
print(result["response_data"])
print("="*60)
def print_summary(results: List[Dict[str, Any]]):
"""Print summary of multiple validations."""
print("\n" + "="*60)
print("📊 HEALTH CHECK VALIDATION SUMMARY")
print("="*60)
healthy = sum(1 for r in results if r["overall_status"] == "HEALTHY")
degraded = sum(1 for r in results if r["overall_status"] == "DEGRADED")
unhealthy = sum(1 for r in results if r["overall_status"] == "UNHEALTHY")
print(f"\n✅ Healthy: {healthy}/{len(results)}")
print(f"⚠️ Degraded: {degraded}/{len(results)}")
print(f"❌ Unhealthy: {unhealthy}/{len(results)}")
if results:
avg_response_time = sum(r.get("response_time", 0) for r in results if r.get("response_time")) / len(results)
print(f"\n⏱️ Average Response Time: {avg_response_time:.3f}s")
print("="*60)
def main():
parser = argparse.ArgumentParser(
description="Validate health check endpoints",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
# Check a single endpoint
python3 health_check_validator.py https://api.example.com/health
# Check multiple endpoints
python3 health_check_validator.py \\
https://api.example.com/health \\
https://api.example.com/readiness
# Verbose output with response data
python3 health_check_validator.py https://api.example.com/health --verbose
# Custom timeout
python3 health_check_validator.py https://api.example.com/health --timeout 10
Best Practices Checked:
✓ Returns 200 status code
✓ Response time < 1 second
✓ Returns JSON format
✓ Contains 'status' field
✓ Includes version/build info
✓ Checks dependencies
✓ Includes metrics
✓ Disables caching
"""
)
parser.add_argument('urls', nargs='+', help='Health check endpoint URL(s)')
parser.add_argument('--timeout', type=int, default=5, help='Request timeout in seconds (default: 5)')
parser.add_argument('--verbose', action='store_true', help='Show detailed response data')
args = parser.parse_args()
validator = HealthCheckValidator(timeout=args.timeout)
results = validator.validate_multiple(args.urls)
# Print individual results
for result in results:
print_result(result, args.verbose)
# Print summary if multiple endpoints
if len(results) > 1:
print_summary(results)
if __name__ == "__main__":
main()

321
scripts/log_analyzer.py Normal file
View File

@@ -0,0 +1,321 @@
#!/usr/bin/env python3
"""
Parse and analyze logs for patterns, errors, and anomalies.
Supports: error detection, frequency analysis, pattern matching.
"""
import argparse
import sys
import re
import json
from collections import Counter, defaultdict
from datetime import datetime
from typing import Dict, List, Any, Optional
from pathlib import Path
try:
from tabulate import tabulate
except ImportError:
tabulate = None
class LogAnalyzer:
# Common log level patterns
LOG_LEVELS = {
'ERROR': r'\b(ERROR|Error|error)\b',
'WARN': r'\b(WARN|Warning|warn|warning)\b',
'INFO': r'\b(INFO|Info|info)\b',
'DEBUG': r'\b(DEBUG|Debug|debug)\b',
'FATAL': r'\b(FATAL|Fatal|fatal|CRITICAL|Critical)\b'
}
# Common error patterns
ERROR_PATTERNS = {
'exception': r'Exception|exception|EXCEPTION',
'stack_trace': r'\s+at\s+.*\(.*:\d+\)',
'http_error': r'\b[45]\d{2}\b', # 4xx and 5xx HTTP codes
'timeout': r'timeout|timed out|TIMEOUT',
'connection_refused': r'connection refused|ECONNREFUSED',
'out_of_memory': r'OutOfMemoryError|OOM|out of memory',
'null_pointer': r'NullPointerException|null pointer|NPE',
'database_error': r'SQLException|database error|DB error'
}
def __init__(self, log_file: str):
self.log_file = log_file
self.lines = []
self.log_levels = Counter()
self.error_patterns = Counter()
self.timestamps = []
def parse_file(self) -> bool:
"""Parse log file."""
try:
with open(self.log_file, 'r', encoding='utf-8', errors='ignore') as f:
self.lines = f.readlines()
return True
except Exception as e:
print(f"❌ Error reading file: {e}")
return False
def analyze_log_levels(self):
"""Count log levels."""
for line in self.lines:
for level, pattern in self.LOG_LEVELS.items():
if re.search(pattern, line):
self.log_levels[level] += 1
break # Count each line only once
def analyze_error_patterns(self):
"""Detect common error patterns."""
for line in self.lines:
for pattern_name, pattern in self.ERROR_PATTERNS.items():
if re.search(pattern, line, re.IGNORECASE):
self.error_patterns[pattern_name] += 1
def extract_timestamps(self, timestamp_pattern: Optional[str] = None):
"""Extract timestamps from logs."""
if not timestamp_pattern:
# Common timestamp patterns
patterns = [
r'\d{4}-\d{2}-\d{2}[T ]\d{2}:\d{2}:\d{2}', # ISO format
r'\d{2}/\w{3}/\d{4}:\d{2}:\d{2}:\d{2}', # Apache format
r'\w{3}\s+\d{1,2}\s+\d{2}:\d{2}:\d{2}', # Syslog format
]
else:
patterns = [timestamp_pattern]
for line in self.lines:
for pattern in patterns:
match = re.search(pattern, line)
if match:
self.timestamps.append(match.group())
break
def find_error_lines(self, context: int = 2) -> List[Dict[str, Any]]:
"""Find error lines with context."""
errors = []
for i, line in enumerate(self.lines):
# Check if line contains error keywords
is_error = any(re.search(pattern, line, re.IGNORECASE)
for pattern in [self.LOG_LEVELS['ERROR'], self.LOG_LEVELS['FATAL']])
if is_error:
# Get context lines
start = max(0, i - context)
end = min(len(self.lines), i + context + 1)
context_lines = self.lines[start:end]
errors.append({
'line_number': i + 1,
'line': line.strip(),
'context': ''.join(context_lines)
})
return errors
def analyze_frequency(self, time_window_minutes: int = 5) -> Dict[str, Any]:
"""Analyze log frequency over time."""
if not self.timestamps:
return {"error": "No timestamps found"}
# This is a simplified version - in production you'd parse actual timestamps
total_lines = len(self.lines)
if self.timestamps:
time_span = len(self.timestamps)
avg_per_window = total_lines / max(1, time_span / time_window_minutes)
else:
avg_per_window = 0
return {
"total_lines": total_lines,
"timestamps_found": len(self.timestamps),
"avg_per_window": avg_per_window
}
def extract_unique_messages(self, pattern: str) -> List[str]:
"""Extract unique messages matching a pattern."""
matches = []
seen = set()
for line in self.lines:
match = re.search(pattern, line, re.IGNORECASE)
if match:
msg = match.group() if match.lastindex is None else match.group(1)
if msg not in seen:
matches.append(msg)
seen.add(msg)
return matches
def find_stack_traces(self) -> List[Dict[str, Any]]:
"""Extract complete stack traces."""
stack_traces = []
current_trace = []
in_trace = False
for i, line in enumerate(self.lines):
# Start of stack trace
if re.search(r'Exception|Error.*:', line):
if current_trace:
stack_traces.append({
'line_start': i - len(current_trace) + 1,
'trace': '\n'.join(current_trace)
})
current_trace = [line.strip()]
in_trace = True
# Stack trace continuation
elif in_trace and re.search(r'^\s+at\s+', line):
current_trace.append(line.strip())
# End of stack trace
elif in_trace:
if current_trace:
stack_traces.append({
'line_start': i - len(current_trace) + 1,
'trace': '\n'.join(current_trace)
})
current_trace = []
in_trace = False
# Add last trace if exists
if current_trace:
stack_traces.append({
'line_start': len(self.lines) - len(current_trace) + 1,
'trace': '\n'.join(current_trace)
})
return stack_traces
def print_analysis_results(analyzer: LogAnalyzer, show_errors: bool = False,
show_traces: bool = False):
"""Print analysis results."""
print("\n" + "="*60)
print("📝 LOG ANALYSIS RESULTS")
print("="*60)
print(f"\n📁 File: {analyzer.log_file}")
print(f"📊 Total Lines: {len(analyzer.lines):,}")
# Log levels
if analyzer.log_levels:
print(f"\n{'='*60}")
print("📊 LOG LEVEL DISTRIBUTION:")
print(f"{'='*60}")
level_emoji = {
'FATAL': '🔴',
'ERROR': '',
'WARN': '⚠️',
'INFO': '',
'DEBUG': '🐛'
}
for level, count in analyzer.log_levels.most_common():
emoji = level_emoji.get(level, '')
percentage = (count / len(analyzer.lines)) * 100
print(f"{emoji} {level:10s}: {count:6,} ({percentage:5.1f}%)")
# Error patterns
if analyzer.error_patterns:
print(f"\n{'='*60}")
print("🔍 ERROR PATTERNS DETECTED:")
print(f"{'='*60}")
for pattern, count in analyzer.error_patterns.most_common(10):
print(f"{pattern:20s}: {count:,} occurrences")
# Timestamps
if analyzer.timestamps:
print(f"\n{'='*60}")
print(f"⏰ Timestamps Found: {len(analyzer.timestamps):,}")
print(f" First: {analyzer.timestamps[0]}")
print(f" Last: {analyzer.timestamps[-1]}")
# Error lines
if show_errors:
errors = analyzer.find_error_lines(context=1)
if errors:
print(f"\n{'='*60}")
print(f"❌ ERROR LINES (showing first 10 of {len(errors)}):")
print(f"{'='*60}")
for error in errors[:10]:
print(f"\nLine {error['line_number']}:")
print(f" {error['line']}")
# Stack traces
if show_traces:
traces = analyzer.find_stack_traces()
if traces:
print(f"\n{'='*60}")
print(f"📚 STACK TRACES FOUND: {len(traces)}")
print(f"{'='*60}")
for i, trace in enumerate(traces[:5], 1):
print(f"\nTrace {i} (starting at line {trace['line_start']}):")
print(trace['trace'])
if i < len(traces):
print("\n" + "-"*60)
print("\n" + "="*60)
def main():
parser = argparse.ArgumentParser(
description="Analyze log files for errors, patterns, and anomalies",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
# Basic analysis
python3 log_analyzer.py application.log
# Show error lines with context
python3 log_analyzer.py application.log --show-errors
# Show stack traces
python3 log_analyzer.py application.log --show-traces
# Full analysis
python3 log_analyzer.py application.log --show-errors --show-traces
Features:
• Log level distribution (ERROR, WARN, INFO, DEBUG, FATAL)
• Common error pattern detection
• Timestamp extraction
• Error line identification with context
• Stack trace extraction
• Frequency analysis
"""
)
parser.add_argument('log_file', help='Path to log file')
parser.add_argument('--show-errors', action='store_true', help='Show error lines')
parser.add_argument('--show-traces', action='store_true', help='Show stack traces')
parser.add_argument('--timestamp-pattern', help='Custom regex for timestamp extraction')
args = parser.parse_args()
if not Path(args.log_file).exists():
print(f"❌ File not found: {args.log_file}")
sys.exit(1)
print(f"🔍 Analyzing log file: {args.log_file}")
analyzer = LogAnalyzer(args.log_file)
if not analyzer.parse_file():
sys.exit(1)
# Perform analysis
analyzer.analyze_log_levels()
analyzer.analyze_error_patterns()
analyzer.extract_timestamps(args.timestamp_pattern)
# Print results
print_analysis_results(analyzer, args.show_errors, args.show_traces)
if __name__ == "__main__":
main()

365
scripts/slo_calculator.py Normal file
View File

@@ -0,0 +1,365 @@
#!/usr/bin/env python3
"""
Calculate SLO compliance, error budgets, and burn rates.
Supports availability SLOs and latency SLOs.
"""
import argparse
import sys
from datetime import datetime, timedelta
from typing import Dict, Any, Optional
try:
from tabulate import tabulate
except ImportError:
print("⚠️ Warning: 'tabulate' library not found. Install with: pip install tabulate")
tabulate = None
class SLOCalculator:
# SLO targets and allowed downtime per period
SLO_TARGETS = {
"90.0": {"year": 36.5, "month": 3.0, "week": 0.7, "day": 0.1}, # days
"95.0": {"year": 18.25, "month": 1.5, "week": 0.35, "day": 0.05},
"99.0": {"year": 3.65, "month": 0.3, "week": 0.07, "day": 0.01},
"99.5": {"year": 1.83, "month": 0.15, "week": 0.035, "day": 0.005},
"99.9": {"year": 0.365, "month": 0.03, "week": 0.007, "day": 0.001},
"99.95": {"year": 0.183, "month": 0.015, "week": 0.0035, "day": 0.0005},
"99.99": {"year": 0.0365, "month": 0.003, "week": 0.0007, "day": 0.0001},
}
def __init__(self, slo_target: float, period_days: int = 30):
"""
Initialize SLO calculator.
Args:
slo_target: SLO target percentage (e.g., 99.9)
period_days: Time period in days (default: 30)
"""
self.slo_target = slo_target
self.period_days = period_days
self.error_budget_minutes = self.calculate_error_budget_minutes()
def calculate_error_budget_minutes(self) -> float:
"""Calculate error budget in minutes for the period."""
total_minutes = self.period_days * 24 * 60
allowed_error_rate = (100 - self.slo_target) / 100
return total_minutes * allowed_error_rate
def calculate_availability_slo(self, total_requests: int, failed_requests: int) -> Dict[str, Any]:
"""
Calculate availability SLO compliance.
Args:
total_requests: Total number of requests
failed_requests: Number of failed requests
Returns:
Dict with SLO compliance metrics
"""
if total_requests == 0:
return {
"error": "No requests in the period",
"slo_met": False
}
success_rate = ((total_requests - failed_requests) / total_requests) * 100
error_rate = (failed_requests / total_requests) * 100
# Calculate error budget consumption
allowed_failures = total_requests * ((100 - self.slo_target) / 100)
error_budget_consumed = (failed_requests / allowed_failures) * 100 if allowed_failures > 0 else float('inf')
error_budget_remaining = max(0, 100 - error_budget_consumed)
# Determine if SLO is met
slo_met = success_rate >= self.slo_target
return {
"slo_target": self.slo_target,
"period_days": self.period_days,
"total_requests": total_requests,
"failed_requests": failed_requests,
"success_requests": total_requests - failed_requests,
"success_rate": success_rate,
"error_rate": error_rate,
"slo_met": slo_met,
"error_budget_total": allowed_failures,
"error_budget_consumed": error_budget_consumed,
"error_budget_remaining": error_budget_remaining,
"margin": success_rate - self.slo_target
}
def calculate_latency_slo(self, total_requests: int, requests_exceeding_threshold: int) -> Dict[str, Any]:
"""
Calculate latency SLO compliance.
Args:
total_requests: Total number of requests
requests_exceeding_threshold: Number of requests exceeding latency threshold
Returns:
Dict with SLO compliance metrics
"""
if total_requests == 0:
return {
"error": "No requests in the period",
"slo_met": False
}
within_threshold_rate = ((total_requests - requests_exceeding_threshold) / total_requests) * 100
# Calculate error budget consumption
allowed_slow_requests = total_requests * ((100 - self.slo_target) / 100)
error_budget_consumed = (requests_exceeding_threshold / allowed_slow_requests) * 100 if allowed_slow_requests > 0 else float('inf')
error_budget_remaining = max(0, 100 - error_budget_consumed)
slo_met = within_threshold_rate >= self.slo_target
return {
"slo_target": self.slo_target,
"period_days": self.period_days,
"total_requests": total_requests,
"requests_exceeding_threshold": requests_exceeding_threshold,
"requests_within_threshold": total_requests - requests_exceeding_threshold,
"within_threshold_rate": within_threshold_rate,
"slo_met": slo_met,
"error_budget_total": allowed_slow_requests,
"error_budget_consumed": error_budget_consumed,
"error_budget_remaining": error_budget_remaining,
"margin": within_threshold_rate - self.slo_target
}
def calculate_burn_rate(self, errors_in_window: int, requests_in_window: int, window_hours: float) -> Dict[str, Any]:
"""
Calculate error budget burn rate.
Args:
errors_in_window: Number of errors in the time window
requests_in_window: Total requests in the time window
window_hours: Size of the time window in hours
Returns:
Dict with burn rate metrics
"""
if requests_in_window == 0:
return {"error": "No requests in window"}
# Calculate actual error rate in this window
actual_error_rate = (errors_in_window / requests_in_window) * 100
# Calculate allowed error rate for SLO
allowed_error_rate = 100 - self.slo_target
# Burn rate = actual error rate / allowed error rate
burn_rate = actual_error_rate / allowed_error_rate if allowed_error_rate > 0 else float('inf')
# Calculate time to exhaustion
if burn_rate > 0:
error_budget_hours = self.error_budget_minutes / 60
hours_to_exhaustion = error_budget_hours / burn_rate
else:
hours_to_exhaustion = float('inf')
# Determine severity
if burn_rate >= 14.4: # 1 hour window, burns budget in 2 days
severity = "critical"
elif burn_rate >= 6: # 6 hour window, burns budget in 5 days
severity = "warning"
elif burn_rate >= 1:
severity = "elevated"
else:
severity = "normal"
return {
"window_hours": window_hours,
"requests_in_window": requests_in_window,
"errors_in_window": errors_in_window,
"actual_error_rate": actual_error_rate,
"allowed_error_rate": allowed_error_rate,
"burn_rate": burn_rate,
"hours_to_exhaustion": hours_to_exhaustion,
"severity": severity
}
@staticmethod
def print_slo_table():
"""Print table of common SLO targets and allowed downtime."""
if not tabulate:
print("Install tabulate for formatted output: pip install tabulate")
return
print("\n📊 SLO TARGETS AND ALLOWED DOWNTIME")
print("="*60)
headers = ["SLO", "Year", "Month", "Week", "Day"]
rows = []
for slo, downtimes in sorted(SLOCalculator.SLO_TARGETS.items(), reverse=True):
row = [
f"{slo}%",
f"{downtimes['year']:.2f} days",
f"{downtimes['month']:.2f} days",
f"{downtimes['week']:.2f} days",
f"{downtimes['day']:.2f} days"
]
rows.append(row)
print(tabulate(rows, headers=headers, tablefmt="grid"))
def print_availability_results(results: Dict[str, Any]):
"""Print availability SLO results."""
print("\n" + "="*60)
print("📊 AVAILABILITY SLO COMPLIANCE")
print("="*60)
if "error" in results:
print(f"\n❌ Error: {results['error']}")
return
status_emoji = "" if results['slo_met'] else ""
print(f"\n{status_emoji} SLO Status: {'MET' if results['slo_met'] else 'VIOLATED'}")
print(f" Target: {results['slo_target']}%")
print(f" Actual: {results['success_rate']:.3f}%")
print(f" Margin: {results['margin']:+.3f}%")
print(f"\n📈 Request Statistics:")
print(f" Total Requests: {results['total_requests']:,}")
print(f" Successful: {results['success_requests']:,}")
print(f" Failed: {results['failed_requests']:,}")
print(f" Error Rate: {results['error_rate']:.3f}%")
print(f"\n💰 Error Budget:")
budget_emoji = "" if results['error_budget_remaining'] > 20 else "⚠️" if results['error_budget_remaining'] > 0 else ""
print(f" {budget_emoji} Remaining: {results['error_budget_remaining']:.1f}%")
print(f" Consumed: {results['error_budget_consumed']:.1f}%")
print(f" Allowed Failures: {results['error_budget_total']:.0f}")
print("\n" + "="*60)
def print_burn_rate_results(results: Dict[str, Any]):
"""Print burn rate results."""
print("\n" + "="*60)
print("🔥 ERROR BUDGET BURN RATE")
print("="*60)
if "error" in results:
print(f"\n❌ Error: {results['error']}")
return
severity_emoji = {
"critical": "🔴",
"warning": "🟡",
"elevated": "🟠",
"normal": "🟢"
}
print(f"\n{severity_emoji.get(results['severity'], '')} Severity: {results['severity'].upper()}")
print(f" Burn Rate: {results['burn_rate']:.2f}x")
print(f" Time to Exhaustion: {results['hours_to_exhaustion']:.1f} hours ({results['hours_to_exhaustion']/24:.1f} days)")
print(f"\n📊 Window Statistics:")
print(f" Window: {results['window_hours']} hours")
print(f" Requests: {results['requests_in_window']:,}")
print(f" Errors: {results['errors_in_window']:,}")
print(f" Actual Error Rate: {results['actual_error_rate']:.3f}%")
print(f" Allowed Error Rate: {results['allowed_error_rate']:.3f}%")
print("\n" + "="*60)
def main():
parser = argparse.ArgumentParser(
description="Calculate SLO compliance and error budgets",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
# Show SLO reference table
python3 slo_calculator.py --table
# Calculate availability SLO
python3 slo_calculator.py availability \\
--slo 99.9 \\
--total-requests 1000000 \\
--failed-requests 1500 \\
--period-days 30
# Calculate latency SLO
python3 slo_calculator.py latency \\
--slo 99.5 \\
--total-requests 500000 \\
--slow-requests 3000 \\
--period-days 7
# Calculate burn rate
python3 slo_calculator.py burn-rate \\
--slo 99.9 \\
--errors 50 \\
--requests 10000 \\
--window-hours 1
"""
)
parser.add_argument('mode', nargs='?', choices=['availability', 'latency', 'burn-rate'],
help='Calculation mode')
parser.add_argument('--table', action='store_true', help='Show SLO reference table')
parser.add_argument('--slo', type=float, help='SLO target percentage (e.g., 99.9)')
parser.add_argument('--period-days', type=int, default=30, help='Period in days (default: 30)')
# Availability SLO arguments
parser.add_argument('--total-requests', type=int, help='Total number of requests')
parser.add_argument('--failed-requests', type=int, help='Number of failed requests')
# Latency SLO arguments
parser.add_argument('--slow-requests', type=int, help='Number of requests exceeding threshold')
# Burn rate arguments
parser.add_argument('--errors', type=int, help='Number of errors in window')
parser.add_argument('--requests', type=int, help='Number of requests in window')
parser.add_argument('--window-hours', type=float, help='Window size in hours')
args = parser.parse_args()
# Show table if requested
if args.table:
SLOCalculator.print_slo_table()
return
if not args.mode:
parser.print_help()
return
if not args.slo:
print("❌ --slo required")
sys.exit(1)
calculator = SLOCalculator(args.slo, args.period_days)
if args.mode == 'availability':
if not args.total_requests or args.failed_requests is None:
print("❌ --total-requests and --failed-requests required")
sys.exit(1)
results = calculator.calculate_availability_slo(args.total_requests, args.failed_requests)
print_availability_results(results)
elif args.mode == 'latency':
if not args.total_requests or args.slow_requests is None:
print("❌ --total-requests and --slow-requests required")
sys.exit(1)
results = calculator.calculate_latency_slo(args.total_requests, args.slow_requests)
print_availability_results(results) # Same format
elif args.mode == 'burn-rate':
if not all([args.errors is not None, args.requests, args.window_hours]):
print("❌ --errors, --requests, and --window-hours required")
sys.exit(1)
results = calculator.calculate_burn_rate(args.errors, args.requests, args.window_hours)
print_burn_rate_results(results)
if __name__ == "__main__":
main()