Initial commit
This commit is contained in:
12
.claude-plugin/plugin.json
Normal file
12
.claude-plugin/plugin.json
Normal file
@@ -0,0 +1,12 @@
|
||||
{
|
||||
"name": "monitoring-observability",
|
||||
"description": "Monitoring and observability strategy, metrics/logs/traces systems, SLOs/error budgets, Prometheus/Grafana/Loki, OpenTelemetry, and tool comparison",
|
||||
"version": "0.0.0-2025.11.28",
|
||||
"author": {
|
||||
"name": "Ahmad Asmar",
|
||||
"email": "zhongweili@tubi.tv"
|
||||
},
|
||||
"skills": [
|
||||
"./"
|
||||
]
|
||||
}
|
||||
3
README.md
Normal file
3
README.md
Normal file
@@ -0,0 +1,3 @@
|
||||
# monitoring-observability
|
||||
|
||||
Monitoring and observability strategy, metrics/logs/traces systems, SLOs/error budgets, Prometheus/Grafana/Loki, OpenTelemetry, and tool comparison
|
||||
869
SKILL.md
Normal file
869
SKILL.md
Normal file
@@ -0,0 +1,869 @@
|
||||
---
|
||||
name: monitoring-observability
|
||||
description: Monitoring and observability strategy, implementation, and troubleshooting. Use for designing metrics/logs/traces systems, setting up Prometheus/Grafana/Loki, creating alerts and dashboards, calculating SLOs and error budgets, analyzing performance issues, and comparing monitoring tools (Datadog, ELK, CloudWatch). Covers the Four Golden Signals, RED/USE methods, OpenTelemetry instrumentation, log aggregation patterns, and distributed tracing.
|
||||
---
|
||||
|
||||
# Monitoring & Observability
|
||||
|
||||
## Overview
|
||||
|
||||
This skill provides comprehensive guidance for monitoring and observability workflows including metrics design, log aggregation, distributed tracing, alerting strategies, SLO/SLA management, and tool selection.
|
||||
|
||||
**When to use this skill**:
|
||||
- Setting up monitoring for new services
|
||||
- Designing alerts and dashboards
|
||||
- Troubleshooting performance issues
|
||||
- Implementing SLO tracking and error budgets
|
||||
- Choosing between monitoring tools
|
||||
- Integrating OpenTelemetry instrumentation
|
||||
- Analyzing metrics, logs, and traces
|
||||
- Optimizing Datadog costs and finding waste
|
||||
- Migrating from Datadog to open-source stack
|
||||
|
||||
---
|
||||
|
||||
## Core Workflow: Observability Implementation
|
||||
|
||||
Use this decision tree to determine your starting point:
|
||||
|
||||
```
|
||||
Are you setting up monitoring from scratch?
|
||||
├─ YES → Start with "1. Design Metrics Strategy"
|
||||
└─ NO → Do you have an existing issue?
|
||||
├─ YES → Go to "9. Troubleshooting & Analysis"
|
||||
└─ NO → Are you improving existing monitoring?
|
||||
├─ Alerts → Go to "3. Alert Design"
|
||||
├─ Dashboards → Go to "4. Dashboard & Visualization"
|
||||
├─ SLOs → Go to "5. SLO & Error Budgets"
|
||||
├─ Tool selection → Read references/tool_comparison.md
|
||||
└─ Using Datadog? High costs? → Go to "7. Datadog Cost Optimization & Migration"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 1. Design Metrics Strategy
|
||||
|
||||
### Start with The Four Golden Signals
|
||||
|
||||
Every service should monitor:
|
||||
|
||||
1. **Latency**: Response time (p50, p95, p99)
|
||||
2. **Traffic**: Requests per second
|
||||
3. **Errors**: Failure rate
|
||||
4. **Saturation**: Resource utilization
|
||||
|
||||
**For request-driven services**, use the **RED Method**:
|
||||
- **R**ate: Requests/sec
|
||||
- **E**rrors: Error rate
|
||||
- **D**uration: Response time
|
||||
|
||||
**For infrastructure resources**, use the **USE Method**:
|
||||
- **U**tilization: % time busy
|
||||
- **S**aturation**: Queue depth
|
||||
- **E**rrors**: Error count
|
||||
|
||||
**Quick Start - Web Application Example**:
|
||||
```promql
|
||||
# Rate (requests/sec)
|
||||
sum(rate(http_requests_total[5m]))
|
||||
|
||||
# Errors (error rate %)
|
||||
sum(rate(http_requests_total{status=~"5.."}[5m]))
|
||||
/
|
||||
sum(rate(http_requests_total[5m])) * 100
|
||||
|
||||
# Duration (p95 latency)
|
||||
histogram_quantile(0.95,
|
||||
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
|
||||
)
|
||||
```
|
||||
|
||||
### Deep Dive: Metric Design
|
||||
|
||||
For comprehensive metric design guidance including:
|
||||
- Metric types (counter, gauge, histogram, summary)
|
||||
- Cardinality best practices
|
||||
- Naming conventions
|
||||
- Dashboard design principles
|
||||
|
||||
**→ Read**: [references/metrics_design.md](references/metrics_design.md)
|
||||
|
||||
### Automated Metric Analysis
|
||||
|
||||
Detect anomalies and trends in your metrics:
|
||||
|
||||
```bash
|
||||
# Analyze Prometheus metrics for anomalies
|
||||
python3 scripts/analyze_metrics.py prometheus \
|
||||
--endpoint http://localhost:9090 \
|
||||
--query 'rate(http_requests_total[5m])' \
|
||||
--hours 24
|
||||
|
||||
# Analyze CloudWatch metrics
|
||||
python3 scripts/analyze_metrics.py cloudwatch \
|
||||
--namespace AWS/EC2 \
|
||||
--metric CPUUtilization \
|
||||
--dimensions InstanceId=i-1234567890abcdef0 \
|
||||
--hours 48
|
||||
```
|
||||
|
||||
**→ Script**: [scripts/analyze_metrics.py](scripts/analyze_metrics.py)
|
||||
|
||||
---
|
||||
|
||||
## 2. Log Aggregation & Analysis
|
||||
|
||||
### Structured Logging Checklist
|
||||
|
||||
Every log entry should include:
|
||||
- ✅ Timestamp (ISO 8601 format)
|
||||
- ✅ Log level (DEBUG, INFO, WARN, ERROR, FATAL)
|
||||
- ✅ Message (human-readable)
|
||||
- ✅ Service name
|
||||
- ✅ Request ID (for tracing)
|
||||
|
||||
**Example structured log (JSON)**:
|
||||
```json
|
||||
{
|
||||
"timestamp": "2024-10-28T14:32:15Z",
|
||||
"level": "error",
|
||||
"message": "Payment processing failed",
|
||||
"service": "payment-service",
|
||||
"request_id": "550e8400-e29b-41d4-a716-446655440000",
|
||||
"user_id": "user123",
|
||||
"order_id": "ORD-456",
|
||||
"error_type": "GatewayTimeout",
|
||||
"duration_ms": 5000
|
||||
}
|
||||
```
|
||||
|
||||
### Log Aggregation Patterns
|
||||
|
||||
**ELK Stack** (Elasticsearch, Logstash, Kibana):
|
||||
- Best for: Deep log analysis, complex queries
|
||||
- Cost: High (infrastructure + operations)
|
||||
- Complexity: High
|
||||
|
||||
**Grafana Loki**:
|
||||
- Best for: Cost-effective logging, Kubernetes
|
||||
- Cost: Low
|
||||
- Complexity: Medium
|
||||
|
||||
**CloudWatch Logs**:
|
||||
- Best for: AWS-centric applications
|
||||
- Cost: Medium
|
||||
- Complexity: Low
|
||||
|
||||
### Log Analysis
|
||||
|
||||
Analyze logs for errors, patterns, and anomalies:
|
||||
|
||||
```bash
|
||||
# Analyze log file for patterns
|
||||
python3 scripts/log_analyzer.py application.log
|
||||
|
||||
# Show error lines with context
|
||||
python3 scripts/log_analyzer.py application.log --show-errors
|
||||
|
||||
# Extract stack traces
|
||||
python3 scripts/log_analyzer.py application.log --show-traces
|
||||
```
|
||||
|
||||
**→ Script**: [scripts/log_analyzer.py](scripts/log_analyzer.py)
|
||||
|
||||
### Deep Dive: Logging
|
||||
|
||||
For comprehensive logging guidance including:
|
||||
- Structured logging implementation examples (Python, Node.js, Go, Java)
|
||||
- Log aggregation patterns (ELK, Loki, CloudWatch, Fluentd)
|
||||
- Query patterns and best practices
|
||||
- PII redaction and security
|
||||
- Sampling and rate limiting
|
||||
|
||||
**→ Read**: [references/logging_guide.md](references/logging_guide.md)
|
||||
|
||||
---
|
||||
|
||||
## 3. Alert Design
|
||||
|
||||
### Alert Design Principles
|
||||
|
||||
1. **Every alert must be actionable** - If you can't do something, don't alert
|
||||
2. **Alert on symptoms, not causes** - Alert on user experience, not components
|
||||
3. **Tie alerts to SLOs** - Connect to business impact
|
||||
4. **Reduce noise** - Only page for critical issues
|
||||
|
||||
### Alert Severity Levels
|
||||
|
||||
| Severity | Response Time | Example |
|
||||
|----------|--------------|---------|
|
||||
| **Critical** | Page immediately | Service down, SLO violation |
|
||||
| **Warning** | Ticket, review in hours | Elevated error rate, resource warning |
|
||||
| **Info** | Log for awareness | Deployment completed, scaling event |
|
||||
|
||||
### Multi-Window Burn Rate Alerting
|
||||
|
||||
Alert when error budget is consumed too quickly:
|
||||
|
||||
```yaml
|
||||
# Fast burn (1h window) - Critical
|
||||
- alert: ErrorBudgetFastBurn
|
||||
expr: |
|
||||
(error_rate / 0.001) > 14.4 # 99.9% SLO
|
||||
for: 2m
|
||||
labels:
|
||||
severity: critical
|
||||
|
||||
# Slow burn (6h window) - Warning
|
||||
- alert: ErrorBudgetSlowBurn
|
||||
expr: |
|
||||
(error_rate / 0.001) > 6 # 99.9% SLO
|
||||
for: 30m
|
||||
labels:
|
||||
severity: warning
|
||||
```
|
||||
|
||||
### Alert Quality Checker
|
||||
|
||||
Audit your alert rules against best practices:
|
||||
|
||||
```bash
|
||||
# Check single file
|
||||
python3 scripts/alert_quality_checker.py alerts.yml
|
||||
|
||||
# Check all rules in directory
|
||||
python3 scripts/alert_quality_checker.py /path/to/prometheus/rules/
|
||||
```
|
||||
|
||||
**Checks for**:
|
||||
- Alert naming conventions
|
||||
- Required labels (severity, team)
|
||||
- Required annotations (summary, description, runbook_url)
|
||||
- PromQL expression quality
|
||||
- 'for' clause to prevent flapping
|
||||
|
||||
**→ Script**: [scripts/alert_quality_checker.py](scripts/alert_quality_checker.py)
|
||||
|
||||
### Alert Templates
|
||||
|
||||
Production-ready alert rule templates:
|
||||
|
||||
**→ Templates**:
|
||||
- [assets/templates/prometheus-alerts/webapp-alerts.yml](assets/templates/prometheus-alerts/webapp-alerts.yml) - Web application alerts
|
||||
- [assets/templates/prometheus-alerts/kubernetes-alerts.yml](assets/templates/prometheus-alerts/kubernetes-alerts.yml) - Kubernetes alerts
|
||||
|
||||
### Deep Dive: Alerting
|
||||
|
||||
For comprehensive alerting guidance including:
|
||||
- Alert design patterns (multi-window, rate of change, threshold with hysteresis)
|
||||
- Alert annotation best practices
|
||||
- Alert routing (severity-based, team-based, time-based)
|
||||
- Inhibition rules
|
||||
- Runbook structure
|
||||
- On-call best practices
|
||||
|
||||
**→ Read**: [references/alerting_best_practices.md](references/alerting_best_practices.md)
|
||||
|
||||
### Runbook Template
|
||||
|
||||
Create comprehensive runbooks for your alerts:
|
||||
|
||||
**→ Template**: [assets/templates/runbooks/incident-runbook-template.md](assets/templates/runbooks/incident-runbook-template.md)
|
||||
|
||||
---
|
||||
|
||||
## 4. Dashboard & Visualization
|
||||
|
||||
### Dashboard Design Principles
|
||||
|
||||
1. **Top-down layout**: Most important metrics first
|
||||
2. **Color coding**: Red (critical), yellow (warning), green (healthy)
|
||||
3. **Consistent time windows**: All panels use same time range
|
||||
4. **Limit panels**: 8-12 panels per dashboard maximum
|
||||
5. **Include context**: Show related metrics together
|
||||
|
||||
### Recommended Dashboard Structure
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────┐
|
||||
│ Overall Health (Single Stats) │
|
||||
│ [Requests/s] [Error%] [P95 Latency]│
|
||||
└─────────────────────────────────────┘
|
||||
┌─────────────────────────────────────┐
|
||||
│ Request Rate & Errors (Graphs) │
|
||||
└─────────────────────────────────────┘
|
||||
┌─────────────────────────────────────┐
|
||||
│ Latency Distribution (Graphs) │
|
||||
└─────────────────────────────────────┘
|
||||
┌─────────────────────────────────────┐
|
||||
│ Resource Usage (Graphs) │
|
||||
└─────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Generate Grafana Dashboards
|
||||
|
||||
Automatically generate dashboards from templates:
|
||||
|
||||
```bash
|
||||
# Web application dashboard
|
||||
python3 scripts/dashboard_generator.py webapp \
|
||||
--title "My API Dashboard" \
|
||||
--service my_api \
|
||||
--output dashboard.json
|
||||
|
||||
# Kubernetes dashboard
|
||||
python3 scripts/dashboard_generator.py kubernetes \
|
||||
--title "K8s Production" \
|
||||
--namespace production \
|
||||
--output k8s-dashboard.json
|
||||
|
||||
# Database dashboard
|
||||
python3 scripts/dashboard_generator.py database \
|
||||
--title "PostgreSQL" \
|
||||
--db-type postgres \
|
||||
--instance db.example.com:5432 \
|
||||
--output db-dashboard.json
|
||||
```
|
||||
|
||||
**Supports**:
|
||||
- Web applications (requests, errors, latency, resources)
|
||||
- Kubernetes (pods, nodes, resources, network)
|
||||
- Databases (PostgreSQL, MySQL)
|
||||
|
||||
**→ Script**: [scripts/dashboard_generator.py](scripts/dashboard_generator.py)
|
||||
|
||||
---
|
||||
|
||||
## 5. SLO & Error Budgets
|
||||
|
||||
### SLO Fundamentals
|
||||
|
||||
**SLI** (Service Level Indicator): Measurement of service quality
|
||||
- Example: Request latency, error rate, availability
|
||||
|
||||
**SLO** (Service Level Objective): Target value for an SLI
|
||||
- Example: "99.9% of requests return in < 500ms"
|
||||
|
||||
**Error Budget**: Allowed failure amount = (100% - SLO)
|
||||
- Example: 99.9% SLO = 0.1% error budget = 43.2 minutes/month
|
||||
|
||||
### Common SLO Targets
|
||||
|
||||
| Availability | Downtime/Month | Use Case |
|
||||
|--------------|----------------|----------|
|
||||
| **99%** | 7.2 hours | Internal tools |
|
||||
| **99.9%** | 43.2 minutes | Standard production |
|
||||
| **99.95%** | 21.6 minutes | Critical services |
|
||||
| **99.99%** | 4.3 minutes | High availability |
|
||||
|
||||
### SLO Calculator
|
||||
|
||||
Calculate compliance, error budgets, and burn rates:
|
||||
|
||||
```bash
|
||||
# Show SLO reference table
|
||||
python3 scripts/slo_calculator.py --table
|
||||
|
||||
# Calculate availability SLO
|
||||
python3 scripts/slo_calculator.py availability \
|
||||
--slo 99.9 \
|
||||
--total-requests 1000000 \
|
||||
--failed-requests 1500 \
|
||||
--period-days 30
|
||||
|
||||
# Calculate burn rate
|
||||
python3 scripts/slo_calculator.py burn-rate \
|
||||
--slo 99.9 \
|
||||
--errors 50 \
|
||||
--requests 10000 \
|
||||
--window-hours 1
|
||||
```
|
||||
|
||||
**→ Script**: [scripts/slo_calculator.py](scripts/slo_calculator.py)
|
||||
|
||||
### Deep Dive: SLO/SLA
|
||||
|
||||
For comprehensive SLO/SLA guidance including:
|
||||
- Choosing appropriate SLIs
|
||||
- Setting realistic SLO targets
|
||||
- Error budget policies
|
||||
- Burn rate alerting
|
||||
- SLA structure and contracts
|
||||
- Monthly reporting templates
|
||||
|
||||
**→ Read**: [references/slo_sla_guide.md](references/slo_sla_guide.md)
|
||||
|
||||
---
|
||||
|
||||
## 6. Distributed Tracing
|
||||
|
||||
### When to Use Tracing
|
||||
|
||||
Use distributed tracing when you need to:
|
||||
- Debug performance issues across services
|
||||
- Understand request flow through microservices
|
||||
- Identify bottlenecks in distributed systems
|
||||
- Find N+1 query problems
|
||||
|
||||
### OpenTelemetry Implementation
|
||||
|
||||
**Python example**:
|
||||
```python
|
||||
from opentelemetry import trace
|
||||
|
||||
tracer = trace.get_tracer(__name__)
|
||||
|
||||
@tracer.start_as_current_span("process_order")
|
||||
def process_order(order_id):
|
||||
span = trace.get_current_span()
|
||||
span.set_attribute("order.id", order_id)
|
||||
|
||||
try:
|
||||
result = payment_service.charge(order_id)
|
||||
span.set_attribute("payment.status", "success")
|
||||
return result
|
||||
except Exception as e:
|
||||
span.set_status(trace.Status(trace.StatusCode.ERROR))
|
||||
span.record_exception(e)
|
||||
raise
|
||||
```
|
||||
|
||||
### Sampling Strategies
|
||||
|
||||
- **Development**: 100% (ALWAYS_ON)
|
||||
- **Staging**: 50-100%
|
||||
- **Production**: 1-10% (or error-based sampling)
|
||||
|
||||
**Error-based sampling** (always sample errors, 1% of successes):
|
||||
```python
|
||||
class ErrorSampler(Sampler):
|
||||
def should_sample(self, parent_context, trace_id, name, **kwargs):
|
||||
attributes = kwargs.get('attributes', {})
|
||||
|
||||
if attributes.get('error', False):
|
||||
return Decision.RECORD_AND_SAMPLE
|
||||
|
||||
if trace_id & 0xFF < 3: # ~1%
|
||||
return Decision.RECORD_AND_SAMPLE
|
||||
|
||||
return Decision.DROP
|
||||
```
|
||||
|
||||
### OTel Collector Configuration
|
||||
|
||||
Production-ready OpenTelemetry Collector configuration:
|
||||
|
||||
**→ Template**: [assets/templates/otel-config/collector-config.yaml](assets/templates/otel-config/collector-config.yaml)
|
||||
|
||||
**Features**:
|
||||
- Receives OTLP, Prometheus, and host metrics
|
||||
- Batching and memory limiting
|
||||
- Tail sampling (error-based, latency-based, probabilistic)
|
||||
- Multiple exporters (Tempo, Jaeger, Loki, Prometheus, CloudWatch, Datadog)
|
||||
|
||||
### Deep Dive: Tracing
|
||||
|
||||
For comprehensive tracing guidance including:
|
||||
- OpenTelemetry instrumentation (Python, Node.js, Go, Java)
|
||||
- Span attributes and semantic conventions
|
||||
- Context propagation (W3C Trace Context)
|
||||
- Backend comparison (Jaeger, Tempo, X-Ray, Datadog APM)
|
||||
- Analysis patterns (finding slow traces, N+1 queries)
|
||||
- Integration with logs
|
||||
|
||||
**→ Read**: [references/tracing_guide.md](references/tracing_guide.md)
|
||||
|
||||
---
|
||||
|
||||
## 7. Datadog Cost Optimization & Migration
|
||||
|
||||
### Scenario 1: I'm Using Datadog and Costs Are Too High
|
||||
|
||||
If your Datadog bill is growing out of control, start by identifying waste:
|
||||
|
||||
#### Cost Analysis Script
|
||||
|
||||
Automatically analyze your Datadog usage and find cost optimization opportunities:
|
||||
|
||||
```bash
|
||||
# Analyze Datadog usage (requires API key and APP key)
|
||||
python3 scripts/datadog_cost_analyzer.py \
|
||||
--api-key $DD_API_KEY \
|
||||
--app-key $DD_APP_KEY
|
||||
|
||||
# Show detailed breakdown by category
|
||||
python3 scripts/datadog_cost_analyzer.py \
|
||||
--api-key $DD_API_KEY \
|
||||
--app-key $DD_APP_KEY \
|
||||
--show-details
|
||||
```
|
||||
|
||||
**What it checks**:
|
||||
- Infrastructure host count and cost
|
||||
- Custom metrics usage and high-cardinality metrics
|
||||
- Log ingestion volume and trends
|
||||
- APM host usage
|
||||
- Unused or noisy monitors
|
||||
- Container vs VM optimization opportunities
|
||||
|
||||
**→ Script**: [scripts/datadog_cost_analyzer.py](scripts/datadog_cost_analyzer.py)
|
||||
|
||||
#### Common Cost Optimization Strategies
|
||||
|
||||
**1. Custom Metrics Optimization** (typical savings: 20-40%):
|
||||
- Remove high-cardinality tags (user IDs, request IDs)
|
||||
- Delete unused custom metrics
|
||||
- Aggregate metrics before sending
|
||||
- Use metric prefixes to identify teams/services
|
||||
|
||||
**2. Log Management** (typical savings: 30-50%):
|
||||
- Implement log sampling for high-volume services
|
||||
- Use exclusion filters for debug/trace logs in production
|
||||
- Archive cold logs to S3/GCS after 7 days
|
||||
- Set log retention policies (15 days instead of 30)
|
||||
|
||||
**3. APM Optimization** (typical savings: 15-25%):
|
||||
- Reduce trace sampling rates (10% → 5% in prod)
|
||||
- Use head-based sampling instead of complete sampling
|
||||
- Remove APM from non-critical services
|
||||
- Use trace search with lower retention
|
||||
|
||||
**4. Infrastructure Monitoring** (typical savings: 10-20%):
|
||||
- Switch from VM-based to container-based pricing where possible
|
||||
- Remove agents from ephemeral instances
|
||||
- Use Datadog's host reduction strategies
|
||||
- Consolidate staging environments
|
||||
|
||||
### Scenario 2: Migrating Away from Datadog
|
||||
|
||||
If you're considering migrating to a more cost-effective open-source stack:
|
||||
|
||||
#### Migration Overview
|
||||
|
||||
**From Datadog** → **To Open Source Stack**:
|
||||
- Metrics: Datadog → **Prometheus + Grafana**
|
||||
- Logs: Datadog Logs → **Grafana Loki**
|
||||
- Traces: Datadog APM → **Tempo or Jaeger**
|
||||
- Dashboards: Datadog → **Grafana**
|
||||
- Alerts: Datadog Monitors → **Prometheus Alertmanager**
|
||||
|
||||
**Estimated Cost Savings**: 60-77% ($49.8k-61.8k/year for 100-host environment)
|
||||
|
||||
#### Migration Strategy
|
||||
|
||||
**Phase 1: Run Parallel** (Month 1-2):
|
||||
- Deploy open-source stack alongside Datadog
|
||||
- Migrate metrics first (lowest risk)
|
||||
- Validate data accuracy
|
||||
|
||||
**Phase 2: Migrate Dashboards & Alerts** (Month 2-3):
|
||||
- Convert Datadog dashboards to Grafana
|
||||
- Translate alert rules (use DQL → PromQL guide below)
|
||||
- Train team on new tools
|
||||
|
||||
**Phase 3: Migrate Logs & Traces** (Month 3-4):
|
||||
- Set up Loki for log aggregation
|
||||
- Deploy Tempo/Jaeger for tracing
|
||||
- Update application instrumentation
|
||||
|
||||
**Phase 4: Decommission Datadog** (Month 4-5):
|
||||
- Confirm all functionality migrated
|
||||
- Cancel Datadog subscription
|
||||
|
||||
#### Query Translation: DQL → PromQL
|
||||
|
||||
When migrating dashboards and alerts, you'll need to translate Datadog queries to PromQL:
|
||||
|
||||
**Quick examples**:
|
||||
```
|
||||
# Average CPU
|
||||
Datadog: avg:system.cpu.user{*}
|
||||
Prometheus: avg(node_cpu_seconds_total{mode="user"})
|
||||
|
||||
# Request rate
|
||||
Datadog: sum:requests.count{*}.as_rate()
|
||||
Prometheus: sum(rate(http_requests_total[5m]))
|
||||
|
||||
# P95 latency
|
||||
Datadog: p95:request.duration{*}
|
||||
Prometheus: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
|
||||
|
||||
# Error rate percentage
|
||||
Datadog: (sum:requests.errors{*}.as_rate() / sum:requests.count{*}.as_rate()) * 100
|
||||
Prometheus: (sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) * 100
|
||||
```
|
||||
|
||||
**→ Full Translation Guide**: [references/dql_promql_translation.md](references/dql_promql_translation.md)
|
||||
|
||||
#### Cost Comparison
|
||||
|
||||
**Example: 100-host infrastructure**
|
||||
|
||||
| Component | Datadog (Annual) | Open Source (Annual) | Savings |
|
||||
|-----------|-----------------|---------------------|---------|
|
||||
| Infrastructure | $18,000 | $10,000 (self-hosted infra) | $8,000 |
|
||||
| Custom Metrics | $600 | Included | $600 |
|
||||
| Logs | $24,000 | $3,000 (storage) | $21,000 |
|
||||
| APM/Traces | $37,200 | $5,000 (storage) | $32,200 |
|
||||
| **Total** | **$79,800** | **$18,000** | **$61,800 (77%)** |
|
||||
|
||||
### Deep Dive: Datadog Migration
|
||||
|
||||
For comprehensive migration guidance including:
|
||||
- Detailed cost comparison and ROI calculations
|
||||
- Step-by-step migration instructions
|
||||
- Infrastructure sizing recommendations (CPU, RAM, storage)
|
||||
- Dashboard conversion tools and examples
|
||||
- Alert rule translation patterns
|
||||
- Application instrumentation changes (DogStatsD → Prometheus client)
|
||||
- Python scripts for exporting Datadog dashboards and monitors
|
||||
- Common challenges and solutions
|
||||
|
||||
**→ Read**: [references/datadog_migration.md](references/datadog_migration.md)
|
||||
|
||||
---
|
||||
|
||||
## 8. Tool Selection & Comparison
|
||||
|
||||
### Decision Matrix
|
||||
|
||||
**Choose Prometheus + Grafana if**:
|
||||
- ✅ Using Kubernetes
|
||||
- ✅ Want control and customization
|
||||
- ✅ Have ops capacity
|
||||
- ✅ Budget-conscious
|
||||
|
||||
**Choose Datadog if**:
|
||||
- ✅ Want ease of use
|
||||
- ✅ Need full observability now
|
||||
- ✅ Budget allows ($8k+/month for 100 hosts)
|
||||
|
||||
**Choose Grafana Stack (LGTM) if**:
|
||||
- ✅ Want open source full stack
|
||||
- ✅ Cost-effective solution
|
||||
- ✅ Cloud-native architecture
|
||||
|
||||
**Choose ELK Stack if**:
|
||||
- ✅ Heavy log analysis needs
|
||||
- ✅ Need powerful search
|
||||
- ✅ Have dedicated ops team
|
||||
|
||||
**Choose Cloud Native (CloudWatch/etc) if**:
|
||||
- ✅ Single cloud provider
|
||||
- ✅ Simple needs
|
||||
- ✅ Want minimal setup
|
||||
|
||||
### Cost Comparison (100 hosts, 1TB logs/month)
|
||||
|
||||
| Solution | Monthly Cost | Setup | Ops Burden |
|
||||
|----------|-------------|--------|------------|
|
||||
| Prometheus + Loki + Tempo | $1,500 | Medium | Medium |
|
||||
| Grafana Cloud | $3,000 | Low | Low |
|
||||
| Datadog | $8,000 | Low | None |
|
||||
| ELK Stack | $4,000 | High | High |
|
||||
| CloudWatch | $2,000 | Low | Low |
|
||||
|
||||
### Deep Dive: Tool Comparison
|
||||
|
||||
For comprehensive tool comparison including:
|
||||
- Metrics platforms (Prometheus, Datadog, New Relic, CloudWatch, Grafana Cloud)
|
||||
- Logging platforms (ELK, Loki, Splunk, CloudWatch Logs, Sumo Logic)
|
||||
- Tracing platforms (Jaeger, Tempo, Datadog APM, X-Ray)
|
||||
- Full-stack observability comparison
|
||||
- Recommendations by company size
|
||||
|
||||
**→ Read**: [references/tool_comparison.md](references/tool_comparison.md)
|
||||
|
||||
---
|
||||
|
||||
## 9. Troubleshooting & Analysis
|
||||
|
||||
### Health Check Validation
|
||||
|
||||
Validate health check endpoints against best practices:
|
||||
|
||||
```bash
|
||||
# Check single endpoint
|
||||
python3 scripts/health_check_validator.py https://api.example.com/health
|
||||
|
||||
# Check multiple endpoints
|
||||
python3 scripts/health_check_validator.py \
|
||||
https://api.example.com/health \
|
||||
https://api.example.com/readiness \
|
||||
--verbose
|
||||
```
|
||||
|
||||
**Checks for**:
|
||||
- ✓ Returns 200 status code
|
||||
- ✓ Response time < 1 second
|
||||
- ✓ Returns JSON format
|
||||
- ✓ Contains 'status' field
|
||||
- ✓ Includes version/build info
|
||||
- ✓ Checks dependencies
|
||||
- ✓ Disables caching
|
||||
|
||||
**→ Script**: [scripts/health_check_validator.py](scripts/health_check_validator.py)
|
||||
|
||||
### Common Troubleshooting Workflows
|
||||
|
||||
**High Latency Investigation**:
|
||||
1. Check dashboards for latency spike
|
||||
2. Query traces for slow operations
|
||||
3. Check database slow query log
|
||||
4. Check external API response times
|
||||
5. Review recent deployments
|
||||
6. Check resource utilization (CPU, memory)
|
||||
|
||||
**High Error Rate Investigation**:
|
||||
1. Check error logs for patterns
|
||||
2. Identify affected endpoints
|
||||
3. Check dependency health
|
||||
4. Review recent deployments
|
||||
5. Check resource limits
|
||||
6. Verify configuration
|
||||
|
||||
**Service Down Investigation**:
|
||||
1. Check if pods/instances are running
|
||||
2. Check health check endpoint
|
||||
3. Review recent deployments
|
||||
4. Check resource availability
|
||||
5. Check network connectivity
|
||||
6. Review logs for startup errors
|
||||
|
||||
---
|
||||
|
||||
## Quick Reference Commands
|
||||
|
||||
### Prometheus Queries
|
||||
|
||||
```promql
|
||||
# Request rate
|
||||
sum(rate(http_requests_total[5m]))
|
||||
|
||||
# Error rate
|
||||
sum(rate(http_requests_total{status=~"5.."}[5m]))
|
||||
/
|
||||
sum(rate(http_requests_total[5m])) * 100
|
||||
|
||||
# P95 latency
|
||||
histogram_quantile(0.95,
|
||||
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
|
||||
)
|
||||
|
||||
# CPU usage
|
||||
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
|
||||
|
||||
# Memory usage
|
||||
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
|
||||
```
|
||||
|
||||
### Kubernetes Commands
|
||||
|
||||
```bash
|
||||
# Check pod status
|
||||
kubectl get pods -n <namespace>
|
||||
|
||||
# View pod logs
|
||||
kubectl logs -f <pod-name> -n <namespace>
|
||||
|
||||
# Check pod resources
|
||||
kubectl top pods -n <namespace>
|
||||
|
||||
# Describe pod for events
|
||||
kubectl describe pod <pod-name> -n <namespace>
|
||||
|
||||
# Check recent deployments
|
||||
kubectl rollout history deployment/<name> -n <namespace>
|
||||
```
|
||||
|
||||
### Log Queries
|
||||
|
||||
**Elasticsearch**:
|
||||
```json
|
||||
GET /logs-*/_search
|
||||
{
|
||||
"query": {
|
||||
"bool": {
|
||||
"must": [
|
||||
{ "match": { "level": "error" } },
|
||||
{ "range": { "@timestamp": { "gte": "now-1h" } } }
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Loki (LogQL)**:
|
||||
```logql
|
||||
{job="app", level="error"} |= "error" | json
|
||||
```
|
||||
|
||||
**CloudWatch Insights**:
|
||||
```
|
||||
fields @timestamp, level, message
|
||||
| filter level = "error"
|
||||
| filter @timestamp > ago(1h)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Resources Summary
|
||||
|
||||
### Scripts (automation and analysis)
|
||||
- `analyze_metrics.py` - Detect anomalies in Prometheus/CloudWatch metrics
|
||||
- `alert_quality_checker.py` - Audit alert rules against best practices
|
||||
- `slo_calculator.py` - Calculate SLO compliance and error budgets
|
||||
- `log_analyzer.py` - Parse logs for errors and patterns
|
||||
- `dashboard_generator.py` - Generate Grafana dashboards from templates
|
||||
- `health_check_validator.py` - Validate health check endpoints
|
||||
- `datadog_cost_analyzer.py` - Analyze Datadog usage and find cost waste
|
||||
|
||||
### References (deep-dive documentation)
|
||||
- `metrics_design.md` - Four Golden Signals, RED/USE methods, metric types
|
||||
- `alerting_best_practices.md` - Alert design, runbooks, on-call practices
|
||||
- `logging_guide.md` - Structured logging, aggregation patterns
|
||||
- `tracing_guide.md` - OpenTelemetry, distributed tracing
|
||||
- `slo_sla_guide.md` - SLI/SLO/SLA definitions, error budgets
|
||||
- `tool_comparison.md` - Comprehensive comparison of monitoring tools
|
||||
- `datadog_migration.md` - Complete guide for migrating from Datadog to OSS stack
|
||||
- `dql_promql_translation.md` - Datadog Query Language to PromQL translation reference
|
||||
|
||||
### Templates (ready-to-use configurations)
|
||||
- `prometheus-alerts/webapp-alerts.yml` - Production-ready web app alerts
|
||||
- `prometheus-alerts/kubernetes-alerts.yml` - Kubernetes monitoring alerts
|
||||
- `otel-config/collector-config.yaml` - OpenTelemetry Collector configuration
|
||||
- `runbooks/incident-runbook-template.md` - Incident response template
|
||||
|
||||
---
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Metrics
|
||||
- Start with Four Golden Signals
|
||||
- Use appropriate metric types (counter, gauge, histogram)
|
||||
- Keep cardinality low (avoid high-cardinality labels)
|
||||
- Follow naming conventions
|
||||
|
||||
### Logging
|
||||
- Use structured logging (JSON)
|
||||
- Include request IDs for tracing
|
||||
- Set appropriate log levels
|
||||
- Redact PII before logging
|
||||
|
||||
### Alerting
|
||||
- Make every alert actionable
|
||||
- Alert on symptoms, not causes
|
||||
- Use multi-window burn rate alerts
|
||||
- Include runbook links
|
||||
|
||||
### Tracing
|
||||
- Sample appropriately (1-10% in production)
|
||||
- Always record errors
|
||||
- Use semantic conventions
|
||||
- Propagate context between services
|
||||
|
||||
### SLOs
|
||||
- Start with current performance
|
||||
- Set realistic targets
|
||||
- Define error budget policies
|
||||
- Review and adjust quarterly
|
||||
227
assets/templates/otel-config/collector-config.yaml
Normal file
227
assets/templates/otel-config/collector-config.yaml
Normal file
@@ -0,0 +1,227 @@
|
||||
# OpenTelemetry Collector Configuration
|
||||
# Receives metrics, logs, and traces and exports to various backends
|
||||
|
||||
receivers:
|
||||
# OTLP receiver (standard OpenTelemetry protocol)
|
||||
otlp:
|
||||
protocols:
|
||||
grpc:
|
||||
endpoint: 0.0.0.0:4317
|
||||
http:
|
||||
endpoint: 0.0.0.0:4318
|
||||
|
||||
# Prometheus receiver (scrape Prometheus endpoints)
|
||||
prometheus:
|
||||
config:
|
||||
scrape_configs:
|
||||
- job_name: 'otel-collector'
|
||||
scrape_interval: 30s
|
||||
static_configs:
|
||||
- targets: ['localhost:8888']
|
||||
|
||||
# Host metrics (CPU, memory, disk, network)
|
||||
hostmetrics:
|
||||
collection_interval: 30s
|
||||
scrapers:
|
||||
cpu:
|
||||
memory:
|
||||
disk:
|
||||
network:
|
||||
filesystem:
|
||||
load:
|
||||
|
||||
# Kubernetes receiver (cluster metrics)
|
||||
k8s_cluster:
|
||||
auth_type: serviceAccount
|
||||
node_conditions_to_report: [Ready, MemoryPressure, DiskPressure]
|
||||
distribution: kubernetes
|
||||
|
||||
# Zipkin receiver (legacy tracing)
|
||||
zipkin:
|
||||
endpoint: 0.0.0.0:9411
|
||||
|
||||
processors:
|
||||
# Batch processor (improves performance)
|
||||
batch:
|
||||
timeout: 10s
|
||||
send_batch_size: 1024
|
||||
send_batch_max_size: 2048
|
||||
|
||||
# Memory limiter (prevent OOM)
|
||||
memory_limiter:
|
||||
check_interval: 1s
|
||||
limit_mib: 512
|
||||
spike_limit_mib: 128
|
||||
|
||||
# Resource processor (add resource attributes)
|
||||
resource:
|
||||
attributes:
|
||||
- key: environment
|
||||
value: production
|
||||
action: insert
|
||||
- key: cluster.name
|
||||
value: prod-cluster
|
||||
action: insert
|
||||
|
||||
# Attributes processor (modify span/metric attributes)
|
||||
attributes:
|
||||
actions:
|
||||
- key: http.url
|
||||
action: delete # Remove potentially sensitive URLs
|
||||
- key: db.statement
|
||||
action: hash # Hash SQL queries for privacy
|
||||
|
||||
# Filter processor (drop unwanted data)
|
||||
filter:
|
||||
metrics:
|
||||
# Drop metrics matching criteria
|
||||
exclude:
|
||||
match_type: regexp
|
||||
metric_names:
|
||||
- ^go_.* # Drop Go runtime metrics
|
||||
- ^process_.* # Drop process metrics
|
||||
|
||||
# Tail sampling (intelligent trace sampling)
|
||||
tail_sampling:
|
||||
decision_wait: 10s
|
||||
num_traces: 100
|
||||
policies:
|
||||
# Always sample errors
|
||||
- name: error-policy
|
||||
type: status_code
|
||||
status_code:
|
||||
status_codes: [ERROR]
|
||||
|
||||
# Sample slow traces
|
||||
- name: latency-policy
|
||||
type: latency
|
||||
latency:
|
||||
threshold_ms: 1000
|
||||
|
||||
# Sample 10% of others
|
||||
- name: probabilistic-policy
|
||||
type: probabilistic
|
||||
probabilistic:
|
||||
sampling_percentage: 10
|
||||
|
||||
# Span processor (modify spans)
|
||||
span:
|
||||
name:
|
||||
to_attributes:
|
||||
rules:
|
||||
- ^\/api\/v1\/users\/(?P<user_id>.*)$
|
||||
from_attributes:
|
||||
- db.name
|
||||
- http.method
|
||||
|
||||
exporters:
|
||||
# Prometheus exporter (expose metrics endpoint)
|
||||
prometheus:
|
||||
endpoint: 0.0.0.0:8889
|
||||
namespace: otel
|
||||
|
||||
# OTLP exporters (send to backends)
|
||||
otlp/tempo:
|
||||
endpoint: tempo:4317
|
||||
tls:
|
||||
insecure: true
|
||||
|
||||
otlp/mimir:
|
||||
endpoint: mimir:4317
|
||||
tls:
|
||||
insecure: true
|
||||
|
||||
# Loki exporter (for logs)
|
||||
loki:
|
||||
endpoint: http://loki:3100/loki/api/v1/push
|
||||
labels:
|
||||
resource:
|
||||
service.name: "service_name"
|
||||
service.namespace: "service_namespace"
|
||||
attributes:
|
||||
level: "level"
|
||||
|
||||
# Jaeger exporter (alternative tracing backend)
|
||||
jaeger:
|
||||
endpoint: jaeger:14250
|
||||
tls:
|
||||
insecure: true
|
||||
|
||||
# Elasticsearch exporter (for logs)
|
||||
elasticsearch:
|
||||
endpoints:
|
||||
- http://elasticsearch:9200
|
||||
logs_index: otel-logs
|
||||
traces_index: otel-traces
|
||||
|
||||
# CloudWatch exporter (AWS)
|
||||
awscloudwatch:
|
||||
region: us-east-1
|
||||
namespace: MyApp
|
||||
log_group_name: /aws/otel/logs
|
||||
log_stream_name: otel-collector
|
||||
|
||||
# Datadog exporter
|
||||
datadog:
|
||||
api:
|
||||
key: ${DD_API_KEY}
|
||||
site: datadoghq.com
|
||||
|
||||
# File exporter (debugging)
|
||||
file:
|
||||
path: /tmp/otel-output.json
|
||||
|
||||
# Logging exporter (console output for debugging)
|
||||
logging:
|
||||
verbosity: detailed
|
||||
sampling_initial: 5
|
||||
sampling_thereafter: 200
|
||||
|
||||
extensions:
|
||||
# Health check endpoint
|
||||
health_check:
|
||||
endpoint: 0.0.0.0:13133
|
||||
|
||||
# Pprof endpoint (for profiling)
|
||||
pprof:
|
||||
endpoint: 0.0.0.0:1777
|
||||
|
||||
# ZPages (internal diagnostics)
|
||||
zpages:
|
||||
endpoint: 0.0.0.0:55679
|
||||
|
||||
service:
|
||||
extensions: [health_check, pprof, zpages]
|
||||
|
||||
pipelines:
|
||||
# Traces pipeline
|
||||
traces:
|
||||
receivers: [otlp, zipkin]
|
||||
processors: [memory_limiter, batch, tail_sampling, resource, span]
|
||||
exporters: [otlp/tempo, jaeger, logging]
|
||||
|
||||
# Metrics pipeline
|
||||
metrics:
|
||||
receivers: [otlp, prometheus, hostmetrics, k8s_cluster]
|
||||
processors: [memory_limiter, batch, filter, resource]
|
||||
exporters: [otlp/mimir, prometheus, awscloudwatch]
|
||||
|
||||
# Logs pipeline
|
||||
logs:
|
||||
receivers: [otlp]
|
||||
processors: [memory_limiter, batch, resource, attributes]
|
||||
exporters: [loki, elasticsearch, awscloudwatch]
|
||||
|
||||
# Telemetry (collector's own metrics)
|
||||
telemetry:
|
||||
logs:
|
||||
level: info
|
||||
metrics:
|
||||
address: 0.0.0.0:8888
|
||||
|
||||
# Notes:
|
||||
# 1. Replace ${DD_API_KEY} with actual API key or use environment variable
|
||||
# 2. Adjust endpoints to match your infrastructure
|
||||
# 3. Comment out exporters you don't use
|
||||
# 4. Adjust sampling rates based on your volume and needs
|
||||
# 5. Add TLS configuration for production deployments
|
||||
293
assets/templates/prometheus-alerts/kubernetes-alerts.yml
Normal file
293
assets/templates/prometheus-alerts/kubernetes-alerts.yml
Normal file
@@ -0,0 +1,293 @@
|
||||
---
|
||||
# Prometheus Alert Rules for Kubernetes
|
||||
# Covers pods, nodes, deployments, and resource usage
|
||||
|
||||
groups:
|
||||
- name: kubernetes_pods
|
||||
interval: 30s
|
||||
rules:
|
||||
# Pod crash looping
|
||||
- alert: PodCrashLooping
|
||||
expr: |
|
||||
rate(kube_pod_container_status_restarts_total[15m]) > 0
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
team: platform
|
||||
component: kubernetes
|
||||
annotations:
|
||||
summary: "Pod is crash looping - {{ $labels.namespace }}/{{ $labels.pod }}"
|
||||
description: |
|
||||
Pod {{ $labels.namespace }}/{{ $labels.pod }} has restarted {{ $value }} times in the last 15 minutes.
|
||||
|
||||
Check pod logs:
|
||||
kubectl logs -n {{ $labels.namespace }} {{ $labels.pod }} --previous
|
||||
runbook_url: "https://runbooks.example.com/pod-crash-loop"
|
||||
|
||||
# Pod not ready
|
||||
- alert: PodNotReady
|
||||
expr: |
|
||||
sum by (namespace, pod) (kube_pod_status_phase{phase!~"Running|Succeeded"}) > 0
|
||||
for: 10m
|
||||
labels:
|
||||
severity: warning
|
||||
team: platform
|
||||
component: kubernetes
|
||||
annotations:
|
||||
summary: "Pod not ready - {{ $labels.namespace }}/{{ $labels.pod }}"
|
||||
description: |
|
||||
Pod {{ $labels.namespace }}/{{ $labels.pod }} is in {{ $labels.phase }} state for 10 minutes.
|
||||
|
||||
Investigate:
|
||||
kubectl describe pod -n {{ $labels.namespace }} {{ $labels.pod }}
|
||||
runbook_url: "https://runbooks.example.com/pod-not-ready"
|
||||
|
||||
# Pod OOMKilled
|
||||
- alert: PodOOMKilled
|
||||
expr: |
|
||||
sum by (namespace, pod) (kube_pod_container_status_terminated_reason{reason="OOMKilled"}) > 0
|
||||
for: 1m
|
||||
labels:
|
||||
severity: warning
|
||||
team: platform
|
||||
component: kubernetes
|
||||
annotations:
|
||||
summary: "Pod killed due to OOM - {{ $labels.namespace }}/{{ $labels.pod }}"
|
||||
description: |
|
||||
Pod {{ $labels.namespace }}/{{ $labels.pod }} was killed due to out-of-memory.
|
||||
|
||||
Increase memory limits or investigate memory leak.
|
||||
runbook_url: "https://runbooks.example.com/oom-killed"
|
||||
|
||||
- name: kubernetes_deployments
|
||||
interval: 30s
|
||||
rules:
|
||||
# Deployment replica mismatch
|
||||
- alert: DeploymentReplicasMismatch
|
||||
expr: |
|
||||
kube_deployment_spec_replicas != kube_deployment_status_replicas_available
|
||||
for: 15m
|
||||
labels:
|
||||
severity: warning
|
||||
team: platform
|
||||
component: kubernetes
|
||||
annotations:
|
||||
summary: "Deployment replicas mismatch - {{ $labels.namespace }}/{{ $labels.deployment }}"
|
||||
description: |
|
||||
Deployment {{ $labels.namespace }}/{{ $labels.deployment }} has been running with
|
||||
fewer replicas than desired for 15 minutes.
|
||||
|
||||
Desired: {{ $value }}
|
||||
Available: Check deployment status
|
||||
runbook_url: "https://runbooks.example.com/replica-mismatch"
|
||||
|
||||
# Deployment rollout stuck
|
||||
- alert: DeploymentRolloutStuck
|
||||
expr: |
|
||||
kube_deployment_status_condition{condition="Progressing", status="false"} > 0
|
||||
for: 15m
|
||||
labels:
|
||||
severity: warning
|
||||
team: platform
|
||||
component: kubernetes
|
||||
annotations:
|
||||
summary: "Deployment rollout stuck - {{ $labels.namespace }}/{{ $labels.deployment }}"
|
||||
description: |
|
||||
Deployment {{ $labels.namespace }}/{{ $labels.deployment }} rollout is stuck.
|
||||
|
||||
Check rollout status:
|
||||
kubectl rollout status deployment/{{ $labels.deployment }} -n {{ $labels.namespace }}
|
||||
runbook_url: "https://runbooks.example.com/rollout-stuck"
|
||||
|
||||
- name: kubernetes_nodes
|
||||
interval: 30s
|
||||
rules:
|
||||
# Node not ready
|
||||
- alert: NodeNotReady
|
||||
expr: |
|
||||
kube_node_status_condition{condition="Ready",status="true"} == 0
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
team: platform
|
||||
component: kubernetes
|
||||
annotations:
|
||||
summary: "Node not ready - {{ $labels.node }}"
|
||||
description: |
|
||||
Node {{ $labels.node }} has been NotReady for 5 minutes.
|
||||
|
||||
This will affect pod scheduling and availability.
|
||||
|
||||
Check node status:
|
||||
kubectl describe node {{ $labels.node }}
|
||||
runbook_url: "https://runbooks.example.com/node-not-ready"
|
||||
|
||||
# Node memory pressure
|
||||
- alert: NodeMemoryPressure
|
||||
expr: |
|
||||
kube_node_status_condition{condition="MemoryPressure",status="true"} == 1
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
team: platform
|
||||
component: kubernetes
|
||||
annotations:
|
||||
summary: "Node under memory pressure - {{ $labels.node }}"
|
||||
description: |
|
||||
Node {{ $labels.node }} is experiencing memory pressure.
|
||||
|
||||
Pods may be evicted. Consider scaling up or evicting low-priority pods.
|
||||
runbook_url: "https://runbooks.example.com/memory-pressure"
|
||||
|
||||
# Node disk pressure
|
||||
- alert: NodeDiskPressure
|
||||
expr: |
|
||||
kube_node_status_condition{condition="DiskPressure",status="true"} == 1
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
team: platform
|
||||
component: kubernetes
|
||||
annotations:
|
||||
summary: "Node under disk pressure - {{ $labels.node }}"
|
||||
description: |
|
||||
Node {{ $labels.node }} is experiencing disk pressure.
|
||||
|
||||
Clean up disk space or add capacity.
|
||||
runbook_url: "https://runbooks.example.com/disk-pressure"
|
||||
|
||||
# Node high CPU
|
||||
- alert: NodeHighCPU
|
||||
expr: |
|
||||
(1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)) * 100 > 80
|
||||
for: 15m
|
||||
labels:
|
||||
severity: warning
|
||||
team: platform
|
||||
component: kubernetes
|
||||
annotations:
|
||||
summary: "Node high CPU usage - {{ $labels.instance }}"
|
||||
description: |
|
||||
Node {{ $labels.instance }} CPU usage is {{ $value | humanize }}%.
|
||||
|
||||
Check for resource-intensive pods or scale cluster.
|
||||
runbook_url: "https://runbooks.example.com/node-high-cpu"
|
||||
|
||||
- name: kubernetes_resources
|
||||
interval: 30s
|
||||
rules:
|
||||
# Container CPU throttling
|
||||
- alert: ContainerCPUThrottling
|
||||
expr: |
|
||||
rate(container_cpu_cfs_throttled_seconds_total[5m]) > 0.5
|
||||
for: 10m
|
||||
labels:
|
||||
severity: warning
|
||||
team: platform
|
||||
component: kubernetes
|
||||
annotations:
|
||||
summary: "Container CPU throttling - {{ $labels.namespace }}/{{ $labels.pod }}"
|
||||
description: |
|
||||
Container {{ $labels.container }} in pod {{ $labels.namespace }}/{{ $labels.pod }}
|
||||
is being CPU throttled.
|
||||
|
||||
CPU throttling rate: {{ $value | humanize }}
|
||||
|
||||
Consider increasing CPU limits.
|
||||
runbook_url: "https://runbooks.example.com/cpu-throttling"
|
||||
|
||||
# Container memory usage high
|
||||
- alert: ContainerMemoryUsageHigh
|
||||
expr: |
|
||||
(container_memory_usage_bytes / container_spec_memory_limit_bytes) > 0.9
|
||||
for: 10m
|
||||
labels:
|
||||
severity: warning
|
||||
team: platform
|
||||
component: kubernetes
|
||||
annotations:
|
||||
summary: "Container memory usage high - {{ $labels.namespace }}/{{ $labels.pod }}"
|
||||
description: |
|
||||
Container {{ $labels.container }} in pod {{ $labels.namespace }}/{{ $labels.pod }}
|
||||
is using {{ $value | humanizePercentage }} of its memory limit.
|
||||
|
||||
Risk of OOMKill. Consider increasing memory limits.
|
||||
runbook_url: "https://runbooks.example.com/high-memory"
|
||||
|
||||
- name: kubernetes_pv
|
||||
interval: 30s
|
||||
rules:
|
||||
# PersistentVolume nearing full
|
||||
- alert: PersistentVolumeFillingUp
|
||||
expr: |
|
||||
(kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes) < 0.15
|
||||
for: 10m
|
||||
labels:
|
||||
severity: warning
|
||||
team: platform
|
||||
component: kubernetes
|
||||
annotations:
|
||||
summary: "PersistentVolume filling up - {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }}"
|
||||
description: |
|
||||
PersistentVolume {{ $labels.persistentvolumeclaim }} in namespace {{ $labels.namespace }}
|
||||
is {{ $value | humanizePercentage }} full.
|
||||
|
||||
Available space is running low. Consider expanding volume.
|
||||
runbook_url: "https://runbooks.example.com/pv-filling-up"
|
||||
|
||||
# PersistentVolume critically full
|
||||
- alert: PersistentVolumeCriticallyFull
|
||||
expr: |
|
||||
(kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes) < 0.05
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
team: platform
|
||||
component: kubernetes
|
||||
annotations:
|
||||
summary: "PersistentVolume critically full - {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }}"
|
||||
description: |
|
||||
PersistentVolume {{ $labels.persistentvolumeclaim }} in namespace {{ $labels.namespace }}
|
||||
is {{ $value | humanizePercentage }} full.
|
||||
|
||||
Immediate action required to prevent application failures.
|
||||
runbook_url: "https://runbooks.example.com/pv-critically-full"
|
||||
|
||||
- name: kubernetes_jobs
|
||||
interval: 30s
|
||||
rules:
|
||||
# Job failed
|
||||
- alert: JobFailed
|
||||
expr: |
|
||||
kube_job_status_failed > 0
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
team: platform
|
||||
component: kubernetes
|
||||
annotations:
|
||||
summary: "Job failed - {{ $labels.namespace }}/{{ $labels.job_name }}"
|
||||
description: |
|
||||
Job {{ $labels.namespace }}/{{ $labels.job_name }} has failed.
|
||||
|
||||
Check job logs:
|
||||
kubectl logs job/{{ $labels.job_name }} -n {{ $labels.namespace }}
|
||||
runbook_url: "https://runbooks.example.com/job-failed"
|
||||
|
||||
# CronJob not running
|
||||
- alert: CronJobNotRunning
|
||||
expr: |
|
||||
time() - kube_cronjob_status_last_schedule_time > 3600
|
||||
for: 10m
|
||||
labels:
|
||||
severity: warning
|
||||
team: platform
|
||||
component: kubernetes
|
||||
annotations:
|
||||
summary: "CronJob not running - {{ $labels.namespace }}/{{ $labels.cronjob }}"
|
||||
description: |
|
||||
CronJob {{ $labels.namespace}}/{{ $labels.cronjob }} hasn't run in over an hour.
|
||||
|
||||
Check CronJob status:
|
||||
kubectl describe cronjob {{ $labels.cronjob }} -n {{ $labels.namespace }}
|
||||
runbook_url: "https://runbooks.example.com/cronjob-not-running"
|
||||
243
assets/templates/prometheus-alerts/webapp-alerts.yml
Normal file
243
assets/templates/prometheus-alerts/webapp-alerts.yml
Normal file
@@ -0,0 +1,243 @@
|
||||
---
|
||||
# Prometheus Alert Rules for Web Applications
|
||||
# Based on SLO best practices and multi-window burn rate alerting
|
||||
|
||||
groups:
|
||||
- name: webapp_availability
|
||||
interval: 30s
|
||||
rules:
|
||||
# Fast burn rate alert (1h window) - SLO: 99.9%
|
||||
- alert: ErrorBudgetFastBurn
|
||||
expr: |
|
||||
(
|
||||
sum(rate(http_requests_total{job="webapp",status=~"5.."}[1h]))
|
||||
/
|
||||
sum(rate(http_requests_total{job="webapp"}[1h]))
|
||||
) > (14.4 * 0.001)
|
||||
for: 2m
|
||||
labels:
|
||||
severity: critical
|
||||
team: backend
|
||||
component: webapp
|
||||
annotations:
|
||||
summary: "Fast error budget burn - {{ $labels.job }}"
|
||||
description: |
|
||||
Error rate is {{ $value | humanizePercentage }} over the last hour,
|
||||
burning through error budget at 14.4x rate.
|
||||
|
||||
At this rate, the monthly error budget will be exhausted in 2 days.
|
||||
|
||||
Immediate investigation required.
|
||||
runbook_url: "https://runbooks.example.com/error-budget-burn"
|
||||
dashboard: "https://grafana.example.com/d/webapp"
|
||||
|
||||
# Slow burn rate alert (6h window)
|
||||
- alert: ErrorBudgetSlowBurn
|
||||
expr: |
|
||||
(
|
||||
sum(rate(http_requests_total{job="webapp",status=~"5.."}[6h]))
|
||||
/
|
||||
sum(rate(http_requests_total{job="webapp"}[6h]))
|
||||
) > (6 * 0.001)
|
||||
for: 30m
|
||||
labels:
|
||||
severity: warning
|
||||
team: backend
|
||||
component: webapp
|
||||
annotations:
|
||||
summary: "Elevated error budget burn - {{ $labels.job }}"
|
||||
description: |
|
||||
Error rate is {{ $value | humanizePercentage }} over the last 6 hours,
|
||||
burning through error budget at 6x rate.
|
||||
|
||||
Monitor closely and investigate if trend continues.
|
||||
runbook_url: "https://runbooks.example.com/error-budget-burn"
|
||||
|
||||
# Service down alert
|
||||
- alert: WebAppDown
|
||||
expr: up{job="webapp"} == 0
|
||||
for: 2m
|
||||
labels:
|
||||
severity: critical
|
||||
team: backend
|
||||
component: webapp
|
||||
annotations:
|
||||
summary: "Web application is down - {{ $labels.instance }}"
|
||||
description: |
|
||||
Web application instance {{ $labels.instance }} has been down for 2 minutes.
|
||||
|
||||
Check service health and logs immediately.
|
||||
runbook_url: "https://runbooks.example.com/service-down"
|
||||
|
||||
- name: webapp_latency
|
||||
interval: 30s
|
||||
rules:
|
||||
# High latency (p95)
|
||||
- alert: HighLatencyP95
|
||||
expr: |
|
||||
histogram_quantile(0.95,
|
||||
sum(rate(http_request_duration_seconds_bucket{job="webapp"}[5m])) by (le)
|
||||
) > 0.5
|
||||
for: 10m
|
||||
labels:
|
||||
severity: warning
|
||||
team: backend
|
||||
component: webapp
|
||||
annotations:
|
||||
summary: "High p95 latency - {{ $labels.job }}"
|
||||
description: |
|
||||
P95 request latency is {{ $value }}s, exceeding 500ms threshold.
|
||||
|
||||
This may impact user experience. Check for:
|
||||
- Slow database queries
|
||||
- External API issues
|
||||
- Resource saturation
|
||||
runbook_url: "https://runbooks.example.com/high-latency"
|
||||
dashboard: "https://grafana.example.com/d/webapp-latency"
|
||||
|
||||
# Very high latency (p99)
|
||||
- alert: HighLatencyP99
|
||||
expr: |
|
||||
histogram_quantile(0.99,
|
||||
sum(rate(http_request_duration_seconds_bucket{job="webapp"}[5m])) by (le)
|
||||
) > 2
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
team: backend
|
||||
component: webapp
|
||||
annotations:
|
||||
summary: "Critical latency degradation - {{ $labels.job }}"
|
||||
description: |
|
||||
P99 request latency is {{ $value }}s, exceeding 2s threshold.
|
||||
|
||||
Severe performance degradation detected.
|
||||
runbook_url: "https://runbooks.example.com/high-latency"
|
||||
|
||||
- name: webapp_resources
|
||||
interval: 30s
|
||||
rules:
|
||||
# High CPU
|
||||
- alert: HighCPU
|
||||
expr: |
|
||||
rate(process_cpu_seconds_total{job="webapp"}[5m]) * 100 > 80
|
||||
for: 15m
|
||||
labels:
|
||||
severity: warning
|
||||
team: backend
|
||||
component: webapp
|
||||
annotations:
|
||||
summary: "High CPU usage - {{ $labels.instance }}"
|
||||
description: |
|
||||
CPU usage is {{ $value | humanize }}% on {{ $labels.instance }}.
|
||||
|
||||
Consider scaling up or investigating CPU-intensive operations.
|
||||
runbook_url: "https://runbooks.example.com/high-cpu"
|
||||
|
||||
# High memory
|
||||
- alert: HighMemory
|
||||
expr: |
|
||||
(process_resident_memory_bytes{job="webapp"} / node_memory_MemTotal_bytes) * 100 > 80
|
||||
for: 15m
|
||||
labels:
|
||||
severity: warning
|
||||
team: backend
|
||||
component: webapp
|
||||
annotations:
|
||||
summary: "High memory usage - {{ $labels.instance }}"
|
||||
description: |
|
||||
Memory usage is {{ $value | humanize }}% on {{ $labels.instance }}.
|
||||
|
||||
Check for memory leaks or consider scaling up.
|
||||
runbook_url: "https://runbooks.example.com/high-memory"
|
||||
|
||||
- name: webapp_traffic
|
||||
interval: 30s
|
||||
rules:
|
||||
# Traffic spike
|
||||
- alert: TrafficSpike
|
||||
expr: |
|
||||
sum(rate(http_requests_total{job="webapp"}[5m]))
|
||||
>
|
||||
1.5 * sum(rate(http_requests_total{job="webapp"}[5m] offset 1h))
|
||||
for: 10m
|
||||
labels:
|
||||
severity: warning
|
||||
team: backend
|
||||
component: webapp
|
||||
annotations:
|
||||
summary: "Traffic spike detected - {{ $labels.job }}"
|
||||
description: |
|
||||
Request rate increased by 50% compared to 1 hour ago.
|
||||
|
||||
Current: {{ $value | humanize }} req/s
|
||||
|
||||
This could be:
|
||||
- Legitimate traffic increase
|
||||
- DDoS attack
|
||||
- Retry storm
|
||||
|
||||
Monitor closely and be ready to scale.
|
||||
runbook_url: "https://runbooks.example.com/traffic-spike"
|
||||
|
||||
# Traffic drop (potential issue)
|
||||
- alert: TrafficDrop
|
||||
expr: |
|
||||
sum(rate(http_requests_total{job="webapp"}[5m]))
|
||||
<
|
||||
0.5 * sum(rate(http_requests_total{job="webapp"}[5m] offset 1h))
|
||||
for: 10m
|
||||
labels:
|
||||
severity: warning
|
||||
team: backend
|
||||
component: webapp
|
||||
annotations:
|
||||
summary: "Traffic drop detected - {{ $labels.job }}"
|
||||
description: |
|
||||
Request rate dropped by 50% compared to 1 hour ago.
|
||||
|
||||
This could indicate:
|
||||
- Upstream service issue
|
||||
- DNS problems
|
||||
- Load balancer misconfiguration
|
||||
runbook_url: "https://runbooks.example.com/traffic-drop"
|
||||
|
||||
- name: webapp_dependencies
|
||||
interval: 30s
|
||||
rules:
|
||||
# Database connection pool exhaustion
|
||||
- alert: DatabasePoolExhausted
|
||||
expr: |
|
||||
(db_connection_pool_active / db_connection_pool_max) > 0.9
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
team: backend
|
||||
component: database
|
||||
annotations:
|
||||
summary: "Database connection pool near exhaustion"
|
||||
description: |
|
||||
Connection pool is {{ $value | humanizePercentage }} full.
|
||||
|
||||
This will cause request failures. Immediate action required.
|
||||
runbook_url: "https://runbooks.example.com/db-pool-exhausted"
|
||||
|
||||
# External API errors
|
||||
- alert: ExternalAPIErrors
|
||||
expr: |
|
||||
sum(rate(external_api_requests_total{status=~"5.."}[5m])) by (api)
|
||||
/
|
||||
sum(rate(external_api_requests_total[5m])) by (api)
|
||||
> 0.1
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
team: backend
|
||||
component: integration
|
||||
annotations:
|
||||
summary: "High error rate from external API - {{ $labels.api }}"
|
||||
description: |
|
||||
{{ $labels.api }} is returning errors at {{ $value | humanizePercentage }} rate.
|
||||
|
||||
Check API status page and consider enabling circuit breaker.
|
||||
runbook_url: "https://runbooks.example.com/external-api-errors"
|
||||
409
assets/templates/runbooks/incident-runbook-template.md
Normal file
409
assets/templates/runbooks/incident-runbook-template.md
Normal file
@@ -0,0 +1,409 @@
|
||||
# Runbook: [Alert Name]
|
||||
|
||||
## Overview
|
||||
|
||||
**Alert Name**: [e.g., HighLatency, ServiceDown, ErrorBudgetBurn]
|
||||
|
||||
**Severity**: [Critical | Warning | Info]
|
||||
|
||||
**Team**: [e.g., Backend, Platform, Database]
|
||||
|
||||
**Component**: [e.g., API Gateway, User Service, PostgreSQL]
|
||||
|
||||
**What it means**: [One-line description of what this alert indicates]
|
||||
|
||||
**User impact**: [How does this affect users? High/Medium/Low]
|
||||
|
||||
**Urgency**: [How quickly must this be addressed? Immediate/Hours/Days]
|
||||
|
||||
---
|
||||
|
||||
## Alert Details
|
||||
|
||||
### When This Alert Fires
|
||||
|
||||
This alert fires when:
|
||||
- [Specific condition, e.g., "P95 latency exceeds 500ms for 10 minutes"]
|
||||
- [Any additional conditions]
|
||||
|
||||
### Symptoms
|
||||
|
||||
Users will experience:
|
||||
- [ ] Slow response times
|
||||
- [ ] Errors or failures
|
||||
- [ ] Service unavailable
|
||||
- [ ] [Other symptoms]
|
||||
|
||||
### Probable Causes
|
||||
|
||||
Common causes include:
|
||||
1. **[Cause 1]**: [Description]
|
||||
- Example: Database overload due to slow queries
|
||||
2. **[Cause 2]**: [Description]
|
||||
- Example: Memory leak causing OOM errors
|
||||
3. **[Cause 3]**: [Description]
|
||||
- Example: Upstream service degradation
|
||||
|
||||
---
|
||||
|
||||
## Investigation Steps
|
||||
|
||||
### 1. Check Service Health
|
||||
|
||||
**Dashboard**: [Link to primary dashboard]
|
||||
|
||||
**Key metrics to check**:
|
||||
```bash
|
||||
# Request rate
|
||||
sum(rate(http_requests_total[5m]))
|
||||
|
||||
# Error rate
|
||||
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
|
||||
|
||||
# Latency (p95, p99)
|
||||
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
|
||||
```
|
||||
|
||||
**What to look for**:
|
||||
- [ ] Has traffic spiked recently?
|
||||
- [ ] Is error rate elevated?
|
||||
- [ ] Are any endpoints particularly slow?
|
||||
|
||||
### 2. Check Recent Changes
|
||||
|
||||
**Deployments**:
|
||||
```bash
|
||||
# Kubernetes
|
||||
kubectl rollout history deployment/[service-name] -n [namespace]
|
||||
|
||||
# Check when last deployed
|
||||
kubectl get pods -n [namespace] -o wide | grep [service-name]
|
||||
```
|
||||
|
||||
**What to look for**:
|
||||
- [ ] Was there a recent deployment?
|
||||
- [ ] Did alert start after deployment?
|
||||
- [ ] Any configuration changes?
|
||||
|
||||
### 3. Check Logs
|
||||
|
||||
**Log query** (adjust for your log system):
|
||||
```bash
|
||||
# Kubernetes
|
||||
kubectl logs deployment/[service-name] -n [namespace] --tail=100 | grep ERROR
|
||||
|
||||
# Elasticsearch/Kibana
|
||||
GET /logs-*/_search
|
||||
{
|
||||
"query": {
|
||||
"bool": {
|
||||
"must": [
|
||||
{ "match": { "service": "[service-name]" } },
|
||||
{ "match": { "level": "error" } },
|
||||
{ "range": { "@timestamp": { "gte": "now-30m" } } }
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
# Loki/LogQL
|
||||
{job="[service-name]"} |= "error" | json | level="error"
|
||||
```
|
||||
|
||||
**What to look for**:
|
||||
- [ ] Repeated error messages
|
||||
- [ ] Stack traces
|
||||
- [ ] Connection errors
|
||||
- [ ] Timeout errors
|
||||
|
||||
### 4. Check Dependencies
|
||||
|
||||
**Database**:
|
||||
```bash
|
||||
# Check active connections
|
||||
SELECT count(*) FROM pg_stat_activity WHERE state = 'active';
|
||||
|
||||
# Check slow queries
|
||||
SELECT pid, now() - pg_stat_activity.query_start AS duration, query
|
||||
FROM pg_stat_activity
|
||||
WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '5 seconds';
|
||||
```
|
||||
|
||||
**External APIs**:
|
||||
- [ ] Check status pages: [Link to status pages]
|
||||
- [ ] Check API error rates in dashboard
|
||||
- [ ] Test API endpoints manually
|
||||
|
||||
**Cache** (Redis/Memcached):
|
||||
```bash
|
||||
# Redis info
|
||||
redis-cli -h [host] INFO stats
|
||||
|
||||
# Check memory usage
|
||||
redis-cli -h [host] INFO memory
|
||||
```
|
||||
|
||||
### 5. Check Resource Usage
|
||||
|
||||
**CPU and Memory**:
|
||||
```bash
|
||||
# Kubernetes
|
||||
kubectl top pods -n [namespace] | grep [service-name]
|
||||
|
||||
# Node metrics
|
||||
kubectl top nodes
|
||||
```
|
||||
|
||||
**Prometheus queries**:
|
||||
```promql
|
||||
# CPU usage by pod
|
||||
sum(rate(container_cpu_usage_seconds_total{pod=~"[service-name].*"}[5m])) by (pod)
|
||||
|
||||
# Memory usage by pod
|
||||
sum(container_memory_usage_bytes{pod=~"[service-name].*"}) by (pod)
|
||||
```
|
||||
|
||||
**What to look for**:
|
||||
- [ ] CPU throttling
|
||||
- [ ] Memory approaching limits
|
||||
- [ ] Disk space issues
|
||||
|
||||
### 6. Check Traces (if available)
|
||||
|
||||
**Trace query**:
|
||||
```bash
|
||||
# Jaeger
|
||||
# Search for slow traces (> 1s) in last 30 minutes
|
||||
|
||||
# Tempo/TraceQL
|
||||
{ duration > 1s && resource.service.name = "[service-name]" }
|
||||
```
|
||||
|
||||
**What to look for**:
|
||||
- [ ] Which operation is slow?
|
||||
- [ ] Where is time spent? (DB, external API, service logic)
|
||||
- [ ] Any N+1 query patterns?
|
||||
|
||||
---
|
||||
|
||||
## Common Scenarios and Solutions
|
||||
|
||||
### Scenario 1: Recent Deployment Caused Issue
|
||||
|
||||
**Symptoms**:
|
||||
- Alert started immediately after deployment
|
||||
- Error logs correlate with new code
|
||||
|
||||
**Solution**:
|
||||
```bash
|
||||
# Rollback deployment
|
||||
kubectl rollout undo deployment/[service-name] -n [namespace]
|
||||
|
||||
# Verify rollback succeeded
|
||||
kubectl rollout status deployment/[service-name] -n [namespace]
|
||||
|
||||
# Monitor for alert resolution
|
||||
```
|
||||
|
||||
**Follow-up**:
|
||||
- [ ] Create incident report
|
||||
- [ ] Review deployment process
|
||||
- [ ] Add pre-deployment checks
|
||||
|
||||
### Scenario 2: Database Performance Issue
|
||||
|
||||
**Symptoms**:
|
||||
- Slow query logs show problematic queries
|
||||
- Database CPU or connection pool exhausted
|
||||
|
||||
**Solution**:
|
||||
```bash
|
||||
# Identify slow query
|
||||
# Kill long-running query (use with caution)
|
||||
SELECT pg_cancel_backend([pid]);
|
||||
|
||||
# Or terminate if cancel doesn't work
|
||||
SELECT pg_terminate_backend([pid]);
|
||||
|
||||
# Add index if missing (in maintenance window)
|
||||
CREATE INDEX CONCURRENTLY idx_name ON table_name (column_name);
|
||||
```
|
||||
|
||||
**Follow-up**:
|
||||
- [ ] Add query performance test
|
||||
- [ ] Review and optimize query
|
||||
- [ ] Consider read replicas
|
||||
|
||||
### Scenario 3: Memory Leak
|
||||
|
||||
**Symptoms**:
|
||||
- Memory usage gradually increasing
|
||||
- Eventually OOMKilled
|
||||
- Restarts temporarily fix issue
|
||||
|
||||
**Solution**:
|
||||
```bash
|
||||
# Immediate: Restart pods
|
||||
kubectl rollout restart deployment/[service-name] -n [namespace]
|
||||
|
||||
# Increase memory limits (temporary)
|
||||
kubectl set resources deployment/[service-name] -n [namespace] \
|
||||
--limits=memory=2Gi
|
||||
```
|
||||
|
||||
**Follow-up**:
|
||||
- [ ] Profile application for memory leaks
|
||||
- [ ] Add memory usage alerts
|
||||
- [ ] Fix root cause
|
||||
|
||||
### Scenario 4: Traffic Spike / DDoS
|
||||
|
||||
**Symptoms**:
|
||||
- Sudden traffic increase
|
||||
- Traffic from unusual sources
|
||||
- High CPU/memory across all instances
|
||||
|
||||
**Solution**:
|
||||
```bash
|
||||
# Scale up immediately
|
||||
kubectl scale deployment/[service-name] -n [namespace] --replicas=10
|
||||
|
||||
# Enable rate limiting at load balancer level
|
||||
# (Specific steps depend on LB)
|
||||
|
||||
# Block suspicious IPs if confirmed DDoS
|
||||
# (Use WAF or network policies)
|
||||
```
|
||||
|
||||
**Follow-up**:
|
||||
- [ ] Implement rate limiting
|
||||
- [ ] Add DDoS protection (CloudFlare, WAF)
|
||||
- [ ] Set up auto-scaling
|
||||
|
||||
### Scenario 5: Upstream Service Degradation
|
||||
|
||||
**Symptoms**:
|
||||
- Errors calling external API
|
||||
- Timeouts to upstream service
|
||||
- Upstream status page shows issues
|
||||
|
||||
**Solution**:
|
||||
```bash
|
||||
# Enable circuit breaker (if available)
|
||||
# Adjust timeout configuration
|
||||
# Switch to backup service/cached data
|
||||
|
||||
# Monitor external service
|
||||
# Check status page: [Link]
|
||||
```
|
||||
|
||||
**Follow-up**:
|
||||
- [ ] Implement circuit breaker pattern
|
||||
- [ ] Add fallback mechanisms
|
||||
- [ ] Set up external service monitoring
|
||||
|
||||
---
|
||||
|
||||
## Immediate Actions (< 5 minutes)
|
||||
|
||||
These should be done first to mitigate impact:
|
||||
|
||||
1. **[Action 1]**: [e.g., "Scale up service"]
|
||||
```bash
|
||||
kubectl scale deployment/[service] --replicas=10
|
||||
```
|
||||
|
||||
2. **[Action 2]**: [e.g., "Rollback deployment"]
|
||||
```bash
|
||||
kubectl rollout undo deployment/[service]
|
||||
```
|
||||
|
||||
3. **[Action 3]**: [e.g., "Enable circuit breaker"]
|
||||
|
||||
---
|
||||
|
||||
## Short-term Actions (< 30 minutes)
|
||||
|
||||
After immediate mitigation:
|
||||
|
||||
1. **[Action 1]**: [e.g., "Investigate root cause"]
|
||||
2. **[Action 2]**: [e.g., "Optimize slow query"]
|
||||
3. **[Action 3]**: [e.g., "Clear cache if stale"]
|
||||
|
||||
---
|
||||
|
||||
## Long-term Actions (Post-Incident)
|
||||
|
||||
Preventive measures:
|
||||
|
||||
1. **[Action 1]**: [e.g., "Add circuit breaker"]
|
||||
2. **[Action 2]**: [e.g., "Implement auto-scaling"]
|
||||
3. **[Action 3]**: [e.g., "Add query performance tests"]
|
||||
4. **[Action 4]**: [e.g., "Update alert thresholds"]
|
||||
|
||||
---
|
||||
|
||||
## Escalation
|
||||
|
||||
If issue persists after 30 minutes:
|
||||
|
||||
**Escalation Path**:
|
||||
1. **Primary oncall**: @[username] ([slack/email])
|
||||
2. **Team lead**: @[username] ([slack/email])
|
||||
3. **Engineering manager**: @[username] ([slack/email])
|
||||
4. **Incident commander**: @[username] ([slack/email])
|
||||
|
||||
**Communication**:
|
||||
- **Slack channel**: #[incidents-channel]
|
||||
- **Status page**: [Link]
|
||||
- **Incident tracking**: [Link to incident management tool]
|
||||
|
||||
---
|
||||
|
||||
## Related Runbooks
|
||||
|
||||
- [Related Runbook 1]
|
||||
- [Related Runbook 2]
|
||||
- [Related Runbook 3]
|
||||
|
||||
## Related Dashboards
|
||||
|
||||
- [Main Service Dashboard]
|
||||
- [Resource Usage Dashboard]
|
||||
- [Dependency Dashboard]
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [Architecture Diagram]
|
||||
- [Service Documentation]
|
||||
- [API Documentation]
|
||||
|
||||
---
|
||||
|
||||
## Recent Incidents
|
||||
|
||||
| Date | Duration | Root Cause | Resolution | Ticket |
|
||||
|------|----------|------------|------------|--------|
|
||||
| 2024-10-15 | 23 min | Database pool exhausted | Increased pool size | INC-123 |
|
||||
| 2024-09-30 | 45 min | Memory leak | Fixed code, restarted | INC-120 |
|
||||
|
||||
---
|
||||
|
||||
## Runbook Metadata
|
||||
|
||||
**Last Updated**: [Date]
|
||||
|
||||
**Owner**: [Team name]
|
||||
|
||||
**Reviewers**: [Names]
|
||||
|
||||
**Next Review**: [Date]
|
||||
|
||||
---
|
||||
|
||||
## Notes
|
||||
|
||||
- This runbook should be reviewed quarterly
|
||||
- Update after each incident to capture new learnings
|
||||
- Keep investigation steps concise and actionable
|
||||
- Include actual commands that can be copy-pasted
|
||||
BIN
monitoring-observability.skill
Normal file
BIN
monitoring-observability.skill
Normal file
Binary file not shown.
125
plugin.lock.json
Normal file
125
plugin.lock.json
Normal file
@@ -0,0 +1,125 @@
|
||||
{
|
||||
"$schema": "internal://schemas/plugin.lock.v1.json",
|
||||
"pluginId": "gh:ahmedasmar/devops-claude-skills:monitoring-observability",
|
||||
"normalized": {
|
||||
"repo": null,
|
||||
"ref": "refs/tags/v20251128.0",
|
||||
"commit": "9bb89b1ce889c2df6d7c3c2eedbd6d1301297561",
|
||||
"treeHash": "9fd50a78a79b6d45553e3372bc2d5142f4c48ba4a945ca724356f89f9ce08825",
|
||||
"generatedAt": "2025-11-28T10:13:03.403599Z",
|
||||
"toolVersion": "publish_plugins.py@0.2.0"
|
||||
},
|
||||
"origin": {
|
||||
"remote": "git@github.com:zhongweili/42plugin-data.git",
|
||||
"branch": "master",
|
||||
"commit": "aa1497ed0949fd50e99e70d6324a29c5b34f9390",
|
||||
"repoRoot": "/Users/zhongweili/projects/openmind/42plugin-data"
|
||||
},
|
||||
"manifest": {
|
||||
"name": "monitoring-observability",
|
||||
"description": "Monitoring and observability strategy, metrics/logs/traces systems, SLOs/error budgets, Prometheus/Grafana/Loki, OpenTelemetry, and tool comparison",
|
||||
"version": null
|
||||
},
|
||||
"content": {
|
||||
"files": [
|
||||
{
|
||||
"path": "README.md",
|
||||
"sha256": "b18b6358cf31ab285b751916a5b2c670b5bc2c8748ef17216f2c9106e4997f8e"
|
||||
},
|
||||
{
|
||||
"path": "SKILL.md",
|
||||
"sha256": "c02fcac42ed2d4d6fcda67a9f835000b1a1198734e4d8d18000546dda81402e4"
|
||||
},
|
||||
{
|
||||
"path": "monitoring-observability.skill",
|
||||
"sha256": "c2c368577bb73885c887cc824b695fb3d36f4a77e74b2e25dcd7815c331a71c1"
|
||||
},
|
||||
{
|
||||
"path": "references/alerting_best_practices.md",
|
||||
"sha256": "99cea7a40310b77a4fdff5543a0b1ee44189497508757bee0dc9ebbe11794a53"
|
||||
},
|
||||
{
|
||||
"path": "references/metrics_design.md",
|
||||
"sha256": "6edc73473e9d3c2ac7e46a4d97576d356d177ed701a2468c5e21d528ff9c29d7"
|
||||
},
|
||||
{
|
||||
"path": "references/tracing_guide.md",
|
||||
"sha256": "5e419d77a31d8b3ee5c16fb57e1fc6e3e16d31efb8f4a86dd756c7327a482fa0"
|
||||
},
|
||||
{
|
||||
"path": "references/dql_promql_translation.md",
|
||||
"sha256": "47113e77b03d9ac70fc35121efd93cf5e17e031b878d27791403493b71058c5c"
|
||||
},
|
||||
{
|
||||
"path": "references/tool_comparison.md",
|
||||
"sha256": "fd0fc7e4fc3641ca0ddc469a14fa1373457f5a4586fe4bc7ec23afe3de9f6171"
|
||||
},
|
||||
{
|
||||
"path": "references/datadog_migration.md",
|
||||
"sha256": "9ed5e276eb2ea67f72c91e1bb53374b293e164fa28c4c44f31ee9f8660dfaf02"
|
||||
},
|
||||
{
|
||||
"path": "references/logging_guide.md",
|
||||
"sha256": "2c94b61d6db2c0f6b8927c8092010f3a2f1ea20d2eefd330d8073e7b4bcf4c9d"
|
||||
},
|
||||
{
|
||||
"path": "references/slo_sla_guide.md",
|
||||
"sha256": "2a0cb69dd120897183f7bcab002a368dbe11bd5038817906da3391ca168e0052"
|
||||
},
|
||||
{
|
||||
"path": "scripts/log_analyzer.py",
|
||||
"sha256": "c7fb7e13c2d6507c81ee9575fc8514408d36b2f2e786caeb536ba927d517046e"
|
||||
},
|
||||
{
|
||||
"path": "scripts/analyze_metrics.py",
|
||||
"sha256": "50ad856cb043dfd70b60c6ca685b526d34b8bc5e5454dd0b530033da3da22545"
|
||||
},
|
||||
{
|
||||
"path": "scripts/health_check_validator.py",
|
||||
"sha256": "cef8c447fabf83dfd9bd28a8d22127b87b66aafa4d151cbccd9fe1f1db0bbcf2"
|
||||
},
|
||||
{
|
||||
"path": "scripts/alert_quality_checker.py",
|
||||
"sha256": "b561cf9c41e2de8d5f09557c018110553047d0ad54629bdc7a07a654d76263d1"
|
||||
},
|
||||
{
|
||||
"path": "scripts/datadog_cost_analyzer.py",
|
||||
"sha256": "05a1c6c0033b04f2f5206af015907f2df4c9cf57f4c2b8f10ba2565236a5c97f"
|
||||
},
|
||||
{
|
||||
"path": "scripts/slo_calculator.py",
|
||||
"sha256": "c26ab0f0a31e5efa830a9f24938ec356bfaef927438bd47b95f4ad0015cff662"
|
||||
},
|
||||
{
|
||||
"path": "scripts/dashboard_generator.py",
|
||||
"sha256": "6fe98a49ae431d67bc44eb631c542ba29199da72cc348e90ec99d73a05783ee5"
|
||||
},
|
||||
{
|
||||
"path": ".claude-plugin/plugin.json",
|
||||
"sha256": "7b6a16e6bce66bf87929c2f3c4ea32f4bfadd8d9606edd195f144c82ec85f151"
|
||||
},
|
||||
{
|
||||
"path": "assets/templates/prometheus-alerts/webapp-alerts.yml",
|
||||
"sha256": "d881081e53650c335ec5cc7d5d96bade03e607e55bff3bcbafe6811377055154"
|
||||
},
|
||||
{
|
||||
"path": "assets/templates/prometheus-alerts/kubernetes-alerts.yml",
|
||||
"sha256": "cb8c247b245ea1fb2a904f525fce8f74f9237d79eda04c2c60938135a7271415"
|
||||
},
|
||||
{
|
||||
"path": "assets/templates/runbooks/incident-runbook-template.md",
|
||||
"sha256": "1a5ba8951cf5b1408ea2101232ffe8d88fab75ed4ae63b0c9f1902059373112d"
|
||||
},
|
||||
{
|
||||
"path": "assets/templates/otel-config/collector-config.yaml",
|
||||
"sha256": "2696548b1c7f4034283cc2387f9730efa4811881d1c9c9219002e7affc8c29f2"
|
||||
}
|
||||
],
|
||||
"dirSha256": "9fd50a78a79b6d45553e3372bc2d5142f4c48ba4a945ca724356f89f9ce08825"
|
||||
},
|
||||
"security": {
|
||||
"scannedAt": null,
|
||||
"scannerVersion": null,
|
||||
"flags": []
|
||||
}
|
||||
}
|
||||
609
references/alerting_best_practices.md
Normal file
609
references/alerting_best_practices.md
Normal file
@@ -0,0 +1,609 @@
|
||||
# Alerting Best Practices
|
||||
|
||||
## Core Principles
|
||||
|
||||
### 1. Every Alert Should Be Actionable
|
||||
If you can't do something about it, don't alert on it.
|
||||
|
||||
❌ Bad: `Alert: CPU > 50%` (What action should be taken?)
|
||||
✅ Good: `Alert: API latency p95 > 2s for 10m` (Investigate/scale up)
|
||||
|
||||
### 2. Alert on Symptoms, Not Causes
|
||||
Alert on what users experience, not underlying components.
|
||||
|
||||
❌ Bad: `Database connection pool 80% full`
|
||||
✅ Good: `Request latency p95 > 1s` (which might be caused by DB pool)
|
||||
|
||||
### 3. Alert on SLO Violations
|
||||
Tie alerts to Service Level Objectives.
|
||||
|
||||
✅ `Error rate exceeds 0.1% (SLO: 99.9% availability)`
|
||||
|
||||
### 4. Reduce Noise
|
||||
Alert fatigue is real. Only page for critical issues.
|
||||
|
||||
**Alert Severity Levels**:
|
||||
- **Critical**: Page on-call immediately (user-facing issue)
|
||||
- **Warning**: Create ticket, review during business hours
|
||||
- **Info**: Log for awareness, no action needed
|
||||
|
||||
---
|
||||
|
||||
## Alert Design Patterns
|
||||
|
||||
### Pattern 1: Multi-Window Multi-Burn-Rate
|
||||
|
||||
Google's recommended SLO alerting approach.
|
||||
|
||||
**Concept**: Alert when error budget burn rate is high enough to exhaust the budget too quickly.
|
||||
|
||||
```yaml
|
||||
# Fast burn (6% of budget in 1 hour)
|
||||
- alert: FastBurnRate
|
||||
expr: |
|
||||
sum(rate(http_requests_total{status=~"5.."}[1h]))
|
||||
/
|
||||
sum(rate(http_requests_total[1h]))
|
||||
> (14.4 * 0.001) # 14.4x burn rate for 99.9% SLO
|
||||
for: 2m
|
||||
labels:
|
||||
severity: critical
|
||||
|
||||
# Slow burn (6% of budget in 6 hours)
|
||||
- alert: SlowBurnRate
|
||||
expr: |
|
||||
sum(rate(http_requests_total{status=~"5.."}[6h]))
|
||||
/
|
||||
sum(rate(http_requests_total[6h]))
|
||||
> (6 * 0.001) # 6x burn rate for 99.9% SLO
|
||||
for: 30m
|
||||
labels:
|
||||
severity: warning
|
||||
```
|
||||
|
||||
**Burn Rate Multipliers for 99.9% SLO (0.1% error budget)**:
|
||||
- 1 hour window, 2m grace: 14.4x burn rate
|
||||
- 6 hour window, 30m grace: 6x burn rate
|
||||
- 3 day window, 6h grace: 1x burn rate
|
||||
|
||||
### Pattern 2: Rate of Change
|
||||
Alert when metrics change rapidly.
|
||||
|
||||
```yaml
|
||||
- alert: TrafficSpike
|
||||
expr: |
|
||||
sum(rate(http_requests_total[5m]))
|
||||
>
|
||||
1.5 * sum(rate(http_requests_total[5m] offset 1h))
|
||||
for: 10m
|
||||
annotations:
|
||||
summary: "Traffic increased by 50% compared to 1 hour ago"
|
||||
```
|
||||
|
||||
### Pattern 3: Threshold with Hysteresis
|
||||
Prevent flapping with different thresholds for firing and resolving.
|
||||
|
||||
```yaml
|
||||
# Fire at 90%, resolve at 70%
|
||||
- alert: HighCPU
|
||||
expr: cpu_usage > 90
|
||||
for: 5m
|
||||
|
||||
- alert: HighCPU_Resolved
|
||||
expr: cpu_usage < 70
|
||||
for: 5m
|
||||
```
|
||||
|
||||
### Pattern 4: Absent Metrics
|
||||
Alert when expected metrics stop being reported (service down).
|
||||
|
||||
```yaml
|
||||
- alert: ServiceDown
|
||||
expr: absent(up{job="my-service"})
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "Service {{ $labels.job }} is not reporting metrics"
|
||||
```
|
||||
|
||||
### Pattern 5: Aggregate Alerts
|
||||
Alert on aggregate performance across multiple instances.
|
||||
|
||||
```yaml
|
||||
- alert: HighOverallErrorRate
|
||||
expr: |
|
||||
sum(rate(http_requests_total{status=~"5.."}[5m]))
|
||||
/
|
||||
sum(rate(http_requests_total[5m]))
|
||||
> 0.05
|
||||
for: 10m
|
||||
annotations:
|
||||
summary: "Overall error rate is {{ $value | humanizePercentage }}"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Alert Annotation Best Practices
|
||||
|
||||
### Required Fields
|
||||
|
||||
**summary**: One-line description of the issue
|
||||
```yaml
|
||||
summary: "High error rate on {{ $labels.service }}: {{ $value | humanizePercentage }}"
|
||||
```
|
||||
|
||||
**description**: Detailed explanation with context
|
||||
```yaml
|
||||
description: |
|
||||
Error rate on {{ $labels.service }} is {{ $value | humanizePercentage }},
|
||||
which exceeds the threshold of 1% for more than 10 minutes.
|
||||
|
||||
Current value: {{ $value }}
|
||||
Runbook: https://runbooks.example.com/high-error-rate
|
||||
```
|
||||
|
||||
**runbook_url**: Link to investigation steps
|
||||
```yaml
|
||||
runbook_url: "https://runbooks.example.com/alerts/{{ $labels.alertname }}"
|
||||
```
|
||||
|
||||
### Optional but Recommended
|
||||
|
||||
**dashboard**: Link to relevant dashboard
|
||||
```yaml
|
||||
dashboard: "https://grafana.example.com/d/service-dashboard?var-service={{ $labels.service }}"
|
||||
```
|
||||
|
||||
**logs**: Link to logs
|
||||
```yaml
|
||||
logs: "https://kibana.example.com/app/discover#/?_a=(query:(query_string:(query:'service:{{ $labels.service }}')))"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Alert Label Best Practices
|
||||
|
||||
### Required Labels
|
||||
|
||||
**severity**: Critical, warning, or info
|
||||
```yaml
|
||||
labels:
|
||||
severity: critical
|
||||
```
|
||||
|
||||
**team**: Who should handle this alert
|
||||
```yaml
|
||||
labels:
|
||||
team: platform
|
||||
severity: critical
|
||||
```
|
||||
|
||||
**component**: What part of the system
|
||||
```yaml
|
||||
labels:
|
||||
component: api-gateway
|
||||
severity: warning
|
||||
```
|
||||
|
||||
### Example Complete Alert
|
||||
```yaml
|
||||
- alert: HighLatency
|
||||
expr: |
|
||||
histogram_quantile(0.95,
|
||||
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
|
||||
) > 1
|
||||
for: 10m
|
||||
labels:
|
||||
severity: warning
|
||||
team: backend
|
||||
component: api
|
||||
environment: "{{ $labels.environment }}"
|
||||
annotations:
|
||||
summary: "High latency on {{ $labels.service }}"
|
||||
description: |
|
||||
P95 latency on {{ $labels.service }} is {{ $value }}s, exceeding 1s threshold.
|
||||
|
||||
This may impact user experience. Check recent deployments and database performance.
|
||||
|
||||
Current p95: {{ $value }}s
|
||||
Threshold: 1s
|
||||
Duration: 10m+
|
||||
runbook_url: "https://runbooks.example.com/high-latency"
|
||||
dashboard: "https://grafana.example.com/d/api-dashboard"
|
||||
logs: "https://kibana.example.com/app/discover#/?_a=(query:(query_string:(query:'service:{{ $labels.service }} AND level:error')))"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Alert Thresholds
|
||||
|
||||
### General Guidelines
|
||||
|
||||
**Response Time / Latency**:
|
||||
- Warning: p95 > 500ms or p99 > 1s
|
||||
- Critical: p95 > 2s or p99 > 5s
|
||||
|
||||
**Error Rate**:
|
||||
- Warning: > 1%
|
||||
- Critical: > 5%
|
||||
|
||||
**Availability**:
|
||||
- Warning: < 99.9%
|
||||
- Critical: < 99.5%
|
||||
|
||||
**CPU Utilization**:
|
||||
- Warning: > 70% for 15m
|
||||
- Critical: > 90% for 5m
|
||||
|
||||
**Memory Utilization**:
|
||||
- Warning: > 80% for 15m
|
||||
- Critical: > 95% for 5m
|
||||
|
||||
**Disk Space**:
|
||||
- Warning: > 80% full
|
||||
- Critical: > 90% full
|
||||
|
||||
**Queue Depth**:
|
||||
- Warning: > 70% of max capacity
|
||||
- Critical: > 90% of max capacity
|
||||
|
||||
### Application-Specific Thresholds
|
||||
|
||||
Set thresholds based on:
|
||||
1. **Historical performance**: Use p95 of last 30 days + 20%
|
||||
2. **SLO requirements**: If SLO is 99.9%, alert at 99.5%
|
||||
3. **Business impact**: What error rate causes user complaints?
|
||||
|
||||
---
|
||||
|
||||
## The "for" Clause
|
||||
|
||||
Prevent alert flapping by requiring the condition to be true for a duration.
|
||||
|
||||
### Guidelines
|
||||
|
||||
**Critical alerts**: Short duration (2-5m)
|
||||
```yaml
|
||||
- alert: ServiceDown
|
||||
expr: up == 0
|
||||
for: 2m # Quick detection for critical issues
|
||||
```
|
||||
|
||||
**Warning alerts**: Longer duration (10-30m)
|
||||
```yaml
|
||||
- alert: HighMemoryUsage
|
||||
expr: memory_usage > 80
|
||||
for: 15m # Avoid noise from temporary spikes
|
||||
```
|
||||
|
||||
**Resource saturation**: Medium duration (5-10m)
|
||||
```yaml
|
||||
- alert: HighCPU
|
||||
expr: cpu_usage > 90
|
||||
for: 5m
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Alert Routing
|
||||
|
||||
### Severity-Based Routing
|
||||
|
||||
```yaml
|
||||
# alertmanager.yml
|
||||
route:
|
||||
group_by: ['alertname', 'cluster']
|
||||
group_wait: 10s
|
||||
group_interval: 5m
|
||||
repeat_interval: 4h
|
||||
receiver: 'default'
|
||||
|
||||
routes:
|
||||
# Critical alerts → PagerDuty
|
||||
- match:
|
||||
severity: critical
|
||||
receiver: pagerduty
|
||||
group_wait: 10s
|
||||
repeat_interval: 5m
|
||||
|
||||
# Warning alerts → Slack
|
||||
- match:
|
||||
severity: warning
|
||||
receiver: slack
|
||||
group_wait: 30s
|
||||
repeat_interval: 12h
|
||||
|
||||
# Info alerts → Email
|
||||
- match:
|
||||
severity: info
|
||||
receiver: email
|
||||
repeat_interval: 24h
|
||||
```
|
||||
|
||||
### Team-Based Routing
|
||||
|
||||
```yaml
|
||||
routes:
|
||||
# Platform team
|
||||
- match:
|
||||
team: platform
|
||||
receiver: platform-pagerduty
|
||||
|
||||
# Backend team
|
||||
- match:
|
||||
team: backend
|
||||
receiver: backend-slack
|
||||
|
||||
# Database team
|
||||
- match:
|
||||
component: database
|
||||
receiver: dba-pagerduty
|
||||
```
|
||||
|
||||
### Time-Based Routing
|
||||
|
||||
```yaml
|
||||
# Only page during business hours for non-critical
|
||||
routes:
|
||||
- match:
|
||||
severity: warning
|
||||
receiver: slack
|
||||
active_time_intervals:
|
||||
- business_hours
|
||||
|
||||
time_intervals:
|
||||
- name: business_hours
|
||||
time_intervals:
|
||||
- weekdays: ['monday:friday']
|
||||
times:
|
||||
- start_time: '09:00'
|
||||
end_time: '17:00'
|
||||
location: 'America/New_York'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Alert Grouping
|
||||
|
||||
### Intelligent Grouping
|
||||
|
||||
**Group by service and environment**:
|
||||
```yaml
|
||||
route:
|
||||
group_by: ['alertname', 'service', 'environment']
|
||||
group_wait: 30s
|
||||
group_interval: 5m
|
||||
```
|
||||
|
||||
This prevents:
|
||||
- 50 alerts for "HighCPU" on different pods → 1 grouped alert
|
||||
- Mixing production and staging alerts
|
||||
|
||||
### Inhibition Rules
|
||||
|
||||
Suppress related alerts when a parent alert fires.
|
||||
|
||||
```yaml
|
||||
inhibit_rules:
|
||||
# If service is down, suppress latency alerts
|
||||
- source_match:
|
||||
alertname: ServiceDown
|
||||
target_match:
|
||||
alertname: HighLatency
|
||||
equal: ['service']
|
||||
|
||||
# If node is down, suppress all pod alerts on that node
|
||||
- source_match:
|
||||
alertname: NodeDown
|
||||
target_match_re:
|
||||
alertname: '(PodCrashLoop|HighCPU|HighMemory)'
|
||||
equal: ['node']
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Runbook Structure
|
||||
|
||||
Every alert should link to a runbook with:
|
||||
|
||||
### 1. Context
|
||||
- What does this alert mean?
|
||||
- What is the user impact?
|
||||
- What is the urgency?
|
||||
|
||||
### 2. Investigation Steps
|
||||
```markdown
|
||||
## Investigation
|
||||
|
||||
1. Check service health dashboard
|
||||
https://grafana.example.com/d/service-dashboard
|
||||
|
||||
2. Check recent deployments
|
||||
kubectl rollout history deployment/myapp -n production
|
||||
|
||||
3. Check error logs
|
||||
kubectl logs deployment/myapp -n production --tail=100 | grep ERROR
|
||||
|
||||
4. Check dependencies
|
||||
- Database: Check slow query log
|
||||
- Redis: Check memory usage
|
||||
- External APIs: Check status pages
|
||||
```
|
||||
|
||||
### 3. Common Causes
|
||||
```markdown
|
||||
## Common Causes
|
||||
|
||||
- **Recent deployment**: Check if alert started after deployment
|
||||
- **Traffic spike**: Check request rate, might need to scale
|
||||
- **Database issues**: Check query performance and connection pool
|
||||
- **External API degradation**: Check third-party status pages
|
||||
```
|
||||
|
||||
### 4. Resolution Steps
|
||||
```markdown
|
||||
## Resolution
|
||||
|
||||
### Immediate Actions (< 5 minutes)
|
||||
1. Scale up if traffic spike: `kubectl scale deployment myapp --replicas=10`
|
||||
2. Rollback if recent deployment: `kubectl rollout undo deployment/myapp`
|
||||
|
||||
### Short-term Actions (< 30 minutes)
|
||||
1. Restart pods if memory leak: `kubectl rollout restart deployment/myapp`
|
||||
2. Clear cache if stale data: `redis-cli -h cache.example.com FLUSHDB`
|
||||
|
||||
### Long-term Actions (post-incident)
|
||||
1. Review and optimize slow queries
|
||||
2. Implement circuit breakers
|
||||
3. Add more capacity
|
||||
4. Update alert thresholds if false positive
|
||||
```
|
||||
|
||||
### 5. Escalation
|
||||
```markdown
|
||||
## Escalation
|
||||
|
||||
If issue persists after 30 minutes:
|
||||
- Slack: #backend-oncall
|
||||
- PagerDuty: Escalate to senior engineer
|
||||
- Incident Commander: Jane Doe (jane@example.com)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Anti-Patterns to Avoid
|
||||
|
||||
### 1. Alert on Everything
|
||||
❌ Don't: Alert on every warning log
|
||||
✅ Do: Alert on error rate exceeding threshold
|
||||
|
||||
### 2. Alert Without Context
|
||||
❌ Don't: "Error rate high"
|
||||
✅ Do: "Error rate 5.2% exceeds 1% threshold for 10m, impacting checkout flow"
|
||||
|
||||
### 3. Static Thresholds for Dynamic Systems
|
||||
❌ Don't: `cpu_usage > 70` (fails during scale-up)
|
||||
✅ Do: Alert on SLO violations or rate of change
|
||||
|
||||
### 4. No "for" Clause
|
||||
❌ Don't: Alert immediately on threshold breach
|
||||
✅ Do: Use `for: 5m` to avoid flapping
|
||||
|
||||
### 5. Too Many Recipients
|
||||
❌ Don't: Page 10 people for every alert
|
||||
✅ Do: Route to specific on-call rotation
|
||||
|
||||
### 6. Duplicate Alerts
|
||||
❌ Don't: Alert on both cause and symptom
|
||||
✅ Do: Alert on symptom, use inhibition for causes
|
||||
|
||||
### 7. No Runbook
|
||||
❌ Don't: Alert without guidance
|
||||
✅ Do: Include runbook_url in every alert
|
||||
|
||||
---
|
||||
|
||||
## Alert Testing
|
||||
|
||||
### Test Alert Firing
|
||||
```bash
|
||||
# Trigger test alert in Prometheus
|
||||
amtool alert add alertname="TestAlert" \
|
||||
severity="warning" \
|
||||
summary="Test alert"
|
||||
|
||||
# Or use Alertmanager API
|
||||
curl -X POST http://alertmanager:9093/api/v1/alerts \
|
||||
-d '[{
|
||||
"labels": {"alertname": "TestAlert", "severity": "critical"},
|
||||
"annotations": {"summary": "Test critical alert"}
|
||||
}]'
|
||||
```
|
||||
|
||||
### Verify Alert Rules
|
||||
```bash
|
||||
# Check syntax
|
||||
promtool check rules alerts.yml
|
||||
|
||||
# Test expression
|
||||
promtool query instant http://prometheus:9090 \
|
||||
'sum(rate(http_requests_total{status=~"5.."}[5m]))'
|
||||
|
||||
# Unit test alerts
|
||||
promtool test rules test.yml
|
||||
```
|
||||
|
||||
### Test Alertmanager Routing
|
||||
```bash
|
||||
# Test which receiver an alert would go to
|
||||
amtool config routes test \
|
||||
--config.file=alertmanager.yml \
|
||||
alertname="HighLatency" \
|
||||
severity="critical" \
|
||||
team="backend"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## On-Call Best Practices
|
||||
|
||||
### Rotation Schedule
|
||||
- **Primary on-call**: First responder
|
||||
- **Secondary on-call**: Escalation backup
|
||||
- **Rotation length**: 1 week (balance load vs context)
|
||||
- **Handoff**: Monday morning (not Friday evening)
|
||||
|
||||
### On-Call Checklist
|
||||
```markdown
|
||||
## Pre-shift
|
||||
- [ ] Test pager/phone
|
||||
- [ ] Review recent incidents
|
||||
- [ ] Check upcoming deployments
|
||||
- [ ] Update contact info
|
||||
|
||||
## During shift
|
||||
- [ ] Respond to pages within 5 minutes
|
||||
- [ ] Document all incidents
|
||||
- [ ] Update runbooks if gaps found
|
||||
- [ ] Communicate in #incidents channel
|
||||
|
||||
## Post-shift
|
||||
- [ ] Hand off open incidents
|
||||
- [ ] Complete incident reports
|
||||
- [ ] Suggest improvements
|
||||
- [ ] Update team documentation
|
||||
```
|
||||
|
||||
### Escalation Policy
|
||||
1. **Primary**: Responds within 5 minutes
|
||||
2. **Secondary**: Auto-escalate after 15 minutes
|
||||
3. **Manager**: Auto-escalate after 30 minutes
|
||||
4. **Incident Commander**: Critical incidents only
|
||||
|
||||
---
|
||||
|
||||
## Metrics About Alerts
|
||||
|
||||
Monitor your monitoring system!
|
||||
|
||||
### Key Metrics
|
||||
```promql
|
||||
# Alert firing frequency
|
||||
sum(ALERTS{alertstate="firing"}) by (alertname)
|
||||
|
||||
# Alert duration
|
||||
ALERTS_FOR_STATE{alertstate="firing"}
|
||||
|
||||
# Alerts per severity
|
||||
sum(ALERTS{alertstate="firing"}) by (severity)
|
||||
|
||||
# Time to acknowledge (from PagerDuty/etc)
|
||||
pagerduty_incident_ack_duration_seconds
|
||||
```
|
||||
|
||||
### Alert Quality Metrics
|
||||
- **Mean Time to Acknowledge (MTTA)**: < 5 minutes
|
||||
- **Mean Time to Resolve (MTTR)**: < 30 minutes
|
||||
- **False Positive Rate**: < 10%
|
||||
- **Alert Coverage**: % of incidents with preceding alert > 80%
|
||||
649
references/datadog_migration.md
Normal file
649
references/datadog_migration.md
Normal file
@@ -0,0 +1,649 @@
|
||||
# Migrating from Datadog to Open-Source Stack
|
||||
|
||||
## Overview
|
||||
|
||||
This guide helps you migrate from Datadog to a cost-effective open-source observability stack:
|
||||
- **Metrics**: Datadog → Prometheus + Grafana
|
||||
- **Logs**: Datadog → Loki + Grafana
|
||||
- **Traces**: Datadog APM → Tempo/Jaeger + Grafana
|
||||
- **Dashboards**: Datadog → Grafana
|
||||
- **Alerts**: Datadog Monitors → Prometheus Alertmanager
|
||||
|
||||
**Estimated Cost Savings**: 60-80% for similar functionality
|
||||
|
||||
---
|
||||
|
||||
## Cost Comparison
|
||||
|
||||
### Example: 100-host infrastructure
|
||||
|
||||
**Datadog**:
|
||||
- Infrastructure Pro: $1,500/month (100 hosts × $15)
|
||||
- Custom Metrics: $50/month (5,000 extra metrics beyond included 10,000)
|
||||
- Logs: $2,000/month (20GB/day × $0.10/GB × 30 days)
|
||||
- APM: $3,100/month (100 hosts × $31)
|
||||
- **Total**: ~$6,650/month ($79,800/year)
|
||||
|
||||
**Open-Source Stack** (self-hosted):
|
||||
- Infrastructure: $1,200/month (EC2/GKE for Prometheus, Grafana, Loki, Tempo)
|
||||
- Storage: $300/month (S3/GCS for long-term metrics and traces)
|
||||
- Operations time: Variable
|
||||
- **Total**: ~$1,500-2,500/month ($18,000-30,000/year)
|
||||
|
||||
**Savings**: $49,800-61,800/year
|
||||
|
||||
---
|
||||
|
||||
## Migration Strategy
|
||||
|
||||
### Phase 1: Run Parallel (Month 1-2)
|
||||
- Deploy open-source stack alongside Datadog
|
||||
- Migrate metrics first (lowest risk)
|
||||
- Validate data accuracy
|
||||
- Build confidence
|
||||
|
||||
### Phase 2: Migrate Dashboards & Alerts (Month 2-3)
|
||||
- Convert Datadog dashboards to Grafana
|
||||
- Translate alert rules
|
||||
- Train team on new tools
|
||||
|
||||
### Phase 3: Migrate Logs & Traces (Month 3-4)
|
||||
- Set up Loki for log aggregation
|
||||
- Deploy Tempo/Jaeger for tracing
|
||||
- Update application instrumentation
|
||||
|
||||
### Phase 4: Decommission Datadog (Month 4-5)
|
||||
- Confirm all functionality migrated
|
||||
- Cancel Datadog subscription
|
||||
- Archive Datadog dashboards/alerts for reference
|
||||
|
||||
---
|
||||
|
||||
## 1. Metrics Migration (Datadog → Prometheus)
|
||||
|
||||
### Step 1: Deploy Prometheus
|
||||
|
||||
**Kubernetes** (recommended):
|
||||
```yaml
|
||||
# prometheus-values.yaml
|
||||
prometheus:
|
||||
prometheusSpec:
|
||||
retention: 30d
|
||||
storageSpec:
|
||||
volumeClaimTemplate:
|
||||
spec:
|
||||
resources:
|
||||
requests:
|
||||
storage: 100Gi
|
||||
|
||||
# Scrape configs
|
||||
additionalScrapeConfigs:
|
||||
- job_name: 'kubernetes-pods'
|
||||
kubernetes_sd_configs:
|
||||
- role: pod
|
||||
```
|
||||
|
||||
**Install**:
|
||||
```bash
|
||||
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
|
||||
helm install prometheus prometheus-community/kube-prometheus-stack -f prometheus-values.yaml
|
||||
```
|
||||
|
||||
**Docker Compose**:
|
||||
```yaml
|
||||
version: '3'
|
||||
services:
|
||||
prometheus:
|
||||
image: prom/prometheus:latest
|
||||
ports:
|
||||
- "9090:9090"
|
||||
volumes:
|
||||
- ./prometheus.yml:/etc/prometheus/prometheus.yml
|
||||
- prometheus-data:/prometheus
|
||||
command:
|
||||
- '--config.file=/etc/prometheus/prometheus.yml'
|
||||
- '--storage.tsdb.retention.time=30d'
|
||||
|
||||
volumes:
|
||||
prometheus-data:
|
||||
```
|
||||
|
||||
### Step 2: Replace DogStatsD with Prometheus Exporters
|
||||
|
||||
**Before (DogStatsD)**:
|
||||
```python
|
||||
from datadog import statsd
|
||||
|
||||
statsd.increment('page.views')
|
||||
statsd.histogram('request.duration', 0.5)
|
||||
statsd.gauge('active_users', 100)
|
||||
```
|
||||
|
||||
**After (Prometheus Python client)**:
|
||||
```python
|
||||
from prometheus_client import Counter, Histogram, Gauge
|
||||
|
||||
page_views = Counter('page_views_total', 'Page views')
|
||||
request_duration = Histogram('request_duration_seconds', 'Request duration')
|
||||
active_users = Gauge('active_users', 'Active users')
|
||||
|
||||
# Usage
|
||||
page_views.inc()
|
||||
request_duration.observe(0.5)
|
||||
active_users.set(100)
|
||||
```
|
||||
|
||||
### Step 3: Metric Name Translation
|
||||
|
||||
| Datadog Metric | Prometheus Equivalent |
|
||||
|----------------|----------------------|
|
||||
| `system.cpu.idle` | `node_cpu_seconds_total{mode="idle"}` |
|
||||
| `system.mem.free` | `node_memory_MemFree_bytes` |
|
||||
| `system.disk.used` | `node_filesystem_size_bytes - node_filesystem_free_bytes` |
|
||||
| `docker.cpu.usage` | `container_cpu_usage_seconds_total` |
|
||||
| `kubernetes.pods.running` | `kube_pod_status_phase{phase="Running"}` |
|
||||
|
||||
### Step 4: Export Existing Datadog Metrics (Optional)
|
||||
|
||||
Use Datadog API to export historical data:
|
||||
|
||||
```python
|
||||
from datadog import api, initialize
|
||||
|
||||
options = {
|
||||
'api_key': 'YOUR_API_KEY',
|
||||
'app_key': 'YOUR_APP_KEY'
|
||||
}
|
||||
initialize(**options)
|
||||
|
||||
# Query metric
|
||||
result = api.Metric.query(
|
||||
start=int(time.time() - 86400), # Last 24h
|
||||
end=int(time.time()),
|
||||
query='avg:system.cpu.user{*}'
|
||||
)
|
||||
|
||||
# Convert to Prometheus format and import
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 2. Dashboard Migration (Datadog → Grafana)
|
||||
|
||||
### Step 1: Export Datadog Dashboards
|
||||
|
||||
```python
|
||||
import requests
|
||||
import json
|
||||
|
||||
api_key = "YOUR_API_KEY"
|
||||
app_key = "YOUR_APP_KEY"
|
||||
|
||||
headers = {
|
||||
'DD-API-KEY': api_key,
|
||||
'DD-APPLICATION-KEY': app_key
|
||||
}
|
||||
|
||||
# Get all dashboards
|
||||
response = requests.get(
|
||||
'https://api.datadoghq.com/api/v1/dashboard',
|
||||
headers=headers
|
||||
)
|
||||
|
||||
dashboards = response.json()
|
||||
|
||||
# Export each dashboard
|
||||
for dashboard in dashboards['dashboards']:
|
||||
dash_id = dashboard['id']
|
||||
detail = requests.get(
|
||||
f'https://api.datadoghq.com/api/v1/dashboard/{dash_id}',
|
||||
headers=headers
|
||||
).json()
|
||||
|
||||
with open(f'datadog_{dash_id}.json', 'w') as f:
|
||||
json.dump(detail, f, indent=2)
|
||||
```
|
||||
|
||||
### Step 2: Convert to Grafana Format
|
||||
|
||||
**Manual Conversion Template**:
|
||||
|
||||
| Datadog Widget | Grafana Panel Type |
|
||||
|----------------|-------------------|
|
||||
| Timeseries | Graph / Time series |
|
||||
| Query Value | Stat |
|
||||
| Toplist | Table / Bar gauge |
|
||||
| Heatmap | Heatmap |
|
||||
| Distribution | Histogram |
|
||||
|
||||
**Automated Conversion** (basic example):
|
||||
```python
|
||||
def convert_datadog_to_grafana(datadog_dashboard):
|
||||
grafana_dashboard = {
|
||||
"title": datadog_dashboard['title'],
|
||||
"panels": []
|
||||
}
|
||||
|
||||
for widget in datadog_dashboard['widgets']:
|
||||
panel = {
|
||||
"title": widget['definition'].get('title', ''),
|
||||
"type": map_widget_type(widget['definition']['type']),
|
||||
"targets": convert_queries(widget['definition']['requests'])
|
||||
}
|
||||
grafana_dashboard['panels'].append(panel)
|
||||
|
||||
return grafana_dashboard
|
||||
```
|
||||
|
||||
### Step 3: Common Query Translations
|
||||
|
||||
See `dql_promql_translation.md` for comprehensive query mappings.
|
||||
|
||||
**Example conversions**:
|
||||
|
||||
```
|
||||
Datadog: avg:system.cpu.user{*}
|
||||
Prometheus: avg(rate(node_cpu_seconds_total{mode="user"}[5m])) * 100
|
||||
|
||||
Datadog: sum:requests.count{status:200}.as_rate()
|
||||
Prometheus: sum(rate(http_requests_total{status="200"}[5m]))
|
||||
|
||||
Datadog: p95:request.duration{*}
|
||||
Prometheus: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3. Alert Migration (Datadog Monitors → Prometheus Alerts)
|
||||
|
||||
### Step 1: Export Datadog Monitors
|
||||
|
||||
```python
|
||||
import requests
|
||||
|
||||
api_key = "YOUR_API_KEY"
|
||||
app_key = "YOUR_APP_KEY"
|
||||
|
||||
headers = {
|
||||
'DD-API-KEY': api_key,
|
||||
'DD-APPLICATION-KEY': app_key
|
||||
}
|
||||
|
||||
response = requests.get(
|
||||
'https://api.datadoghq.com/api/v1/monitor',
|
||||
headers=headers
|
||||
)
|
||||
|
||||
monitors = response.json()
|
||||
|
||||
# Save each monitor
|
||||
for monitor in monitors:
|
||||
with open(f'monitor_{monitor["id"]}.json', 'w') as f:
|
||||
json.dump(monitor, f, indent=2)
|
||||
```
|
||||
|
||||
### Step 2: Convert to Prometheus Alert Rules
|
||||
|
||||
**Datadog Monitor**:
|
||||
```json
|
||||
{
|
||||
"name": "High CPU Usage",
|
||||
"type": "metric alert",
|
||||
"query": "avg(last_5m):avg:system.cpu.user{*} > 80",
|
||||
"message": "CPU usage is high on {{host.name}}"
|
||||
}
|
||||
```
|
||||
|
||||
**Prometheus Alert**:
|
||||
```yaml
|
||||
groups:
|
||||
- name: infrastructure
|
||||
rules:
|
||||
- alert: HighCPUUsage
|
||||
expr: |
|
||||
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "High CPU usage on {{ $labels.instance }}"
|
||||
description: "CPU usage is {{ $value }}%"
|
||||
```
|
||||
|
||||
### Step 3: Alert Routing (Datadog → Alertmanager)
|
||||
|
||||
**Datadog notification channels** → **Alertmanager receivers**
|
||||
|
||||
```yaml
|
||||
# alertmanager.yml
|
||||
route:
|
||||
group_by: ['alertname', 'severity']
|
||||
receiver: 'slack-notifications'
|
||||
|
||||
receivers:
|
||||
- name: 'slack-notifications'
|
||||
slack_configs:
|
||||
- api_url: 'YOUR_SLACK_WEBHOOK'
|
||||
channel: '#alerts'
|
||||
|
||||
- name: 'pagerduty-critical'
|
||||
pagerduty_configs:
|
||||
- service_key: 'YOUR_PAGERDUTY_KEY'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. Log Migration (Datadog → Loki)
|
||||
|
||||
### Step 1: Deploy Loki
|
||||
|
||||
**Kubernetes**:
|
||||
```bash
|
||||
helm repo add grafana https://grafana.github.io/helm-charts
|
||||
helm install loki grafana/loki-stack \
|
||||
--set loki.persistence.enabled=true \
|
||||
--set loki.persistence.size=100Gi \
|
||||
--set promtail.enabled=true
|
||||
```
|
||||
|
||||
**Docker Compose**:
|
||||
```yaml
|
||||
version: '3'
|
||||
services:
|
||||
loki:
|
||||
image: grafana/loki:latest
|
||||
ports:
|
||||
- "3100:3100"
|
||||
volumes:
|
||||
- ./loki-config.yaml:/etc/loki/local-config.yaml
|
||||
- loki-data:/loki
|
||||
|
||||
promtail:
|
||||
image: grafana/promtail:latest
|
||||
volumes:
|
||||
- /var/log:/var/log
|
||||
- ./promtail-config.yaml:/etc/promtail/config.yml
|
||||
|
||||
volumes:
|
||||
loki-data:
|
||||
```
|
||||
|
||||
### Step 2: Replace Datadog Log Forwarder
|
||||
|
||||
**Before (Datadog Agent)**:
|
||||
```yaml
|
||||
# datadog.yaml
|
||||
logs_enabled: true
|
||||
|
||||
logs_config:
|
||||
container_collect_all: true
|
||||
```
|
||||
|
||||
**After (Promtail)**:
|
||||
```yaml
|
||||
# promtail-config.yaml
|
||||
server:
|
||||
http_listen_port: 9080
|
||||
|
||||
positions:
|
||||
filename: /tmp/positions.yaml
|
||||
|
||||
clients:
|
||||
- url: http://loki:3100/loki/api/v1/push
|
||||
|
||||
scrape_configs:
|
||||
- job_name: system
|
||||
static_configs:
|
||||
- targets:
|
||||
- localhost
|
||||
labels:
|
||||
job: varlogs
|
||||
__path__: /var/log/*.log
|
||||
```
|
||||
|
||||
### Step 3: Query Translation
|
||||
|
||||
**Datadog Logs Query**:
|
||||
```
|
||||
service:my-app status:error
|
||||
```
|
||||
|
||||
**Loki LogQL**:
|
||||
```logql
|
||||
{job="my-app", level="error"}
|
||||
```
|
||||
|
||||
**More examples**:
|
||||
```
|
||||
Datadog: service:api-gateway status:error @http.status_code:>=500
|
||||
Loki: {service="api-gateway", level="error"} | json | http_status_code >= 500
|
||||
|
||||
Datadog: source:nginx "404"
|
||||
Loki: {source="nginx"} |= "404"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. APM Migration (Datadog APM → Tempo/Jaeger)
|
||||
|
||||
### Step 1: Choose Tracing Backend
|
||||
|
||||
- **Tempo**: Better for high volume, cheaper storage (object storage)
|
||||
- **Jaeger**: More mature, better UI, requires separate storage
|
||||
|
||||
### Step 2: Replace Datadog Tracer with OpenTelemetry
|
||||
|
||||
**Before (Datadog Python)**:
|
||||
```python
|
||||
from ddtrace import tracer
|
||||
|
||||
@tracer.wrap()
|
||||
def my_function():
|
||||
pass
|
||||
```
|
||||
|
||||
**After (OpenTelemetry)**:
|
||||
```python
|
||||
from opentelemetry import trace
|
||||
from opentelemetry.sdk.trace import TracerProvider
|
||||
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
|
||||
|
||||
# Setup
|
||||
trace.set_tracer_provider(TracerProvider())
|
||||
tracer = trace.get_tracer(__name__)
|
||||
exporter = OTLPSpanExporter(endpoint="tempo:4317")
|
||||
|
||||
@tracer.start_as_current_span("my_function")
|
||||
def my_function():
|
||||
pass
|
||||
```
|
||||
|
||||
### Step 3: Deploy Tempo
|
||||
|
||||
```yaml
|
||||
# tempo.yaml
|
||||
server:
|
||||
http_listen_port: 3200
|
||||
|
||||
distributor:
|
||||
receivers:
|
||||
otlp:
|
||||
protocols:
|
||||
grpc:
|
||||
endpoint: 0.0.0.0:4317
|
||||
|
||||
storage:
|
||||
trace:
|
||||
backend: s3
|
||||
s3:
|
||||
bucket: tempo-traces
|
||||
endpoint: s3.amazonaws.com
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. Infrastructure Migration
|
||||
|
||||
### Recommended Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────┐
|
||||
│ Grafana (Visualization) │
|
||||
│ - Dashboards │
|
||||
│ - Unified view │
|
||||
└─────────────────────────────────────────┘
|
||||
↓ ↓ ↓
|
||||
┌──────────────┐ ┌──────────┐ ┌──────────┐
|
||||
│ Prometheus │ │ Loki │ │ Tempo │
|
||||
│ (Metrics) │ │ (Logs) │ │ (Traces) │
|
||||
└──────────────┘ └──────────┘ └──────────┘
|
||||
↓ ↓ ↓
|
||||
┌─────────────────────────────────────────┐
|
||||
│ Applications (OpenTelemetry) │
|
||||
└─────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Sizing Recommendations
|
||||
|
||||
**100-host environment**:
|
||||
|
||||
- **Prometheus**: 2-4 CPU, 8-16GB RAM, 100GB SSD
|
||||
- **Grafana**: 1 CPU, 2GB RAM
|
||||
- **Loki**: 2-4 CPU, 8GB RAM, 100GB SSD
|
||||
- **Tempo**: 2-4 CPU, 8GB RAM, S3 for storage
|
||||
- **Alertmanager**: 1 CPU, 1GB RAM
|
||||
|
||||
**Total**: ~8-16 CPU, 32-64GB RAM, 200GB SSD + object storage
|
||||
|
||||
---
|
||||
|
||||
## 7. Migration Checklist
|
||||
|
||||
### Pre-Migration
|
||||
- [ ] Calculate current Datadog costs
|
||||
- [ ] Identify all Datadog integrations
|
||||
- [ ] Export all dashboards
|
||||
- [ ] Export all monitors
|
||||
- [ ] Document custom metrics
|
||||
- [ ] Get stakeholder approval
|
||||
|
||||
### During Migration
|
||||
- [ ] Deploy Prometheus + Grafana
|
||||
- [ ] Deploy Loki + Promtail
|
||||
- [ ] Deploy Tempo/Jaeger (if using APM)
|
||||
- [ ] Migrate metrics instrumentation
|
||||
- [ ] Convert dashboards (top 10 critical first)
|
||||
- [ ] Convert alerts (critical alerts first)
|
||||
- [ ] Update application logging
|
||||
- [ ] Replace APM instrumentation
|
||||
- [ ] Run parallel for 2-4 weeks
|
||||
- [ ] Validate data accuracy
|
||||
- [ ] Train team on new tools
|
||||
|
||||
### Post-Migration
|
||||
- [ ] Decommission Datadog agent from all hosts
|
||||
- [ ] Cancel Datadog subscription
|
||||
- [ ] Archive Datadog configs
|
||||
- [ ] Document new workflows
|
||||
- [ ] Create runbooks for common tasks
|
||||
|
||||
---
|
||||
|
||||
## 8. Common Challenges & Solutions
|
||||
|
||||
### Challenge: Missing Datadog Features
|
||||
|
||||
**Datadog Synthetic Monitoring**:
|
||||
- Solution: Use **Blackbox Exporter** (Prometheus) or **Grafana Synthetic Monitoring**
|
||||
|
||||
**Datadog Network Performance Monitoring**:
|
||||
- Solution: Use **Cilium Hubble** (Kubernetes) or **eBPF-based tools**
|
||||
|
||||
**Datadog RUM (Real User Monitoring)**:
|
||||
- Solution: Use **Grafana Faro** or **OpenTelemetry Browser SDK**
|
||||
|
||||
### Challenge: Team Learning Curve
|
||||
|
||||
**Solution**:
|
||||
- Provide training sessions (2-3 hours per tool)
|
||||
- Create internal documentation with examples
|
||||
- Set up sandbox environment for practice
|
||||
- Assign champions for each tool
|
||||
|
||||
### Challenge: Query Performance
|
||||
|
||||
**Prometheus too slow**:
|
||||
- Use **Thanos** or **Cortex** for scaling
|
||||
- Implement recording rules for expensive queries
|
||||
- Increase retention only where needed
|
||||
|
||||
**Loki too slow**:
|
||||
- Add more labels for better filtering
|
||||
- Use chunk caching
|
||||
- Consider **parallel query execution**
|
||||
|
||||
---
|
||||
|
||||
## 9. Maintenance Comparison
|
||||
|
||||
### Datadog (Managed)
|
||||
- **Ops burden**: Low (fully managed)
|
||||
- **Upgrades**: Automatic
|
||||
- **Scaling**: Automatic
|
||||
- **Cost**: High ($6k-10k+/month)
|
||||
|
||||
### Open-Source Stack (Self-hosted)
|
||||
- **Ops burden**: Medium (requires ops team)
|
||||
- **Upgrades**: Manual (quarterly)
|
||||
- **Scaling**: Manual planning required
|
||||
- **Cost**: Low ($1.5k-3k/month infrastructure)
|
||||
|
||||
**Hybrid Option**: Use **Grafana Cloud** (managed Prometheus/Loki/Tempo)
|
||||
- Cost: ~$3k/month for 100 hosts
|
||||
- Ops burden: Low
|
||||
- Savings: ~50% vs Datadog
|
||||
|
||||
---
|
||||
|
||||
## 10. ROI Calculation
|
||||
|
||||
### Example Scenario
|
||||
|
||||
**Before (Datadog)**:
|
||||
- Monthly cost: $7,000
|
||||
- Annual cost: $84,000
|
||||
|
||||
**After (Self-hosted OSS)**:
|
||||
- Infrastructure: $1,800/month
|
||||
- Operations (0.5 FTE): $4,000/month
|
||||
- Annual cost: $69,600
|
||||
|
||||
**Savings**: $14,400/year
|
||||
|
||||
**After (Grafana Cloud)**:
|
||||
- Monthly cost: $3,500
|
||||
- Annual cost: $42,000
|
||||
|
||||
**Savings**: $42,000/year (50%)
|
||||
|
||||
**Break-even**: Immediate (no migration costs beyond engineering time)
|
||||
|
||||
---
|
||||
|
||||
## Resources
|
||||
|
||||
- **Prometheus**: https://prometheus.io/docs/
|
||||
- **Grafana**: https://grafana.com/docs/
|
||||
- **Loki**: https://grafana.com/docs/loki/
|
||||
- **Tempo**: https://grafana.com/docs/tempo/
|
||||
- **OpenTelemetry**: https://opentelemetry.io/
|
||||
- **Migration Tools**: https://github.com/grafana/dashboard-linter
|
||||
|
||||
---
|
||||
|
||||
## Support
|
||||
|
||||
If you need help with migration:
|
||||
- Grafana Labs offers migration consulting
|
||||
- Many SRE consulting firms specialize in this
|
||||
- Community support via Slack/Discord channels
|
||||
756
references/dql_promql_translation.md
Normal file
756
references/dql_promql_translation.md
Normal file
@@ -0,0 +1,756 @@
|
||||
# DQL (Datadog Query Language) ↔ PromQL Translation Guide
|
||||
|
||||
## Quick Reference
|
||||
|
||||
| Concept | Datadog (DQL) | Prometheus (PromQL) |
|
||||
|---------|---------------|---------------------|
|
||||
| Aggregation | `avg:`, `sum:`, `min:`, `max:` | `avg()`, `sum()`, `min()`, `max()` |
|
||||
| Rate | `.as_rate()`, `.as_count()` | `rate()`, `increase()` |
|
||||
| Percentile | `p50:`, `p95:`, `p99:` | `histogram_quantile()` |
|
||||
| Filtering | `{tag:value}` | `{label="value"}` |
|
||||
| Time window | `last_5m`, `last_1h` | `[5m]`, `[1h]` |
|
||||
|
||||
---
|
||||
|
||||
## Basic Queries
|
||||
|
||||
### Simple Metric Query
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
system.cpu.user
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
node_cpu_seconds_total{mode="user"}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Metric with Filter
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
system.cpu.user{host:web-01}
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
node_cpu_seconds_total{mode="user", instance="web-01"}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Multiple Filters (AND)
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
system.cpu.user{host:web-01,env:production}
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
node_cpu_seconds_total{mode="user", instance="web-01", env="production"}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Wildcard Filters
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
system.cpu.user{host:web-*}
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
node_cpu_seconds_total{mode="user", instance=~"web-.*"}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### OR Filters
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
system.cpu.user{host:web-01 OR host:web-02}
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
node_cpu_seconds_total{mode="user", instance=~"web-01|web-02"}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Aggregations
|
||||
|
||||
### Average
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
avg:system.cpu.user{*}
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
avg(node_cpu_seconds_total{mode="user"})
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Sum
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
sum:requests.count{*}
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
sum(http_requests_total)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Min/Max
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
min:system.mem.free{*}
|
||||
max:system.mem.free{*}
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
min(node_memory_MemFree_bytes)
|
||||
max(node_memory_MemFree_bytes)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Aggregation by Tag/Label
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
avg:system.cpu.user{*} by {host}
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
avg by (instance) (node_cpu_seconds_total{mode="user"})
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Rates and Counts
|
||||
|
||||
### Rate (per second)
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
sum:requests.count{*}.as_rate()
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
sum(rate(http_requests_total[5m]))
|
||||
```
|
||||
|
||||
Note: Prometheus requires explicit time window `[5m]`
|
||||
|
||||
---
|
||||
|
||||
### Count (total over time)
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
sum:requests.count{*}.as_count()
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
sum(increase(http_requests_total[1h]))
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Derivative (change over time)
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
derivative(avg:system.disk.used{*})
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
deriv(node_filesystem_size_bytes[5m])
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Percentiles
|
||||
|
||||
### P50 (Median)
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
p50:request.duration{*}
|
||||
```
|
||||
|
||||
**Prometheus** (requires histogram):
|
||||
```promql
|
||||
histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### P95
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
p95:request.duration{*}
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### P99
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
p99:request.duration{*}
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Time Windows
|
||||
|
||||
### Last 5 minutes
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
avg(last_5m):system.cpu.user{*}
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
avg(node_cpu_seconds_total{mode="user"}[5m])
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Last 1 hour
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
avg(last_1h):system.cpu.user{*}
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
avg_over_time(node_cpu_seconds_total{mode="user"}[1h])
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Math Operations
|
||||
|
||||
### Division
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
avg:system.mem.used{*} / avg:system.mem.total{*}
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
node_memory_MemUsed_bytes / node_memory_MemTotal_bytes
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Multiplication
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
avg:system.cpu.user{*} * 100
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
avg(node_cpu_seconds_total{mode="user"}) * 100
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Percentage Calculation
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
(sum:requests.errors{*} / sum:requests.count{*}) * 100
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
(sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) * 100
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Common Use Cases
|
||||
|
||||
### CPU Usage Percentage
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
100 - avg:system.cpu.idle{*}
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Memory Usage Percentage
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
(avg:system.mem.used{*} / avg:system.mem.total{*}) * 100
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Disk Usage Percentage
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
(avg:system.disk.used{*} / avg:system.disk.total{*}) * 100
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
(node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes * 100
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Request Rate (requests/sec)
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
sum:requests.count{*}.as_rate()
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
sum(rate(http_requests_total[5m]))
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Error Rate Percentage
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
(sum:requests.errors{*}.as_rate() / sum:requests.count{*}.as_rate()) * 100
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
(sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) * 100
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Request Latency (P95)
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
p95:request.duration{*}
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Top 5 Hosts by CPU
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
top(avg:system.cpu.user{*} by {host}, 5, 'mean', 'desc')
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
topk(5, avg by (instance) (rate(node_cpu_seconds_total{mode="user"}[5m])))
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Functions
|
||||
|
||||
### Absolute Value
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
abs(diff(avg:system.cpu.user{*}))
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
abs(delta(node_cpu_seconds_total{mode="user"}[5m]))
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Ceiling/Floor
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
ceil(avg:system.cpu.user{*})
|
||||
floor(avg:system.cpu.user{*})
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
ceil(avg(node_cpu_seconds_total{mode="user"}))
|
||||
floor(avg(node_cpu_seconds_total{mode="user"}))
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Clamp (Limit Range)
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
clamp_min(avg:system.cpu.user{*}, 0)
|
||||
clamp_max(avg:system.cpu.user{*}, 100)
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
clamp_min(avg(node_cpu_seconds_total{mode="user"}), 0)
|
||||
clamp_max(avg(node_cpu_seconds_total{mode="user"}), 100)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Moving Average
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
moving_rollup(avg:system.cpu.user{*}, 60, 'avg')
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
avg_over_time(node_cpu_seconds_total{mode="user"}[1h])
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Advanced Patterns
|
||||
|
||||
### Compare to Previous Period
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
sum:requests.count{*}.as_rate() / timeshift(sum:requests.count{*}.as_rate(), 3600)
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
sum(rate(http_requests_total[5m])) / sum(rate(http_requests_total[5m] offset 1h))
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Forecast
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
forecast(avg:system.disk.used{*}, 'linear', 1)
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
predict_linear(node_filesystem_size_bytes[1h], 3600)
|
||||
```
|
||||
|
||||
Note: Predicts value 1 hour in future based on last 1 hour trend
|
||||
|
||||
---
|
||||
|
||||
### Anomaly Detection
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
anomalies(avg:system.cpu.user{*}, 'basic', 2)
|
||||
```
|
||||
|
||||
**Prometheus**: No built-in function
|
||||
- Use recording rules with stddev
|
||||
- External tools like **Robust Perception's anomaly detector**
|
||||
- Or use **Grafana ML** plugin
|
||||
|
||||
---
|
||||
|
||||
### Outlier Detection
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
outliers(avg:system.cpu.user{*} by {host}, 'mad')
|
||||
```
|
||||
|
||||
**Prometheus**: No built-in function
|
||||
- Calculate manually with stddev:
|
||||
```promql
|
||||
abs(metric - avg(metric)) > 2 * stddev(metric)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Container & Kubernetes
|
||||
|
||||
### Container CPU Usage
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
avg:docker.cpu.usage{*} by {container_name}
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
avg by (container) (rate(container_cpu_usage_seconds_total[5m]))
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Container Memory Usage
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
avg:docker.mem.rss{*} by {container_name}
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
avg by (container) (container_memory_rss)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Pod Count by Status
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
sum:kubernetes.pods.running{*} by {kube_namespace}
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
sum by (namespace) (kube_pod_status_phase{phase="Running"})
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Database Queries
|
||||
|
||||
### MySQL Queries Per Second
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
sum:mysql.performance.queries{*}.as_rate()
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
sum(rate(mysql_global_status_queries[5m]))
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### PostgreSQL Active Connections
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
avg:postgresql.connections{*}
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
avg(pg_stat_database_numbackends)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Redis Memory Usage
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
avg:redis.mem.used{*}
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
avg(redis_memory_used_bytes)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Network Metrics
|
||||
|
||||
### Network Bytes Sent
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
sum:system.net.bytes_sent{*}.as_rate()
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
sum(rate(node_network_transmit_bytes_total[5m]))
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Network Bytes Received
|
||||
|
||||
**Datadog**:
|
||||
```
|
||||
sum:system.net.bytes_rcvd{*}.as_rate()
|
||||
```
|
||||
|
||||
**Prometheus**:
|
||||
```promql
|
||||
sum(rate(node_network_receive_bytes_total[5m]))
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Key Differences
|
||||
|
||||
### 1. Time Windows
|
||||
- **Datadog**: Optional, defaults to query time range
|
||||
- **Prometheus**: Always required for rate/increase functions
|
||||
|
||||
### 2. Histograms
|
||||
- **Datadog**: Percentiles available directly
|
||||
- **Prometheus**: Requires histogram buckets + `histogram_quantile()`
|
||||
|
||||
### 3. Default Aggregation
|
||||
- **Datadog**: No default, must specify
|
||||
- **Prometheus**: Returns all time series unless aggregated
|
||||
|
||||
### 4. Metric Types
|
||||
- **Datadog**: All metrics treated similarly
|
||||
- **Prometheus**: Explicit types (counter, gauge, histogram, summary)
|
||||
|
||||
### 5. Tag vs Label
|
||||
- **Datadog**: Uses "tags" (key:value)
|
||||
- **Prometheus**: Uses "labels" (key="value")
|
||||
|
||||
---
|
||||
|
||||
## Migration Tips
|
||||
|
||||
1. **Start with dashboards**: Convert most-used dashboards first
|
||||
2. **Use recording rules**: Pre-calculate expensive PromQL queries
|
||||
3. **Test in parallel**: Run both systems during migration
|
||||
4. **Document mappings**: Create team-specific translation guide
|
||||
5. **Train team**: PromQL has learning curve, invest in training
|
||||
|
||||
---
|
||||
|
||||
## Tools
|
||||
|
||||
- **Datadog Dashboard Exporter**: Export JSON dashboards
|
||||
- **Grafana Dashboard Linter**: Validate converted dashboards
|
||||
- **PromQL Learning Resources**: https://prometheus.io/docs/prometheus/latest/querying/basics/
|
||||
|
||||
---
|
||||
|
||||
## Common Gotchas
|
||||
|
||||
### Rate without Time Window
|
||||
|
||||
❌ **Wrong**:
|
||||
```promql
|
||||
rate(http_requests_total)
|
||||
```
|
||||
|
||||
✅ **Correct**:
|
||||
```promql
|
||||
rate(http_requests_total[5m])
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Aggregating Before Rate
|
||||
|
||||
❌ **Wrong**:
|
||||
```promql
|
||||
rate(sum(http_requests_total)[5m])
|
||||
```
|
||||
|
||||
✅ **Correct**:
|
||||
```promql
|
||||
sum(rate(http_requests_total[5m]))
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Histogram Quantile Without by (le)
|
||||
|
||||
❌ **Wrong**:
|
||||
```promql
|
||||
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
|
||||
```
|
||||
|
||||
✅ **Correct**:
|
||||
```promql
|
||||
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Quick Conversion Checklist
|
||||
|
||||
When converting a Datadog query to PromQL:
|
||||
|
||||
- [ ] Replace metric name (e.g., `system.cpu.user` → `node_cpu_seconds_total`)
|
||||
- [ ] Convert tags to labels (`{tag:value}` → `{label="value"}`)
|
||||
- [ ] Add time window for rate/increase (`[5m]`)
|
||||
- [ ] Change aggregation syntax (`avg:` → `avg()`)
|
||||
- [ ] Convert percentiles to histogram_quantile if needed
|
||||
- [ ] Test query in Prometheus before adding to dashboard
|
||||
- [ ] Add `by (label)` for grouped aggregations
|
||||
|
||||
---
|
||||
|
||||
## Need More Help?
|
||||
|
||||
- See `datadog_migration.md` for full migration guide
|
||||
- PromQL documentation: https://prometheus.io/docs/prometheus/latest/querying/
|
||||
- Practice at: https://demo.promlens.com/
|
||||
775
references/logging_guide.md
Normal file
775
references/logging_guide.md
Normal file
@@ -0,0 +1,775 @@
|
||||
# Logging Guide
|
||||
|
||||
## Structured Logging
|
||||
|
||||
### Why Structured Logs?
|
||||
|
||||
**Unstructured** (text):
|
||||
```
|
||||
2024-10-28 14:32:15 User john@example.com logged in from 192.168.1.1
|
||||
```
|
||||
|
||||
**Structured** (JSON):
|
||||
```json
|
||||
{
|
||||
"timestamp": "2024-10-28T14:32:15Z",
|
||||
"level": "info",
|
||||
"message": "User logged in",
|
||||
"user": "john@example.com",
|
||||
"ip": "192.168.1.1",
|
||||
"event_type": "user_login"
|
||||
}
|
||||
```
|
||||
|
||||
**Benefits**:
|
||||
- Easy to parse and query
|
||||
- Consistent format
|
||||
- Machine-readable
|
||||
- Efficient storage and indexing
|
||||
|
||||
---
|
||||
|
||||
## Log Levels
|
||||
|
||||
Use appropriate log levels for better filtering and alerting.
|
||||
|
||||
### DEBUG
|
||||
**When**: Development, troubleshooting
|
||||
**Examples**:
|
||||
- Function entry/exit
|
||||
- Variable values
|
||||
- Internal state changes
|
||||
|
||||
```python
|
||||
logger.debug("Processing request", extra={
|
||||
"request_id": req_id,
|
||||
"params": params
|
||||
})
|
||||
```
|
||||
|
||||
### INFO
|
||||
**When**: Important business events
|
||||
**Examples**:
|
||||
- User actions (login, purchase)
|
||||
- System state changes (started, stopped)
|
||||
- Significant milestones
|
||||
|
||||
```python
|
||||
logger.info("Order placed", extra={
|
||||
"order_id": "12345",
|
||||
"user_id": "user123",
|
||||
"amount": 99.99
|
||||
})
|
||||
```
|
||||
|
||||
### WARN
|
||||
**When**: Potentially problematic situations
|
||||
**Examples**:
|
||||
- Deprecated API usage
|
||||
- Slow operations (but not failing)
|
||||
- Retry attempts
|
||||
- Resource usage approaching limits
|
||||
|
||||
```python
|
||||
logger.warning("API response slow", extra={
|
||||
"endpoint": "/api/users",
|
||||
"duration_ms": 2500,
|
||||
"threshold_ms": 1000
|
||||
})
|
||||
```
|
||||
|
||||
### ERROR
|
||||
**When**: Error conditions that need attention
|
||||
**Examples**:
|
||||
- Failed requests
|
||||
- Exceptions caught and handled
|
||||
- Integration failures
|
||||
- Data validation errors
|
||||
|
||||
```python
|
||||
logger.error("Payment processing failed", extra={
|
||||
"order_id": "12345",
|
||||
"error": str(e),
|
||||
"payment_gateway": "stripe"
|
||||
}, exc_info=True)
|
||||
```
|
||||
|
||||
### FATAL/CRITICAL
|
||||
**When**: Severe errors causing shutdown
|
||||
**Examples**:
|
||||
- Database connection lost
|
||||
- Out of memory
|
||||
- Configuration errors preventing startup
|
||||
|
||||
```python
|
||||
logger.critical("Database connection lost", extra={
|
||||
"database": "postgres",
|
||||
"host": "db.example.com",
|
||||
"attempt": 3
|
||||
})
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Required Fields
|
||||
|
||||
Every log entry should include:
|
||||
|
||||
### 1. Timestamp
|
||||
ISO 8601 format with timezone:
|
||||
```json
|
||||
{
|
||||
"timestamp": "2024-10-28T14:32:15.123Z"
|
||||
}
|
||||
```
|
||||
|
||||
### 2. Level
|
||||
Standard levels: debug, info, warn, error, critical
|
||||
```json
|
||||
{
|
||||
"level": "error"
|
||||
}
|
||||
```
|
||||
|
||||
### 3. Message
|
||||
Human-readable description:
|
||||
```json
|
||||
{
|
||||
"message": "User authentication failed"
|
||||
}
|
||||
```
|
||||
|
||||
### 4. Service/Application
|
||||
What component logged this:
|
||||
```json
|
||||
{
|
||||
"service": "api-gateway",
|
||||
"version": "1.2.3"
|
||||
}
|
||||
```
|
||||
|
||||
### 5. Environment
|
||||
```json
|
||||
{
|
||||
"environment": "production"
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Recommended Fields
|
||||
|
||||
### Request Context
|
||||
```json
|
||||
{
|
||||
"request_id": "550e8400-e29b-41d4-a716-446655440000",
|
||||
"user_id": "user123",
|
||||
"session_id": "sess_abc",
|
||||
"ip_address": "192.168.1.1",
|
||||
"user_agent": "Mozilla/5.0..."
|
||||
}
|
||||
```
|
||||
|
||||
### Performance Metrics
|
||||
```json
|
||||
{
|
||||
"duration_ms": 245,
|
||||
"response_size_bytes": 1024
|
||||
}
|
||||
```
|
||||
|
||||
### Error Details
|
||||
```json
|
||||
{
|
||||
"error_type": "ValidationError",
|
||||
"error_message": "Invalid email format",
|
||||
"stack_trace": "...",
|
||||
"error_code": "VAL_001"
|
||||
}
|
||||
```
|
||||
|
||||
### Business Context
|
||||
```json
|
||||
{
|
||||
"order_id": "ORD-12345",
|
||||
"customer_id": "CUST-789",
|
||||
"transaction_amount": 99.99,
|
||||
"payment_method": "credit_card"
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Implementation Examples
|
||||
|
||||
### Python (using structlog)
|
||||
```python
|
||||
import structlog
|
||||
|
||||
logger = structlog.get_logger()
|
||||
|
||||
# Configure structured logging
|
||||
structlog.configure(
|
||||
processors=[
|
||||
structlog.processors.TimeStamper(fmt="iso"),
|
||||
structlog.processors.add_log_level,
|
||||
structlog.processors.JSONRenderer()
|
||||
]
|
||||
)
|
||||
|
||||
# Usage
|
||||
logger.info(
|
||||
"user_logged_in",
|
||||
user_id="user123",
|
||||
ip_address="192.168.1.1",
|
||||
login_method="oauth"
|
||||
)
|
||||
```
|
||||
|
||||
### Node.js (using Winston)
|
||||
```javascript
|
||||
const winston = require('winston');
|
||||
|
||||
const logger = winston.createLogger({
|
||||
format: winston.format.json(),
|
||||
defaultMeta: { service: 'api-gateway' },
|
||||
transports: [
|
||||
new winston.transports.Console()
|
||||
]
|
||||
});
|
||||
|
||||
logger.info('User logged in', {
|
||||
userId: 'user123',
|
||||
ipAddress: '192.168.1.1',
|
||||
loginMethod: 'oauth'
|
||||
});
|
||||
```
|
||||
|
||||
### Go (using zap)
|
||||
```go
|
||||
import "go.uber.org/zap"
|
||||
|
||||
logger, _ := zap.NewProduction()
|
||||
defer logger.Sync()
|
||||
|
||||
logger.Info("User logged in",
|
||||
zap.String("userId", "user123"),
|
||||
zap.String("ipAddress", "192.168.1.1"),
|
||||
zap.String("loginMethod", "oauth"),
|
||||
)
|
||||
```
|
||||
|
||||
### Java (using Logback with JSON)
|
||||
```java
|
||||
import org.slf4j.Logger;
|
||||
import org.slf4j.LoggerFactory;
|
||||
import net.logstash.logback.argument.StructuredArguments;
|
||||
|
||||
Logger logger = LoggerFactory.getLogger(MyClass.class);
|
||||
|
||||
logger.info("User logged in",
|
||||
StructuredArguments.kv("userId", "user123"),
|
||||
StructuredArguments.kv("ipAddress", "192.168.1.1"),
|
||||
StructuredArguments.kv("loginMethod", "oauth")
|
||||
);
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Log Aggregation Patterns
|
||||
|
||||
### Pattern 1: ELK Stack (Elasticsearch, Logstash, Kibana)
|
||||
|
||||
**Architecture**:
|
||||
```
|
||||
Application → Filebeat → Logstash → Elasticsearch → Kibana
|
||||
```
|
||||
|
||||
**filebeat.yml**:
|
||||
```yaml
|
||||
filebeat.inputs:
|
||||
- type: log
|
||||
enabled: true
|
||||
paths:
|
||||
- /var/log/app/*.log
|
||||
json.keys_under_root: true
|
||||
json.add_error_key: true
|
||||
|
||||
output.logstash:
|
||||
hosts: ["logstash:5044"]
|
||||
```
|
||||
|
||||
**logstash.conf**:
|
||||
```
|
||||
input {
|
||||
beats {
|
||||
port => 5044
|
||||
}
|
||||
}
|
||||
|
||||
filter {
|
||||
json {
|
||||
source => "message"
|
||||
}
|
||||
|
||||
date {
|
||||
match => ["timestamp", "ISO8601"]
|
||||
}
|
||||
|
||||
grok {
|
||||
match => { "message" => "%{COMBINEDAPACHELOG}" }
|
||||
}
|
||||
}
|
||||
|
||||
output {
|
||||
elasticsearch {
|
||||
hosts => ["elasticsearch:9200"]
|
||||
index => "app-logs-%{+YYYY.MM.dd}"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Pattern 2: Loki (Grafana Loki)
|
||||
|
||||
**Architecture**:
|
||||
```
|
||||
Application → Promtail → Loki → Grafana
|
||||
```
|
||||
|
||||
**promtail-config.yml**:
|
||||
```yaml
|
||||
server:
|
||||
http_listen_port: 9080
|
||||
|
||||
positions:
|
||||
filename: /tmp/positions.yaml
|
||||
|
||||
clients:
|
||||
- url: http://loki:3100/loki/api/v1/push
|
||||
|
||||
scrape_configs:
|
||||
- job_name: app
|
||||
static_configs:
|
||||
- targets:
|
||||
- localhost
|
||||
labels:
|
||||
job: app
|
||||
__path__: /var/log/app/*.log
|
||||
pipeline_stages:
|
||||
- json:
|
||||
expressions:
|
||||
level: level
|
||||
timestamp: timestamp
|
||||
- labels:
|
||||
level:
|
||||
service:
|
||||
- timestamp:
|
||||
source: timestamp
|
||||
format: RFC3339
|
||||
```
|
||||
|
||||
**Query in Grafana**:
|
||||
```logql
|
||||
{job="app"} |= "error" | json | level="error"
|
||||
```
|
||||
|
||||
### Pattern 3: CloudWatch Logs
|
||||
|
||||
**Install CloudWatch agent**:
|
||||
```json
|
||||
{
|
||||
"logs": {
|
||||
"logs_collected": {
|
||||
"files": {
|
||||
"collect_list": [
|
||||
{
|
||||
"file_path": "/var/log/app/*.log",
|
||||
"log_group_name": "/aws/app/production",
|
||||
"log_stream_name": "{instance_id}",
|
||||
"timezone": "UTC"
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Query with CloudWatch Insights**:
|
||||
```
|
||||
fields @timestamp, level, message, user_id
|
||||
| filter level = "error"
|
||||
| sort @timestamp desc
|
||||
| limit 100
|
||||
```
|
||||
|
||||
### Pattern 4: Fluentd/Fluent Bit
|
||||
|
||||
**fluent-bit.conf**:
|
||||
```
|
||||
[INPUT]
|
||||
Name tail
|
||||
Path /var/log/app/*.log
|
||||
Parser json
|
||||
Tag app.*
|
||||
|
||||
[FILTER]
|
||||
Name record_modifier
|
||||
Match *
|
||||
Record hostname ${HOSTNAME}
|
||||
Record cluster production
|
||||
|
||||
[OUTPUT]
|
||||
Name es
|
||||
Match *
|
||||
Host elasticsearch
|
||||
Port 9200
|
||||
Index app-logs
|
||||
Type _doc
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Query Patterns
|
||||
|
||||
### Find Errors in Time Range
|
||||
**Elasticsearch**:
|
||||
```json
|
||||
GET /app-logs-*/_search
|
||||
{
|
||||
"query": {
|
||||
"bool": {
|
||||
"must": [
|
||||
{ "match": { "level": "error" } },
|
||||
{ "range": { "@timestamp": {
|
||||
"gte": "now-1h",
|
||||
"lte": "now"
|
||||
}}}
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Loki (LogQL)**:
|
||||
```logql
|
||||
{job="app", level="error"} |= "error"
|
||||
```
|
||||
|
||||
**CloudWatch Insights**:
|
||||
```
|
||||
fields @timestamp, @message
|
||||
| filter level = "error"
|
||||
| filter @timestamp > ago(1h)
|
||||
```
|
||||
|
||||
### Count Errors by Type
|
||||
**Elasticsearch**:
|
||||
```json
|
||||
GET /app-logs-*/_search
|
||||
{
|
||||
"size": 0,
|
||||
"query": { "match": { "level": "error" } },
|
||||
"aggs": {
|
||||
"error_types": {
|
||||
"terms": { "field": "error_type.keyword" }
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Loki**:
|
||||
```logql
|
||||
sum by (error_type) (count_over_time({job="app", level="error"}[1h]))
|
||||
```
|
||||
|
||||
### Find Slow Requests
|
||||
**Elasticsearch**:
|
||||
```json
|
||||
GET /app-logs-*/_search
|
||||
{
|
||||
"query": {
|
||||
"range": { "duration_ms": { "gte": 1000 } }
|
||||
},
|
||||
"sort": [ { "duration_ms": "desc" } ]
|
||||
}
|
||||
```
|
||||
|
||||
### Trace Request Through Services
|
||||
**Elasticsearch** (using request_id):
|
||||
```json
|
||||
GET /_search
|
||||
{
|
||||
"query": {
|
||||
"match": { "request_id": "550e8400-e29b-41d4-a716-446655440000" }
|
||||
},
|
||||
"sort": [ { "@timestamp": "asc" } ]
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Sampling and Rate Limiting
|
||||
|
||||
### When to Sample
|
||||
- **High volume services**: > 10,000 logs/second
|
||||
- **Debug logs in production**: Sample 1-10%
|
||||
- **Cost optimization**: Reduce storage costs
|
||||
|
||||
### Sampling Strategies
|
||||
|
||||
**1. Random Sampling**:
|
||||
```python
|
||||
import random
|
||||
|
||||
if random.random() < 0.1: # Sample 10%
|
||||
logger.debug("Debug message", ...)
|
||||
```
|
||||
|
||||
**2. Rate Limiting**:
|
||||
```python
|
||||
from rate_limiter import RateLimiter
|
||||
|
||||
limiter = RateLimiter(max_per_second=100)
|
||||
|
||||
if limiter.allow():
|
||||
logger.info("Rate limited log", ...)
|
||||
```
|
||||
|
||||
**3. Error-Biased Sampling**:
|
||||
```python
|
||||
# Always log errors, sample successful requests
|
||||
if level == "error" or random.random() < 0.01:
|
||||
logger.log(level, message, ...)
|
||||
```
|
||||
|
||||
**4. Head-Based Sampling** (trace-aware):
|
||||
```python
|
||||
# If trace is sampled, log all related logs
|
||||
if trace_context.is_sampled():
|
||||
logger.info("Traced log", trace_id=trace_context.trace_id)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Log Retention
|
||||
|
||||
### Retention Strategy
|
||||
|
||||
**Hot tier** (fast SSD): 7-30 days
|
||||
- Recent logs
|
||||
- Full query performance
|
||||
- High cost
|
||||
|
||||
**Warm tier** (regular disk): 30-90 days
|
||||
- Older logs
|
||||
- Slower queries acceptable
|
||||
- Medium cost
|
||||
|
||||
**Cold tier** (object storage): 90+ days
|
||||
- Archive logs
|
||||
- Query via restore
|
||||
- Low cost
|
||||
|
||||
### Example: Elasticsearch ILM Policy
|
||||
```json
|
||||
{
|
||||
"policy": {
|
||||
"phases": {
|
||||
"hot": {
|
||||
"actions": {
|
||||
"rollover": {
|
||||
"max_size": "50GB",
|
||||
"max_age": "1d"
|
||||
}
|
||||
}
|
||||
},
|
||||
"warm": {
|
||||
"min_age": "7d",
|
||||
"actions": {
|
||||
"allocate": { "number_of_replicas": 1 },
|
||||
"shrink": { "number_of_shards": 1 }
|
||||
}
|
||||
},
|
||||
"cold": {
|
||||
"min_age": "30d",
|
||||
"actions": {
|
||||
"allocate": { "require": { "box_type": "cold" } }
|
||||
}
|
||||
},
|
||||
"delete": {
|
||||
"min_age": "90d",
|
||||
"actions": {
|
||||
"delete": {}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Security and Compliance
|
||||
|
||||
### PII Redaction
|
||||
|
||||
**Before logging**:
|
||||
```python
|
||||
import re
|
||||
|
||||
def redact_pii(data):
|
||||
# Redact email
|
||||
data = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
|
||||
'[EMAIL]', data)
|
||||
# Redact credit card
|
||||
data = re.sub(r'\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b',
|
||||
'[CARD]', data)
|
||||
# Redact SSN
|
||||
data = re.sub(r'\b\d{3}-\d{2}-\d{4}\b', '[SSN]', data)
|
||||
return data
|
||||
|
||||
logger.info("User data", user_input=redact_pii(user_input))
|
||||
```
|
||||
|
||||
**In Logstash**:
|
||||
```
|
||||
filter {
|
||||
mutate {
|
||||
gsub => [
|
||||
"message", "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b", "[EMAIL]",
|
||||
"message", "\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b", "[CARD]"
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Access Control
|
||||
|
||||
**Elasticsearch** (with Security):
|
||||
```yaml
|
||||
# Role for developers
|
||||
dev_logs:
|
||||
indices:
|
||||
- names: ['app-logs-*']
|
||||
privileges: ['read']
|
||||
query: '{"match": {"environment": "development"}}'
|
||||
```
|
||||
|
||||
**CloudWatch** (IAM Policy):
|
||||
```json
|
||||
{
|
||||
"Effect": "Allow",
|
||||
"Action": [
|
||||
"logs:DescribeLogGroups",
|
||||
"logs:GetLogEvents",
|
||||
"logs:FilterLogEvents"
|
||||
],
|
||||
"Resource": "arn:aws:logs:*:*:log-group:/aws/app/production:*"
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Common Pitfalls
|
||||
|
||||
### 1. Logging Sensitive Data
|
||||
❌ `logger.info("Login", password=password)`
|
||||
✅ `logger.info("Login", user_id=user_id)`
|
||||
|
||||
### 2. Excessive Logging
|
||||
❌ Logging every iteration of a loop
|
||||
✅ Log aggregate results or sample
|
||||
|
||||
### 3. Not Including Context
|
||||
❌ `logger.error("Failed")`
|
||||
✅ `logger.error("Payment failed", order_id=order_id, error=str(e))`
|
||||
|
||||
### 4. Inconsistent Formats
|
||||
❌ Mix of JSON and plain text
|
||||
✅ Pick one format and stick to it
|
||||
|
||||
### 5. No Request IDs
|
||||
❌ Can't trace request across services
|
||||
✅ Generate and propagate request_id
|
||||
|
||||
### 6. Logging to Multiple Places
|
||||
❌ Log to file AND stdout AND syslog
|
||||
✅ Log to stdout, let agent handle routing
|
||||
|
||||
### 7. Blocking on Log Writes
|
||||
❌ Synchronous writes to remote systems
|
||||
✅ Asynchronous buffered writes
|
||||
|
||||
---
|
||||
|
||||
## Performance Optimization
|
||||
|
||||
### 1. Async Logging
|
||||
```python
|
||||
import logging
|
||||
from logging.handlers import QueueHandler, QueueListener
|
||||
import queue
|
||||
|
||||
# Create queue
|
||||
log_queue = queue.Queue()
|
||||
|
||||
# Configure async handler
|
||||
queue_handler = QueueHandler(log_queue)
|
||||
logger.addHandler(queue_handler)
|
||||
|
||||
# Process logs in background thread
|
||||
listener = QueueListener(log_queue, *handlers)
|
||||
listener.start()
|
||||
```
|
||||
|
||||
### 2. Conditional Logging
|
||||
```python
|
||||
# Avoid expensive operations if not logging
|
||||
if logger.isEnabledFor(logging.DEBUG):
|
||||
logger.debug("Details", data=expensive_serialization(obj))
|
||||
```
|
||||
|
||||
### 3. Batching
|
||||
```python
|
||||
# Batch logs before sending
|
||||
batch = []
|
||||
for log in logs:
|
||||
batch.append(log)
|
||||
if len(batch) >= 100:
|
||||
send_to_aggregator(batch)
|
||||
batch = []
|
||||
```
|
||||
|
||||
### 4. Compression
|
||||
```yaml
|
||||
# Filebeat with compression
|
||||
output.logstash:
|
||||
hosts: ["logstash:5044"]
|
||||
compression_level: 3
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Monitoring Log Pipeline
|
||||
|
||||
Track pipeline health with metrics:
|
||||
|
||||
```promql
|
||||
# Log ingestion rate
|
||||
rate(logs_ingested_total[5m])
|
||||
|
||||
# Pipeline lag
|
||||
log_processing_lag_seconds
|
||||
|
||||
# Dropped logs
|
||||
rate(logs_dropped_total[5m])
|
||||
|
||||
# Error parsing rate
|
||||
rate(logs_parse_errors_total[5m])
|
||||
```
|
||||
|
||||
Alert on:
|
||||
- Sudden drop in log volume (service down?)
|
||||
- High parse error rate (format changed?)
|
||||
- Pipeline lag > 1 minute (capacity issue?)
|
||||
406
references/metrics_design.md
Normal file
406
references/metrics_design.md
Normal file
@@ -0,0 +1,406 @@
|
||||
# Metrics Design Guide
|
||||
|
||||
## The Four Golden Signals
|
||||
|
||||
The Four Golden Signals from Google's SRE book provide a comprehensive view of system health:
|
||||
|
||||
### 1. Latency
|
||||
**What**: Time to service a request
|
||||
|
||||
**Why Monitor**: Directly impacts user experience
|
||||
|
||||
**Key Metrics**:
|
||||
- Request duration (p50, p95, p99, p99.9)
|
||||
- Time to first byte (TTFB)
|
||||
- Backend processing time
|
||||
- Database query latency
|
||||
|
||||
**PromQL Examples**:
|
||||
```promql
|
||||
# P95 latency
|
||||
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
|
||||
|
||||
# Average latency by endpoint
|
||||
avg(rate(http_request_duration_seconds_sum[5m])) by (endpoint)
|
||||
/
|
||||
avg(rate(http_request_duration_seconds_count[5m])) by (endpoint)
|
||||
```
|
||||
|
||||
**Alert Thresholds**:
|
||||
- Warning: p95 > 500ms
|
||||
- Critical: p99 > 2s
|
||||
|
||||
### 2. Traffic
|
||||
**What**: Demand on your system
|
||||
|
||||
**Why Monitor**: Understand load patterns, capacity planning
|
||||
|
||||
**Key Metrics**:
|
||||
- Requests per second (RPS)
|
||||
- Transactions per second (TPS)
|
||||
- Concurrent connections
|
||||
- Network throughput
|
||||
|
||||
**PromQL Examples**:
|
||||
```promql
|
||||
# Requests per second
|
||||
sum(rate(http_requests_total[5m]))
|
||||
|
||||
# Requests per second by status code
|
||||
sum(rate(http_requests_total[5m])) by (status)
|
||||
|
||||
# Traffic growth rate (week over week)
|
||||
sum(rate(http_requests_total[5m]))
|
||||
/
|
||||
sum(rate(http_requests_total[5m] offset 7d))
|
||||
```
|
||||
|
||||
**Alert Thresholds**:
|
||||
- Warning: RPS > 80% of capacity
|
||||
- Critical: RPS > 95% of capacity
|
||||
|
||||
### 3. Errors
|
||||
**What**: Rate of requests that fail
|
||||
|
||||
**Why Monitor**: Direct indicator of user-facing problems
|
||||
|
||||
**Key Metrics**:
|
||||
- Error rate (%)
|
||||
- 5xx response codes
|
||||
- Failed transactions
|
||||
- Exception counts
|
||||
|
||||
**PromQL Examples**:
|
||||
```promql
|
||||
# Error rate percentage
|
||||
sum(rate(http_requests_total{status=~"5.."}[5m]))
|
||||
/
|
||||
sum(rate(http_requests_total[5m])) * 100
|
||||
|
||||
# Error count by type
|
||||
sum(rate(http_requests_total{status=~"5.."}[5m])) by (status)
|
||||
|
||||
# Application errors
|
||||
rate(application_errors_total[5m])
|
||||
```
|
||||
|
||||
**Alert Thresholds**:
|
||||
- Warning: Error rate > 1%
|
||||
- Critical: Error rate > 5%
|
||||
|
||||
### 4. Saturation
|
||||
**What**: How "full" your service is
|
||||
|
||||
**Why Monitor**: Predict capacity issues before they impact users
|
||||
|
||||
**Key Metrics**:
|
||||
- CPU utilization
|
||||
- Memory utilization
|
||||
- Disk I/O
|
||||
- Network bandwidth
|
||||
- Queue depth
|
||||
- Thread pool usage
|
||||
|
||||
**PromQL Examples**:
|
||||
```promql
|
||||
# CPU saturation
|
||||
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
|
||||
|
||||
# Memory saturation
|
||||
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
|
||||
|
||||
# Disk saturation
|
||||
rate(node_disk_io_time_seconds_total[5m]) * 100
|
||||
|
||||
# Queue depth
|
||||
queue_depth_current / queue_depth_max * 100
|
||||
```
|
||||
|
||||
**Alert Thresholds**:
|
||||
- Warning: > 70% utilization
|
||||
- Critical: > 90% utilization
|
||||
|
||||
---
|
||||
|
||||
## RED Method (for Services)
|
||||
|
||||
**R**ate, **E**rrors, **D**uration - a simplified approach for request-driven services
|
||||
|
||||
### Rate
|
||||
Number of requests per second:
|
||||
```promql
|
||||
sum(rate(http_requests_total[5m]))
|
||||
```
|
||||
|
||||
### Errors
|
||||
Number of failed requests per second:
|
||||
```promql
|
||||
sum(rate(http_requests_total{status=~"5.."}[5m]))
|
||||
```
|
||||
|
||||
### Duration
|
||||
Time taken to process requests:
|
||||
```promql
|
||||
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
|
||||
```
|
||||
|
||||
**When to Use**: Microservices, APIs, web applications
|
||||
|
||||
---
|
||||
|
||||
## USE Method (for Resources)
|
||||
|
||||
**U**tilization, **S**aturation, **E**rrors - for infrastructure resources
|
||||
|
||||
### Utilization
|
||||
Percentage of time resource is busy:
|
||||
```promql
|
||||
# CPU utilization
|
||||
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
|
||||
|
||||
# Disk utilization
|
||||
(node_filesystem_size_bytes - node_filesystem_avail_bytes)
|
||||
/ node_filesystem_size_bytes * 100
|
||||
```
|
||||
|
||||
### Saturation
|
||||
Amount of work the resource cannot service (queued):
|
||||
```promql
|
||||
# Load average (saturation indicator)
|
||||
node_load15
|
||||
|
||||
# Disk I/O wait time
|
||||
rate(node_disk_io_time_weighted_seconds_total[5m])
|
||||
```
|
||||
|
||||
### Errors
|
||||
Count of error events:
|
||||
```promql
|
||||
# Network errors
|
||||
rate(node_network_receive_errs_total[5m])
|
||||
rate(node_network_transmit_errs_total[5m])
|
||||
|
||||
# Disk errors
|
||||
rate(node_disk_io_errors_total[5m])
|
||||
```
|
||||
|
||||
**When to Use**: Servers, databases, network devices
|
||||
|
||||
---
|
||||
|
||||
## Metric Types
|
||||
|
||||
### Counter
|
||||
Monotonically increasing value (never decreases)
|
||||
|
||||
**Examples**: Request count, error count, bytes sent
|
||||
|
||||
**Usage**:
|
||||
```promql
|
||||
# Always use rate() or increase() with counters
|
||||
rate(http_requests_total[5m]) # Requests per second
|
||||
increase(http_requests_total[1h]) # Total requests in 1 hour
|
||||
```
|
||||
|
||||
### Gauge
|
||||
Value that can go up and down
|
||||
|
||||
**Examples**: Memory usage, queue depth, concurrent connections
|
||||
|
||||
**Usage**:
|
||||
```promql
|
||||
# Use directly or with aggregations
|
||||
avg(memory_usage_bytes)
|
||||
max(queue_depth)
|
||||
```
|
||||
|
||||
### Histogram
|
||||
Samples observations and counts them in configurable buckets
|
||||
|
||||
**Examples**: Request duration, response size
|
||||
|
||||
**Usage**:
|
||||
```promql
|
||||
# Calculate percentiles
|
||||
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
|
||||
|
||||
# Average from histogram
|
||||
rate(http_request_duration_seconds_sum[5m])
|
||||
/
|
||||
rate(http_request_duration_seconds_count[5m])
|
||||
```
|
||||
|
||||
### Summary
|
||||
Similar to histogram but calculates quantiles on client side
|
||||
|
||||
**Usage**: Less flexible than histograms, avoid for new metrics
|
||||
|
||||
---
|
||||
|
||||
## Cardinality Best Practices
|
||||
|
||||
**Cardinality**: Number of unique time series
|
||||
|
||||
### High Cardinality Labels (AVOID)
|
||||
❌ User ID
|
||||
❌ Email address
|
||||
❌ IP address
|
||||
❌ Timestamp
|
||||
❌ Random IDs
|
||||
|
||||
### Low Cardinality Labels (GOOD)
|
||||
✅ Environment (prod, staging)
|
||||
✅ Region (us-east-1, eu-west-1)
|
||||
✅ Service name
|
||||
✅ HTTP status code category (2xx, 4xx, 5xx)
|
||||
✅ Endpoint/route
|
||||
|
||||
### Calculating Cardinality Impact
|
||||
```
|
||||
Time series = unique combinations of labels
|
||||
|
||||
Example:
|
||||
service (5) × environment (3) × region (4) × status (5) = 300 time series ✅
|
||||
|
||||
service (5) × environment (3) × region (4) × user_id (1M) = 60M time series ❌
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Naming Conventions
|
||||
|
||||
### Prometheus Naming
|
||||
```
|
||||
<namespace>_<name>_<unit>_total
|
||||
|
||||
Examples:
|
||||
http_requests_total
|
||||
http_request_duration_seconds
|
||||
process_cpu_seconds_total
|
||||
node_memory_MemAvailable_bytes
|
||||
```
|
||||
|
||||
**Rules**:
|
||||
- Use snake_case
|
||||
- Include unit in name (seconds, bytes, ratio)
|
||||
- Use `_total` suffix for counters
|
||||
- Namespace by application/component
|
||||
|
||||
### CloudWatch Naming
|
||||
```
|
||||
<Namespace>/<MetricName>
|
||||
|
||||
Examples:
|
||||
AWS/EC2/CPUUtilization
|
||||
MyApp/RequestCount
|
||||
```
|
||||
|
||||
**Rules**:
|
||||
- Use PascalCase
|
||||
- Group by namespace
|
||||
- No unit in name (specified separately)
|
||||
|
||||
---
|
||||
|
||||
## Dashboard Design
|
||||
|
||||
### Key Principles
|
||||
|
||||
1. **Top-Down Layout**: Most important metrics first
|
||||
2. **Color Coding**: Red (critical), yellow (warning), green (healthy)
|
||||
3. **Consistent Time Windows**: All panels use same time range
|
||||
4. **Limit Panels**: 8-12 panels per dashboard maximum
|
||||
5. **Include Context**: Show related metrics together
|
||||
|
||||
### Dashboard Structure
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────┐
|
||||
│ Overall Health (Single Stats) │
|
||||
│ [Requests/s] [Error%] [P95 Latency] │
|
||||
└─────────────────────────────────────────────┘
|
||||
|
||||
┌─────────────────────────────────────────────┐
|
||||
│ Request Rate & Errors (Graphs) │
|
||||
└─────────────────────────────────────────────┘
|
||||
|
||||
┌─────────────────────────────────────────────┐
|
||||
│ Latency Distribution (Graphs) │
|
||||
└─────────────────────────────────────────────┘
|
||||
|
||||
┌─────────────────────────────────────────────┐
|
||||
│ Resource Usage (Graphs) │
|
||||
└─────────────────────────────────────────────┘
|
||||
|
||||
┌─────────────────────────────────────────────┐
|
||||
│ Dependencies (Graphs) │
|
||||
└─────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Template Variables
|
||||
Use variables for filtering:
|
||||
- Environment: `$environment`
|
||||
- Service: `$service`
|
||||
- Region: `$region`
|
||||
- Pod: `$pod`
|
||||
|
||||
---
|
||||
|
||||
## Common Pitfalls
|
||||
|
||||
### 1. Monitoring What You Build, Not What Users Experience
|
||||
❌ `backend_processing_complete`
|
||||
✅ `user_request_completed`
|
||||
|
||||
### 2. Too Many Metrics
|
||||
- Start with Four Golden Signals
|
||||
- Add metrics only when needed for specific issues
|
||||
- Remove unused metrics
|
||||
|
||||
### 3. Incorrect Aggregations
|
||||
❌ `avg(rate(...))` - averages rates incorrectly
|
||||
✅ `sum(rate(...)) / count(...)` - correct average
|
||||
|
||||
### 4. Wrong Time Windows
|
||||
- Too short (< 1m): Noisy data
|
||||
- Too long (> 15m): Miss short-lived issues
|
||||
- Sweet spot: 5m for most alerts
|
||||
|
||||
### 5. Missing Labels
|
||||
❌ `http_requests_total`
|
||||
✅ `http_requests_total{method="GET", status="200", endpoint="/api/users"}`
|
||||
|
||||
---
|
||||
|
||||
## Metric Collection Best Practices
|
||||
|
||||
### Application Instrumentation
|
||||
```python
|
||||
from prometheus_client import Counter, Histogram, Gauge
|
||||
|
||||
# Counter for requests
|
||||
requests_total = Counter('http_requests_total',
|
||||
'Total HTTP requests',
|
||||
['method', 'endpoint', 'status'])
|
||||
|
||||
# Histogram for latency
|
||||
request_duration = Histogram('http_request_duration_seconds',
|
||||
'HTTP request duration',
|
||||
['method', 'endpoint'])
|
||||
|
||||
# Gauge for in-progress requests
|
||||
requests_in_progress = Gauge('http_requests_in_progress',
|
||||
'HTTP requests currently being processed')
|
||||
```
|
||||
|
||||
### Collection Intervals
|
||||
- Application metrics: 15-30s
|
||||
- Infrastructure metrics: 30-60s
|
||||
- Billing/cost metrics: 5-15m
|
||||
- External API checks: 1-5m
|
||||
|
||||
### Retention
|
||||
- Raw metrics: 15-30 days
|
||||
- 5m aggregates: 90 days
|
||||
- 1h aggregates: 1 year
|
||||
- Daily aggregates: 2+ years
|
||||
652
references/slo_sla_guide.md
Normal file
652
references/slo_sla_guide.md
Normal file
@@ -0,0 +1,652 @@
|
||||
# SLI, SLO, and SLA Guide
|
||||
|
||||
## Definitions
|
||||
|
||||
### SLI (Service Level Indicator)
|
||||
**What**: A quantitative measure of service quality
|
||||
|
||||
**Examples**:
|
||||
- Request latency (ms)
|
||||
- Error rate (%)
|
||||
- Availability (%)
|
||||
- Throughput (requests/sec)
|
||||
|
||||
### SLO (Service Level Objective)
|
||||
**What**: Target value or range for an SLI
|
||||
|
||||
**Examples**:
|
||||
- "99.9% of requests return in < 500ms"
|
||||
- "99.95% availability"
|
||||
- "Error rate < 0.1%"
|
||||
|
||||
### SLA (Service Level Agreement)
|
||||
**What**: Business contract with consequences for SLO violations
|
||||
|
||||
**Examples**:
|
||||
- "99.9% uptime or 10% monthly credit"
|
||||
- "p95 latency < 1s or refund"
|
||||
|
||||
### Relationship
|
||||
```
|
||||
SLI = Measurement
|
||||
SLO = Target (internal goal)
|
||||
SLA = Promise (customer contract with penalties)
|
||||
|
||||
Example:
|
||||
SLI: Actual availability this month = 99.92%
|
||||
SLO: Target availability = 99.9%
|
||||
SLA: Guaranteed availability = 99.5% (with penalties)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Choosing SLIs
|
||||
|
||||
### The Four Golden Signals as SLIs
|
||||
|
||||
1. **Latency SLIs**
|
||||
- Request duration (p50, p95, p99)
|
||||
- Time to first byte
|
||||
- Page load time
|
||||
|
||||
2. **Availability/Success SLIs**
|
||||
- % of successful requests
|
||||
- % uptime
|
||||
- % of requests completing
|
||||
|
||||
3. **Throughput SLIs** (less common)
|
||||
- Requests per second
|
||||
- Transactions per second
|
||||
|
||||
4. **Saturation SLIs** (internal only)
|
||||
- Resource utilization
|
||||
- Queue depth
|
||||
|
||||
### SLI Selection Criteria
|
||||
|
||||
✅ **Good SLIs**:
|
||||
- Measured from user perspective
|
||||
- Directly impact user experience
|
||||
- Aggregatable across instances
|
||||
- Proportional to user happiness
|
||||
|
||||
❌ **Bad SLIs**:
|
||||
- Internal metrics only
|
||||
- Not user-facing
|
||||
- Hard to measure consistently
|
||||
|
||||
### Examples by Service Type
|
||||
|
||||
**Web Application**:
|
||||
```
|
||||
SLI 1: Request Success Rate
|
||||
= successful_requests / total_requests
|
||||
|
||||
SLI 2: Request Latency (p95)
|
||||
= 95th percentile of response times
|
||||
|
||||
SLI 3: Availability
|
||||
= time_service_responding / total_time
|
||||
```
|
||||
|
||||
**API Service**:
|
||||
```
|
||||
SLI 1: Error Rate
|
||||
= (4xx_errors + 5xx_errors) / total_requests
|
||||
|
||||
SLI 2: Response Time (p99)
|
||||
= 99th percentile latency
|
||||
|
||||
SLI 3: Throughput
|
||||
= requests_per_second
|
||||
```
|
||||
|
||||
**Batch Processing**:
|
||||
```
|
||||
SLI 1: Job Success Rate
|
||||
= successful_jobs / total_jobs
|
||||
|
||||
SLI 2: Processing Latency
|
||||
= time_from_submission_to_completion
|
||||
|
||||
SLI 3: Freshness
|
||||
= age_of_oldest_unprocessed_item
|
||||
```
|
||||
|
||||
**Storage Service**:
|
||||
```
|
||||
SLI 1: Durability
|
||||
= data_not_lost / total_data
|
||||
|
||||
SLI 2: Read Latency (p99)
|
||||
= 99th percentile read time
|
||||
|
||||
SLI 3: Write Success Rate
|
||||
= successful_writes / total_writes
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Setting SLO Targets
|
||||
|
||||
### Start with Current Performance
|
||||
|
||||
1. **Measure baseline**: Collect 30 days of data
|
||||
2. **Analyze distribution**: Look at p50, p95, p99, p99.9
|
||||
3. **Set initial SLO**: Slightly better than worst performer
|
||||
4. **Iterate**: Tighten or loosen based on feasibility
|
||||
|
||||
### Example Process
|
||||
|
||||
**Current Performance** (30 days):
|
||||
```
|
||||
p50 latency: 120ms
|
||||
p95 latency: 450ms
|
||||
p99 latency: 1200ms
|
||||
p99.9 latency: 3500ms
|
||||
|
||||
Error rate: 0.05%
|
||||
Availability: 99.95%
|
||||
```
|
||||
|
||||
**Initial SLOs**:
|
||||
```
|
||||
Latency: p95 < 500ms (slightly worse than current p95)
|
||||
Error rate: < 0.1% (double current rate)
|
||||
Availability: 99.9% (slightly worse than current)
|
||||
```
|
||||
|
||||
**Rationale**: Start loose, prevent false alarms, tighten over time
|
||||
|
||||
### Common SLO Targets
|
||||
|
||||
**Availability**:
|
||||
- **99%** (3.65 days downtime/year): Internal tools
|
||||
- **99.5%** (1.83 days/year): Non-critical services
|
||||
- **99.9%** (8.76 hours/year): Standard production
|
||||
- **99.95%** (4.38 hours/year): Critical services
|
||||
- **99.99%** (52 minutes/year): High availability
|
||||
- **99.999%** (5 minutes/year): Mission critical
|
||||
|
||||
**Latency**:
|
||||
- **p50 < 100ms**: Excellent responsiveness
|
||||
- **p95 < 500ms**: Standard web applications
|
||||
- **p99 < 1s**: Acceptable for most users
|
||||
- **p99.9 < 5s**: Acceptable for rare edge cases
|
||||
|
||||
**Error Rate**:
|
||||
- **< 0.01%** (99.99% success): Critical operations
|
||||
- **< 0.1%** (99.9% success): Standard production
|
||||
- **< 1%** (99% success): Non-critical services
|
||||
|
||||
---
|
||||
|
||||
## Error Budgets
|
||||
|
||||
### Concept
|
||||
|
||||
Error budget = (100% - SLO target)
|
||||
|
||||
If SLO is 99.9%, error budget is 0.1%
|
||||
|
||||
**Purpose**: Balance reliability with feature velocity
|
||||
|
||||
### Calculation
|
||||
|
||||
**For availability**:
|
||||
```
|
||||
Monthly error budget = (1 - SLO) × time_period
|
||||
|
||||
Example (99.9% SLO, 30 days):
|
||||
Error budget = 0.001 × 30 days = 0.03 days = 43.2 minutes
|
||||
```
|
||||
|
||||
**For request-based SLIs**:
|
||||
```
|
||||
Error budget = (1 - SLO) × total_requests
|
||||
|
||||
Example (99.9% SLO, 10M requests/month):
|
||||
Error budget = 0.001 × 10,000,000 = 10,000 failed requests
|
||||
```
|
||||
|
||||
### Error Budget Consumption
|
||||
|
||||
**Formula**:
|
||||
```
|
||||
Budget consumed = actual_errors / allowed_errors × 100%
|
||||
|
||||
Example:
|
||||
SLO: 99.9% (0.1% error budget)
|
||||
Total requests: 1,000,000
|
||||
Failed requests: 500
|
||||
Allowed failures: 1,000
|
||||
|
||||
Budget consumed = 500 / 1,000 × 100% = 50%
|
||||
Budget remaining = 50%
|
||||
```
|
||||
|
||||
### Error Budget Policy
|
||||
|
||||
**Example policy**:
|
||||
|
||||
```markdown
|
||||
## Error Budget Policy
|
||||
|
||||
### If error budget > 50%
|
||||
- Deploy frequently (multiple times per day)
|
||||
- Take calculated risks
|
||||
- Experiment with new features
|
||||
- Acceptable to have some incidents
|
||||
|
||||
### If error budget 20-50%
|
||||
- Deploy normally (once per day)
|
||||
- Increase testing
|
||||
- Review recent changes
|
||||
- Monitor closely
|
||||
|
||||
### If error budget < 20%
|
||||
- Freeze non-critical deploys
|
||||
- Focus on reliability improvements
|
||||
- Postmortem all incidents
|
||||
- Reduce change velocity
|
||||
|
||||
### If error budget exhausted (< 0%)
|
||||
- Complete deploy freeze except rollbacks
|
||||
- All hands on reliability
|
||||
- Mandatory postmortems
|
||||
- Executive escalation
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Error Budget Burn Rate
|
||||
|
||||
### Concept
|
||||
|
||||
Burn rate = rate of error budget consumption
|
||||
|
||||
**Example**:
|
||||
- Monthly budget: 43.2 minutes (99.9% SLO)
|
||||
- If consuming at 2x rate: Budget exhausted in 15 days
|
||||
- If consuming at 10x rate: Budget exhausted in 3 days
|
||||
|
||||
### Burn Rate Calculation
|
||||
|
||||
```
|
||||
Burn rate = (actual_error_rate / allowed_error_rate)
|
||||
|
||||
Example:
|
||||
SLO: 99.9% (0.1% allowed error rate)
|
||||
Current error rate: 0.5%
|
||||
|
||||
Burn rate = 0.5% / 0.1% = 5x
|
||||
Time to exhaust = 30 days / 5 = 6 days
|
||||
```
|
||||
|
||||
### Multi-Window Alerting
|
||||
|
||||
Alert on burn rate across multiple time windows:
|
||||
|
||||
**Fast burn** (1 hour window):
|
||||
```
|
||||
Burn rate > 14.4x → Exhausts budget in 2 days
|
||||
Alert after 2 minutes
|
||||
Severity: Critical (page immediately)
|
||||
```
|
||||
|
||||
**Moderate burn** (6 hour window):
|
||||
```
|
||||
Burn rate > 6x → Exhausts budget in 5 days
|
||||
Alert after 30 minutes
|
||||
Severity: Warning (create ticket)
|
||||
```
|
||||
|
||||
**Slow burn** (3 day window):
|
||||
```
|
||||
Burn rate > 1x → Exhausts budget by end of month
|
||||
Alert after 6 hours
|
||||
Severity: Info (monitor)
|
||||
```
|
||||
|
||||
### Implementation
|
||||
|
||||
**Prometheus**:
|
||||
```yaml
|
||||
# Fast burn alert (1h window, 2m grace period)
|
||||
- alert: ErrorBudgetFastBurn
|
||||
expr: |
|
||||
(
|
||||
sum(rate(http_requests_total{status=~"5.."}[1h]))
|
||||
/
|
||||
sum(rate(http_requests_total[1h]))
|
||||
) > (14.4 * 0.001) # 14.4x burn rate for 99.9% SLO
|
||||
for: 2m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "Fast error budget burn detected"
|
||||
description: "Error budget will be exhausted in 2 days at current rate"
|
||||
|
||||
# Slow burn alert (6h window, 30m grace period)
|
||||
- alert: ErrorBudgetSlowBurn
|
||||
expr: |
|
||||
(
|
||||
sum(rate(http_requests_total{status=~"5.."}[6h]))
|
||||
/
|
||||
sum(rate(http_requests_total[6h]))
|
||||
) > (6 * 0.001) # 6x burn rate for 99.9% SLO
|
||||
for: 30m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "Elevated error budget burn detected"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## SLO Reporting
|
||||
|
||||
### Dashboard Structure
|
||||
|
||||
**Overall Health**:
|
||||
```
|
||||
┌─────────────────────────────────────────┐
|
||||
│ SLO Compliance: 99.92% ✅ │
|
||||
│ Error Budget Remaining: 73% 🟢 │
|
||||
│ Burn Rate: 0.8x 🟢 │
|
||||
└─────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
**SLI Performance**:
|
||||
```
|
||||
Latency p95: 420ms (Target: 500ms) ✅
|
||||
Error Rate: 0.08% (Target: < 0.1%) ✅
|
||||
Availability: 99.95% (Target: > 99.9%) ✅
|
||||
```
|
||||
|
||||
**Error Budget Trend**:
|
||||
```
|
||||
Graph showing:
|
||||
- Error budget consumption over time
|
||||
- Burn rate spikes
|
||||
- Incidents marked
|
||||
- Deploy events overlaid
|
||||
```
|
||||
|
||||
### Monthly SLO Report
|
||||
|
||||
**Template**:
|
||||
```markdown
|
||||
# SLO Report: October 2024
|
||||
|
||||
## Executive Summary
|
||||
- ✅ All SLOs met this month
|
||||
- 🟡 Latency SLO came close to violation (99.1% compliance)
|
||||
- 3 incidents consumed 47% of error budget
|
||||
- Error budget remaining: 53%
|
||||
|
||||
## SLO Performance
|
||||
|
||||
### Availability SLO: 99.9%
|
||||
- Actual: 99.92%
|
||||
- Status: ✅ Met
|
||||
- Error budget consumed: 33%
|
||||
- Downtime: 23 minutes (allowed: 43.2 minutes)
|
||||
|
||||
### Latency SLO: p95 < 500ms
|
||||
- Actual p95: 445ms
|
||||
- Status: ✅ Met
|
||||
- Compliance: 99.1% (target: 99%)
|
||||
- 0.9% of requests exceeded threshold
|
||||
|
||||
### Error Rate SLO: < 0.1%
|
||||
- Actual: 0.05%
|
||||
- Status: ✅ Met
|
||||
- Error budget consumed: 50%
|
||||
|
||||
## Incidents
|
||||
|
||||
### Incident #1: Database Overload (Oct 5)
|
||||
- Duration: 15 minutes
|
||||
- Error budget consumed: 35%
|
||||
- Root cause: Slow query after schema change
|
||||
- Prevention: Added query review to deploy checklist
|
||||
|
||||
### Incident #2: API Gateway Timeout (Oct 12)
|
||||
- Duration: 5 minutes
|
||||
- Error budget consumed: 10%
|
||||
- Root cause: Configuration error in load balancer
|
||||
- Prevention: Automated configuration validation
|
||||
|
||||
### Incident #3: Upstream Service Degradation (Oct 20)
|
||||
- Duration: 3 minutes
|
||||
- Error budget consumed: 2%
|
||||
- Root cause: Third-party API outage
|
||||
- Prevention: Implemented circuit breaker
|
||||
|
||||
## Recommendations
|
||||
1. Investigate latency near-miss (Oct 15-17)
|
||||
2. Add automated rollback for database changes
|
||||
3. Increase circuit breaker thresholds for third-party APIs
|
||||
4. Consider tightening availability SLO to 99.95%
|
||||
|
||||
## Next Month's Focus
|
||||
- Reduce p95 latency to 400ms
|
||||
- Implement automated canary deployments
|
||||
- Add synthetic monitoring for critical paths
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## SLA Structure
|
||||
|
||||
### Components
|
||||
|
||||
**Service Description**:
|
||||
```
|
||||
The API Service provides RESTful endpoints for user management,
|
||||
authentication, and data retrieval.
|
||||
```
|
||||
|
||||
**Covered Metrics**:
|
||||
```
|
||||
- Availability: Service is reachable and returns valid responses
|
||||
- Latency: Time from request to response
|
||||
- Error Rate: Percentage of requests returning errors
|
||||
```
|
||||
|
||||
**SLA Targets**:
|
||||
```
|
||||
Service commits to:
|
||||
1. 99.9% monthly uptime
|
||||
2. p95 API response time < 1 second
|
||||
3. Error rate < 0.5%
|
||||
```
|
||||
|
||||
**Measurement**:
|
||||
```
|
||||
Metrics calculated from server-side monitoring:
|
||||
- Uptime: Successful health check probes / total probes
|
||||
- Latency: Server-side request duration (p95)
|
||||
- Errors: HTTP 5xx responses / total responses
|
||||
|
||||
Calculated monthly (first of month for previous month).
|
||||
```
|
||||
|
||||
**Exclusions**:
|
||||
```
|
||||
SLA does not cover:
|
||||
- Scheduled maintenance (with 7 days notice)
|
||||
- Client-side network issues
|
||||
- DDoS attacks or force majeure
|
||||
- Beta/preview features
|
||||
- Issues caused by customer misuse
|
||||
```
|
||||
|
||||
**Service Credits**:
|
||||
```
|
||||
Monthly Uptime | Service Credit
|
||||
---------------- | --------------
|
||||
< 99.9% (SLA) | 10%
|
||||
< 99.0% | 25%
|
||||
< 95.0% | 50%
|
||||
```
|
||||
|
||||
**Claiming Credits**:
|
||||
```
|
||||
Customer must:
|
||||
1. Report violation within 30 days
|
||||
2. Provide ticket numbers for support requests
|
||||
3. Credits applied to next month's invoice
|
||||
4. Credits do not exceed monthly fee
|
||||
```
|
||||
|
||||
### Example SLAs by Industry
|
||||
|
||||
**E-commerce**:
|
||||
```
|
||||
- 99.95% availability
|
||||
- p95 page load < 2s
|
||||
- p99 checkout < 5s
|
||||
- Credits: 5% per 0.1% below target
|
||||
```
|
||||
|
||||
**Financial Services**:
|
||||
```
|
||||
- 99.99% availability
|
||||
- p99 transaction < 500ms
|
||||
- Zero data loss
|
||||
- Penalties: $10,000 per hour of downtime
|
||||
```
|
||||
|
||||
**Media/Content**:
|
||||
```
|
||||
- 99.9% availability
|
||||
- p95 video start < 3s
|
||||
- No credit system (best effort latency)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Best Practices
|
||||
|
||||
### 1. SLOs Should Be User-Centric
|
||||
❌ "Database queries < 100ms"
|
||||
✅ "API response time p95 < 500ms"
|
||||
|
||||
### 2. Start Loose, Tighten Over Time
|
||||
- Begin with achievable targets
|
||||
- Build reliability culture
|
||||
- Gradually raise bar
|
||||
|
||||
### 3. Fewer, Better SLOs
|
||||
- 1-3 SLOs per service
|
||||
- Focus on user impact
|
||||
- Avoid SLO sprawl
|
||||
|
||||
### 4. SLAs More Conservative Than SLOs
|
||||
```
|
||||
Internal SLO: 99.95%
|
||||
Customer SLA: 99.9%
|
||||
Margin: 0.05% buffer
|
||||
```
|
||||
|
||||
### 5. Make Error Budgets Actionable
|
||||
- Define policies at different thresholds
|
||||
- Empower teams to make tradeoffs
|
||||
- Review in planning meetings
|
||||
|
||||
### 6. Document Everything
|
||||
- How SLIs are measured
|
||||
- Why targets were chosen
|
||||
- Who owns each SLO
|
||||
- How to interpret metrics
|
||||
|
||||
### 7. Review Regularly
|
||||
- Monthly SLO reviews
|
||||
- Quarterly SLO adjustments
|
||||
- Annual SLA renegotiation
|
||||
|
||||
---
|
||||
|
||||
## Common Pitfalls
|
||||
|
||||
### 1. Too Many SLOs
|
||||
❌ 20 different SLOs per service
|
||||
✅ 2-3 critical SLOs
|
||||
|
||||
### 2. Unrealistic Targets
|
||||
❌ 99.999% for non-critical service
|
||||
✅ 99.9% with room to improve
|
||||
|
||||
### 3. SLOs Without Error Budgets
|
||||
❌ "Must always be 99.9%"
|
||||
✅ "Budget for 0.1% errors"
|
||||
|
||||
### 4. No Consequences
|
||||
❌ Missing SLO has no impact
|
||||
✅ Deploy freeze when budget exhausted
|
||||
|
||||
### 5. SLA Equals SLO
|
||||
❌ Promise exactly what you target
|
||||
✅ SLA more conservative than SLO
|
||||
|
||||
### 6. Ignoring User Experience
|
||||
❌ "Our servers are up 99.99%"
|
||||
✅ "Users can complete actions 99.9% of the time"
|
||||
|
||||
### 7. Static Targets
|
||||
❌ Set once, never revisit
|
||||
✅ Quarterly reviews and adjustments
|
||||
|
||||
---
|
||||
|
||||
## Tools and Automation
|
||||
|
||||
### SLO Tracking Tools
|
||||
|
||||
**Prometheus + Grafana**:
|
||||
- Use recording rules for SLIs
|
||||
- Alert on burn rates
|
||||
- Dashboard for compliance
|
||||
|
||||
**Google Cloud SLO Monitoring**:
|
||||
- Built-in SLO tracking
|
||||
- Automatic error budget calculation
|
||||
- Integration with alerting
|
||||
|
||||
**Datadog SLOs**:
|
||||
- UI for SLO definition
|
||||
- Automatic burn rate alerts
|
||||
- Status pages
|
||||
|
||||
**Custom Tools**:
|
||||
- sloth: Generate Prometheus rules from SLO definitions
|
||||
- slo-libsonnet: Jsonnet library for SLO monitoring
|
||||
|
||||
### Example: Prometheus Recording Rules
|
||||
|
||||
```yaml
|
||||
groups:
|
||||
- name: sli_recording
|
||||
interval: 30s
|
||||
rules:
|
||||
# SLI: Request success rate
|
||||
- record: sli:request_success:ratio
|
||||
expr: |
|
||||
sum(rate(http_requests_total{status!~"5.."}[5m]))
|
||||
/
|
||||
sum(rate(http_requests_total[5m]))
|
||||
|
||||
# SLI: Request latency (p95)
|
||||
- record: sli:request_latency:p95
|
||||
expr: |
|
||||
histogram_quantile(0.95,
|
||||
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
|
||||
)
|
||||
|
||||
# Error budget burn rate (1h window)
|
||||
- record: slo:error_budget_burn_rate:1h
|
||||
expr: |
|
||||
(1 - sli:request_success:ratio) / 0.001
|
||||
```
|
||||
697
references/tool_comparison.md
Normal file
697
references/tool_comparison.md
Normal file
@@ -0,0 +1,697 @@
|
||||
# Monitoring Tools Comparison
|
||||
|
||||
## Overview Matrix
|
||||
|
||||
| Tool | Type | Best For | Complexity | Cost | Cloud/Self-Hosted |
|
||||
|------|------|----------|------------|------|-------------------|
|
||||
| **Prometheus** | Metrics | Kubernetes, time-series | Medium | Free | Self-hosted |
|
||||
| **Grafana** | Visualization | Dashboards, multi-source | Low-Medium | Free | Both |
|
||||
| **Datadog** | Full-stack | Ease of use, APM | Low | High | Cloud |
|
||||
| **New Relic** | Full-stack | APM, traces | Low | High | Cloud |
|
||||
| **Elasticsearch (ELK)** | Logs | Log search, analysis | High | Medium | Both |
|
||||
| **Grafana Loki** | Logs | Cost-effective logs | Medium | Free | Both |
|
||||
| **CloudWatch** | AWS-native | AWS infrastructure | Low | Medium | Cloud |
|
||||
| **Jaeger** | Tracing | Distributed tracing | Medium | Free | Self-hosted |
|
||||
| **Grafana Tempo** | Tracing | Cost-effective tracing | Medium | Free | Self-hosted |
|
||||
|
||||
---
|
||||
|
||||
## Metrics Platforms
|
||||
|
||||
### Prometheus
|
||||
|
||||
**Type**: Open-source time-series database
|
||||
|
||||
**Strengths**:
|
||||
- ✅ Industry standard for Kubernetes
|
||||
- ✅ Powerful query language (PromQL)
|
||||
- ✅ Pull-based model (no agent config)
|
||||
- ✅ Service discovery
|
||||
- ✅ Free and open source
|
||||
- ✅ Huge ecosystem (exporters for everything)
|
||||
|
||||
**Weaknesses**:
|
||||
- ❌ No built-in dashboards (need Grafana)
|
||||
- ❌ Single-node only (no HA without federation)
|
||||
- ❌ Limited long-term storage (need Thanos/Cortex)
|
||||
- ❌ Steep learning curve for PromQL
|
||||
|
||||
**Best For**:
|
||||
- Kubernetes monitoring
|
||||
- Infrastructure metrics
|
||||
- Custom application metrics
|
||||
- Organizations that need control
|
||||
|
||||
**Pricing**: Free (open source)
|
||||
|
||||
**Setup Complexity**: Medium
|
||||
|
||||
**Example**:
|
||||
```yaml
|
||||
# prometheus.yml
|
||||
scrape_configs:
|
||||
- job_name: 'app'
|
||||
static_configs:
|
||||
- targets: ['localhost:8080']
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Datadog
|
||||
|
||||
**Type**: SaaS monitoring platform
|
||||
|
||||
**Strengths**:
|
||||
- ✅ Easy to set up (install agent, done)
|
||||
- ✅ Beautiful pre-built dashboards
|
||||
- ✅ APM, logs, metrics, traces in one platform
|
||||
- ✅ Great anomaly detection
|
||||
- ✅ Excellent integrations (500+)
|
||||
- ✅ Good mobile app
|
||||
|
||||
**Weaknesses**:
|
||||
- ❌ Very expensive at scale
|
||||
- ❌ Vendor lock-in
|
||||
- ❌ Cost can be unpredictable (per-host pricing)
|
||||
- ❌ Limited PromQL support
|
||||
|
||||
**Best For**:
|
||||
- Teams that want quick setup
|
||||
- Companies prioritizing ease of use over cost
|
||||
- Organizations needing full observability
|
||||
|
||||
**Pricing**: $15-$31/host/month + custom metrics fees
|
||||
|
||||
**Setup Complexity**: Low
|
||||
|
||||
**Example**:
|
||||
```bash
|
||||
# Install agent
|
||||
DD_API_KEY=xxx bash -c "$(curl -L https://s3.amazonaws.com/dd-agent/scripts/install_script.sh)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### New Relic
|
||||
|
||||
**Type**: SaaS application performance monitoring
|
||||
|
||||
**Strengths**:
|
||||
- ✅ Excellent APM capabilities
|
||||
- ✅ User-friendly interface
|
||||
- ✅ Good transaction tracing
|
||||
- ✅ Comprehensive alerting
|
||||
- ✅ Generous free tier
|
||||
|
||||
**Weaknesses**:
|
||||
- ❌ Can get expensive at scale
|
||||
- ❌ Vendor lock-in
|
||||
- ❌ Query language less powerful than PromQL
|
||||
- ❌ Limited customization
|
||||
|
||||
**Best For**:
|
||||
- Application performance monitoring
|
||||
- Teams focused on APM over infrastructure
|
||||
- Startups (free tier is generous)
|
||||
|
||||
**Pricing**: Free up to 100GB/month, then $0.30/GB
|
||||
|
||||
**Setup Complexity**: Low
|
||||
|
||||
**Example**:
|
||||
```python
|
||||
import newrelic.agent
|
||||
newrelic.agent.initialize('newrelic.ini')
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### CloudWatch
|
||||
|
||||
**Type**: AWS-native monitoring
|
||||
|
||||
**Strengths**:
|
||||
- ✅ Zero setup for AWS services
|
||||
- ✅ Native integration with AWS
|
||||
- ✅ Automatic dashboards for AWS resources
|
||||
- ✅ Tightly integrated with other AWS services
|
||||
- ✅ Good for cost if already on AWS
|
||||
|
||||
**Weaknesses**:
|
||||
- ❌ AWS-only (not multi-cloud)
|
||||
- ❌ Limited query capabilities
|
||||
- ❌ High costs for custom metrics
|
||||
- ❌ Basic visualization
|
||||
- ❌ 1-minute minimum resolution
|
||||
|
||||
**Best For**:
|
||||
- AWS-centric infrastructure
|
||||
- Quick setup for AWS services
|
||||
- Organizations already invested in AWS
|
||||
|
||||
**Pricing**:
|
||||
- First 10 custom metrics: Free
|
||||
- Additional: $0.30/metric/month
|
||||
- API calls: $0.01/1000 requests
|
||||
|
||||
**Setup Complexity**: Low (for AWS), Medium (for custom metrics)
|
||||
|
||||
**Example**:
|
||||
```python
|
||||
import boto3
|
||||
cloudwatch = boto3.client('cloudwatch')
|
||||
cloudwatch.put_metric_data(
|
||||
Namespace='MyApp',
|
||||
MetricData=[{'MetricName': 'RequestCount', 'Value': 1}]
|
||||
)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Grafana Cloud / Mimir
|
||||
|
||||
**Type**: Managed Prometheus-compatible
|
||||
|
||||
**Strengths**:
|
||||
- ✅ Prometheus-compatible (PromQL)
|
||||
- ✅ Managed service (no ops burden)
|
||||
- ✅ Good cost model (pay for what you use)
|
||||
- ✅ Grafana dashboards included
|
||||
- ✅ Long-term storage
|
||||
|
||||
**Weaknesses**:
|
||||
- ❌ Relatively new (less mature)
|
||||
- ❌ Some Prometheus features missing
|
||||
- ❌ Requires Grafana for visualization
|
||||
|
||||
**Best For**:
|
||||
- Teams wanting Prometheus without ops overhead
|
||||
- Multi-cloud environments
|
||||
- Organizations already using Grafana
|
||||
|
||||
**Pricing**: $8/month + $0.29/1M samples
|
||||
|
||||
**Setup Complexity**: Low-Medium
|
||||
|
||||
---
|
||||
|
||||
## Logging Platforms
|
||||
|
||||
### Elasticsearch (ELK Stack)
|
||||
|
||||
**Type**: Open-source log search and analytics
|
||||
|
||||
**Full Stack**: Elasticsearch + Logstash + Kibana
|
||||
|
||||
**Strengths**:
|
||||
- ✅ Powerful search capabilities
|
||||
- ✅ Rich query language
|
||||
- ✅ Great for log analysis
|
||||
- ✅ Mature ecosystem
|
||||
- ✅ Can handle large volumes
|
||||
- ✅ Flexible data model
|
||||
|
||||
**Weaknesses**:
|
||||
- ❌ Complex to operate
|
||||
- ❌ Resource intensive (RAM hungry)
|
||||
- ❌ Expensive at scale
|
||||
- ❌ Requires dedicated ops team
|
||||
- ❌ Slow for high-cardinality queries
|
||||
|
||||
**Best For**:
|
||||
- Large organizations with ops teams
|
||||
- Deep log analysis needs
|
||||
- Search-heavy use cases
|
||||
|
||||
**Pricing**: Free (open source) + infrastructure costs
|
||||
|
||||
**Infrastructure**: ~$500-2000/month for medium scale
|
||||
|
||||
**Setup Complexity**: High
|
||||
|
||||
**Example**:
|
||||
```json
|
||||
PUT /logs-2024.10/_doc/1
|
||||
{
|
||||
"timestamp": "2024-10-28T14:32:15Z",
|
||||
"level": "error",
|
||||
"message": "Payment failed"
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Grafana Loki
|
||||
|
||||
**Type**: Log aggregation system
|
||||
|
||||
**Strengths**:
|
||||
- ✅ Cost-effective (labels only, not full-text indexing)
|
||||
- ✅ Easy to operate
|
||||
- ✅ Prometheus-like label model
|
||||
- ✅ Great Grafana integration
|
||||
- ✅ Low resource usage
|
||||
- ✅ Fast time-range queries
|
||||
|
||||
**Weaknesses**:
|
||||
- ❌ Limited full-text search
|
||||
- ❌ Requires careful label design
|
||||
- ❌ Younger ecosystem than ELK
|
||||
- ❌ Not ideal for complex queries
|
||||
|
||||
**Best For**:
|
||||
- Cost-conscious organizations
|
||||
- Kubernetes environments
|
||||
- Teams already using Prometheus
|
||||
- Time-series log queries
|
||||
|
||||
**Pricing**: Free (open source) + infrastructure costs
|
||||
|
||||
**Infrastructure**: ~$100-500/month for medium scale
|
||||
|
||||
**Setup Complexity**: Medium
|
||||
|
||||
**Example**:
|
||||
```logql
|
||||
{job="api", environment="prod"} |= "error" | json | level="error"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Splunk
|
||||
|
||||
**Type**: Enterprise log management
|
||||
|
||||
**Strengths**:
|
||||
- ✅ Extremely powerful search
|
||||
- ✅ Great for security/compliance
|
||||
- ✅ Mature platform
|
||||
- ✅ Enterprise support
|
||||
- ✅ Machine learning features
|
||||
|
||||
**Weaknesses**:
|
||||
- ❌ Very expensive
|
||||
- ❌ Complex pricing (per GB ingested)
|
||||
- ❌ Steep learning curve
|
||||
- ❌ Heavy resource usage
|
||||
|
||||
**Best For**:
|
||||
- Large enterprises
|
||||
- Security operations centers (SOCs)
|
||||
- Compliance-heavy industries
|
||||
|
||||
**Pricing**: $150-$1800/GB/month (depending on tier)
|
||||
|
||||
**Setup Complexity**: Medium-High
|
||||
|
||||
---
|
||||
|
||||
### CloudWatch Logs
|
||||
|
||||
**Type**: AWS-native log management
|
||||
|
||||
**Strengths**:
|
||||
- ✅ Zero setup for AWS services
|
||||
- ✅ Integrated with AWS ecosystem
|
||||
- ✅ CloudWatch Insights for queries
|
||||
- ✅ Reasonable cost for low volume
|
||||
|
||||
**Weaknesses**:
|
||||
- ❌ AWS-only
|
||||
- ❌ Limited query capabilities
|
||||
- ❌ Expensive at high volume
|
||||
- ❌ Basic visualization
|
||||
|
||||
**Best For**:
|
||||
- AWS-centric applications
|
||||
- Low-volume logging
|
||||
- Simple log aggregation
|
||||
|
||||
**Pricing**: Tiered (as of May 2025)
|
||||
- Vended Logs: $0.50/GB (first 10TB), $0.25/GB (next 20TB), then lower tiers
|
||||
- Standard logs: $0.50/GB flat
|
||||
- Storage: $0.03/GB
|
||||
|
||||
**Setup Complexity**: Low (AWS), Medium (custom)
|
||||
|
||||
---
|
||||
|
||||
### Sumo Logic
|
||||
|
||||
**Type**: SaaS log management
|
||||
|
||||
**Strengths**:
|
||||
- ✅ Easy to use
|
||||
- ✅ Good for cloud-native apps
|
||||
- ✅ Real-time analytics
|
||||
- ✅ Good compliance features
|
||||
|
||||
**Weaknesses**:
|
||||
- ❌ Expensive at scale
|
||||
- ❌ Vendor lock-in
|
||||
- ❌ Limited customization
|
||||
|
||||
**Best For**:
|
||||
- Cloud-native applications
|
||||
- Teams wanting managed solution
|
||||
- Security and compliance use cases
|
||||
|
||||
**Pricing**: $90-$180/GB/month
|
||||
|
||||
**Setup Complexity**: Low
|
||||
|
||||
---
|
||||
|
||||
## Tracing Platforms
|
||||
|
||||
### Jaeger
|
||||
|
||||
**Type**: Open-source distributed tracing
|
||||
|
||||
**Strengths**:
|
||||
- ✅ Industry standard
|
||||
- ✅ CNCF graduated project
|
||||
- ✅ Supports OpenTelemetry
|
||||
- ✅ Good UI
|
||||
- ✅ Free and open source
|
||||
|
||||
**Weaknesses**:
|
||||
- ❌ Requires separate storage backend
|
||||
- ❌ Limited query capabilities
|
||||
- ❌ No built-in analytics
|
||||
|
||||
**Best For**:
|
||||
- Microservices architectures
|
||||
- Kubernetes environments
|
||||
- OpenTelemetry users
|
||||
|
||||
**Pricing**: Free (open source) + storage costs
|
||||
|
||||
**Setup Complexity**: Medium
|
||||
|
||||
---
|
||||
|
||||
### Grafana Tempo
|
||||
|
||||
**Type**: Open-source distributed tracing
|
||||
|
||||
**Strengths**:
|
||||
- ✅ Cost-effective (object storage)
|
||||
- ✅ Easy to operate
|
||||
- ✅ Great Grafana integration
|
||||
- ✅ TraceQL query language
|
||||
- ✅ Supports OpenTelemetry
|
||||
|
||||
**Weaknesses**:
|
||||
- ❌ Younger than Jaeger
|
||||
- ❌ Limited third-party integrations
|
||||
- ❌ Requires Grafana for UI
|
||||
|
||||
**Best For**:
|
||||
- Cost-conscious organizations
|
||||
- Teams using Grafana stack
|
||||
- High trace volumes
|
||||
|
||||
**Pricing**: Free (open source) + storage costs
|
||||
|
||||
**Setup Complexity**: Medium
|
||||
|
||||
---
|
||||
|
||||
### Datadog APM
|
||||
|
||||
**Type**: SaaS application performance monitoring
|
||||
|
||||
**Strengths**:
|
||||
- ✅ Easy to set up
|
||||
- ✅ Excellent trace visualization
|
||||
- ✅ Integrated with metrics/logs
|
||||
- ✅ Automatic service map
|
||||
- ✅ Good profiling features
|
||||
|
||||
**Weaknesses**:
|
||||
- ❌ Expensive ($31/host/month)
|
||||
- ❌ Vendor lock-in
|
||||
- ❌ Limited sampling control
|
||||
|
||||
**Best For**:
|
||||
- Teams wanting ease of use
|
||||
- Organizations already using Datadog
|
||||
- Complex microservices
|
||||
|
||||
**Pricing**: $31/host/month + $1.70/million spans
|
||||
|
||||
**Setup Complexity**: Low
|
||||
|
||||
---
|
||||
|
||||
### AWS X-Ray
|
||||
|
||||
**Type**: AWS-native distributed tracing
|
||||
|
||||
**Strengths**:
|
||||
- ✅ Native AWS integration
|
||||
- ✅ Automatic instrumentation for AWS services
|
||||
- ✅ Low cost
|
||||
|
||||
**Weaknesses**:
|
||||
- ❌ AWS-only
|
||||
- ❌ Basic UI
|
||||
- ❌ Limited query capabilities
|
||||
|
||||
**Best For**:
|
||||
- AWS-centric applications
|
||||
- Serverless architectures (Lambda)
|
||||
- Cost-sensitive projects
|
||||
|
||||
**Pricing**: $5/million traces, first 100k free/month
|
||||
|
||||
**Setup Complexity**: Low (AWS), Medium (custom)
|
||||
|
||||
---
|
||||
|
||||
## Full-Stack Observability
|
||||
|
||||
### Datadog (Full Platform)
|
||||
|
||||
**Components**: Metrics, logs, traces, RUM, synthetics
|
||||
|
||||
**Strengths**:
|
||||
- ✅ Everything in one platform
|
||||
- ✅ Excellent user experience
|
||||
- ✅ Correlation across signals
|
||||
- ✅ Great for teams
|
||||
|
||||
**Weaknesses**:
|
||||
- ❌ Very expensive ($50-100+/host/month)
|
||||
- ❌ Vendor lock-in
|
||||
- ❌ Unpredictable costs
|
||||
|
||||
**Total Cost** (example 100 hosts):
|
||||
- Infrastructure: $3,100/month
|
||||
- APM: $3,100/month
|
||||
- Logs: ~$2,000/month
|
||||
- **Total: ~$8,000/month**
|
||||
|
||||
---
|
||||
|
||||
### Grafana Stack (LGTM)
|
||||
|
||||
**Components**: Loki (logs), Grafana (viz), Tempo (traces), Mimir/Prometheus (metrics)
|
||||
|
||||
**Strengths**:
|
||||
- ✅ Open source and cost-effective
|
||||
- ✅ Unified visualization
|
||||
- ✅ Prometheus-compatible
|
||||
- ✅ Great for cloud-native
|
||||
|
||||
**Weaknesses**:
|
||||
- ❌ Requires self-hosting or Grafana Cloud
|
||||
- ❌ More ops burden
|
||||
- ❌ Less polished than commercial tools
|
||||
|
||||
**Total Cost** (self-hosted, 100 hosts):
|
||||
- Infrastructure: ~$1,500/month
|
||||
- Ops time: Variable
|
||||
- **Total: ~$1,500-3,000/month**
|
||||
|
||||
---
|
||||
|
||||
### Elastic Observability
|
||||
|
||||
**Components**: Elasticsearch (logs), Kibana (viz), APM, metrics
|
||||
|
||||
**Strengths**:
|
||||
- ✅ Powerful search
|
||||
- ✅ Mature platform
|
||||
- ✅ Good for log-heavy use cases
|
||||
|
||||
**Weaknesses**:
|
||||
- ❌ Complex to operate
|
||||
- ❌ Expensive infrastructure
|
||||
- ❌ Resource intensive
|
||||
|
||||
**Total Cost** (self-hosted, 100 hosts):
|
||||
- Infrastructure: ~$3,000-5,000/month
|
||||
- Ops time: High
|
||||
- **Total: ~$4,000-7,000/month**
|
||||
|
||||
---
|
||||
|
||||
### New Relic One
|
||||
|
||||
**Components**: Metrics, logs, traces, synthetics
|
||||
|
||||
**Strengths**:
|
||||
- ✅ Generous free tier (100GB)
|
||||
- ✅ User-friendly
|
||||
- ✅ Good for startups
|
||||
|
||||
**Weaknesses**:
|
||||
- ❌ Costs increase quickly after free tier
|
||||
- ❌ Vendor lock-in
|
||||
|
||||
**Total Cost**:
|
||||
- Free: up to 100GB/month
|
||||
- Paid: $0.30/GB beyond 100GB
|
||||
|
||||
---
|
||||
|
||||
## Cloud Provider Native
|
||||
|
||||
### AWS (CloudWatch + X-Ray)
|
||||
|
||||
**Use When**:
|
||||
- Primarily on AWS
|
||||
- Simple monitoring needs
|
||||
- Want minimal setup
|
||||
|
||||
**Avoid When**:
|
||||
- Multi-cloud environment
|
||||
- Need advanced features
|
||||
- High log volume (expensive)
|
||||
|
||||
**Cost** (example):
|
||||
- 100 EC2 instances with basic metrics: ~$150/month
|
||||
- 1TB logs: ~$500/month ingestion + storage
|
||||
- X-Ray: ~$50/month
|
||||
|
||||
---
|
||||
|
||||
### GCP (Cloud Monitoring + Cloud Trace)
|
||||
|
||||
**Use When**:
|
||||
- Primarily on GCP
|
||||
- Using GKE
|
||||
- Want tight GCP integration
|
||||
|
||||
**Avoid When**:
|
||||
- Multi-cloud environment
|
||||
- Need advanced querying
|
||||
|
||||
**Cost** (example):
|
||||
- First 150MB/month per resource: Free
|
||||
- Additional: $0.2508/MB
|
||||
|
||||
---
|
||||
|
||||
### Azure (Azure Monitor)
|
||||
|
||||
**Use When**:
|
||||
- Primarily on Azure
|
||||
- Using AKS
|
||||
- Need Azure integration
|
||||
|
||||
**Avoid When**:
|
||||
- Multi-cloud
|
||||
- Need advanced features
|
||||
|
||||
**Cost** (example):
|
||||
- First 5GB: Free
|
||||
- Additional: $2.76/GB
|
||||
|
||||
---
|
||||
|
||||
## Decision Matrix
|
||||
|
||||
### Choose Prometheus + Grafana If:
|
||||
- ✅ Using Kubernetes
|
||||
- ✅ Want control and customization
|
||||
- ✅ Have ops capacity
|
||||
- ✅ Budget-conscious
|
||||
- ✅ Need Prometheus ecosystem
|
||||
|
||||
### Choose Datadog If:
|
||||
- ✅ Want ease of use
|
||||
- ✅ Need full observability now
|
||||
- ✅ Budget allows ($8k+/month for 100 hosts)
|
||||
- ✅ Limited ops team
|
||||
- ✅ Need excellent UX
|
||||
|
||||
### Choose ELK If:
|
||||
- ✅ Heavy log analysis needs
|
||||
- ✅ Need powerful search
|
||||
- ✅ Have dedicated ops team
|
||||
- ✅ Compliance requirements
|
||||
- ✅ Willing to invest in infrastructure
|
||||
|
||||
### Choose Grafana Stack (LGTM) If:
|
||||
- ✅ Want open source full stack
|
||||
- ✅ Cost-effective solution
|
||||
- ✅ Cloud-native architecture
|
||||
- ✅ Already using Prometheus
|
||||
- ✅ Have some ops capacity
|
||||
|
||||
### Choose New Relic If:
|
||||
- ✅ Startup with free tier
|
||||
- ✅ APM is priority
|
||||
- ✅ Want easy setup
|
||||
- ✅ Don't need heavy customization
|
||||
|
||||
### Choose Cloud Native (CloudWatch/etc) If:
|
||||
- ✅ Single cloud provider
|
||||
- ✅ Simple needs
|
||||
- ✅ Want minimal setup
|
||||
- ✅ Low to medium scale
|
||||
|
||||
---
|
||||
|
||||
## Cost Comparison
|
||||
|
||||
**Example: 100 hosts, 1TB logs/month, 1M spans/day**
|
||||
|
||||
| Solution | Monthly Cost | Setup | Ops Burden |
|
||||
|----------|-------------|--------|------------|
|
||||
| **Prometheus + Loki + Tempo** | $1,500 | Medium | Medium |
|
||||
| **Grafana Cloud** | $3,000 | Low | Low |
|
||||
| **Datadog** | $8,000 | Low | None |
|
||||
| **New Relic** | $3,500 | Low | None |
|
||||
| **ELK Stack** | $4,000 | High | High |
|
||||
| **CloudWatch** | $2,000 | Low | Low |
|
||||
|
||||
---
|
||||
|
||||
## Recommendations by Company Size
|
||||
|
||||
### Startup (< 10 engineers)
|
||||
**Recommendation**: New Relic or Grafana Cloud
|
||||
- Minimal ops burden
|
||||
- Good free tiers
|
||||
- Easy to get started
|
||||
|
||||
### Small Company (10-50 engineers)
|
||||
**Recommendation**: Prometheus + Grafana + Loki (self-hosted or cloud)
|
||||
- Cost-effective
|
||||
- Growing ops capacity
|
||||
- Flexibility
|
||||
|
||||
### Medium Company (50-200 engineers)
|
||||
**Recommendation**: Datadog or Grafana Stack
|
||||
- Datadog if budget allows
|
||||
- Grafana Stack if cost-conscious
|
||||
|
||||
### Large Enterprise (200+ engineers)
|
||||
**Recommendation**: Build observability platform
|
||||
- Mix of tools based on needs
|
||||
- Dedicated observability team
|
||||
- Custom integrations
|
||||
663
references/tracing_guide.md
Normal file
663
references/tracing_guide.md
Normal file
@@ -0,0 +1,663 @@
|
||||
# Distributed Tracing Guide
|
||||
|
||||
## What is Distributed Tracing?
|
||||
|
||||
Distributed tracing tracks a request as it flows through multiple services in a distributed system.
|
||||
|
||||
### Key Concepts
|
||||
|
||||
**Trace**: End-to-end journey of a request
|
||||
**Span**: Single operation within a trace
|
||||
**Context**: Metadata propagated between services (trace_id, span_id)
|
||||
|
||||
### Example Flow
|
||||
```
|
||||
User Request → API Gateway → Auth Service → User Service → Database
|
||||
↓ ↓ ↓
|
||||
[Trace ID: abc123]
|
||||
Span 1: gateway (50ms)
|
||||
Span 2: auth (20ms)
|
||||
Span 3: user_service (100ms)
|
||||
Span 4: db_query (80ms)
|
||||
|
||||
Total: 250ms with waterfall view showing dependencies
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## OpenTelemetry (OTel)
|
||||
|
||||
OpenTelemetry is the industry standard for instrumentation.
|
||||
|
||||
### Components
|
||||
|
||||
**API**: Instrument code (create spans, add attributes)
|
||||
**SDK**: Implement API, configure exporters
|
||||
**Collector**: Receive, process, and export telemetry data
|
||||
**Exporters**: Send data to backends (Jaeger, Tempo, Zipkin)
|
||||
|
||||
### Architecture
|
||||
```
|
||||
Application → OTel SDK → OTel Collector → Backend (Jaeger/Tempo)
|
||||
↓
|
||||
Visualization
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Instrumentation Examples
|
||||
|
||||
### Python (using OpenTelemetry)
|
||||
|
||||
**Setup**:
|
||||
```python
|
||||
from opentelemetry import trace
|
||||
from opentelemetry.sdk.trace import TracerProvider
|
||||
from opentelemetry.sdk.trace.export import BatchSpanProcessor
|
||||
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
|
||||
|
||||
# Setup tracer
|
||||
trace.set_tracer_provider(TracerProvider())
|
||||
tracer = trace.get_tracer(__name__)
|
||||
|
||||
# Configure exporter
|
||||
otlp_exporter = OTLPSpanExporter(endpoint="localhost:4317")
|
||||
span_processor = BatchSpanProcessor(otlp_exporter)
|
||||
trace.get_tracer_provider().add_span_processor(span_processor)
|
||||
```
|
||||
|
||||
**Manual instrumentation**:
|
||||
```python
|
||||
from opentelemetry import trace
|
||||
|
||||
tracer = trace.get_tracer(__name__)
|
||||
|
||||
@tracer.start_as_current_span("process_order")
|
||||
def process_order(order_id):
|
||||
span = trace.get_current_span()
|
||||
span.set_attribute("order.id", order_id)
|
||||
span.set_attribute("order.amount", 99.99)
|
||||
|
||||
try:
|
||||
result = payment_service.charge(order_id)
|
||||
span.set_attribute("payment.status", "success")
|
||||
return result
|
||||
except Exception as e:
|
||||
span.set_status(trace.Status(trace.StatusCode.ERROR))
|
||||
span.record_exception(e)
|
||||
raise
|
||||
```
|
||||
|
||||
**Auto-instrumentation** (Flask example):
|
||||
```python
|
||||
from opentelemetry.instrumentation.flask import FlaskInstrumentor
|
||||
from opentelemetry.instrumentation.requests import RequestsInstrumentor
|
||||
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
|
||||
|
||||
# Auto-instrument Flask
|
||||
FlaskInstrumentor().instrument_app(app)
|
||||
|
||||
# Auto-instrument requests library
|
||||
RequestsInstrumentor().instrument()
|
||||
|
||||
# Auto-instrument SQLAlchemy
|
||||
SQLAlchemyInstrumentor().instrument(engine=db.engine)
|
||||
```
|
||||
|
||||
### Node.js (using OpenTelemetry)
|
||||
|
||||
**Setup**:
|
||||
```javascript
|
||||
const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
|
||||
const { BatchSpanProcessor } = require('@opentelemetry/sdk-trace-base');
|
||||
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
|
||||
|
||||
// Setup provider
|
||||
const provider = new NodeTracerProvider();
|
||||
const exporter = new OTLPTraceExporter({ url: 'localhost:4317' });
|
||||
provider.addSpanProcessor(new BatchSpanProcessor(exporter));
|
||||
provider.register();
|
||||
```
|
||||
|
||||
**Manual instrumentation**:
|
||||
```javascript
|
||||
const tracer = provider.getTracer('my-service');
|
||||
|
||||
async function processOrder(orderId) {
|
||||
const span = tracer.startSpan('process_order');
|
||||
span.setAttribute('order.id', orderId);
|
||||
|
||||
try {
|
||||
const result = await paymentService.charge(orderId);
|
||||
span.setAttribute('payment.status', 'success');
|
||||
return result;
|
||||
} catch (error) {
|
||||
span.setStatus({ code: SpanStatusCode.ERROR });
|
||||
span.recordException(error);
|
||||
throw error;
|
||||
} finally {
|
||||
span.end();
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Auto-instrumentation**:
|
||||
```javascript
|
||||
const { registerInstrumentations } = require('@opentelemetry/instrumentation');
|
||||
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');
|
||||
const { ExpressInstrumentation } = require('@opentelemetry/instrumentation-express');
|
||||
const { MongoDBInstrumentation } = require('@opentelemetry/instrumentation-mongodb');
|
||||
|
||||
registerInstrumentations({
|
||||
instrumentations: [
|
||||
new HttpInstrumentation(),
|
||||
new ExpressInstrumentation(),
|
||||
new MongoDBInstrumentation()
|
||||
]
|
||||
});
|
||||
```
|
||||
|
||||
### Go (using OpenTelemetry)
|
||||
|
||||
**Setup**:
|
||||
```go
|
||||
import (
|
||||
"go.opentelemetry.io/otel"
|
||||
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
|
||||
"go.opentelemetry.io/otel/sdk/trace"
|
||||
)
|
||||
|
||||
func initTracer() {
|
||||
exporter, _ := otlptracegrpc.New(context.Background())
|
||||
tp := trace.NewTracerProvider(
|
||||
trace.WithBatcher(exporter),
|
||||
)
|
||||
otel.SetTracerProvider(tp)
|
||||
}
|
||||
```
|
||||
|
||||
**Manual instrumentation**:
|
||||
```go
|
||||
import (
|
||||
"go.opentelemetry.io/otel"
|
||||
"go.opentelemetry.io/otel/attribute"
|
||||
)
|
||||
|
||||
func processOrder(ctx context.Context, orderID string) error {
|
||||
tracer := otel.Tracer("my-service")
|
||||
ctx, span := tracer.Start(ctx, "process_order")
|
||||
defer span.End()
|
||||
|
||||
span.SetAttributes(
|
||||
attribute.String("order.id", orderID),
|
||||
attribute.Float64("order.amount", 99.99),
|
||||
)
|
||||
|
||||
err := paymentService.Charge(ctx, orderID)
|
||||
if err != nil {
|
||||
span.RecordError(err)
|
||||
return err
|
||||
}
|
||||
|
||||
span.SetAttributes(attribute.String("payment.status", "success"))
|
||||
return nil
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Span Attributes
|
||||
|
||||
### Semantic Conventions
|
||||
|
||||
Follow OpenTelemetry semantic conventions for consistency:
|
||||
|
||||
**HTTP**:
|
||||
```python
|
||||
span.set_attribute("http.method", "GET")
|
||||
span.set_attribute("http.url", "https://api.example.com/users")
|
||||
span.set_attribute("http.status_code", 200)
|
||||
span.set_attribute("http.user_agent", "Mozilla/5.0...")
|
||||
```
|
||||
|
||||
**Database**:
|
||||
```python
|
||||
span.set_attribute("db.system", "postgresql")
|
||||
span.set_attribute("db.name", "users_db")
|
||||
span.set_attribute("db.statement", "SELECT * FROM users WHERE id = ?")
|
||||
span.set_attribute("db.operation", "SELECT")
|
||||
```
|
||||
|
||||
**RPC/gRPC**:
|
||||
```python
|
||||
span.set_attribute("rpc.system", "grpc")
|
||||
span.set_attribute("rpc.service", "UserService")
|
||||
span.set_attribute("rpc.method", "GetUser")
|
||||
span.set_attribute("rpc.grpc.status_code", 0)
|
||||
```
|
||||
|
||||
**Messaging**:
|
||||
```python
|
||||
span.set_attribute("messaging.system", "kafka")
|
||||
span.set_attribute("messaging.destination", "user-events")
|
||||
span.set_attribute("messaging.operation", "publish")
|
||||
span.set_attribute("messaging.message_id", "msg123")
|
||||
```
|
||||
|
||||
### Custom Attributes
|
||||
|
||||
Add business context:
|
||||
```python
|
||||
span.set_attribute("user.id", "user123")
|
||||
span.set_attribute("order.id", "ORD-456")
|
||||
span.set_attribute("feature.flag.checkout_v2", True)
|
||||
span.set_attribute("cache.hit", False)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Context Propagation
|
||||
|
||||
### W3C Trace Context (Standard)
|
||||
|
||||
Headers propagated between services:
|
||||
```
|
||||
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
|
||||
tracestate: vendor1=value1,vendor2=value2
|
||||
```
|
||||
|
||||
**Format**: `version-trace_id-parent_span_id-trace_flags`
|
||||
|
||||
### Implementation
|
||||
|
||||
**Python**:
|
||||
```python
|
||||
from opentelemetry.propagate import inject, extract
|
||||
import requests
|
||||
|
||||
# Inject context into outgoing request
|
||||
headers = {}
|
||||
inject(headers)
|
||||
requests.get("https://api.example.com", headers=headers)
|
||||
|
||||
# Extract context from incoming request
|
||||
from flask import request
|
||||
ctx = extract(request.headers)
|
||||
```
|
||||
|
||||
**Node.js**:
|
||||
```javascript
|
||||
const { propagation } = require('@opentelemetry/api');
|
||||
|
||||
// Inject
|
||||
const headers = {};
|
||||
propagation.inject(context.active(), headers);
|
||||
axios.get('https://api.example.com', { headers });
|
||||
|
||||
// Extract
|
||||
const ctx = propagation.extract(context.active(), req.headers);
|
||||
```
|
||||
|
||||
**HTTP Example**:
|
||||
```bash
|
||||
curl -H "traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01" \
|
||||
https://api.example.com/users
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Sampling Strategies
|
||||
|
||||
### 1. Always On/Off
|
||||
```python
|
||||
from opentelemetry.sdk.trace import TracerProvider
|
||||
from opentelemetry.sdk.trace.sampling import ALWAYS_ON, ALWAYS_OFF
|
||||
|
||||
# Development: trace everything
|
||||
provider = TracerProvider(sampler=ALWAYS_ON)
|
||||
|
||||
# Production: trace nothing (usually not desired)
|
||||
provider = TracerProvider(sampler=ALWAYS_OFF)
|
||||
```
|
||||
|
||||
### 2. Probability-Based
|
||||
```python
|
||||
from opentelemetry.sdk.trace.sampling import TraceIdRatioBased
|
||||
|
||||
# Sample 10% of traces
|
||||
provider = TracerProvider(sampler=TraceIdRatioBased(0.1))
|
||||
```
|
||||
|
||||
### 3. Rate Limiting
|
||||
```python
|
||||
from opentelemetry.sdk.trace.sampling import ParentBased, RateLimitingSampler
|
||||
|
||||
# Sample max 100 traces per second
|
||||
sampler = ParentBased(root=RateLimitingSampler(100))
|
||||
provider = TracerProvider(sampler=sampler)
|
||||
```
|
||||
|
||||
### 4. Parent-Based (Default)
|
||||
```python
|
||||
from opentelemetry.sdk.trace.sampling import ParentBased, TraceIdRatioBased
|
||||
|
||||
# If parent span is sampled, sample child spans
|
||||
sampler = ParentBased(root=TraceIdRatioBased(0.1))
|
||||
provider = TracerProvider(sampler=sampler)
|
||||
```
|
||||
|
||||
### 5. Custom Sampling
|
||||
```python
|
||||
from opentelemetry.sdk.trace.sampling import Sampler, Decision
|
||||
|
||||
class ErrorSampler(Sampler):
|
||||
"""Always sample errors, sample 1% of successes"""
|
||||
|
||||
def should_sample(self, parent_context, trace_id, name, **kwargs):
|
||||
attributes = kwargs.get('attributes', {})
|
||||
|
||||
# Always sample if error
|
||||
if attributes.get('error', False):
|
||||
return Decision.RECORD_AND_SAMPLE
|
||||
|
||||
# Sample 1% of successes
|
||||
if trace_id & 0xFF < 3: # ~1%
|
||||
return Decision.RECORD_AND_SAMPLE
|
||||
|
||||
return Decision.DROP
|
||||
|
||||
provider = TracerProvider(sampler=ErrorSampler())
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Backends
|
||||
|
||||
### Jaeger
|
||||
|
||||
**Docker Compose**:
|
||||
```yaml
|
||||
version: '3'
|
||||
services:
|
||||
jaeger:
|
||||
image: jaegertracing/all-in-one:latest
|
||||
ports:
|
||||
- "16686:16686" # UI
|
||||
- "4317:4317" # OTLP gRPC
|
||||
- "4318:4318" # OTLP HTTP
|
||||
environment:
|
||||
- COLLECTOR_OTLP_ENABLED=true
|
||||
```
|
||||
|
||||
**Query traces**:
|
||||
```bash
|
||||
# UI: http://localhost:16686
|
||||
|
||||
# API: Get trace by ID
|
||||
curl http://localhost:16686/api/traces/abc123
|
||||
|
||||
# Search traces
|
||||
curl "http://localhost:16686/api/traces?service=my-service&limit=20"
|
||||
```
|
||||
|
||||
### Grafana Tempo
|
||||
|
||||
**Docker Compose**:
|
||||
```yaml
|
||||
version: '3'
|
||||
services:
|
||||
tempo:
|
||||
image: grafana/tempo:latest
|
||||
ports:
|
||||
- "3200:3200" # Tempo
|
||||
- "4317:4317" # OTLP gRPC
|
||||
volumes:
|
||||
- ./tempo.yaml:/etc/tempo.yaml
|
||||
command: ["-config.file=/etc/tempo.yaml"]
|
||||
```
|
||||
|
||||
**tempo.yaml**:
|
||||
```yaml
|
||||
server:
|
||||
http_listen_port: 3200
|
||||
|
||||
distributor:
|
||||
receivers:
|
||||
otlp:
|
||||
protocols:
|
||||
grpc:
|
||||
endpoint: 0.0.0.0:4317
|
||||
|
||||
storage:
|
||||
trace:
|
||||
backend: local
|
||||
local:
|
||||
path: /tmp/tempo/traces
|
||||
```
|
||||
|
||||
**Query in Grafana**:
|
||||
- Install Tempo data source
|
||||
- Use TraceQL: `{ span.http.status_code = 500 }`
|
||||
|
||||
### AWS X-Ray
|
||||
|
||||
**Configuration**:
|
||||
```python
|
||||
from aws_xray_sdk.core import xray_recorder
|
||||
from aws_xray_sdk.ext.flask.middleware import XRayMiddleware
|
||||
|
||||
xray_recorder.configure(service='my-service')
|
||||
XRayMiddleware(app, xray_recorder)
|
||||
```
|
||||
|
||||
**Query**:
|
||||
```bash
|
||||
aws xray get-trace-summaries \
|
||||
--start-time 2024-10-28T00:00:00 \
|
||||
--end-time 2024-10-28T23:59:59 \
|
||||
--filter-expression 'error = true'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Analysis Patterns
|
||||
|
||||
### Find Slow Traces
|
||||
```
|
||||
# Jaeger UI
|
||||
- Filter by service
|
||||
- Set min duration: 1000ms
|
||||
- Sort by duration
|
||||
|
||||
# TraceQL (Tempo)
|
||||
{ duration > 1s }
|
||||
```
|
||||
|
||||
### Find Error Traces
|
||||
```
|
||||
# Jaeger UI
|
||||
- Filter by tag: error=true
|
||||
- Or by HTTP status: http.status_code=500
|
||||
|
||||
# TraceQL (Tempo)
|
||||
{ span.http.status_code >= 500 }
|
||||
```
|
||||
|
||||
### Find Traces by User
|
||||
```
|
||||
# Jaeger UI
|
||||
- Filter by tag: user.id=user123
|
||||
|
||||
# TraceQL (Tempo)
|
||||
{ span.user.id = "user123" }
|
||||
```
|
||||
|
||||
### Find N+1 Query Problems
|
||||
Look for:
|
||||
- Many sequential database spans
|
||||
- Same query repeated multiple times
|
||||
- Pattern: API call → DB query → DB query → DB query...
|
||||
|
||||
### Find Service Bottlenecks
|
||||
- Identify spans with longest duration
|
||||
- Check if time is spent in service logic or waiting for dependencies
|
||||
- Look at span relationships (parallel vs sequential)
|
||||
|
||||
---
|
||||
|
||||
## Integration with Logs
|
||||
|
||||
### Trace ID in Logs
|
||||
|
||||
**Python**:
|
||||
```python
|
||||
from opentelemetry import trace
|
||||
|
||||
def add_trace_context():
|
||||
span = trace.get_current_span()
|
||||
trace_id = span.get_span_context().trace_id
|
||||
span_id = span.get_span_context().span_id
|
||||
|
||||
return {
|
||||
"trace_id": format(trace_id, '032x'),
|
||||
"span_id": format(span_id, '016x')
|
||||
}
|
||||
|
||||
logger.info("Processing order", **add_trace_context(), order_id=order_id)
|
||||
```
|
||||
|
||||
**Query logs for trace**:
|
||||
```
|
||||
# Elasticsearch
|
||||
GET /logs/_search
|
||||
{
|
||||
"query": {
|
||||
"match": { "trace_id": "0af7651916cd43dd8448eb211c80319c" }
|
||||
}
|
||||
}
|
||||
|
||||
# Loki (LogQL)
|
||||
{job="app"} |= "0af7651916cd43dd8448eb211c80319c"
|
||||
```
|
||||
|
||||
### Trace from Log (Grafana)
|
||||
|
||||
Configure derived fields in Grafana:
|
||||
```yaml
|
||||
datasources:
|
||||
- name: Loki
|
||||
type: loki
|
||||
jsonData:
|
||||
derivedFields:
|
||||
- name: TraceID
|
||||
matcherRegex: "trace_id=([\\w]+)"
|
||||
url: "http://tempo:3200/trace/$${__value.raw}"
|
||||
datasourceUid: tempo_uid
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Best Practices
|
||||
|
||||
### 1. Span Naming
|
||||
✅ Use operation names, not IDs
|
||||
- Good: `GET /api/users`, `UserService.GetUser`, `db.query.users`
|
||||
- Bad: `/api/users/123`, `span_abc`, `query_1`
|
||||
|
||||
### 2. Span Granularity
|
||||
✅ One span per logical operation
|
||||
- Too coarse: One span for entire request
|
||||
- Too fine: Span for every variable assignment
|
||||
- Just right: Span per service call, database query, external API
|
||||
|
||||
### 3. Add Context
|
||||
Always include:
|
||||
- Operation name
|
||||
- Service name
|
||||
- Error status
|
||||
- Business identifiers (user_id, order_id)
|
||||
|
||||
### 4. Handle Errors
|
||||
```python
|
||||
try:
|
||||
result = operation()
|
||||
except Exception as e:
|
||||
span.set_status(trace.Status(trace.StatusCode.ERROR))
|
||||
span.record_exception(e)
|
||||
raise
|
||||
```
|
||||
|
||||
### 5. Sampling Strategy
|
||||
- Development: 100%
|
||||
- Staging: 50-100%
|
||||
- Production: 1-10% (or error-based)
|
||||
|
||||
### 6. Performance Impact
|
||||
- Overhead: ~1-5% CPU
|
||||
- Use async exporters
|
||||
- Batch span exports
|
||||
- Sample appropriately
|
||||
|
||||
### 7. Cardinality
|
||||
Avoid high-cardinality attributes:
|
||||
- ❌ Email addresses
|
||||
- ❌ Full URLs with unique IDs
|
||||
- ❌ Timestamps
|
||||
- ✅ User ID
|
||||
- ✅ Endpoint pattern
|
||||
- ✅ Status code
|
||||
|
||||
---
|
||||
|
||||
## Common Issues
|
||||
|
||||
### Missing Traces
|
||||
**Cause**: Context not propagated
|
||||
**Solution**: Verify headers are injected/extracted
|
||||
|
||||
### Incomplete Traces
|
||||
**Cause**: Spans not closed properly
|
||||
**Solution**: Always use `defer span.End()` or context managers
|
||||
|
||||
### High Overhead
|
||||
**Cause**: Too many spans or synchronous export
|
||||
**Solution**: Reduce span count, use batch processor
|
||||
|
||||
### No Error Traces
|
||||
**Cause**: Errors not recorded on spans
|
||||
**Solution**: Call `span.record_exception()` and set error status
|
||||
|
||||
---
|
||||
|
||||
## Metrics from Traces
|
||||
|
||||
Generate RED metrics from trace data:
|
||||
|
||||
**Rate**: Traces per second
|
||||
**Errors**: Traces with error status
|
||||
**Duration**: Span duration percentiles
|
||||
|
||||
**Example** (using Tempo + Prometheus):
|
||||
```yaml
|
||||
# Generate metrics from spans
|
||||
metrics_generator:
|
||||
processor:
|
||||
span_metrics:
|
||||
dimensions:
|
||||
- http.method
|
||||
- http.status_code
|
||||
```
|
||||
|
||||
**Query**:
|
||||
```promql
|
||||
# Request rate
|
||||
rate(traces_spanmetrics_calls_total[5m])
|
||||
|
||||
# Error rate
|
||||
rate(traces_spanmetrics_calls_total{status_code="STATUS_CODE_ERROR"}[5m])
|
||||
/
|
||||
rate(traces_spanmetrics_calls_total[5m])
|
||||
|
||||
# P95 latency
|
||||
histogram_quantile(0.95, traces_spanmetrics_latency_bucket)
|
||||
```
|
||||
315
scripts/alert_quality_checker.py
Normal file
315
scripts/alert_quality_checker.py
Normal file
@@ -0,0 +1,315 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Audit Prometheus alert rules against best practices.
|
||||
Checks for: alert naming, severity labels, runbook links, expression quality.
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import sys
|
||||
import os
|
||||
import re
|
||||
from typing import Dict, List, Any
|
||||
from pathlib import Path
|
||||
|
||||
try:
|
||||
import yaml
|
||||
except ImportError:
|
||||
print("⚠️ Warning: 'PyYAML' library not found. Install with: pip install pyyaml")
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
class AlertQualityChecker:
|
||||
def __init__(self):
|
||||
self.issues = []
|
||||
self.warnings = []
|
||||
self.recommendations = []
|
||||
|
||||
def check_alert_name(self, alert_name: str) -> List[str]:
|
||||
"""Check alert naming conventions."""
|
||||
issues = []
|
||||
|
||||
# Should be PascalCase or camelCase
|
||||
if not re.match(r'^[A-Z][a-zA-Z0-9]*$', alert_name):
|
||||
issues.append(f"Alert name '{alert_name}' should use PascalCase (e.g., HighCPUUsage)")
|
||||
|
||||
# Should be descriptive
|
||||
if len(alert_name) < 5:
|
||||
issues.append(f"Alert name '{alert_name}' is too short, use descriptive names")
|
||||
|
||||
# Avoid generic names
|
||||
generic_names = ['Alert', 'Test', 'Warning', 'Error']
|
||||
if alert_name in generic_names:
|
||||
issues.append(f"Alert name '{alert_name}' is too generic")
|
||||
|
||||
return issues
|
||||
|
||||
def check_labels(self, alert: Dict[str, Any]) -> List[str]:
|
||||
"""Check required and recommended labels."""
|
||||
issues = []
|
||||
labels = alert.get('labels', {})
|
||||
|
||||
# Required labels
|
||||
if 'severity' not in labels:
|
||||
issues.append("Missing required 'severity' label (critical/warning/info)")
|
||||
elif labels['severity'] not in ['critical', 'warning', 'info']:
|
||||
issues.append(f"Severity '{labels['severity']}' should be one of: critical, warning, info")
|
||||
|
||||
# Recommended labels
|
||||
if 'team' not in labels:
|
||||
self.recommendations.append("Consider adding 'team' label for routing")
|
||||
|
||||
if 'component' not in labels and 'service' not in labels:
|
||||
self.recommendations.append("Consider adding 'component' or 'service' label")
|
||||
|
||||
return issues
|
||||
|
||||
def check_annotations(self, alert: Dict[str, Any]) -> List[str]:
|
||||
"""Check annotations quality."""
|
||||
issues = []
|
||||
annotations = alert.get('annotations', {})
|
||||
|
||||
# Required annotations
|
||||
if 'summary' not in annotations:
|
||||
issues.append("Missing 'summary' annotation")
|
||||
elif len(annotations['summary']) < 10:
|
||||
issues.append("Summary annotation is too short, provide clear description")
|
||||
|
||||
if 'description' not in annotations:
|
||||
issues.append("Missing 'description' annotation")
|
||||
|
||||
# Runbook
|
||||
if 'runbook_url' not in annotations and 'runbook' not in annotations:
|
||||
self.recommendations.append("Consider adding 'runbook_url' for incident response")
|
||||
|
||||
# Check for templating
|
||||
if 'summary' in annotations:
|
||||
if '{{ $value }}' not in annotations['summary'] and '{{' not in annotations['summary']:
|
||||
self.recommendations.append("Consider using template variables in summary (e.g., {{ $value }})")
|
||||
|
||||
return issues
|
||||
|
||||
def check_expression(self, expr: str, alert_name: str) -> List[str]:
|
||||
"""Check PromQL expression quality."""
|
||||
issues = []
|
||||
|
||||
# Should have a threshold
|
||||
if '>' not in expr and '<' not in expr and '==' not in expr and '!=' not in expr:
|
||||
issues.append("Expression should include a comparison operator")
|
||||
|
||||
# Should use rate() for counters
|
||||
if '_total' in expr and 'rate(' not in expr and 'increase(' not in expr:
|
||||
self.recommendations.append("Consider using rate() or increase() for counter metrics (*_total)")
|
||||
|
||||
# Avoid instant queries without aggregation
|
||||
if not any(agg in expr for agg in ['sum(', 'avg(', 'min(', 'max(', 'count(']):
|
||||
if expr.count('{') > 1: # Multiple metrics without aggregation
|
||||
self.recommendations.append("Consider aggregating metrics with sum(), avg(), etc.")
|
||||
|
||||
# Check for proper time windows
|
||||
if '[' not in expr and 'rate(' in expr:
|
||||
issues.append("rate() requires a time window (e.g., rate(metric[5m]))")
|
||||
|
||||
return issues
|
||||
|
||||
def check_for_duration(self, rule: Dict[str, Any]) -> List[str]:
|
||||
"""Check for 'for' clause to prevent flapping."""
|
||||
issues = []
|
||||
severity = rule.get('labels', {}).get('severity', 'unknown')
|
||||
|
||||
if 'for' not in rule:
|
||||
if severity == 'critical':
|
||||
issues.append("Critical alerts should have 'for' clause to prevent flapping")
|
||||
else:
|
||||
self.warnings.append("Consider adding 'for' clause to prevent alert flapping")
|
||||
else:
|
||||
# Parse duration
|
||||
duration = rule['for']
|
||||
if severity == 'critical' and any(x in duration for x in ['0s', '30s', '1m']):
|
||||
self.warnings.append(f"'for' duration ({duration}) might be too short for critical alerts")
|
||||
|
||||
return issues
|
||||
|
||||
def check_alert_rule(self, rule: Dict[str, Any]) -> Dict[str, Any]:
|
||||
"""Check a single alert rule."""
|
||||
alert_name = rule.get('alert', 'Unknown')
|
||||
issues = []
|
||||
|
||||
# Check alert name
|
||||
issues.extend(self.check_alert_name(alert_name))
|
||||
|
||||
# Check expression
|
||||
if 'expr' not in rule:
|
||||
issues.append("Missing 'expr' field")
|
||||
else:
|
||||
issues.extend(self.check_expression(rule['expr'], alert_name))
|
||||
|
||||
# Check labels
|
||||
issues.extend(self.check_labels(rule))
|
||||
|
||||
# Check annotations
|
||||
issues.extend(self.check_annotations(rule))
|
||||
|
||||
# Check for duration
|
||||
issues.extend(self.check_for_duration(rule))
|
||||
|
||||
return {
|
||||
"alert": alert_name,
|
||||
"issues": issues,
|
||||
"severity": rule.get('labels', {}).get('severity', 'unknown')
|
||||
}
|
||||
|
||||
def analyze_file(self, filepath: str) -> Dict[str, Any]:
|
||||
"""Analyze a Prometheus rules file."""
|
||||
try:
|
||||
with open(filepath, 'r') as f:
|
||||
data = yaml.safe_load(f)
|
||||
|
||||
if not data:
|
||||
return {"error": "Empty or invalid YAML file"}
|
||||
|
||||
results = []
|
||||
groups = data.get('groups', [])
|
||||
|
||||
for group in groups:
|
||||
group_name = group.get('name', 'Unknown')
|
||||
rules = group.get('rules', [])
|
||||
|
||||
for rule in rules:
|
||||
# Only check alerting rules, not recording rules
|
||||
if 'alert' in rule:
|
||||
result = self.check_alert_rule(rule)
|
||||
result['group'] = group_name
|
||||
results.append(result)
|
||||
|
||||
return {
|
||||
"file": filepath,
|
||||
"groups": len(groups),
|
||||
"alerts_checked": len(results),
|
||||
"results": results
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
return {"error": f"Failed to parse file: {e}"}
|
||||
|
||||
|
||||
def print_results(analysis: Dict[str, Any], checker: AlertQualityChecker):
|
||||
"""Pretty print analysis results."""
|
||||
print("\n" + "="*60)
|
||||
print("🚨 ALERT QUALITY CHECK RESULTS")
|
||||
print("="*60)
|
||||
|
||||
if "error" in analysis:
|
||||
print(f"\n❌ Error: {analysis['error']}")
|
||||
return
|
||||
|
||||
print(f"\n📁 File: {analysis['file']}")
|
||||
print(f"📊 Groups: {analysis['groups']}")
|
||||
print(f"🔔 Alerts Checked: {analysis['alerts_checked']}")
|
||||
|
||||
# Count issues by severity
|
||||
critical_count = 0
|
||||
warning_count = 0
|
||||
|
||||
for result in analysis['results']:
|
||||
if result['issues']:
|
||||
critical_count += 1
|
||||
|
||||
print(f"\n{'='*60}")
|
||||
print(f"📈 Summary:")
|
||||
print(f" ❌ Alerts with Issues: {critical_count}")
|
||||
print(f" ⚠️ Warnings: {len(checker.warnings)}")
|
||||
print(f" 💡 Recommendations: {len(checker.recommendations)}")
|
||||
|
||||
# Print detailed results
|
||||
if critical_count > 0:
|
||||
print(f"\n{'='*60}")
|
||||
print("❌ ALERTS WITH ISSUES:")
|
||||
print(f"{'='*60}")
|
||||
|
||||
for result in analysis['results']:
|
||||
if result['issues']:
|
||||
print(f"\n🔔 Alert: {result['alert']} (Group: {result['group']})")
|
||||
print(f" Severity: {result['severity']}")
|
||||
print(" Issues:")
|
||||
for issue in result['issues']:
|
||||
print(f" • {issue}")
|
||||
|
||||
# Print warnings
|
||||
if checker.warnings:
|
||||
print(f"\n{'='*60}")
|
||||
print("⚠️ WARNINGS:")
|
||||
print(f"{'='*60}")
|
||||
for warning in set(checker.warnings): # Remove duplicates
|
||||
print(f"• {warning}")
|
||||
|
||||
# Print recommendations
|
||||
if checker.recommendations:
|
||||
print(f"\n{'='*60}")
|
||||
print("💡 RECOMMENDATIONS:")
|
||||
print(f"{'='*60}")
|
||||
for rec in list(set(checker.recommendations))[:10]: # Top 10 unique recommendations
|
||||
print(f"• {rec}")
|
||||
|
||||
# Overall score
|
||||
total_alerts = analysis['alerts_checked']
|
||||
if total_alerts > 0:
|
||||
quality_score = ((total_alerts - critical_count) / total_alerts) * 100
|
||||
print(f"\n{'='*60}")
|
||||
print(f"📊 Quality Score: {quality_score:.1f}% ({total_alerts - critical_count}/{total_alerts} alerts passing)")
|
||||
print(f"{'='*60}\n")
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Audit Prometheus alert rules for quality and best practices",
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog="""
|
||||
Examples:
|
||||
# Check a single file
|
||||
python3 alert_quality_checker.py alerts.yml
|
||||
|
||||
# Check all YAML files in a directory
|
||||
python3 alert_quality_checker.py /path/to/prometheus/rules/
|
||||
|
||||
Best Practices Checked:
|
||||
✓ Alert naming conventions (PascalCase, descriptive)
|
||||
✓ Required labels (severity)
|
||||
✓ Required annotations (summary, description)
|
||||
✓ Runbook URL presence
|
||||
✓ PromQL expression quality
|
||||
✓ 'for' clause to prevent flapping
|
||||
✓ Template variable usage
|
||||
"""
|
||||
)
|
||||
|
||||
parser.add_argument('path', help='Path to alert rules file or directory')
|
||||
parser.add_argument('--verbose', action='store_true', help='Show all recommendations')
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
checker = AlertQualityChecker()
|
||||
|
||||
# Check if path is file or directory
|
||||
path = Path(args.path)
|
||||
|
||||
if path.is_file():
|
||||
files = [str(path)]
|
||||
elif path.is_dir():
|
||||
files = [str(f) for f in path.rglob('*.yml')] + [str(f) for f in path.rglob('*.yaml')]
|
||||
else:
|
||||
print(f"❌ Path not found: {args.path}")
|
||||
sys.exit(1)
|
||||
|
||||
if not files:
|
||||
print(f"❌ No YAML files found in: {args.path}")
|
||||
sys.exit(1)
|
||||
|
||||
print(f"🔍 Checking {len(files)} file(s)...")
|
||||
|
||||
for filepath in files:
|
||||
analysis = checker.analyze_file(filepath)
|
||||
print_results(analysis, checker)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
279
scripts/analyze_metrics.py
Normal file
279
scripts/analyze_metrics.py
Normal file
@@ -0,0 +1,279 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Analyze metrics from Prometheus or CloudWatch and detect anomalies.
|
||||
Supports: rate of change analysis, spike detection, trend analysis.
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import sys
|
||||
import json
|
||||
from datetime import datetime, timedelta
|
||||
from typing import Dict, List, Any, Optional
|
||||
import statistics
|
||||
|
||||
try:
|
||||
import requests
|
||||
except ImportError:
|
||||
print("⚠️ Warning: 'requests' library not found. Install with: pip install requests")
|
||||
sys.exit(1)
|
||||
|
||||
try:
|
||||
import boto3
|
||||
except ImportError:
|
||||
boto3 = None
|
||||
|
||||
|
||||
class MetricAnalyzer:
|
||||
def __init__(self, source: str, endpoint: Optional[str] = None, region: str = "us-east-1"):
|
||||
self.source = source
|
||||
self.endpoint = endpoint
|
||||
self.region = region
|
||||
if source == "cloudwatch" and boto3:
|
||||
self.cloudwatch = boto3.client('cloudwatch', region_name=region)
|
||||
elif source == "cloudwatch" and not boto3:
|
||||
print("⚠️ boto3 not installed. Install with: pip install boto3")
|
||||
sys.exit(1)
|
||||
|
||||
def query_prometheus(self, query: str, hours: int = 24) -> List[Dict]:
|
||||
"""Query Prometheus for metric data."""
|
||||
if not self.endpoint:
|
||||
print("❌ Prometheus endpoint required")
|
||||
sys.exit(1)
|
||||
|
||||
try:
|
||||
# Query range for last N hours
|
||||
end_time = datetime.now()
|
||||
start_time = end_time - timedelta(hours=hours)
|
||||
|
||||
params = {
|
||||
'query': query,
|
||||
'start': start_time.timestamp(),
|
||||
'end': end_time.timestamp(),
|
||||
'step': '5m' # 5-minute resolution
|
||||
}
|
||||
|
||||
response = requests.get(f"{self.endpoint}/api/v1/query_range", params=params, timeout=30)
|
||||
response.raise_for_status()
|
||||
|
||||
data = response.json()
|
||||
if data['status'] != 'success':
|
||||
print(f"❌ Prometheus query failed: {data}")
|
||||
return []
|
||||
|
||||
return data['data']['result']
|
||||
|
||||
except Exception as e:
|
||||
print(f"❌ Error querying Prometheus: {e}")
|
||||
return []
|
||||
|
||||
def query_cloudwatch(self, namespace: str, metric_name: str, dimensions: Dict[str, str],
|
||||
hours: int = 24, stat: str = "Average") -> List[Dict]:
|
||||
"""Query CloudWatch for metric data."""
|
||||
try:
|
||||
end_time = datetime.now()
|
||||
start_time = end_time - timedelta(hours=hours)
|
||||
|
||||
dimensions_list = [{'Name': k, 'Value': v} for k, v in dimensions.items()]
|
||||
|
||||
response = self.cloudwatch.get_metric_statistics(
|
||||
Namespace=namespace,
|
||||
MetricName=metric_name,
|
||||
Dimensions=dimensions_list,
|
||||
StartTime=start_time,
|
||||
EndTime=end_time,
|
||||
Period=300, # 5-minute intervals
|
||||
Statistics=[stat]
|
||||
)
|
||||
|
||||
return sorted(response['Datapoints'], key=lambda x: x['Timestamp'])
|
||||
|
||||
except Exception as e:
|
||||
print(f"❌ Error querying CloudWatch: {e}")
|
||||
return []
|
||||
|
||||
def detect_anomalies(self, values: List[float], sensitivity: float = 2.0) -> Dict[str, Any]:
|
||||
"""Detect anomalies using standard deviation method."""
|
||||
if len(values) < 10:
|
||||
return {
|
||||
"anomalies_detected": False,
|
||||
"message": "Insufficient data points for anomaly detection"
|
||||
}
|
||||
|
||||
mean = statistics.mean(values)
|
||||
stdev = statistics.stdev(values)
|
||||
threshold_upper = mean + (sensitivity * stdev)
|
||||
threshold_lower = mean - (sensitivity * stdev)
|
||||
|
||||
anomalies = []
|
||||
for i, value in enumerate(values):
|
||||
if value > threshold_upper or value < threshold_lower:
|
||||
anomalies.append({
|
||||
"index": i,
|
||||
"value": value,
|
||||
"deviation": abs(value - mean) / stdev if stdev > 0 else 0
|
||||
})
|
||||
|
||||
return {
|
||||
"anomalies_detected": len(anomalies) > 0,
|
||||
"count": len(anomalies),
|
||||
"anomalies": anomalies,
|
||||
"stats": {
|
||||
"mean": mean,
|
||||
"stdev": stdev,
|
||||
"threshold_upper": threshold_upper,
|
||||
"threshold_lower": threshold_lower,
|
||||
"total_points": len(values)
|
||||
}
|
||||
}
|
||||
|
||||
def analyze_trend(self, values: List[float]) -> Dict[str, Any]:
|
||||
"""Analyze trend using simple linear regression."""
|
||||
if len(values) < 2:
|
||||
return {"trend": "unknown", "message": "Insufficient data"}
|
||||
|
||||
n = len(values)
|
||||
x = list(range(n))
|
||||
x_mean = sum(x) / n
|
||||
y_mean = sum(values) / n
|
||||
|
||||
numerator = sum((x[i] - x_mean) * (values[i] - y_mean) for i in range(n))
|
||||
denominator = sum((x[i] - x_mean) ** 2 for i in range(n))
|
||||
|
||||
if denominator == 0:
|
||||
return {"trend": "flat", "slope": 0}
|
||||
|
||||
slope = numerator / denominator
|
||||
|
||||
# Determine trend direction
|
||||
if abs(slope) < 0.01 * y_mean: # Less than 1% change per interval
|
||||
trend = "stable"
|
||||
elif slope > 0:
|
||||
trend = "increasing"
|
||||
else:
|
||||
trend = "decreasing"
|
||||
|
||||
return {
|
||||
"trend": trend,
|
||||
"slope": slope,
|
||||
"rate_of_change": (slope / y_mean * 100) if y_mean != 0 else 0
|
||||
}
|
||||
|
||||
|
||||
def print_results(results: Dict[str, Any]):
|
||||
"""Pretty print analysis results."""
|
||||
print("\n" + "="*60)
|
||||
print("📊 METRIC ANALYSIS RESULTS")
|
||||
print("="*60)
|
||||
|
||||
if "error" in results:
|
||||
print(f"\n❌ Error: {results['error']}")
|
||||
return
|
||||
|
||||
print(f"\n📈 Data Points: {results.get('data_points', 0)}")
|
||||
|
||||
# Trend analysis
|
||||
if "trend" in results:
|
||||
trend_emoji = {"increasing": "📈", "decreasing": "📉", "stable": "➡️"}.get(results["trend"]["trend"], "❓")
|
||||
print(f"\n{trend_emoji} Trend: {results['trend']['trend'].upper()}")
|
||||
if "rate_of_change" in results["trend"]:
|
||||
print(f" Rate of Change: {results['trend']['rate_of_change']:.2f}% per interval")
|
||||
|
||||
# Anomaly detection
|
||||
if "anomalies" in results:
|
||||
anomaly_data = results["anomalies"]
|
||||
if anomaly_data["anomalies_detected"]:
|
||||
print(f"\n⚠️ ANOMALIES DETECTED: {anomaly_data['count']}")
|
||||
print(f" Mean: {anomaly_data['stats']['mean']:.2f}")
|
||||
print(f" Std Dev: {anomaly_data['stats']['stdev']:.2f}")
|
||||
print(f" Threshold: [{anomaly_data['stats']['threshold_lower']:.2f}, {anomaly_data['stats']['threshold_upper']:.2f}]")
|
||||
|
||||
print("\n Top Anomalies:")
|
||||
for anomaly in sorted(anomaly_data['anomalies'], key=lambda x: x['deviation'], reverse=True)[:5]:
|
||||
print(f" • Index {anomaly['index']}: {anomaly['value']:.2f} ({anomaly['deviation']:.2f}σ)")
|
||||
else:
|
||||
print("\n✅ No anomalies detected")
|
||||
|
||||
print("\n" + "="*60)
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Analyze metrics from Prometheus or CloudWatch",
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog="""
|
||||
Examples:
|
||||
# Prometheus: Analyze request rate
|
||||
python3 analyze_metrics.py prometheus \\
|
||||
--endpoint http://localhost:9090 \\
|
||||
--query 'rate(http_requests_total[5m])' \\
|
||||
--hours 24
|
||||
|
||||
# CloudWatch: Analyze CPU utilization
|
||||
python3 analyze_metrics.py cloudwatch \\
|
||||
--namespace AWS/EC2 \\
|
||||
--metric CPUUtilization \\
|
||||
--dimensions InstanceId=i-1234567890abcdef0 \\
|
||||
--hours 48
|
||||
"""
|
||||
)
|
||||
|
||||
parser.add_argument('source', choices=['prometheus', 'cloudwatch'],
|
||||
help='Metric source')
|
||||
parser.add_argument('--endpoint', help='Prometheus endpoint URL')
|
||||
parser.add_argument('--query', help='PromQL query')
|
||||
parser.add_argument('--namespace', help='CloudWatch namespace')
|
||||
parser.add_argument('--metric', help='CloudWatch metric name')
|
||||
parser.add_argument('--dimensions', help='CloudWatch dimensions (key=value,key2=value2)')
|
||||
parser.add_argument('--hours', type=int, default=24, help='Hours of data to analyze (default: 24)')
|
||||
parser.add_argument('--sensitivity', type=float, default=2.0,
|
||||
help='Anomaly detection sensitivity (std deviations, default: 2.0)')
|
||||
parser.add_argument('--region', default='us-east-1', help='AWS region (default: us-east-1)')
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
analyzer = MetricAnalyzer(args.source, args.endpoint, args.region)
|
||||
|
||||
# Query metrics
|
||||
if args.source == 'prometheus':
|
||||
if not args.query:
|
||||
print("❌ --query required for Prometheus")
|
||||
sys.exit(1)
|
||||
|
||||
print(f"🔍 Querying Prometheus: {args.query}")
|
||||
results = analyzer.query_prometheus(args.query, args.hours)
|
||||
|
||||
if not results:
|
||||
print("❌ No data returned")
|
||||
sys.exit(1)
|
||||
|
||||
# Extract values from first result series
|
||||
values = [float(v[1]) for v in results[0].get('values', [])]
|
||||
|
||||
elif args.source == 'cloudwatch':
|
||||
if not all([args.namespace, args.metric, args.dimensions]):
|
||||
print("❌ --namespace, --metric, and --dimensions required for CloudWatch")
|
||||
sys.exit(1)
|
||||
|
||||
dims = dict(item.split('=') for item in args.dimensions.split(','))
|
||||
|
||||
print(f"🔍 Querying CloudWatch: {args.namespace}/{args.metric}")
|
||||
results = analyzer.query_cloudwatch(args.namespace, args.metric, dims, args.hours)
|
||||
|
||||
if not results:
|
||||
print("❌ No data returned")
|
||||
sys.exit(1)
|
||||
|
||||
values = [point['Average'] for point in results]
|
||||
|
||||
# Analyze metrics
|
||||
analysis_results = {
|
||||
"data_points": len(values),
|
||||
"trend": analyzer.analyze_trend(values),
|
||||
"anomalies": analyzer.detect_anomalies(values, args.sensitivity)
|
||||
}
|
||||
|
||||
print_results(analysis_results)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
395
scripts/dashboard_generator.py
Normal file
395
scripts/dashboard_generator.py
Normal file
@@ -0,0 +1,395 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Generate Grafana dashboards from templates.
|
||||
Supports: web applications, Kubernetes, databases, Redis, and custom metrics.
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import sys
|
||||
import json
|
||||
from typing import Dict, List, Any, Optional
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
class DashboardGenerator:
|
||||
def __init__(self, title: str, datasource: str = "Prometheus"):
|
||||
self.title = title
|
||||
self.datasource = datasource
|
||||
self.dashboard = self._create_base_dashboard()
|
||||
self.panel_id = 1
|
||||
self.row_y = 0
|
||||
|
||||
def _create_base_dashboard(self) -> Dict[str, Any]:
|
||||
"""Create base dashboard structure."""
|
||||
return {
|
||||
"dashboard": {
|
||||
"title": self.title,
|
||||
"tags": [],
|
||||
"timezone": "browser",
|
||||
"schemaVersion": 16,
|
||||
"version": 0,
|
||||
"refresh": "30s",
|
||||
"panels": [],
|
||||
"templating": {
|
||||
"list": []
|
||||
},
|
||||
"time": {
|
||||
"from": "now-6h",
|
||||
"to": "now"
|
||||
}
|
||||
},
|
||||
"overwrite": True
|
||||
}
|
||||
|
||||
def add_variable(self, name: str, label: str, query: str):
|
||||
"""Add a template variable."""
|
||||
variable = {
|
||||
"name": name,
|
||||
"label": label,
|
||||
"type": "query",
|
||||
"datasource": self.datasource,
|
||||
"query": query,
|
||||
"refresh": 1,
|
||||
"regex": "",
|
||||
"multi": False,
|
||||
"includeAll": False
|
||||
}
|
||||
self.dashboard["dashboard"]["templating"]["list"].append(variable)
|
||||
|
||||
def add_row(self, title: str):
|
||||
"""Add a row panel."""
|
||||
panel = {
|
||||
"id": self.panel_id,
|
||||
"type": "row",
|
||||
"title": title,
|
||||
"collapsed": False,
|
||||
"gridPos": {"h": 1, "w": 24, "x": 0, "y": self.row_y}
|
||||
}
|
||||
self.dashboard["dashboard"]["panels"].append(panel)
|
||||
self.panel_id += 1
|
||||
self.row_y += 1
|
||||
|
||||
def add_graph(self, title: str, targets: List[Dict[str, str]], unit: str = "short",
|
||||
width: int = 12, height: int = 8):
|
||||
"""Add a graph panel."""
|
||||
panel = {
|
||||
"id": self.panel_id,
|
||||
"type": "graph",
|
||||
"title": title,
|
||||
"datasource": self.datasource,
|
||||
"targets": [
|
||||
{
|
||||
"expr": target["query"],
|
||||
"legendFormat": target.get("legend", ""),
|
||||
"refId": chr(65 + i) # A, B, C, etc.
|
||||
}
|
||||
for i, target in enumerate(targets)
|
||||
],
|
||||
"gridPos": {"h": height, "w": width, "x": 0, "y": self.row_y},
|
||||
"yaxes": [
|
||||
{"format": unit, "label": None, "show": True},
|
||||
{"format": "short", "label": None, "show": True}
|
||||
],
|
||||
"lines": True,
|
||||
"fill": 1,
|
||||
"linewidth": 2,
|
||||
"legend": {
|
||||
"show": True,
|
||||
"alignAsTable": True,
|
||||
"avg": True,
|
||||
"current": True,
|
||||
"max": True,
|
||||
"min": False,
|
||||
"total": False,
|
||||
"values": True
|
||||
}
|
||||
}
|
||||
self.dashboard["dashboard"]["panels"].append(panel)
|
||||
self.panel_id += 1
|
||||
self.row_y += height
|
||||
|
||||
def add_stat(self, title: str, query: str, unit: str = "short",
|
||||
width: int = 6, height: int = 4):
|
||||
"""Add a stat panel (single value)."""
|
||||
panel = {
|
||||
"id": self.panel_id,
|
||||
"type": "stat",
|
||||
"title": title,
|
||||
"datasource": self.datasource,
|
||||
"targets": [
|
||||
{
|
||||
"expr": query,
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"gridPos": {"h": height, "w": width, "x": 0, "y": self.row_y},
|
||||
"options": {
|
||||
"graphMode": "area",
|
||||
"orientation": "auto",
|
||||
"reduceOptions": {
|
||||
"values": False,
|
||||
"calcs": ["lastNotNull"]
|
||||
}
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": unit,
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{"value": None, "color": "green"},
|
||||
{"value": 80, "color": "red"}
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
self.dashboard["dashboard"]["panels"].append(panel)
|
||||
self.panel_id += 1
|
||||
|
||||
def generate_webapp_dashboard(self, service: str):
|
||||
"""Generate dashboard for web application."""
|
||||
self.add_variable("service", "Service", f"label_values({service}_http_requests_total, service)")
|
||||
|
||||
# Request metrics
|
||||
self.add_row("Request Metrics")
|
||||
|
||||
self.add_graph(
|
||||
"Request Rate",
|
||||
[{"query": f'sum(rate({service}_http_requests_total[5m])) by (status)', "legend": "{{status}}"}],
|
||||
unit="reqps",
|
||||
width=12
|
||||
)
|
||||
|
||||
self.add_graph(
|
||||
"Request Latency (p50, p95, p99)",
|
||||
[
|
||||
{"query": f'histogram_quantile(0.50, sum(rate({service}_http_request_duration_seconds_bucket[5m])) by (le))', "legend": "p50"},
|
||||
{"query": f'histogram_quantile(0.95, sum(rate({service}_http_request_duration_seconds_bucket[5m])) by (le))', "legend": "p95"},
|
||||
{"query": f'histogram_quantile(0.99, sum(rate({service}_http_request_duration_seconds_bucket[5m])) by (le))', "legend": "p99"}
|
||||
],
|
||||
unit="s",
|
||||
width=12
|
||||
)
|
||||
|
||||
# Error rate
|
||||
self.add_row("Errors")
|
||||
|
||||
self.add_graph(
|
||||
"Error Rate (%)",
|
||||
[{"query": f'sum(rate({service}_http_requests_total{{status=~"5.."}}[5m])) / sum(rate({service}_http_requests_total[5m])) * 100', "legend": "Error Rate"}],
|
||||
unit="percent",
|
||||
width=12
|
||||
)
|
||||
|
||||
# Resource usage
|
||||
self.add_row("Resource Usage")
|
||||
|
||||
self.add_graph(
|
||||
"CPU Usage",
|
||||
[{"query": f'sum(rate(process_cpu_seconds_total{{job="{service}"}}[5m])) * 100', "legend": "CPU %"}],
|
||||
unit="percent",
|
||||
width=12
|
||||
)
|
||||
|
||||
self.add_graph(
|
||||
"Memory Usage",
|
||||
[{"query": f'process_resident_memory_bytes{{job="{service}"}}', "legend": "Memory"}],
|
||||
unit="bytes",
|
||||
width=12
|
||||
)
|
||||
|
||||
def generate_kubernetes_dashboard(self, namespace: str):
|
||||
"""Generate dashboard for Kubernetes cluster."""
|
||||
self.add_variable("namespace", "Namespace", f"label_values(kube_pod_info, namespace)")
|
||||
|
||||
# Cluster overview
|
||||
self.add_row("Cluster Overview")
|
||||
|
||||
self.add_stat("Total Pods", f'count(kube_pod_info{{namespace="{namespace}"}})', width=6)
|
||||
self.add_stat("Running Pods", f'count(kube_pod_status_phase{{namespace="{namespace}", phase="Running"}})', width=6)
|
||||
self.add_stat("Pending Pods", f'count(kube_pod_status_phase{{namespace="{namespace}", phase="Pending"}})', width=6)
|
||||
self.add_stat("Failed Pods", f'count(kube_pod_status_phase{{namespace="{namespace}", phase="Failed"}})', width=6)
|
||||
|
||||
# Resource usage
|
||||
self.add_row("Resource Usage")
|
||||
|
||||
self.add_graph(
|
||||
"CPU Usage by Pod",
|
||||
[{"query": f'sum(rate(container_cpu_usage_seconds_total{{namespace="{namespace}"}}[5m])) by (pod)', "legend": "{{pod}}"}],
|
||||
unit="percent",
|
||||
width=12
|
||||
)
|
||||
|
||||
self.add_graph(
|
||||
"Memory Usage by Pod",
|
||||
[{"query": f'sum(container_memory_usage_bytes{{namespace="{namespace}"}}) by (pod)', "legend": "{{pod}}"}],
|
||||
unit="bytes",
|
||||
width=12
|
||||
)
|
||||
|
||||
# Network
|
||||
self.add_row("Network")
|
||||
|
||||
self.add_graph(
|
||||
"Network I/O",
|
||||
[
|
||||
{"query": f'sum(rate(container_network_receive_bytes_total{{namespace="{namespace}"}}[5m])) by (pod)', "legend": "Receive - {{pod}}"},
|
||||
{"query": f'sum(rate(container_network_transmit_bytes_total{{namespace="{namespace}"}}[5m])) by (pod)', "legend": "Transmit - {{pod}}"}
|
||||
],
|
||||
unit="Bps",
|
||||
width=12
|
||||
)
|
||||
|
||||
def generate_database_dashboard(self, db_type: str, instance: str):
|
||||
"""Generate dashboard for database (postgres/mysql)."""
|
||||
if db_type == "postgres":
|
||||
self._generate_postgres_dashboard(instance)
|
||||
elif db_type == "mysql":
|
||||
self._generate_mysql_dashboard(instance)
|
||||
|
||||
def _generate_postgres_dashboard(self, instance: str):
|
||||
"""Generate PostgreSQL dashboard."""
|
||||
self.add_row("PostgreSQL Metrics")
|
||||
|
||||
self.add_graph(
|
||||
"Connections",
|
||||
[
|
||||
{"query": f'pg_stat_database_numbackends{{instance="{instance}"}}', "legend": "{{datname}}"}
|
||||
],
|
||||
unit="short",
|
||||
width=12
|
||||
)
|
||||
|
||||
self.add_graph(
|
||||
"Transactions per Second",
|
||||
[
|
||||
{"query": f'rate(pg_stat_database_xact_commit{{instance="{instance}"}}[5m])', "legend": "Commits"},
|
||||
{"query": f'rate(pg_stat_database_xact_rollback{{instance="{instance}"}}[5m])', "legend": "Rollbacks"}
|
||||
],
|
||||
unit="tps",
|
||||
width=12
|
||||
)
|
||||
|
||||
self.add_graph(
|
||||
"Query Duration (p95)",
|
||||
[
|
||||
{"query": f'histogram_quantile(0.95, rate(pg_stat_statements_total_time_bucket{{instance="{instance}"}}[5m]))', "legend": "p95"}
|
||||
],
|
||||
unit="ms",
|
||||
width=12
|
||||
)
|
||||
|
||||
def _generate_mysql_dashboard(self, instance: str):
|
||||
"""Generate MySQL dashboard."""
|
||||
self.add_row("MySQL Metrics")
|
||||
|
||||
self.add_graph(
|
||||
"Connections",
|
||||
[
|
||||
{"query": f'mysql_global_status_threads_connected{{instance="{instance}"}}', "legend": "Connected"},
|
||||
{"query": f'mysql_global_status_threads_running{{instance="{instance}"}}', "legend": "Running"}
|
||||
],
|
||||
unit="short",
|
||||
width=12
|
||||
)
|
||||
|
||||
self.add_graph(
|
||||
"Queries per Second",
|
||||
[
|
||||
{"query": f'rate(mysql_global_status_queries{{instance="{instance}"}}[5m])', "legend": "Queries"}
|
||||
],
|
||||
unit="qps",
|
||||
width=12
|
||||
)
|
||||
|
||||
def save(self, output_file: str):
|
||||
"""Save dashboard to file."""
|
||||
try:
|
||||
with open(output_file, 'w') as f:
|
||||
json.dump(self.dashboard, f, indent=2)
|
||||
return True
|
||||
except Exception as e:
|
||||
print(f"❌ Error saving dashboard: {e}")
|
||||
return False
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Generate Grafana dashboards from templates",
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog="""
|
||||
Examples:
|
||||
# Web application dashboard
|
||||
python3 dashboard_generator.py webapp \\
|
||||
--title "My API Dashboard" \\
|
||||
--service my_api \\
|
||||
--output dashboard.json
|
||||
|
||||
# Kubernetes dashboard
|
||||
python3 dashboard_generator.py kubernetes \\
|
||||
--title "K8s Namespace" \\
|
||||
--namespace production \\
|
||||
--output k8s-dashboard.json
|
||||
|
||||
# Database dashboard
|
||||
python3 dashboard_generator.py database \\
|
||||
--title "PostgreSQL" \\
|
||||
--db-type postgres \\
|
||||
--instance db.example.com:5432 \\
|
||||
--output db-dashboard.json
|
||||
"""
|
||||
)
|
||||
|
||||
parser.add_argument('type', choices=['webapp', 'kubernetes', 'database'],
|
||||
help='Dashboard type')
|
||||
parser.add_argument('--title', required=True, help='Dashboard title')
|
||||
parser.add_argument('--output', required=True, help='Output file path')
|
||||
parser.add_argument('--datasource', default='Prometheus', help='Data source name')
|
||||
|
||||
# Web app specific
|
||||
parser.add_argument('--service', help='Service name (for webapp)')
|
||||
|
||||
# Kubernetes specific
|
||||
parser.add_argument('--namespace', help='Kubernetes namespace')
|
||||
|
||||
# Database specific
|
||||
parser.add_argument('--db-type', choices=['postgres', 'mysql'], help='Database type')
|
||||
parser.add_argument('--instance', help='Database instance')
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
print(f"🎨 Generating {args.type} dashboard: {args.title}")
|
||||
|
||||
generator = DashboardGenerator(args.title, args.datasource)
|
||||
|
||||
if args.type == 'webapp':
|
||||
if not args.service:
|
||||
print("❌ --service required for webapp dashboard")
|
||||
sys.exit(1)
|
||||
generator.generate_webapp_dashboard(args.service)
|
||||
|
||||
elif args.type == 'kubernetes':
|
||||
if not args.namespace:
|
||||
print("❌ --namespace required for kubernetes dashboard")
|
||||
sys.exit(1)
|
||||
generator.generate_kubernetes_dashboard(args.namespace)
|
||||
|
||||
elif args.type == 'database':
|
||||
if not args.db_type or not args.instance:
|
||||
print("❌ --db-type and --instance required for database dashboard")
|
||||
sys.exit(1)
|
||||
generator.generate_database_dashboard(args.db_type, args.instance)
|
||||
|
||||
if generator.save(args.output):
|
||||
print(f"✅ Dashboard saved to: {args.output}")
|
||||
print(f"\n📝 Import to Grafana:")
|
||||
print(f" 1. Go to Grafana → Dashboards → Import")
|
||||
print(f" 2. Upload {args.output}")
|
||||
print(f" 3. Select datasource and save")
|
||||
else:
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
477
scripts/datadog_cost_analyzer.py
Normal file
477
scripts/datadog_cost_analyzer.py
Normal file
@@ -0,0 +1,477 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Analyze Datadog usage and identify cost optimization opportunities.
|
||||
Helps find waste in custom metrics, logs, APM, and infrastructure monitoring.
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import sys
|
||||
import os
|
||||
from datetime import datetime, timedelta
|
||||
from typing import Dict, List, Any, Optional
|
||||
from collections import defaultdict
|
||||
|
||||
try:
|
||||
import requests
|
||||
except ImportError:
|
||||
print("⚠️ Warning: 'requests' library not found. Install with: pip install requests")
|
||||
sys.exit(1)
|
||||
|
||||
try:
|
||||
from tabulate import tabulate
|
||||
except ImportError:
|
||||
tabulate = None
|
||||
|
||||
|
||||
class DatadogCostAnalyzer:
|
||||
# Pricing (as of 2024-2025)
|
||||
PRICING = {
|
||||
'infrastructure_pro': 15, # per host per month
|
||||
'infrastructure_enterprise': 23,
|
||||
'custom_metric': 0.01, # per metric per month (first 100 free per host)
|
||||
'log_ingestion': 0.10, # per GB ingested per month
|
||||
'apm_host': 31, # APM Pro per host per month
|
||||
'apm_span': 1.70, # per million indexed spans
|
||||
}
|
||||
|
||||
def __init__(self, api_key: str, app_key: str, site: str = "datadoghq.com"):
|
||||
self.api_key = api_key
|
||||
self.app_key = app_key
|
||||
self.site = site
|
||||
self.base_url = f"https://api.{site}"
|
||||
self.headers = {
|
||||
'DD-API-KEY': api_key,
|
||||
'DD-APPLICATION-KEY': app_key,
|
||||
'Content-Type': 'application/json'
|
||||
}
|
||||
|
||||
def _make_request(self, endpoint: str, params: Optional[Dict] = None) -> Dict:
|
||||
"""Make API request to Datadog."""
|
||||
try:
|
||||
url = f"{self.base_url}{endpoint}"
|
||||
response = requests.get(url, headers=self.headers, params=params, timeout=30)
|
||||
response.raise_for_status()
|
||||
return response.json()
|
||||
except requests.exceptions.RequestException as e:
|
||||
print(f"❌ API Error: {e}")
|
||||
return {}
|
||||
|
||||
def get_usage_metrics(self, start_date: str, end_date: str) -> Dict[str, Any]:
|
||||
"""Get usage metrics for specified date range."""
|
||||
endpoint = "/api/v1/usage/summary"
|
||||
params = {
|
||||
'start_month': start_date,
|
||||
'end_month': end_date,
|
||||
'include_org_details': 'true'
|
||||
}
|
||||
|
||||
data = self._make_request(endpoint, params)
|
||||
return data.get('usage', [])
|
||||
|
||||
def get_custom_metrics(self) -> Dict[str, Any]:
|
||||
"""Get custom metrics usage and identify high-cardinality metrics."""
|
||||
endpoint = "/api/v1/usage/timeseries"
|
||||
|
||||
# Get last 30 days
|
||||
end_date = datetime.now()
|
||||
start_date = end_date - timedelta(days=30)
|
||||
|
||||
params = {
|
||||
'start_hr': int(start_date.timestamp()),
|
||||
'end_hr': int(end_date.timestamp())
|
||||
}
|
||||
|
||||
data = self._make_request(endpoint, params)
|
||||
|
||||
if not data:
|
||||
return {'metrics': [], 'total_count': 0}
|
||||
|
||||
# Extract custom metrics info
|
||||
usage_data = data.get('usage', [])
|
||||
|
||||
metrics_summary = {
|
||||
'total_custom_metrics': 0,
|
||||
'avg_custom_metrics': 0,
|
||||
'billable_metrics': 0
|
||||
}
|
||||
|
||||
for day in usage_data:
|
||||
if 'timeseries' in day:
|
||||
for ts in day['timeseries']:
|
||||
if ts.get('metric_category') == 'custom':
|
||||
metrics_summary['total_custom_metrics'] = max(
|
||||
metrics_summary['total_custom_metrics'],
|
||||
ts.get('num_custom_timeseries', 0)
|
||||
)
|
||||
|
||||
# Calculate billable (first 100 free)
|
||||
metrics_summary['billable_metrics'] = max(0, metrics_summary['total_custom_metrics'] - 100)
|
||||
|
||||
return metrics_summary
|
||||
|
||||
def get_infrastructure_hosts(self) -> Dict[str, Any]:
|
||||
"""Get infrastructure host count and breakdown."""
|
||||
endpoint = "/api/v1/usage/hosts"
|
||||
|
||||
end_date = datetime.now()
|
||||
start_date = end_date - timedelta(days=30)
|
||||
|
||||
params = {
|
||||
'start_hr': int(start_date.timestamp()),
|
||||
'end_hr': int(end_date.timestamp())
|
||||
}
|
||||
|
||||
data = self._make_request(endpoint, params)
|
||||
|
||||
if not data:
|
||||
return {'total_hosts': 0}
|
||||
|
||||
usage = data.get('usage', [])
|
||||
|
||||
host_summary = {
|
||||
'total_hosts': 0,
|
||||
'agent_hosts': 0,
|
||||
'aws_hosts': 0,
|
||||
'azure_hosts': 0,
|
||||
'gcp_hosts': 0,
|
||||
'container_count': 0
|
||||
}
|
||||
|
||||
for day in usage:
|
||||
host_summary['total_hosts'] = max(host_summary['total_hosts'], day.get('host_count', 0))
|
||||
host_summary['agent_hosts'] = max(host_summary['agent_hosts'], day.get('agent_host_count', 0))
|
||||
host_summary['aws_hosts'] = max(host_summary['aws_hosts'], day.get('aws_host_count', 0))
|
||||
host_summary['azure_hosts'] = max(host_summary['azure_hosts'], day.get('azure_host_count', 0))
|
||||
host_summary['gcp_hosts'] = max(host_summary['gcp_hosts'], day.get('gcp_host_count', 0))
|
||||
host_summary['container_count'] = max(host_summary['container_count'], day.get('container_count', 0))
|
||||
|
||||
return host_summary
|
||||
|
||||
def get_log_usage(self) -> Dict[str, Any]:
|
||||
"""Get log ingestion and retention usage."""
|
||||
endpoint = "/api/v1/usage/logs"
|
||||
|
||||
end_date = datetime.now()
|
||||
start_date = end_date - timedelta(days=30)
|
||||
|
||||
params = {
|
||||
'start_hr': int(start_date.timestamp()),
|
||||
'end_hr': int(end_date.timestamp())
|
||||
}
|
||||
|
||||
data = self._make_request(endpoint, params)
|
||||
|
||||
if not data:
|
||||
return {'total_gb': 0, 'daily_avg_gb': 0}
|
||||
|
||||
usage = data.get('usage', [])
|
||||
|
||||
total_ingested = 0
|
||||
days_count = len(usage)
|
||||
|
||||
for day in usage:
|
||||
total_ingested += day.get('ingested_events_bytes', 0)
|
||||
|
||||
total_gb = total_ingested / (1024**3) # Convert to GB
|
||||
daily_avg_gb = total_gb / max(days_count, 1)
|
||||
|
||||
return {
|
||||
'total_gb': total_gb,
|
||||
'daily_avg_gb': daily_avg_gb,
|
||||
'monthly_projected_gb': daily_avg_gb * 30
|
||||
}
|
||||
|
||||
def get_unused_monitors(self) -> List[Dict[str, Any]]:
|
||||
"""Find monitors that haven't alerted in 30+ days."""
|
||||
endpoint = "/api/v1/monitor"
|
||||
|
||||
data = self._make_request(endpoint)
|
||||
|
||||
if not data:
|
||||
return []
|
||||
|
||||
monitors = data if isinstance(data, list) else []
|
||||
|
||||
unused = []
|
||||
now = datetime.now()
|
||||
|
||||
for monitor in monitors:
|
||||
# Check if monitor has triggered recently
|
||||
overall_state = monitor.get('overall_state')
|
||||
modified = monitor.get('modified', '')
|
||||
|
||||
# If monitor has been in OK state and not modified in 30+ days
|
||||
try:
|
||||
if modified:
|
||||
mod_date = datetime.fromisoformat(modified.replace('Z', '+00:00'))
|
||||
days_since_modified = (now - mod_date.replace(tzinfo=None)).days
|
||||
|
||||
if days_since_modified > 30 and overall_state in ['OK', 'No Data']:
|
||||
unused.append({
|
||||
'name': monitor.get('name', 'Unknown'),
|
||||
'id': monitor.get('id'),
|
||||
'days_since_modified': days_since_modified,
|
||||
'state': overall_state
|
||||
})
|
||||
except:
|
||||
pass
|
||||
|
||||
return unused
|
||||
|
||||
def calculate_costs(self, usage_data: Dict[str, Any]) -> Dict[str, float]:
|
||||
"""Calculate estimated monthly costs."""
|
||||
costs = {
|
||||
'infrastructure': 0,
|
||||
'custom_metrics': 0,
|
||||
'logs': 0,
|
||||
'apm': 0,
|
||||
'total': 0
|
||||
}
|
||||
|
||||
# Infrastructure (assuming Pro tier)
|
||||
if 'hosts' in usage_data:
|
||||
costs['infrastructure'] = usage_data['hosts'].get('total_hosts', 0) * self.PRICING['infrastructure_pro']
|
||||
|
||||
# Custom metrics
|
||||
if 'custom_metrics' in usage_data:
|
||||
billable = usage_data['custom_metrics'].get('billable_metrics', 0)
|
||||
costs['custom_metrics'] = billable * self.PRICING['custom_metric']
|
||||
|
||||
# Logs
|
||||
if 'logs' in usage_data:
|
||||
monthly_gb = usage_data['logs'].get('monthly_projected_gb', 0)
|
||||
costs['logs'] = monthly_gb * self.PRICING['log_ingestion']
|
||||
|
||||
costs['total'] = sum(costs.values())
|
||||
|
||||
return costs
|
||||
|
||||
def get_recommendations(self, usage_data: Dict[str, Any]) -> List[str]:
|
||||
"""Generate cost optimization recommendations."""
|
||||
recommendations = []
|
||||
|
||||
# Custom metrics recommendations
|
||||
if 'custom_metrics' in usage_data:
|
||||
billable = usage_data['custom_metrics'].get('billable_metrics', 0)
|
||||
if billable > 500:
|
||||
savings = (billable * 0.3) * self.PRICING['custom_metric'] # Assume 30% reduction possible
|
||||
recommendations.append({
|
||||
'category': 'Custom Metrics',
|
||||
'issue': f'High custom metric count: {billable:,} billable metrics',
|
||||
'action': 'Review metric tags for high cardinality, consider aggregating or dropping unused metrics',
|
||||
'potential_savings': f'${savings:.2f}/month'
|
||||
})
|
||||
|
||||
# Container vs VM recommendations
|
||||
if 'hosts' in usage_data:
|
||||
hosts = usage_data['hosts'].get('total_hosts', 0)
|
||||
containers = usage_data['hosts'].get('container_count', 0)
|
||||
|
||||
if containers > hosts * 10: # Many containers per host
|
||||
savings = hosts * 0.2 * self.PRICING['infrastructure_pro']
|
||||
recommendations.append({
|
||||
'category': 'Infrastructure',
|
||||
'issue': f'{containers:,} containers running on {hosts} hosts',
|
||||
'action': 'Consider using container monitoring instead of host-based (can be 50-70% cheaper)',
|
||||
'potential_savings': f'${savings:.2f}/month'
|
||||
})
|
||||
|
||||
# Unused monitors
|
||||
if 'unused_monitors' in usage_data:
|
||||
count = len(usage_data['unused_monitors'])
|
||||
if count > 10:
|
||||
recommendations.append({
|
||||
'category': 'Monitors',
|
||||
'issue': f'{count} monitors unused for 30+ days',
|
||||
'action': 'Delete or disable unused monitors to reduce noise and improve performance',
|
||||
'potential_savings': 'Operational efficiency'
|
||||
})
|
||||
|
||||
# Log volume recommendations
|
||||
if 'logs' in usage_data:
|
||||
monthly_gb = usage_data['logs'].get('monthly_projected_gb', 0)
|
||||
if monthly_gb > 100:
|
||||
savings = (monthly_gb * 0.4) * self.PRICING['log_ingestion'] # 40% reduction
|
||||
recommendations.append({
|
||||
'category': 'Logs',
|
||||
'issue': f'High log volume: {monthly_gb:.1f} GB/month projected',
|
||||
'action': 'Review log sources, implement sampling for debug logs, exclude health checks',
|
||||
'potential_savings': f'${savings:.2f}/month'
|
||||
})
|
||||
|
||||
# Migration recommendation if costs are high
|
||||
costs = self.calculate_costs(usage_data)
|
||||
if costs['total'] > 5000:
|
||||
oss_cost = usage_data['hosts'].get('total_hosts', 0) * 15 # Rough estimate for self-hosted
|
||||
savings = costs['total'] - oss_cost
|
||||
recommendations.append({
|
||||
'category': 'Strategic',
|
||||
'issue': f'Total monthly cost: ${costs["total"]:.2f}',
|
||||
'action': 'Consider migrating to open-source stack (Prometheus + Grafana + Loki)',
|
||||
'potential_savings': f'${savings:.2f}/month (~{(savings/costs["total"]*100):.0f}% reduction)'
|
||||
})
|
||||
|
||||
return recommendations
|
||||
|
||||
|
||||
def print_usage_summary(usage_data: Dict[str, Any]):
|
||||
"""Print usage summary."""
|
||||
print("\n" + "="*70)
|
||||
print("📊 DATADOG USAGE SUMMARY")
|
||||
print("="*70)
|
||||
|
||||
# Infrastructure
|
||||
if 'hosts' in usage_data:
|
||||
hosts = usage_data['hosts']
|
||||
print(f"\n🖥️ Infrastructure:")
|
||||
print(f" Total Hosts: {hosts.get('total_hosts', 0):,}")
|
||||
print(f" Agent Hosts: {hosts.get('agent_hosts', 0):,}")
|
||||
print(f" AWS Hosts: {hosts.get('aws_hosts', 0):,}")
|
||||
print(f" Azure Hosts: {hosts.get('azure_hosts', 0):,}")
|
||||
print(f" GCP Hosts: {hosts.get('gcp_hosts', 0):,}")
|
||||
print(f" Containers: {hosts.get('container_count', 0):,}")
|
||||
|
||||
# Custom Metrics
|
||||
if 'custom_metrics' in usage_data:
|
||||
metrics = usage_data['custom_metrics']
|
||||
print(f"\n📈 Custom Metrics:")
|
||||
print(f" Total: {metrics.get('total_custom_metrics', 0):,}")
|
||||
print(f" Billable: {metrics.get('billable_metrics', 0):,} (first 100 free)")
|
||||
|
||||
# Logs
|
||||
if 'logs' in usage_data:
|
||||
logs = usage_data['logs']
|
||||
print(f"\n📝 Logs:")
|
||||
print(f" Daily Average: {logs.get('daily_avg_gb', 0):.2f} GB")
|
||||
print(f" Monthly Projected: {logs.get('monthly_projected_gb', 0):.2f} GB")
|
||||
|
||||
# Unused Monitors
|
||||
if 'unused_monitors' in usage_data:
|
||||
print(f"\n🔔 Unused Monitors:")
|
||||
print(f" Count: {len(usage_data['unused_monitors'])}")
|
||||
|
||||
|
||||
def print_cost_breakdown(costs: Dict[str, float]):
|
||||
"""Print cost breakdown."""
|
||||
print("\n" + "="*70)
|
||||
print("💰 ESTIMATED MONTHLY COSTS")
|
||||
print("="*70)
|
||||
|
||||
print(f"\n Infrastructure Monitoring: ${costs['infrastructure']:,.2f}")
|
||||
print(f" Custom Metrics: ${costs['custom_metrics']:,.2f}")
|
||||
print(f" Log Management: ${costs['logs']:,.2f}")
|
||||
print(f" APM: ${costs['apm']:,.2f}")
|
||||
print(f" " + "-"*40)
|
||||
print(f" TOTAL: ${costs['total']:,.2f}/month")
|
||||
print(f" ${costs['total']*12:,.2f}/year")
|
||||
|
||||
|
||||
def print_recommendations(recommendations: List[Dict]):
|
||||
"""Print recommendations."""
|
||||
print("\n" + "="*70)
|
||||
print("💡 COST OPTIMIZATION RECOMMENDATIONS")
|
||||
print("="*70)
|
||||
|
||||
total_savings = 0
|
||||
|
||||
for i, rec in enumerate(recommendations, 1):
|
||||
print(f"\n{i}. {rec['category']}")
|
||||
print(f" Issue: {rec['issue']}")
|
||||
print(f" Action: {rec['action']}")
|
||||
print(f" Potential Savings: {rec['potential_savings']}")
|
||||
|
||||
# Extract savings amount if it's a dollar value
|
||||
if '$' in rec['potential_savings']:
|
||||
try:
|
||||
amount = float(rec['potential_savings'].replace('$', '').replace('/month', '').replace(',', ''))
|
||||
total_savings += amount
|
||||
except:
|
||||
pass
|
||||
|
||||
if total_savings > 0:
|
||||
print(f"\n{'='*70}")
|
||||
print(f"💵 Total Potential Monthly Savings: ${total_savings:,.2f}")
|
||||
print(f"💵 Total Potential Annual Savings: ${total_savings*12:,.2f}")
|
||||
print(f"{'='*70}")
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Analyze Datadog usage and identify cost optimization opportunities",
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog="""
|
||||
Examples:
|
||||
# Analyze current usage
|
||||
python3 datadog_cost_analyzer.py \\
|
||||
--api-key DD_API_KEY \\
|
||||
--app-key DD_APP_KEY
|
||||
|
||||
# Use environment variables
|
||||
export DD_API_KEY=your_api_key
|
||||
export DD_APP_KEY=your_app_key
|
||||
python3 datadog_cost_analyzer.py
|
||||
|
||||
# Specify site (for EU)
|
||||
python3 datadog_cost_analyzer.py --site datadoghq.eu
|
||||
|
||||
Required Datadog Permissions:
|
||||
- usage_read
|
||||
- monitors_read
|
||||
"""
|
||||
)
|
||||
|
||||
parser.add_argument('--api-key',
|
||||
default=os.environ.get('DD_API_KEY'),
|
||||
help='Datadog API key (or set DD_API_KEY env var)')
|
||||
parser.add_argument('--app-key',
|
||||
default=os.environ.get('DD_APP_KEY'),
|
||||
help='Datadog Application key (or set DD_APP_KEY env var)')
|
||||
parser.add_argument('--site',
|
||||
default='datadoghq.com',
|
||||
help='Datadog site (default: datadoghq.com, EU: datadoghq.eu)')
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
if not args.api_key or not args.app_key:
|
||||
print("❌ Error: API key and Application key required")
|
||||
print(" Set via --api-key and --app-key flags or DD_API_KEY and DD_APP_KEY env vars")
|
||||
sys.exit(1)
|
||||
|
||||
print("🔍 Analyzing Datadog usage...")
|
||||
print(" This may take 30-60 seconds...\n")
|
||||
|
||||
analyzer = DatadogCostAnalyzer(args.api_key, args.app_key, args.site)
|
||||
|
||||
# Gather usage data
|
||||
usage_data = {}
|
||||
|
||||
print(" ⏳ Fetching infrastructure usage...")
|
||||
usage_data['hosts'] = analyzer.get_infrastructure_hosts()
|
||||
|
||||
print(" ⏳ Fetching custom metrics...")
|
||||
usage_data['custom_metrics'] = analyzer.get_custom_metrics()
|
||||
|
||||
print(" ⏳ Fetching log usage...")
|
||||
usage_data['logs'] = analyzer.get_log_usage()
|
||||
|
||||
print(" ⏳ Finding unused monitors...")
|
||||
usage_data['unused_monitors'] = analyzer.get_unused_monitors()
|
||||
|
||||
# Calculate costs
|
||||
costs = analyzer.calculate_costs(usage_data)
|
||||
|
||||
# Generate recommendations
|
||||
recommendations = analyzer.get_recommendations(usage_data)
|
||||
|
||||
# Print results
|
||||
print_usage_summary(usage_data)
|
||||
print_cost_breakdown(costs)
|
||||
print_recommendations(recommendations)
|
||||
|
||||
print("\n" + "="*70)
|
||||
print("✅ Analysis complete!")
|
||||
print("="*70)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
297
scripts/health_check_validator.py
Normal file
297
scripts/health_check_validator.py
Normal file
@@ -0,0 +1,297 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Validate health check endpoints and analyze response quality.
|
||||
Checks: response time, status code, response format, dependencies.
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import sys
|
||||
import time
|
||||
import json
|
||||
from typing import Dict, List, Any, Optional
|
||||
from urllib.parse import urlparse
|
||||
|
||||
try:
|
||||
import requests
|
||||
except ImportError:
|
||||
print("⚠️ Warning: 'requests' library not found. Install with: pip install requests")
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
class HealthCheckValidator:
|
||||
def __init__(self, timeout: int = 5):
|
||||
self.timeout = timeout
|
||||
self.results = []
|
||||
|
||||
def validate_endpoint(self, url: str) -> Dict[str, Any]:
|
||||
"""Validate a health check endpoint."""
|
||||
result = {
|
||||
"url": url,
|
||||
"timestamp": time.strftime("%Y-%m-%d %H:%M:%S"),
|
||||
"checks": [],
|
||||
"warnings": [],
|
||||
"errors": []
|
||||
}
|
||||
|
||||
try:
|
||||
# Make request
|
||||
start_time = time.time()
|
||||
response = requests.get(url, timeout=self.timeout, verify=True)
|
||||
response_time = time.time() - start_time
|
||||
|
||||
result["status_code"] = response.status_code
|
||||
result["response_time"] = response_time
|
||||
|
||||
# Check 1: Status code
|
||||
if response.status_code == 200:
|
||||
result["checks"].append("✅ Status code is 200")
|
||||
else:
|
||||
result["errors"].append(f"❌ Unexpected status code: {response.status_code} (expected 200)")
|
||||
|
||||
# Check 2: Response time
|
||||
if response_time < 1.0:
|
||||
result["checks"].append(f"✅ Response time: {response_time:.3f}s (< 1s)")
|
||||
elif response_time < 3.0:
|
||||
result["warnings"].append(f"⚠️ Slow response time: {response_time:.3f}s (should be < 1s)")
|
||||
else:
|
||||
result["errors"].append(f"❌ Very slow response time: {response_time:.3f}s (should be < 1s)")
|
||||
|
||||
# Check 3: Content type
|
||||
content_type = response.headers.get('Content-Type', '')
|
||||
if 'application/json' in content_type:
|
||||
result["checks"].append("✅ Content-Type is application/json")
|
||||
|
||||
# Try to parse JSON
|
||||
try:
|
||||
data = response.json()
|
||||
result["response_data"] = data
|
||||
|
||||
# Check for common health check fields
|
||||
self._validate_json_structure(data, result)
|
||||
|
||||
except json.JSONDecodeError:
|
||||
result["errors"].append("❌ Invalid JSON response")
|
||||
elif 'text/plain' in content_type:
|
||||
result["warnings"].append("⚠️ Content-Type is text/plain (JSON recommended)")
|
||||
result["response_data"] = response.text
|
||||
else:
|
||||
result["warnings"].append(f"⚠️ Unexpected Content-Type: {content_type}")
|
||||
|
||||
# Check 4: Response headers
|
||||
self._validate_headers(response.headers, result)
|
||||
|
||||
except requests.exceptions.Timeout:
|
||||
result["errors"].append(f"❌ Request timeout (> {self.timeout}s)")
|
||||
result["status_code"] = None
|
||||
result["response_time"] = None
|
||||
|
||||
except requests.exceptions.ConnectionError:
|
||||
result["errors"].append("❌ Connection error (endpoint unreachable)")
|
||||
result["status_code"] = None
|
||||
result["response_time"] = None
|
||||
|
||||
except requests.exceptions.SSLError:
|
||||
result["errors"].append("❌ SSL certificate validation failed")
|
||||
result["status_code"] = None
|
||||
result["response_time"] = None
|
||||
|
||||
except Exception as e:
|
||||
result["errors"].append(f"❌ Unexpected error: {str(e)}")
|
||||
result["status_code"] = None
|
||||
result["response_time"] = None
|
||||
|
||||
# Overall status
|
||||
if result["errors"]:
|
||||
result["overall_status"] = "UNHEALTHY"
|
||||
elif result["warnings"]:
|
||||
result["overall_status"] = "DEGRADED"
|
||||
else:
|
||||
result["overall_status"] = "HEALTHY"
|
||||
|
||||
return result
|
||||
|
||||
def _validate_json_structure(self, data: Dict[str, Any], result: Dict[str, Any]):
|
||||
"""Validate JSON health check structure."""
|
||||
# Check for status field
|
||||
if "status" in data:
|
||||
status = data["status"]
|
||||
if status in ["ok", "healthy", "up", "pass"]:
|
||||
result["checks"].append(f"✅ Status field present: '{status}'")
|
||||
else:
|
||||
result["warnings"].append(f"⚠️ Status field has unexpected value: '{status}'")
|
||||
else:
|
||||
result["warnings"].append("⚠️ Missing 'status' field (recommended)")
|
||||
|
||||
# Check for version/build info
|
||||
if any(key in data for key in ["version", "build", "commit", "timestamp"]):
|
||||
result["checks"].append("✅ Version/build information present")
|
||||
else:
|
||||
result["warnings"].append("⚠️ No version/build information (recommended)")
|
||||
|
||||
# Check for dependencies
|
||||
if "dependencies" in data or "checks" in data or "components" in data:
|
||||
result["checks"].append("✅ Dependency checks present")
|
||||
|
||||
# Validate dependency structure
|
||||
deps = data.get("dependencies") or data.get("checks") or data.get("components")
|
||||
if isinstance(deps, dict):
|
||||
unhealthy_deps = []
|
||||
for name, info in deps.items():
|
||||
if isinstance(info, dict):
|
||||
dep_status = info.get("status", "unknown")
|
||||
if dep_status not in ["ok", "healthy", "up", "pass"]:
|
||||
unhealthy_deps.append(name)
|
||||
elif isinstance(info, str):
|
||||
if info not in ["ok", "healthy", "up", "pass"]:
|
||||
unhealthy_deps.append(name)
|
||||
|
||||
if unhealthy_deps:
|
||||
result["warnings"].append(f"⚠️ Unhealthy dependencies: {', '.join(unhealthy_deps)}")
|
||||
else:
|
||||
result["checks"].append(f"✅ All dependencies healthy ({len(deps)} checked)")
|
||||
else:
|
||||
result["warnings"].append("⚠️ No dependency checks (recommended for production services)")
|
||||
|
||||
# Check for uptime/metrics
|
||||
if any(key in data for key in ["uptime", "metrics", "stats"]):
|
||||
result["checks"].append("✅ Metrics/stats present")
|
||||
|
||||
def _validate_headers(self, headers: Dict[str, str], result: Dict[str, Any]):
|
||||
"""Validate response headers."""
|
||||
# Check for caching headers
|
||||
cache_control = headers.get('Cache-Control', '')
|
||||
if 'no-cache' in cache_control or 'no-store' in cache_control:
|
||||
result["checks"].append("✅ Caching disabled (Cache-Control: no-cache)")
|
||||
else:
|
||||
result["warnings"].append("⚠️ Caching not explicitly disabled (add Cache-Control: no-cache)")
|
||||
|
||||
def validate_multiple(self, urls: List[str]) -> List[Dict[str, Any]]:
|
||||
"""Validate multiple health check endpoints."""
|
||||
results = []
|
||||
for url in urls:
|
||||
print(f"🔍 Checking: {url}")
|
||||
result = self.validate_endpoint(url)
|
||||
results.append(result)
|
||||
return results
|
||||
|
||||
|
||||
def print_result(result: Dict[str, Any], verbose: bool = False):
|
||||
"""Print validation result."""
|
||||
status_emoji = {
|
||||
"HEALTHY": "✅",
|
||||
"DEGRADED": "⚠️",
|
||||
"UNHEALTHY": "❌"
|
||||
}
|
||||
|
||||
print("\n" + "="*60)
|
||||
emoji = status_emoji.get(result["overall_status"], "❓")
|
||||
print(f"{emoji} {result['overall_status']}: {result['url']}")
|
||||
print("="*60)
|
||||
|
||||
if result.get("status_code"):
|
||||
print(f"\n📊 Status Code: {result['status_code']}")
|
||||
print(f"⏱️ Response Time: {result['response_time']:.3f}s")
|
||||
|
||||
# Print checks
|
||||
if result["checks"]:
|
||||
print(f"\n✅ Passed Checks:")
|
||||
for check in result["checks"]:
|
||||
print(f" {check}")
|
||||
|
||||
# Print warnings
|
||||
if result["warnings"]:
|
||||
print(f"\n⚠️ Warnings:")
|
||||
for warning in result["warnings"]:
|
||||
print(f" {warning}")
|
||||
|
||||
# Print errors
|
||||
if result["errors"]:
|
||||
print(f"\n❌ Errors:")
|
||||
for error in result["errors"]:
|
||||
print(f" {error}")
|
||||
|
||||
# Print response data if verbose
|
||||
if verbose and "response_data" in result:
|
||||
print(f"\n📄 Response Data:")
|
||||
if isinstance(result["response_data"], dict):
|
||||
print(json.dumps(result["response_data"], indent=2))
|
||||
else:
|
||||
print(result["response_data"])
|
||||
|
||||
print("="*60)
|
||||
|
||||
|
||||
def print_summary(results: List[Dict[str, Any]]):
|
||||
"""Print summary of multiple validations."""
|
||||
print("\n" + "="*60)
|
||||
print("📊 HEALTH CHECK VALIDATION SUMMARY")
|
||||
print("="*60)
|
||||
|
||||
healthy = sum(1 for r in results if r["overall_status"] == "HEALTHY")
|
||||
degraded = sum(1 for r in results if r["overall_status"] == "DEGRADED")
|
||||
unhealthy = sum(1 for r in results if r["overall_status"] == "UNHEALTHY")
|
||||
|
||||
print(f"\n✅ Healthy: {healthy}/{len(results)}")
|
||||
print(f"⚠️ Degraded: {degraded}/{len(results)}")
|
||||
print(f"❌ Unhealthy: {unhealthy}/{len(results)}")
|
||||
|
||||
if results:
|
||||
avg_response_time = sum(r.get("response_time", 0) for r in results if r.get("response_time")) / len(results)
|
||||
print(f"\n⏱️ Average Response Time: {avg_response_time:.3f}s")
|
||||
|
||||
print("="*60)
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Validate health check endpoints",
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog="""
|
||||
Examples:
|
||||
# Check a single endpoint
|
||||
python3 health_check_validator.py https://api.example.com/health
|
||||
|
||||
# Check multiple endpoints
|
||||
python3 health_check_validator.py \\
|
||||
https://api.example.com/health \\
|
||||
https://api.example.com/readiness
|
||||
|
||||
# Verbose output with response data
|
||||
python3 health_check_validator.py https://api.example.com/health --verbose
|
||||
|
||||
# Custom timeout
|
||||
python3 health_check_validator.py https://api.example.com/health --timeout 10
|
||||
|
||||
Best Practices Checked:
|
||||
✓ Returns 200 status code
|
||||
✓ Response time < 1 second
|
||||
✓ Returns JSON format
|
||||
✓ Contains 'status' field
|
||||
✓ Includes version/build info
|
||||
✓ Checks dependencies
|
||||
✓ Includes metrics
|
||||
✓ Disables caching
|
||||
"""
|
||||
)
|
||||
|
||||
parser.add_argument('urls', nargs='+', help='Health check endpoint URL(s)')
|
||||
parser.add_argument('--timeout', type=int, default=5, help='Request timeout in seconds (default: 5)')
|
||||
parser.add_argument('--verbose', action='store_true', help='Show detailed response data')
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
validator = HealthCheckValidator(timeout=args.timeout)
|
||||
|
||||
results = validator.validate_multiple(args.urls)
|
||||
|
||||
# Print individual results
|
||||
for result in results:
|
||||
print_result(result, args.verbose)
|
||||
|
||||
# Print summary if multiple endpoints
|
||||
if len(results) > 1:
|
||||
print_summary(results)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
321
scripts/log_analyzer.py
Normal file
321
scripts/log_analyzer.py
Normal file
@@ -0,0 +1,321 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Parse and analyze logs for patterns, errors, and anomalies.
|
||||
Supports: error detection, frequency analysis, pattern matching.
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import sys
|
||||
import re
|
||||
import json
|
||||
from collections import Counter, defaultdict
|
||||
from datetime import datetime
|
||||
from typing import Dict, List, Any, Optional
|
||||
from pathlib import Path
|
||||
|
||||
try:
|
||||
from tabulate import tabulate
|
||||
except ImportError:
|
||||
tabulate = None
|
||||
|
||||
|
||||
class LogAnalyzer:
|
||||
# Common log level patterns
|
||||
LOG_LEVELS = {
|
||||
'ERROR': r'\b(ERROR|Error|error)\b',
|
||||
'WARN': r'\b(WARN|Warning|warn|warning)\b',
|
||||
'INFO': r'\b(INFO|Info|info)\b',
|
||||
'DEBUG': r'\b(DEBUG|Debug|debug)\b',
|
||||
'FATAL': r'\b(FATAL|Fatal|fatal|CRITICAL|Critical)\b'
|
||||
}
|
||||
|
||||
# Common error patterns
|
||||
ERROR_PATTERNS = {
|
||||
'exception': r'Exception|exception|EXCEPTION',
|
||||
'stack_trace': r'\s+at\s+.*\(.*:\d+\)',
|
||||
'http_error': r'\b[45]\d{2}\b', # 4xx and 5xx HTTP codes
|
||||
'timeout': r'timeout|timed out|TIMEOUT',
|
||||
'connection_refused': r'connection refused|ECONNREFUSED',
|
||||
'out_of_memory': r'OutOfMemoryError|OOM|out of memory',
|
||||
'null_pointer': r'NullPointerException|null pointer|NPE',
|
||||
'database_error': r'SQLException|database error|DB error'
|
||||
}
|
||||
|
||||
def __init__(self, log_file: str):
|
||||
self.log_file = log_file
|
||||
self.lines = []
|
||||
self.log_levels = Counter()
|
||||
self.error_patterns = Counter()
|
||||
self.timestamps = []
|
||||
|
||||
def parse_file(self) -> bool:
|
||||
"""Parse log file."""
|
||||
try:
|
||||
with open(self.log_file, 'r', encoding='utf-8', errors='ignore') as f:
|
||||
self.lines = f.readlines()
|
||||
return True
|
||||
except Exception as e:
|
||||
print(f"❌ Error reading file: {e}")
|
||||
return False
|
||||
|
||||
def analyze_log_levels(self):
|
||||
"""Count log levels."""
|
||||
for line in self.lines:
|
||||
for level, pattern in self.LOG_LEVELS.items():
|
||||
if re.search(pattern, line):
|
||||
self.log_levels[level] += 1
|
||||
break # Count each line only once
|
||||
|
||||
def analyze_error_patterns(self):
|
||||
"""Detect common error patterns."""
|
||||
for line in self.lines:
|
||||
for pattern_name, pattern in self.ERROR_PATTERNS.items():
|
||||
if re.search(pattern, line, re.IGNORECASE):
|
||||
self.error_patterns[pattern_name] += 1
|
||||
|
||||
def extract_timestamps(self, timestamp_pattern: Optional[str] = None):
|
||||
"""Extract timestamps from logs."""
|
||||
if not timestamp_pattern:
|
||||
# Common timestamp patterns
|
||||
patterns = [
|
||||
r'\d{4}-\d{2}-\d{2}[T ]\d{2}:\d{2}:\d{2}', # ISO format
|
||||
r'\d{2}/\w{3}/\d{4}:\d{2}:\d{2}:\d{2}', # Apache format
|
||||
r'\w{3}\s+\d{1,2}\s+\d{2}:\d{2}:\d{2}', # Syslog format
|
||||
]
|
||||
else:
|
||||
patterns = [timestamp_pattern]
|
||||
|
||||
for line in self.lines:
|
||||
for pattern in patterns:
|
||||
match = re.search(pattern, line)
|
||||
if match:
|
||||
self.timestamps.append(match.group())
|
||||
break
|
||||
|
||||
def find_error_lines(self, context: int = 2) -> List[Dict[str, Any]]:
|
||||
"""Find error lines with context."""
|
||||
errors = []
|
||||
|
||||
for i, line in enumerate(self.lines):
|
||||
# Check if line contains error keywords
|
||||
is_error = any(re.search(pattern, line, re.IGNORECASE)
|
||||
for pattern in [self.LOG_LEVELS['ERROR'], self.LOG_LEVELS['FATAL']])
|
||||
|
||||
if is_error:
|
||||
# Get context lines
|
||||
start = max(0, i - context)
|
||||
end = min(len(self.lines), i + context + 1)
|
||||
context_lines = self.lines[start:end]
|
||||
|
||||
errors.append({
|
||||
'line_number': i + 1,
|
||||
'line': line.strip(),
|
||||
'context': ''.join(context_lines)
|
||||
})
|
||||
|
||||
return errors
|
||||
|
||||
def analyze_frequency(self, time_window_minutes: int = 5) -> Dict[str, Any]:
|
||||
"""Analyze log frequency over time."""
|
||||
if not self.timestamps:
|
||||
return {"error": "No timestamps found"}
|
||||
|
||||
# This is a simplified version - in production you'd parse actual timestamps
|
||||
total_lines = len(self.lines)
|
||||
if self.timestamps:
|
||||
time_span = len(self.timestamps)
|
||||
avg_per_window = total_lines / max(1, time_span / time_window_minutes)
|
||||
else:
|
||||
avg_per_window = 0
|
||||
|
||||
return {
|
||||
"total_lines": total_lines,
|
||||
"timestamps_found": len(self.timestamps),
|
||||
"avg_per_window": avg_per_window
|
||||
}
|
||||
|
||||
def extract_unique_messages(self, pattern: str) -> List[str]:
|
||||
"""Extract unique messages matching a pattern."""
|
||||
matches = []
|
||||
seen = set()
|
||||
|
||||
for line in self.lines:
|
||||
match = re.search(pattern, line, re.IGNORECASE)
|
||||
if match:
|
||||
msg = match.group() if match.lastindex is None else match.group(1)
|
||||
if msg not in seen:
|
||||
matches.append(msg)
|
||||
seen.add(msg)
|
||||
|
||||
return matches
|
||||
|
||||
def find_stack_traces(self) -> List[Dict[str, Any]]:
|
||||
"""Extract complete stack traces."""
|
||||
stack_traces = []
|
||||
current_trace = []
|
||||
in_trace = False
|
||||
|
||||
for i, line in enumerate(self.lines):
|
||||
# Start of stack trace
|
||||
if re.search(r'Exception|Error.*:', line):
|
||||
if current_trace:
|
||||
stack_traces.append({
|
||||
'line_start': i - len(current_trace) + 1,
|
||||
'trace': '\n'.join(current_trace)
|
||||
})
|
||||
current_trace = [line.strip()]
|
||||
in_trace = True
|
||||
# Stack trace continuation
|
||||
elif in_trace and re.search(r'^\s+at\s+', line):
|
||||
current_trace.append(line.strip())
|
||||
# End of stack trace
|
||||
elif in_trace:
|
||||
if current_trace:
|
||||
stack_traces.append({
|
||||
'line_start': i - len(current_trace) + 1,
|
||||
'trace': '\n'.join(current_trace)
|
||||
})
|
||||
current_trace = []
|
||||
in_trace = False
|
||||
|
||||
# Add last trace if exists
|
||||
if current_trace:
|
||||
stack_traces.append({
|
||||
'line_start': len(self.lines) - len(current_trace) + 1,
|
||||
'trace': '\n'.join(current_trace)
|
||||
})
|
||||
|
||||
return stack_traces
|
||||
|
||||
|
||||
def print_analysis_results(analyzer: LogAnalyzer, show_errors: bool = False,
|
||||
show_traces: bool = False):
|
||||
"""Print analysis results."""
|
||||
print("\n" + "="*60)
|
||||
print("📝 LOG ANALYSIS RESULTS")
|
||||
print("="*60)
|
||||
|
||||
print(f"\n📁 File: {analyzer.log_file}")
|
||||
print(f"📊 Total Lines: {len(analyzer.lines):,}")
|
||||
|
||||
# Log levels
|
||||
if analyzer.log_levels:
|
||||
print(f"\n{'='*60}")
|
||||
print("📊 LOG LEVEL DISTRIBUTION:")
|
||||
print(f"{'='*60}")
|
||||
|
||||
level_emoji = {
|
||||
'FATAL': '🔴',
|
||||
'ERROR': '❌',
|
||||
'WARN': '⚠️',
|
||||
'INFO': 'ℹ️',
|
||||
'DEBUG': '🐛'
|
||||
}
|
||||
|
||||
for level, count in analyzer.log_levels.most_common():
|
||||
emoji = level_emoji.get(level, '•')
|
||||
percentage = (count / len(analyzer.lines)) * 100
|
||||
print(f"{emoji} {level:10s}: {count:6,} ({percentage:5.1f}%)")
|
||||
|
||||
# Error patterns
|
||||
if analyzer.error_patterns:
|
||||
print(f"\n{'='*60}")
|
||||
print("🔍 ERROR PATTERNS DETECTED:")
|
||||
print(f"{'='*60}")
|
||||
|
||||
for pattern, count in analyzer.error_patterns.most_common(10):
|
||||
print(f"• {pattern:20s}: {count:,} occurrences")
|
||||
|
||||
# Timestamps
|
||||
if analyzer.timestamps:
|
||||
print(f"\n{'='*60}")
|
||||
print(f"⏰ Timestamps Found: {len(analyzer.timestamps):,}")
|
||||
print(f" First: {analyzer.timestamps[0]}")
|
||||
print(f" Last: {analyzer.timestamps[-1]}")
|
||||
|
||||
# Error lines
|
||||
if show_errors:
|
||||
errors = analyzer.find_error_lines(context=1)
|
||||
if errors:
|
||||
print(f"\n{'='*60}")
|
||||
print(f"❌ ERROR LINES (showing first 10 of {len(errors)}):")
|
||||
print(f"{'='*60}")
|
||||
|
||||
for error in errors[:10]:
|
||||
print(f"\nLine {error['line_number']}:")
|
||||
print(f" {error['line']}")
|
||||
|
||||
# Stack traces
|
||||
if show_traces:
|
||||
traces = analyzer.find_stack_traces()
|
||||
if traces:
|
||||
print(f"\n{'='*60}")
|
||||
print(f"📚 STACK TRACES FOUND: {len(traces)}")
|
||||
print(f"{'='*60}")
|
||||
|
||||
for i, trace in enumerate(traces[:5], 1):
|
||||
print(f"\nTrace {i} (starting at line {trace['line_start']}):")
|
||||
print(trace['trace'])
|
||||
if i < len(traces):
|
||||
print("\n" + "-"*60)
|
||||
|
||||
print("\n" + "="*60)
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Analyze log files for errors, patterns, and anomalies",
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog="""
|
||||
Examples:
|
||||
# Basic analysis
|
||||
python3 log_analyzer.py application.log
|
||||
|
||||
# Show error lines with context
|
||||
python3 log_analyzer.py application.log --show-errors
|
||||
|
||||
# Show stack traces
|
||||
python3 log_analyzer.py application.log --show-traces
|
||||
|
||||
# Full analysis
|
||||
python3 log_analyzer.py application.log --show-errors --show-traces
|
||||
|
||||
Features:
|
||||
• Log level distribution (ERROR, WARN, INFO, DEBUG, FATAL)
|
||||
• Common error pattern detection
|
||||
• Timestamp extraction
|
||||
• Error line identification with context
|
||||
• Stack trace extraction
|
||||
• Frequency analysis
|
||||
"""
|
||||
)
|
||||
|
||||
parser.add_argument('log_file', help='Path to log file')
|
||||
parser.add_argument('--show-errors', action='store_true', help='Show error lines')
|
||||
parser.add_argument('--show-traces', action='store_true', help='Show stack traces')
|
||||
parser.add_argument('--timestamp-pattern', help='Custom regex for timestamp extraction')
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
if not Path(args.log_file).exists():
|
||||
print(f"❌ File not found: {args.log_file}")
|
||||
sys.exit(1)
|
||||
|
||||
print(f"🔍 Analyzing log file: {args.log_file}")
|
||||
|
||||
analyzer = LogAnalyzer(args.log_file)
|
||||
|
||||
if not analyzer.parse_file():
|
||||
sys.exit(1)
|
||||
|
||||
# Perform analysis
|
||||
analyzer.analyze_log_levels()
|
||||
analyzer.analyze_error_patterns()
|
||||
analyzer.extract_timestamps(args.timestamp_pattern)
|
||||
|
||||
# Print results
|
||||
print_analysis_results(analyzer, args.show_errors, args.show_traces)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
365
scripts/slo_calculator.py
Normal file
365
scripts/slo_calculator.py
Normal file
@@ -0,0 +1,365 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Calculate SLO compliance, error budgets, and burn rates.
|
||||
Supports availability SLOs and latency SLOs.
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import sys
|
||||
from datetime import datetime, timedelta
|
||||
from typing import Dict, Any, Optional
|
||||
|
||||
try:
|
||||
from tabulate import tabulate
|
||||
except ImportError:
|
||||
print("⚠️ Warning: 'tabulate' library not found. Install with: pip install tabulate")
|
||||
tabulate = None
|
||||
|
||||
|
||||
class SLOCalculator:
|
||||
# SLO targets and allowed downtime per period
|
||||
SLO_TARGETS = {
|
||||
"90.0": {"year": 36.5, "month": 3.0, "week": 0.7, "day": 0.1}, # days
|
||||
"95.0": {"year": 18.25, "month": 1.5, "week": 0.35, "day": 0.05},
|
||||
"99.0": {"year": 3.65, "month": 0.3, "week": 0.07, "day": 0.01},
|
||||
"99.5": {"year": 1.83, "month": 0.15, "week": 0.035, "day": 0.005},
|
||||
"99.9": {"year": 0.365, "month": 0.03, "week": 0.007, "day": 0.001},
|
||||
"99.95": {"year": 0.183, "month": 0.015, "week": 0.0035, "day": 0.0005},
|
||||
"99.99": {"year": 0.0365, "month": 0.003, "week": 0.0007, "day": 0.0001},
|
||||
}
|
||||
|
||||
def __init__(self, slo_target: float, period_days: int = 30):
|
||||
"""
|
||||
Initialize SLO calculator.
|
||||
|
||||
Args:
|
||||
slo_target: SLO target percentage (e.g., 99.9)
|
||||
period_days: Time period in days (default: 30)
|
||||
"""
|
||||
self.slo_target = slo_target
|
||||
self.period_days = period_days
|
||||
self.error_budget_minutes = self.calculate_error_budget_minutes()
|
||||
|
||||
def calculate_error_budget_minutes(self) -> float:
|
||||
"""Calculate error budget in minutes for the period."""
|
||||
total_minutes = self.period_days * 24 * 60
|
||||
allowed_error_rate = (100 - self.slo_target) / 100
|
||||
return total_minutes * allowed_error_rate
|
||||
|
||||
def calculate_availability_slo(self, total_requests: int, failed_requests: int) -> Dict[str, Any]:
|
||||
"""
|
||||
Calculate availability SLO compliance.
|
||||
|
||||
Args:
|
||||
total_requests: Total number of requests
|
||||
failed_requests: Number of failed requests
|
||||
|
||||
Returns:
|
||||
Dict with SLO compliance metrics
|
||||
"""
|
||||
if total_requests == 0:
|
||||
return {
|
||||
"error": "No requests in the period",
|
||||
"slo_met": False
|
||||
}
|
||||
|
||||
success_rate = ((total_requests - failed_requests) / total_requests) * 100
|
||||
error_rate = (failed_requests / total_requests) * 100
|
||||
|
||||
# Calculate error budget consumption
|
||||
allowed_failures = total_requests * ((100 - self.slo_target) / 100)
|
||||
error_budget_consumed = (failed_requests / allowed_failures) * 100 if allowed_failures > 0 else float('inf')
|
||||
error_budget_remaining = max(0, 100 - error_budget_consumed)
|
||||
|
||||
# Determine if SLO is met
|
||||
slo_met = success_rate >= self.slo_target
|
||||
|
||||
return {
|
||||
"slo_target": self.slo_target,
|
||||
"period_days": self.period_days,
|
||||
"total_requests": total_requests,
|
||||
"failed_requests": failed_requests,
|
||||
"success_requests": total_requests - failed_requests,
|
||||
"success_rate": success_rate,
|
||||
"error_rate": error_rate,
|
||||
"slo_met": slo_met,
|
||||
"error_budget_total": allowed_failures,
|
||||
"error_budget_consumed": error_budget_consumed,
|
||||
"error_budget_remaining": error_budget_remaining,
|
||||
"margin": success_rate - self.slo_target
|
||||
}
|
||||
|
||||
def calculate_latency_slo(self, total_requests: int, requests_exceeding_threshold: int) -> Dict[str, Any]:
|
||||
"""
|
||||
Calculate latency SLO compliance.
|
||||
|
||||
Args:
|
||||
total_requests: Total number of requests
|
||||
requests_exceeding_threshold: Number of requests exceeding latency threshold
|
||||
|
||||
Returns:
|
||||
Dict with SLO compliance metrics
|
||||
"""
|
||||
if total_requests == 0:
|
||||
return {
|
||||
"error": "No requests in the period",
|
||||
"slo_met": False
|
||||
}
|
||||
|
||||
within_threshold_rate = ((total_requests - requests_exceeding_threshold) / total_requests) * 100
|
||||
|
||||
# Calculate error budget consumption
|
||||
allowed_slow_requests = total_requests * ((100 - self.slo_target) / 100)
|
||||
error_budget_consumed = (requests_exceeding_threshold / allowed_slow_requests) * 100 if allowed_slow_requests > 0 else float('inf')
|
||||
error_budget_remaining = max(0, 100 - error_budget_consumed)
|
||||
|
||||
slo_met = within_threshold_rate >= self.slo_target
|
||||
|
||||
return {
|
||||
"slo_target": self.slo_target,
|
||||
"period_days": self.period_days,
|
||||
"total_requests": total_requests,
|
||||
"requests_exceeding_threshold": requests_exceeding_threshold,
|
||||
"requests_within_threshold": total_requests - requests_exceeding_threshold,
|
||||
"within_threshold_rate": within_threshold_rate,
|
||||
"slo_met": slo_met,
|
||||
"error_budget_total": allowed_slow_requests,
|
||||
"error_budget_consumed": error_budget_consumed,
|
||||
"error_budget_remaining": error_budget_remaining,
|
||||
"margin": within_threshold_rate - self.slo_target
|
||||
}
|
||||
|
||||
def calculate_burn_rate(self, errors_in_window: int, requests_in_window: int, window_hours: float) -> Dict[str, Any]:
|
||||
"""
|
||||
Calculate error budget burn rate.
|
||||
|
||||
Args:
|
||||
errors_in_window: Number of errors in the time window
|
||||
requests_in_window: Total requests in the time window
|
||||
window_hours: Size of the time window in hours
|
||||
|
||||
Returns:
|
||||
Dict with burn rate metrics
|
||||
"""
|
||||
if requests_in_window == 0:
|
||||
return {"error": "No requests in window"}
|
||||
|
||||
# Calculate actual error rate in this window
|
||||
actual_error_rate = (errors_in_window / requests_in_window) * 100
|
||||
|
||||
# Calculate allowed error rate for SLO
|
||||
allowed_error_rate = 100 - self.slo_target
|
||||
|
||||
# Burn rate = actual error rate / allowed error rate
|
||||
burn_rate = actual_error_rate / allowed_error_rate if allowed_error_rate > 0 else float('inf')
|
||||
|
||||
# Calculate time to exhaustion
|
||||
if burn_rate > 0:
|
||||
error_budget_hours = self.error_budget_minutes / 60
|
||||
hours_to_exhaustion = error_budget_hours / burn_rate
|
||||
else:
|
||||
hours_to_exhaustion = float('inf')
|
||||
|
||||
# Determine severity
|
||||
if burn_rate >= 14.4: # 1 hour window, burns budget in 2 days
|
||||
severity = "critical"
|
||||
elif burn_rate >= 6: # 6 hour window, burns budget in 5 days
|
||||
severity = "warning"
|
||||
elif burn_rate >= 1:
|
||||
severity = "elevated"
|
||||
else:
|
||||
severity = "normal"
|
||||
|
||||
return {
|
||||
"window_hours": window_hours,
|
||||
"requests_in_window": requests_in_window,
|
||||
"errors_in_window": errors_in_window,
|
||||
"actual_error_rate": actual_error_rate,
|
||||
"allowed_error_rate": allowed_error_rate,
|
||||
"burn_rate": burn_rate,
|
||||
"hours_to_exhaustion": hours_to_exhaustion,
|
||||
"severity": severity
|
||||
}
|
||||
|
||||
@staticmethod
|
||||
def print_slo_table():
|
||||
"""Print table of common SLO targets and allowed downtime."""
|
||||
if not tabulate:
|
||||
print("Install tabulate for formatted output: pip install tabulate")
|
||||
return
|
||||
|
||||
print("\n📊 SLO TARGETS AND ALLOWED DOWNTIME")
|
||||
print("="*60)
|
||||
|
||||
headers = ["SLO", "Year", "Month", "Week", "Day"]
|
||||
rows = []
|
||||
|
||||
for slo, downtimes in sorted(SLOCalculator.SLO_TARGETS.items(), reverse=True):
|
||||
row = [
|
||||
f"{slo}%",
|
||||
f"{downtimes['year']:.2f} days",
|
||||
f"{downtimes['month']:.2f} days",
|
||||
f"{downtimes['week']:.2f} days",
|
||||
f"{downtimes['day']:.2f} days"
|
||||
]
|
||||
rows.append(row)
|
||||
|
||||
print(tabulate(rows, headers=headers, tablefmt="grid"))
|
||||
|
||||
|
||||
def print_availability_results(results: Dict[str, Any]):
|
||||
"""Print availability SLO results."""
|
||||
print("\n" + "="*60)
|
||||
print("📊 AVAILABILITY SLO COMPLIANCE")
|
||||
print("="*60)
|
||||
|
||||
if "error" in results:
|
||||
print(f"\n❌ Error: {results['error']}")
|
||||
return
|
||||
|
||||
status_emoji = "✅" if results['slo_met'] else "❌"
|
||||
print(f"\n{status_emoji} SLO Status: {'MET' if results['slo_met'] else 'VIOLATED'}")
|
||||
print(f" Target: {results['slo_target']}%")
|
||||
print(f" Actual: {results['success_rate']:.3f}%")
|
||||
print(f" Margin: {results['margin']:+.3f}%")
|
||||
|
||||
print(f"\n📈 Request Statistics:")
|
||||
print(f" Total Requests: {results['total_requests']:,}")
|
||||
print(f" Successful: {results['success_requests']:,}")
|
||||
print(f" Failed: {results['failed_requests']:,}")
|
||||
print(f" Error Rate: {results['error_rate']:.3f}%")
|
||||
|
||||
print(f"\n💰 Error Budget:")
|
||||
budget_emoji = "✅" if results['error_budget_remaining'] > 20 else "⚠️" if results['error_budget_remaining'] > 0 else "❌"
|
||||
print(f" {budget_emoji} Remaining: {results['error_budget_remaining']:.1f}%")
|
||||
print(f" Consumed: {results['error_budget_consumed']:.1f}%")
|
||||
print(f" Allowed Failures: {results['error_budget_total']:.0f}")
|
||||
|
||||
print("\n" + "="*60)
|
||||
|
||||
|
||||
def print_burn_rate_results(results: Dict[str, Any]):
|
||||
"""Print burn rate results."""
|
||||
print("\n" + "="*60)
|
||||
print("🔥 ERROR BUDGET BURN RATE")
|
||||
print("="*60)
|
||||
|
||||
if "error" in results:
|
||||
print(f"\n❌ Error: {results['error']}")
|
||||
return
|
||||
|
||||
severity_emoji = {
|
||||
"critical": "🔴",
|
||||
"warning": "🟡",
|
||||
"elevated": "🟠",
|
||||
"normal": "🟢"
|
||||
}
|
||||
|
||||
print(f"\n{severity_emoji.get(results['severity'], '❓')} Severity: {results['severity'].upper()}")
|
||||
print(f" Burn Rate: {results['burn_rate']:.2f}x")
|
||||
print(f" Time to Exhaustion: {results['hours_to_exhaustion']:.1f} hours ({results['hours_to_exhaustion']/24:.1f} days)")
|
||||
|
||||
print(f"\n📊 Window Statistics:")
|
||||
print(f" Window: {results['window_hours']} hours")
|
||||
print(f" Requests: {results['requests_in_window']:,}")
|
||||
print(f" Errors: {results['errors_in_window']:,}")
|
||||
print(f" Actual Error Rate: {results['actual_error_rate']:.3f}%")
|
||||
print(f" Allowed Error Rate: {results['allowed_error_rate']:.3f}%")
|
||||
|
||||
print("\n" + "="*60)
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Calculate SLO compliance and error budgets",
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog="""
|
||||
Examples:
|
||||
# Show SLO reference table
|
||||
python3 slo_calculator.py --table
|
||||
|
||||
# Calculate availability SLO
|
||||
python3 slo_calculator.py availability \\
|
||||
--slo 99.9 \\
|
||||
--total-requests 1000000 \\
|
||||
--failed-requests 1500 \\
|
||||
--period-days 30
|
||||
|
||||
# Calculate latency SLO
|
||||
python3 slo_calculator.py latency \\
|
||||
--slo 99.5 \\
|
||||
--total-requests 500000 \\
|
||||
--slow-requests 3000 \\
|
||||
--period-days 7
|
||||
|
||||
# Calculate burn rate
|
||||
python3 slo_calculator.py burn-rate \\
|
||||
--slo 99.9 \\
|
||||
--errors 50 \\
|
||||
--requests 10000 \\
|
||||
--window-hours 1
|
||||
"""
|
||||
)
|
||||
|
||||
parser.add_argument('mode', nargs='?', choices=['availability', 'latency', 'burn-rate'],
|
||||
help='Calculation mode')
|
||||
parser.add_argument('--table', action='store_true', help='Show SLO reference table')
|
||||
parser.add_argument('--slo', type=float, help='SLO target percentage (e.g., 99.9)')
|
||||
parser.add_argument('--period-days', type=int, default=30, help='Period in days (default: 30)')
|
||||
|
||||
# Availability SLO arguments
|
||||
parser.add_argument('--total-requests', type=int, help='Total number of requests')
|
||||
parser.add_argument('--failed-requests', type=int, help='Number of failed requests')
|
||||
|
||||
# Latency SLO arguments
|
||||
parser.add_argument('--slow-requests', type=int, help='Number of requests exceeding threshold')
|
||||
|
||||
# Burn rate arguments
|
||||
parser.add_argument('--errors', type=int, help='Number of errors in window')
|
||||
parser.add_argument('--requests', type=int, help='Number of requests in window')
|
||||
parser.add_argument('--window-hours', type=float, help='Window size in hours')
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# Show table if requested
|
||||
if args.table:
|
||||
SLOCalculator.print_slo_table()
|
||||
return
|
||||
|
||||
if not args.mode:
|
||||
parser.print_help()
|
||||
return
|
||||
|
||||
if not args.slo:
|
||||
print("❌ --slo required")
|
||||
sys.exit(1)
|
||||
|
||||
calculator = SLOCalculator(args.slo, args.period_days)
|
||||
|
||||
if args.mode == 'availability':
|
||||
if not args.total_requests or args.failed_requests is None:
|
||||
print("❌ --total-requests and --failed-requests required")
|
||||
sys.exit(1)
|
||||
|
||||
results = calculator.calculate_availability_slo(args.total_requests, args.failed_requests)
|
||||
print_availability_results(results)
|
||||
|
||||
elif args.mode == 'latency':
|
||||
if not args.total_requests or args.slow_requests is None:
|
||||
print("❌ --total-requests and --slow-requests required")
|
||||
sys.exit(1)
|
||||
|
||||
results = calculator.calculate_latency_slo(args.total_requests, args.slow_requests)
|
||||
print_availability_results(results) # Same format
|
||||
|
||||
elif args.mode == 'burn-rate':
|
||||
if not all([args.errors is not None, args.requests, args.window_hours]):
|
||||
print("❌ --errors, --requests, and --window-hours required")
|
||||
sys.exit(1)
|
||||
|
||||
results = calculator.calculate_burn_rate(args.errors, args.requests, args.window_hours)
|
||||
print_burn_rate_results(results)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
Reference in New Issue
Block a user