Initial commit

2025-11-29 17:51:22 +08:00
commit 23753b435e
24 changed files with 9837 additions and 0 deletions
--- a/.claude-plugin/plugin.json
+++ b/.claude-plugin/plugin.json
@@ -0,0 +1,12 @@
+{
+  "name": "monitoring-observability",
+  "description": "Monitoring and observability strategy, metrics/logs/traces systems, SLOs/error budgets, Prometheus/Grafana/Loki, OpenTelemetry, and tool comparison",
+  "version": "0.0.0-2025.11.28",
+  "author": {
+    "name": "Ahmad Asmar",
+    "email": "zhongweili@tubi.tv"
+  },
+  "skills": [
+    "./"
+  ]
+}
--- a/README.md
+++ b/README.md
@@ -0,0 +1,3 @@
+# monitoring-observability
+
+Monitoring and observability strategy, metrics/logs/traces systems, SLOs/error budgets, Prometheus/Grafana/Loki, OpenTelemetry, and tool comparison
--- a/SKILL.md
+++ b/SKILL.md
@@ -0,0 +1,869 @@
+---
+name: monitoring-observability
+description: Monitoring and observability strategy, implementation, and troubleshooting. Use for designing metrics/logs/traces systems, setting up Prometheus/Grafana/Loki, creating alerts and dashboards, calculating SLOs and error budgets, analyzing performance issues, and comparing monitoring tools (Datadog, ELK, CloudWatch). Covers the Four Golden Signals, RED/USE methods, OpenTelemetry instrumentation, log aggregation patterns, and distributed tracing.
+---
+
+# Monitoring & Observability
+
+## Overview
+
+This skill provides comprehensive guidance for monitoring and observability workflows including metrics design, log aggregation, distributed tracing, alerting strategies, SLO/SLA management, and tool selection.
+
+**When to use this skill**:
+- Setting up monitoring for new services
+- Designing alerts and dashboards
+- Troubleshooting performance issues
+- Implementing SLO tracking and error budgets
+- Choosing between monitoring tools
+- Integrating OpenTelemetry instrumentation
+- Analyzing metrics, logs, and traces
+- Optimizing Datadog costs and finding waste
+- Migrating from Datadog to open-source stack
+
+---
+
+## Core Workflow: Observability Implementation
+
+Use this decision tree to determine your starting point:
+
+```
+Are you setting up monitoring from scratch?
+├─ YES → Start with "1. Design Metrics Strategy"
+└─ NO → Do you have an existing issue?
+    ├─ YES → Go to "9. Troubleshooting & Analysis"
+    └─ NO → Are you improving existing monitoring?
+        ├─ Alerts → Go to "3. Alert Design"
+        ├─ Dashboards → Go to "4. Dashboard & Visualization"
+        ├─ SLOs → Go to "5. SLO & Error Budgets"
+        ├─ Tool selection → Read references/tool_comparison.md
+        └─ Using Datadog? High costs? → Go to "7. Datadog Cost Optimization & Migration"
+```
+
+---
+
+## 1. Design Metrics Strategy
+
+### Start with The Four Golden Signals
+
+Every service should monitor:
+
+1. **Latency**: Response time (p50, p95, p99)
+2. **Traffic**: Requests per second
+3. **Errors**: Failure rate
+4. **Saturation**: Resource utilization
+
+**For request-driven services**, use the **RED Method**:
+- **R**ate: Requests/sec
+- **E**rrors: Error rate
+- **D**uration: Response time
+
+**For infrastructure resources**, use the **USE Method**:
+- **U**tilization: % time busy
+- **S**aturation**: Queue depth
+- **E**rrors**: Error count
+
+**Quick Start - Web Application Example**:
+```promql
+# Rate (requests/sec)
+sum(rate(http_requests_total[5m]))
+
+# Errors (error rate %)
+sum(rate(http_requests_total{status=~"5.."}[5m]))
+  /
+sum(rate(http_requests_total[5m])) * 100
+
+# Duration (p95 latency)
+histogram_quantile(0.95,
+  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
+)
+```
+
+### Deep Dive: Metric Design
+
+For comprehensive metric design guidance including:
+- Metric types (counter, gauge, histogram, summary)
+- Cardinality best practices
+- Naming conventions
+- Dashboard design principles
+
+**→ Read**: [references/metrics_design.md](references/metrics_design.md)
+
+### Automated Metric Analysis
+
+Detect anomalies and trends in your metrics:
+
+```bash
+# Analyze Prometheus metrics for anomalies
+python3 scripts/analyze_metrics.py prometheus \
+  --endpoint http://localhost:9090 \
+  --query 'rate(http_requests_total[5m])' \
+  --hours 24
+
+# Analyze CloudWatch metrics
+python3 scripts/analyze_metrics.py cloudwatch \
+  --namespace AWS/EC2 \
+  --metric CPUUtilization \
+  --dimensions InstanceId=i-1234567890abcdef0 \
+  --hours 48
+```
+
+**→ Script**: [scripts/analyze_metrics.py](scripts/analyze_metrics.py)
+
+---
+
+## 2. Log Aggregation & Analysis
+
+### Structured Logging Checklist
+
+Every log entry should include:
+- ✅ Timestamp (ISO 8601 format)
+- ✅ Log level (DEBUG, INFO, WARN, ERROR, FATAL)
+- ✅ Message (human-readable)
+- ✅ Service name
+- ✅ Request ID (for tracing)
+
+**Example structured log (JSON)**:
+```json
+{
+  "timestamp": "2024-10-28T14:32:15Z",
+  "level": "error",
+  "message": "Payment processing failed",
+  "service": "payment-service",
+  "request_id": "550e8400-e29b-41d4-a716-446655440000",
+  "user_id": "user123",
+  "order_id": "ORD-456",
+  "error_type": "GatewayTimeout",
+  "duration_ms": 5000
+}
+```
+
+### Log Aggregation Patterns
+
+**ELK Stack** (Elasticsearch, Logstash, Kibana):
+- Best for: Deep log analysis, complex queries
+- Cost: High (infrastructure + operations)
+- Complexity: High
+
+**Grafana Loki**:
+- Best for: Cost-effective logging, Kubernetes
+- Cost: Low
+- Complexity: Medium
+
+**CloudWatch Logs**:
+- Best for: AWS-centric applications
+- Cost: Medium
+- Complexity: Low
+
+### Log Analysis
+
+Analyze logs for errors, patterns, and anomalies:
+
+```bash
+# Analyze log file for patterns
+python3 scripts/log_analyzer.py application.log
+
+# Show error lines with context
+python3 scripts/log_analyzer.py application.log --show-errors
+
+# Extract stack traces
+python3 scripts/log_analyzer.py application.log --show-traces
+```
+
+**→ Script**: [scripts/log_analyzer.py](scripts/log_analyzer.py)
+
+### Deep Dive: Logging
+
+For comprehensive logging guidance including:
+- Structured logging implementation examples (Python, Node.js, Go, Java)
+- Log aggregation patterns (ELK, Loki, CloudWatch, Fluentd)
+- Query patterns and best practices
+- PII redaction and security
+- Sampling and rate limiting
+
+**→ Read**: [references/logging_guide.md](references/logging_guide.md)
+
+---
+
+## 3. Alert Design
+
+### Alert Design Principles
+
+1. **Every alert must be actionable** - If you can't do something, don't alert
+2. **Alert on symptoms, not causes** - Alert on user experience, not components
+3. **Tie alerts to SLOs** - Connect to business impact
+4. **Reduce noise** - Only page for critical issues
+
+### Alert Severity Levels
+
+| Severity | Response Time | Example |
+|----------|--------------|---------|
+| **Critical** | Page immediately | Service down, SLO violation |
+| **Warning** | Ticket, review in hours | Elevated error rate, resource warning |
+| **Info** | Log for awareness | Deployment completed, scaling event |
+
+### Multi-Window Burn Rate Alerting
+
+Alert when error budget is consumed too quickly:
+
+```yaml
+# Fast burn (1h window) - Critical
+- alert: ErrorBudgetFastBurn
+  expr: |
+    (error_rate / 0.001) > 14.4  # 99.9% SLO
+  for: 2m
+  labels:
+    severity: critical
+
+# Slow burn (6h window) - Warning
+- alert: ErrorBudgetSlowBurn
+  expr: |
+    (error_rate / 0.001) > 6  # 99.9% SLO
+  for: 30m
+  labels:
+    severity: warning
+```
+
+### Alert Quality Checker
+
+Audit your alert rules against best practices:
+
+```bash
+# Check single file
+python3 scripts/alert_quality_checker.py alerts.yml
+
+# Check all rules in directory
+python3 scripts/alert_quality_checker.py /path/to/prometheus/rules/
+```
+
+**Checks for**:
+- Alert naming conventions
+- Required labels (severity, team)
+- Required annotations (summary, description, runbook_url)
+- PromQL expression quality
+- 'for' clause to prevent flapping
+
+**→ Script**: [scripts/alert_quality_checker.py](scripts/alert_quality_checker.py)
+
+### Alert Templates
+
+Production-ready alert rule templates:
+
+**→ Templates**:
+- [assets/templates/prometheus-alerts/webapp-alerts.yml](assets/templates/prometheus-alerts/webapp-alerts.yml) - Web application alerts
+- [assets/templates/prometheus-alerts/kubernetes-alerts.yml](assets/templates/prometheus-alerts/kubernetes-alerts.yml) - Kubernetes alerts
+
+### Deep Dive: Alerting
+
+For comprehensive alerting guidance including:
+- Alert design patterns (multi-window, rate of change, threshold with hysteresis)
+- Alert annotation best practices
+- Alert routing (severity-based, team-based, time-based)
+- Inhibition rules
+- Runbook structure
+- On-call best practices
+
+**→ Read**: [references/alerting_best_practices.md](references/alerting_best_practices.md)
+
+### Runbook Template
+
+Create comprehensive runbooks for your alerts:
+
+**→ Template**: [assets/templates/runbooks/incident-runbook-template.md](assets/templates/runbooks/incident-runbook-template.md)
+
+---
+
+## 4. Dashboard & Visualization
+
+### Dashboard Design Principles
+
+1. **Top-down layout**: Most important metrics first
+2. **Color coding**: Red (critical), yellow (warning), green (healthy)
+3. **Consistent time windows**: All panels use same time range
+4. **Limit panels**: 8-12 panels per dashboard maximum
+5. **Include context**: Show related metrics together
+
+### Recommended Dashboard Structure
+
+```
+┌─────────────────────────────────────┐
+│  Overall Health (Single Stats)      │
+│  [Requests/s] [Error%] [P95 Latency]│
+└─────────────────────────────────────┘
+┌─────────────────────────────────────┐
+│  Request Rate & Errors (Graphs)     │
+└─────────────────────────────────────┘
+┌─────────────────────────────────────┐
+│  Latency Distribution (Graphs)      │
+└─────────────────────────────────────┘
+┌─────────────────────────────────────┐
+│  Resource Usage (Graphs)            │
+└─────────────────────────────────────┘
+```
+
+### Generate Grafana Dashboards
+
+Automatically generate dashboards from templates:
+
+```bash
+# Web application dashboard
+python3 scripts/dashboard_generator.py webapp \
+  --title "My API Dashboard" \
+  --service my_api \
+  --output dashboard.json
+
+# Kubernetes dashboard
+python3 scripts/dashboard_generator.py kubernetes \
+  --title "K8s Production" \
+  --namespace production \
+  --output k8s-dashboard.json
+
+# Database dashboard
+python3 scripts/dashboard_generator.py database \
+  --title "PostgreSQL" \
+  --db-type postgres \
+  --instance db.example.com:5432 \
+  --output db-dashboard.json
+```
+
+**Supports**:
+- Web applications (requests, errors, latency, resources)
+- Kubernetes (pods, nodes, resources, network)
+- Databases (PostgreSQL, MySQL)
+
+**→ Script**: [scripts/dashboard_generator.py](scripts/dashboard_generator.py)
+
+---
+
+## 5. SLO & Error Budgets
+
+### SLO Fundamentals
+
+**SLI** (Service Level Indicator): Measurement of service quality
+- Example: Request latency, error rate, availability
+
+**SLO** (Service Level Objective): Target value for an SLI
+- Example: "99.9% of requests return in < 500ms"
+
+**Error Budget**: Allowed failure amount = (100% - SLO)
+- Example: 99.9% SLO = 0.1% error budget = 43.2 minutes/month
+
+### Common SLO Targets
+
+| Availability | Downtime/Month | Use Case |
+|--------------|----------------|----------|
+| **99%** | 7.2 hours | Internal tools |
+| **99.9%** | 43.2 minutes | Standard production |
+| **99.95%** | 21.6 minutes | Critical services |
+| **99.99%** | 4.3 minutes | High availability |
+
+### SLO Calculator
+
+Calculate compliance, error budgets, and burn rates:
+
+```bash
+# Show SLO reference table
+python3 scripts/slo_calculator.py --table
+
+# Calculate availability SLO
+python3 scripts/slo_calculator.py availability \
+  --slo 99.9 \
+  --total-requests 1000000 \
+  --failed-requests 1500 \
+  --period-days 30
+
+# Calculate burn rate
+python3 scripts/slo_calculator.py burn-rate \
+  --slo 99.9 \
+  --errors 50 \
+  --requests 10000 \
+  --window-hours 1
+```
+
+**→ Script**: [scripts/slo_calculator.py](scripts/slo_calculator.py)
+
+### Deep Dive: SLO/SLA
+
+For comprehensive SLO/SLA guidance including:
+- Choosing appropriate SLIs
+- Setting realistic SLO targets
+- Error budget policies
+- Burn rate alerting
+- SLA structure and contracts
+- Monthly reporting templates
+
+**→ Read**: [references/slo_sla_guide.md](references/slo_sla_guide.md)
+
+---
+
+## 6. Distributed Tracing
+
+### When to Use Tracing
+
+Use distributed tracing when you need to:
+- Debug performance issues across services
+- Understand request flow through microservices
+- Identify bottlenecks in distributed systems
+- Find N+1 query problems
+
+### OpenTelemetry Implementation
+
+**Python example**:
+```python
+from opentelemetry import trace
+
+tracer = trace.get_tracer(__name__)
+
+@tracer.start_as_current_span("process_order")
+def process_order(order_id):
+    span = trace.get_current_span()
+    span.set_attribute("order.id", order_id)
+
+    try:
+        result = payment_service.charge(order_id)
+        span.set_attribute("payment.status", "success")
+        return result
+    except Exception as e:
+        span.set_status(trace.Status(trace.StatusCode.ERROR))
+        span.record_exception(e)
+        raise
+```
+
+### Sampling Strategies
+
+- **Development**: 100% (ALWAYS_ON)
+- **Staging**: 50-100%
+- **Production**: 1-10% (or error-based sampling)
+
+**Error-based sampling** (always sample errors, 1% of successes):
+```python
+class ErrorSampler(Sampler):
+    def should_sample(self, parent_context, trace_id, name, **kwargs):
+        attributes = kwargs.get('attributes', {})
+
+        if attributes.get('error', False):
+            return Decision.RECORD_AND_SAMPLE
+
+        if trace_id & 0xFF < 3:  # ~1%
+            return Decision.RECORD_AND_SAMPLE
+
+        return Decision.DROP
+```
+
+### OTel Collector Configuration
+
+Production-ready OpenTelemetry Collector configuration:
+
+**→ Template**: [assets/templates/otel-config/collector-config.yaml](assets/templates/otel-config/collector-config.yaml)
+
+**Features**:
+- Receives OTLP, Prometheus, and host metrics
+- Batching and memory limiting
+- Tail sampling (error-based, latency-based, probabilistic)
+- Multiple exporters (Tempo, Jaeger, Loki, Prometheus, CloudWatch, Datadog)
+
+### Deep Dive: Tracing
+
+For comprehensive tracing guidance including:
+- OpenTelemetry instrumentation (Python, Node.js, Go, Java)
+- Span attributes and semantic conventions
+- Context propagation (W3C Trace Context)
+- Backend comparison (Jaeger, Tempo, X-Ray, Datadog APM)
+- Analysis patterns (finding slow traces, N+1 queries)
+- Integration with logs
+
+**→ Read**: [references/tracing_guide.md](references/tracing_guide.md)
+
+---
+
+## 7. Datadog Cost Optimization & Migration
+
+### Scenario 1: I'm Using Datadog and Costs Are Too High
+
+If your Datadog bill is growing out of control, start by identifying waste:
+
+#### Cost Analysis Script
+
+Automatically analyze your Datadog usage and find cost optimization opportunities:
+
+```bash
+# Analyze Datadog usage (requires API key and APP key)
+python3 scripts/datadog_cost_analyzer.py \
+  --api-key $DD_API_KEY \
+  --app-key $DD_APP_KEY
+
+# Show detailed breakdown by category
+python3 scripts/datadog_cost_analyzer.py \
+  --api-key $DD_API_KEY \
+  --app-key $DD_APP_KEY \
+  --show-details
+```
+
+**What it checks**:
+- Infrastructure host count and cost
+- Custom metrics usage and high-cardinality metrics
+- Log ingestion volume and trends
+- APM host usage
+- Unused or noisy monitors
+- Container vs VM optimization opportunities
+
+**→ Script**: [scripts/datadog_cost_analyzer.py](scripts/datadog_cost_analyzer.py)
+
+#### Common Cost Optimization Strategies
+
+**1. Custom Metrics Optimization** (typical savings: 20-40%):
+- Remove high-cardinality tags (user IDs, request IDs)
+- Delete unused custom metrics
+- Aggregate metrics before sending
+- Use metric prefixes to identify teams/services
+
+**2. Log Management** (typical savings: 30-50%):
+- Implement log sampling for high-volume services
+- Use exclusion filters for debug/trace logs in production
+- Archive cold logs to S3/GCS after 7 days
+- Set log retention policies (15 days instead of 30)
+
+**3. APM Optimization** (typical savings: 15-25%):
+- Reduce trace sampling rates (10% → 5% in prod)
+- Use head-based sampling instead of complete sampling
+- Remove APM from non-critical services
+- Use trace search with lower retention
+
+**4. Infrastructure Monitoring** (typical savings: 10-20%):
+- Switch from VM-based to container-based pricing where possible
+- Remove agents from ephemeral instances
+- Use Datadog's host reduction strategies
+- Consolidate staging environments
+
+### Scenario 2: Migrating Away from Datadog
+
+If you're considering migrating to a more cost-effective open-source stack:
+
+#### Migration Overview
+
+**From Datadog** → **To Open Source Stack**:
+- Metrics: Datadog → **Prometheus + Grafana**
+- Logs: Datadog Logs → **Grafana Loki**
+- Traces: Datadog APM → **Tempo or Jaeger**
+- Dashboards: Datadog → **Grafana**
+- Alerts: Datadog Monitors → **Prometheus Alertmanager**
+
+**Estimated Cost Savings**: 60-77% ($49.8k-61.8k/year for 100-host environment)
+
+#### Migration Strategy
+
+**Phase 1: Run Parallel** (Month 1-2):
+- Deploy open-source stack alongside Datadog
+- Migrate metrics first (lowest risk)
+- Validate data accuracy
+
+**Phase 2: Migrate Dashboards & Alerts** (Month 2-3):
+- Convert Datadog dashboards to Grafana
+- Translate alert rules (use DQL → PromQL guide below)
+- Train team on new tools
+
+**Phase 3: Migrate Logs & Traces** (Month 3-4):
+- Set up Loki for log aggregation
+- Deploy Tempo/Jaeger for tracing
+- Update application instrumentation
+
+**Phase 4: Decommission Datadog** (Month 4-5):
+- Confirm all functionality migrated
+- Cancel Datadog subscription
+
+#### Query Translation: DQL → PromQL
+
+When migrating dashboards and alerts, you'll need to translate Datadog queries to PromQL:
+
+**Quick examples**:
+```
+# Average CPU
+Datadog: avg:system.cpu.user{*}
+Prometheus: avg(node_cpu_seconds_total{mode="user"})
+
+# Request rate
+Datadog: sum:requests.count{*}.as_rate()
+Prometheus: sum(rate(http_requests_total[5m]))
+
+# P95 latency
+Datadog: p95:request.duration{*}
+Prometheus: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
+
+# Error rate percentage
+Datadog: (sum:requests.errors{*}.as_rate() / sum:requests.count{*}.as_rate()) * 100
+Prometheus: (sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) * 100
+```
+
+**→ Full Translation Guide**: [references/dql_promql_translation.md](references/dql_promql_translation.md)
+
+#### Cost Comparison
+
+**Example: 100-host infrastructure**
+
+| Component | Datadog (Annual) | Open Source (Annual) | Savings |
+|-----------|-----------------|---------------------|---------|
+| Infrastructure | $18,000 | $10,000 (self-hosted infra) | $8,000 |
+| Custom Metrics | $600 | Included | $600 |
+| Logs | $24,000 | $3,000 (storage) | $21,000 |
+| APM/Traces | $37,200 | $5,000 (storage) | $32,200 |
+| **Total** | **$79,800** | **$18,000** | **$61,800 (77%)** |
+
+### Deep Dive: Datadog Migration
+
+For comprehensive migration guidance including:
+- Detailed cost comparison and ROI calculations
+- Step-by-step migration instructions
+- Infrastructure sizing recommendations (CPU, RAM, storage)
+- Dashboard conversion tools and examples
+- Alert rule translation patterns
+- Application instrumentation changes (DogStatsD → Prometheus client)
+- Python scripts for exporting Datadog dashboards and monitors
+- Common challenges and solutions
+
+**→ Read**: [references/datadog_migration.md](references/datadog_migration.md)
+
+---
+
+## 8. Tool Selection & Comparison
+
+### Decision Matrix
+
+**Choose Prometheus + Grafana if**:
+- ✅ Using Kubernetes
+- ✅ Want control and customization
+- ✅ Have ops capacity
+- ✅ Budget-conscious
+
+**Choose Datadog if**:
+- ✅ Want ease of use
+- ✅ Need full observability now
+- ✅ Budget allows ($8k+/month for 100 hosts)
+
+**Choose Grafana Stack (LGTM) if**:
+- ✅ Want open source full stack
+- ✅ Cost-effective solution
+- ✅ Cloud-native architecture
+
+**Choose ELK Stack if**:
+- ✅ Heavy log analysis needs
+- ✅ Need powerful search
+- ✅ Have dedicated ops team
+
+**Choose Cloud Native (CloudWatch/etc) if**:
+- ✅ Single cloud provider
+- ✅ Simple needs
+- ✅ Want minimal setup
+
+### Cost Comparison (100 hosts, 1TB logs/month)
+
+| Solution | Monthly Cost | Setup | Ops Burden |
+|----------|-------------|--------|------------|
+| Prometheus + Loki + Tempo | $1,500 | Medium | Medium |
+| Grafana Cloud | $3,000 | Low | Low |
+| Datadog | $8,000 | Low | None |
+| ELK Stack | $4,000 | High | High |
+| CloudWatch | $2,000 | Low | Low |
+
+### Deep Dive: Tool Comparison
+
+For comprehensive tool comparison including:
+- Metrics platforms (Prometheus, Datadog, New Relic, CloudWatch, Grafana Cloud)
+- Logging platforms (ELK, Loki, Splunk, CloudWatch Logs, Sumo Logic)
+- Tracing platforms (Jaeger, Tempo, Datadog APM, X-Ray)
+- Full-stack observability comparison
+- Recommendations by company size
+
+**→ Read**: [references/tool_comparison.md](references/tool_comparison.md)
+
+---
+
+## 9. Troubleshooting & Analysis
+
+### Health Check Validation
+
+Validate health check endpoints against best practices:
+
+```bash
+# Check single endpoint
+python3 scripts/health_check_validator.py https://api.example.com/health
+
+# Check multiple endpoints
+python3 scripts/health_check_validator.py \
+  https://api.example.com/health \
+  https://api.example.com/readiness \
+  --verbose
+```
+
+**Checks for**:
+- ✓ Returns 200 status code
+- ✓ Response time < 1 second
+- ✓ Returns JSON format
+- ✓ Contains 'status' field
+- ✓ Includes version/build info
+- ✓ Checks dependencies
+- ✓ Disables caching
+
+**→ Script**: [scripts/health_check_validator.py](scripts/health_check_validator.py)
+
+### Common Troubleshooting Workflows
+
+**High Latency Investigation**:
+1. Check dashboards for latency spike
+2. Query traces for slow operations
+3. Check database slow query log
+4. Check external API response times
+5. Review recent deployments
+6. Check resource utilization (CPU, memory)
+
+**High Error Rate Investigation**:
+1. Check error logs for patterns
+2. Identify affected endpoints
+3. Check dependency health
+4. Review recent deployments
+5. Check resource limits
+6. Verify configuration
+
+**Service Down Investigation**:
+1. Check if pods/instances are running
+2. Check health check endpoint
+3. Review recent deployments
+4. Check resource availability
+5. Check network connectivity
+6. Review logs for startup errors
+
+---
+
+## Quick Reference Commands
+
+### Prometheus Queries
+
+```promql
+# Request rate
+sum(rate(http_requests_total[5m]))
+
+# Error rate
+sum(rate(http_requests_total{status=~"5.."}[5m]))
+  /
+sum(rate(http_requests_total[5m])) * 100
+
+# P95 latency
+histogram_quantile(0.95,
+  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
+)
+
+# CPU usage
+100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
+
+# Memory usage
+(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
+```
+
+### Kubernetes Commands
+
+```bash
+# Check pod status
+kubectl get pods -n <namespace>
+
+# View pod logs
+kubectl logs -f <pod-name> -n <namespace>
+
+# Check pod resources
+kubectl top pods -n <namespace>
+
+# Describe pod for events
+kubectl describe pod <pod-name> -n <namespace>
+
+# Check recent deployments
+kubectl rollout history deployment/<name> -n <namespace>
+```
+
+### Log Queries
+
+**Elasticsearch**:
+```json
+GET /logs-*/_search
+{
+  "query": {
+    "bool": {
+      "must": [
+        { "match": { "level": "error" } },
+        { "range": { "@timestamp": { "gte": "now-1h" } } }
+      ]
+    }
+  }
+}
+```
+
+**Loki (LogQL)**:
+```logql
+{job="app", level="error"} |= "error" | json
+```
+
+**CloudWatch Insights**:
+```
+fields @timestamp, level, message
+| filter level = "error"
+| filter @timestamp > ago(1h)
+```
+
+---
+
+## Resources Summary
+
+### Scripts (automation and analysis)
+- `analyze_metrics.py` - Detect anomalies in Prometheus/CloudWatch metrics
+- `alert_quality_checker.py` - Audit alert rules against best practices
+- `slo_calculator.py` - Calculate SLO compliance and error budgets
+- `log_analyzer.py` - Parse logs for errors and patterns
+- `dashboard_generator.py` - Generate Grafana dashboards from templates
+- `health_check_validator.py` - Validate health check endpoints
+- `datadog_cost_analyzer.py` - Analyze Datadog usage and find cost waste
+
+### References (deep-dive documentation)
+- `metrics_design.md` - Four Golden Signals, RED/USE methods, metric types
+- `alerting_best_practices.md` - Alert design, runbooks, on-call practices
+- `logging_guide.md` - Structured logging, aggregation patterns
+- `tracing_guide.md` - OpenTelemetry, distributed tracing
+- `slo_sla_guide.md` - SLI/SLO/SLA definitions, error budgets
+- `tool_comparison.md` - Comprehensive comparison of monitoring tools
+- `datadog_migration.md` - Complete guide for migrating from Datadog to OSS stack
+- `dql_promql_translation.md` - Datadog Query Language to PromQL translation reference
+
+### Templates (ready-to-use configurations)
+- `prometheus-alerts/webapp-alerts.yml` - Production-ready web app alerts
+- `prometheus-alerts/kubernetes-alerts.yml` - Kubernetes monitoring alerts
+- `otel-config/collector-config.yaml` - OpenTelemetry Collector configuration
+- `runbooks/incident-runbook-template.md` - Incident response template
+
+---
+
+## Best Practices
+
+### Metrics
+- Start with Four Golden Signals
+- Use appropriate metric types (counter, gauge, histogram)
+- Keep cardinality low (avoid high-cardinality labels)
+- Follow naming conventions
+
+### Logging
+- Use structured logging (JSON)
+- Include request IDs for tracing
+- Set appropriate log levels
+- Redact PII before logging
+
+### Alerting
+- Make every alert actionable
+- Alert on symptoms, not causes
+- Use multi-window burn rate alerts
+- Include runbook links
+
+### Tracing
+- Sample appropriately (1-10% in production)
+- Always record errors
+- Use semantic conventions
+- Propagate context between services
+
+### SLOs
+- Start with current performance
+- Set realistic targets
+- Define error budget policies
+- Review and adjust quarterly
--- a/assets/templates/otel-config/collector-config.yaml
+++ b/assets/templates/otel-config/collector-config.yaml
@@ -0,0 +1,227 @@
+# OpenTelemetry Collector Configuration
+# Receives metrics, logs, and traces and exports to various backends
+
+receivers:
+  # OTLP receiver (standard OpenTelemetry protocol)
+  otlp:
+    protocols:
+      grpc:
+        endpoint: 0.0.0.0:4317
+      http:
+        endpoint: 0.0.0.0:4318
+
+  # Prometheus receiver (scrape Prometheus endpoints)
+  prometheus:
+    config:
+      scrape_configs:
+        - job_name: 'otel-collector'
+          scrape_interval: 30s
+          static_configs:
+            - targets: ['localhost:8888']
+
+  # Host metrics (CPU, memory, disk, network)
+  hostmetrics:
+    collection_interval: 30s
+    scrapers:
+      cpu:
+      memory:
+      disk:
+      network:
+      filesystem:
+      load:
+
+  # Kubernetes receiver (cluster metrics)
+  k8s_cluster:
+    auth_type: serviceAccount
+    node_conditions_to_report: [Ready, MemoryPressure, DiskPressure]
+    distribution: kubernetes
+
+  # Zipkin receiver (legacy tracing)
+  zipkin:
+    endpoint: 0.0.0.0:9411
+
+processors:
+  # Batch processor (improves performance)
+  batch:
+    timeout: 10s
+    send_batch_size: 1024
+    send_batch_max_size: 2048
+
+  # Memory limiter (prevent OOM)
+  memory_limiter:
+    check_interval: 1s
+    limit_mib: 512
+    spike_limit_mib: 128
+
+  # Resource processor (add resource attributes)
+  resource:
+    attributes:
+      - key: environment
+        value: production
+        action: insert
+      - key: cluster.name
+        value: prod-cluster
+        action: insert
+
+  # Attributes processor (modify span/metric attributes)
+  attributes:
+    actions:
+      - key: http.url
+        action: delete  # Remove potentially sensitive URLs
+      - key: db.statement
+        action: hash    # Hash SQL queries for privacy
+
+  # Filter processor (drop unwanted data)
+  filter:
+    metrics:
+      # Drop metrics matching criteria
+      exclude:
+        match_type: regexp
+        metric_names:
+          - ^go_.*      # Drop Go runtime metrics
+          - ^process_.* # Drop process metrics
+
+  # Tail sampling (intelligent trace sampling)
+  tail_sampling:
+    decision_wait: 10s
+    num_traces: 100
+    policies:
+      # Always sample errors
+      - name: error-policy
+        type: status_code
+        status_code:
+          status_codes: [ERROR]
+
+      # Sample slow traces
+      - name: latency-policy
+        type: latency
+        latency:
+          threshold_ms: 1000
+
+      # Sample 10% of others
+      - name: probabilistic-policy
+        type: probabilistic
+        probabilistic:
+          sampling_percentage: 10
+
+  # Span processor (modify spans)
+  span:
+    name:
+      to_attributes:
+        rules:
+          - ^\/api\/v1\/users\/(?P<user_id>.*)$
+      from_attributes:
+        - db.name
+        - http.method
+
+exporters:
+  # Prometheus exporter (expose metrics endpoint)
+  prometheus:
+    endpoint: 0.0.0.0:8889
+    namespace: otel
+
+  # OTLP exporters (send to backends)
+  otlp/tempo:
+    endpoint: tempo:4317
+    tls:
+      insecure: true
+
+  otlp/mimir:
+    endpoint: mimir:4317
+    tls:
+      insecure: true
+
+  # Loki exporter (for logs)
+  loki:
+    endpoint: http://loki:3100/loki/api/v1/push
+    labels:
+      resource:
+        service.name: "service_name"
+        service.namespace: "service_namespace"
+      attributes:
+        level: "level"
+
+  # Jaeger exporter (alternative tracing backend)
+  jaeger:
+    endpoint: jaeger:14250
+    tls:
+      insecure: true
+
+  # Elasticsearch exporter (for logs)
+  elasticsearch:
+    endpoints:
+      - http://elasticsearch:9200
+    logs_index: otel-logs
+    traces_index: otel-traces
+
+  # CloudWatch exporter (AWS)
+  awscloudwatch:
+    region: us-east-1
+    namespace: MyApp
+    log_group_name: /aws/otel/logs
+    log_stream_name: otel-collector
+
+  # Datadog exporter
+  datadog:
+    api:
+      key: ${DD_API_KEY}
+      site: datadoghq.com
+
+  # File exporter (debugging)
+  file:
+    path: /tmp/otel-output.json
+
+  # Logging exporter (console output for debugging)
+  logging:
+    verbosity: detailed
+    sampling_initial: 5
+    sampling_thereafter: 200
+
+extensions:
+  # Health check endpoint
+  health_check:
+    endpoint: 0.0.0.0:13133
+
+  # Pprof endpoint (for profiling)
+  pprof:
+    endpoint: 0.0.0.0:1777
+
+  # ZPages (internal diagnostics)
+  zpages:
+    endpoint: 0.0.0.0:55679
+
+service:
+  extensions: [health_check, pprof, zpages]
+
+  pipelines:
+    # Traces pipeline
+    traces:
+      receivers: [otlp, zipkin]
+      processors: [memory_limiter, batch, tail_sampling, resource, span]
+      exporters: [otlp/tempo, jaeger, logging]
+
+    # Metrics pipeline
+    metrics:
+      receivers: [otlp, prometheus, hostmetrics, k8s_cluster]
+      processors: [memory_limiter, batch, filter, resource]
+      exporters: [otlp/mimir, prometheus, awscloudwatch]
+
+    # Logs pipeline
+    logs:
+      receivers: [otlp]
+      processors: [memory_limiter, batch, resource, attributes]
+      exporters: [loki, elasticsearch, awscloudwatch]
+
+  # Telemetry (collector's own metrics)
+  telemetry:
+    logs:
+      level: info
+    metrics:
+      address: 0.0.0.0:8888
+
+# Notes:
+# 1. Replace ${DD_API_KEY} with actual API key or use environment variable
+# 2. Adjust endpoints to match your infrastructure
+# 3. Comment out exporters you don't use
+# 4. Adjust sampling rates based on your volume and needs
+# 5. Add TLS configuration for production deployments
--- a/assets/templates/prometheus-alerts/kubernetes-alerts.yml
+++ b/assets/templates/prometheus-alerts/kubernetes-alerts.yml
@@ -0,0 +1,293 @@
+---
+# Prometheus Alert Rules for Kubernetes
+# Covers pods, nodes, deployments, and resource usage
+
+groups:
+  - name: kubernetes_pods
+    interval: 30s
+    rules:
+      # Pod crash looping
+      - alert: PodCrashLooping
+        expr: |
+          rate(kube_pod_container_status_restarts_total[15m]) > 0
+        for: 5m
+        labels:
+          severity: warning
+          team: platform
+          component: kubernetes
+        annotations:
+          summary: "Pod is crash looping - {{ $labels.namespace }}/{{ $labels.pod }}"
+          description: |
+            Pod {{ $labels.namespace }}/{{ $labels.pod }} has restarted {{ $value }} times in the last 15 minutes.
+
+            Check pod logs:
+            kubectl logs -n {{ $labels.namespace }} {{ $labels.pod }} --previous
+          runbook_url: "https://runbooks.example.com/pod-crash-loop"
+
+      # Pod not ready
+      - alert: PodNotReady
+        expr: |
+          sum by (namespace, pod) (kube_pod_status_phase{phase!~"Running|Succeeded"}) > 0
+        for: 10m
+        labels:
+          severity: warning
+          team: platform
+          component: kubernetes
+        annotations:
+          summary: "Pod not ready - {{ $labels.namespace }}/{{ $labels.pod }}"
+          description: |
+            Pod {{ $labels.namespace }}/{{ $labels.pod }} is in {{ $labels.phase }} state for 10 minutes.
+
+            Investigate:
+            kubectl describe pod -n {{ $labels.namespace }} {{ $labels.pod }}
+          runbook_url: "https://runbooks.example.com/pod-not-ready"
+
+      # Pod OOMKilled
+      - alert: PodOOMKilled
+        expr: |
+          sum by (namespace, pod) (kube_pod_container_status_terminated_reason{reason="OOMKilled"}) > 0
+        for: 1m
+        labels:
+          severity: warning
+          team: platform
+          component: kubernetes
+        annotations:
+          summary: "Pod killed due to OOM - {{ $labels.namespace }}/{{ $labels.pod }}"
+          description: |
+            Pod {{ $labels.namespace }}/{{ $labels.pod }} was killed due to out-of-memory.
+
+            Increase memory limits or investigate memory leak.
+          runbook_url: "https://runbooks.example.com/oom-killed"
+
+  - name: kubernetes_deployments
+    interval: 30s
+    rules:
+      # Deployment replica mismatch
+      - alert: DeploymentReplicasMismatch
+        expr: |
+          kube_deployment_spec_replicas != kube_deployment_status_replicas_available
+        for: 15m
+        labels:
+          severity: warning
+          team: platform
+          component: kubernetes
+        annotations:
+          summary: "Deployment replicas mismatch - {{ $labels.namespace }}/{{ $labels.deployment }}"
+          description: |
+            Deployment {{ $labels.namespace }}/{{ $labels.deployment }} has been running with
+            fewer replicas than desired for 15 minutes.
+
+            Desired: {{ $value }}
+            Available: Check deployment status
+          runbook_url: "https://runbooks.example.com/replica-mismatch"
+
+      # Deployment rollout stuck
+      - alert: DeploymentRolloutStuck
+        expr: |
+          kube_deployment_status_condition{condition="Progressing", status="false"} > 0
+        for: 15m
+        labels:
+          severity: warning
+          team: platform
+          component: kubernetes
+        annotations:
+          summary: "Deployment rollout stuck - {{ $labels.namespace }}/{{ $labels.deployment }}"
+          description: |
+            Deployment {{ $labels.namespace }}/{{ $labels.deployment }} rollout is stuck.
+
+            Check rollout status:
+            kubectl rollout status deployment/{{ $labels.deployment }} -n {{ $labels.namespace }}
+          runbook_url: "https://runbooks.example.com/rollout-stuck"
+
+  - name: kubernetes_nodes
+    interval: 30s
+    rules:
+      # Node not ready
+      - alert: NodeNotReady
+        expr: |
+          kube_node_status_condition{condition="Ready",status="true"} == 0
+        for: 5m
+        labels:
+          severity: critical
+          team: platform
+          component: kubernetes
+        annotations:
+          summary: "Node not ready - {{ $labels.node }}"
+          description: |
+            Node {{ $labels.node }} has been NotReady for 5 minutes.
+
+            This will affect pod scheduling and availability.
+
+            Check node status:
+            kubectl describe node {{ $labels.node }}
+          runbook_url: "https://runbooks.example.com/node-not-ready"
+
+      # Node memory pressure
+      - alert: NodeMemoryPressure
+        expr: |
+          kube_node_status_condition{condition="MemoryPressure",status="true"} == 1
+        for: 5m
+        labels:
+          severity: warning
+          team: platform
+          component: kubernetes
+        annotations:
+          summary: "Node under memory pressure - {{ $labels.node }}"
+          description: |
+            Node {{ $labels.node }} is experiencing memory pressure.
+
+            Pods may be evicted. Consider scaling up or evicting low-priority pods.
+          runbook_url: "https://runbooks.example.com/memory-pressure"
+
+      # Node disk pressure
+      - alert: NodeDiskPressure
+        expr: |
+          kube_node_status_condition{condition="DiskPressure",status="true"} == 1
+        for: 5m
+        labels:
+          severity: warning
+          team: platform
+          component: kubernetes
+        annotations:
+          summary: "Node under disk pressure - {{ $labels.node }}"
+          description: |
+            Node {{ $labels.node }} is experiencing disk pressure.
+
+            Clean up disk space or add capacity.
+          runbook_url: "https://runbooks.example.com/disk-pressure"
+
+      # Node high CPU
+      - alert: NodeHighCPU
+        expr: |
+          (1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)) * 100 > 80
+        for: 15m
+        labels:
+          severity: warning
+          team: platform
+          component: kubernetes
+        annotations:
+          summary: "Node high CPU usage - {{ $labels.instance }}"
+          description: |
+            Node {{ $labels.instance }} CPU usage is {{ $value | humanize }}%.
+
+            Check for resource-intensive pods or scale cluster.
+          runbook_url: "https://runbooks.example.com/node-high-cpu"
+
+  - name: kubernetes_resources
+    interval: 30s
+    rules:
+      # Container CPU throttling
+      - alert: ContainerCPUThrottling
+        expr: |
+          rate(container_cpu_cfs_throttled_seconds_total[5m]) > 0.5
+        for: 10m
+        labels:
+          severity: warning
+          team: platform
+          component: kubernetes
+        annotations:
+          summary: "Container CPU throttling - {{ $labels.namespace }}/{{ $labels.pod }}"
+          description: |
+            Container {{ $labels.container }} in pod {{ $labels.namespace }}/{{ $labels.pod }}
+            is being CPU throttled.
+
+            CPU throttling rate: {{ $value | humanize }}
+
+            Consider increasing CPU limits.
+          runbook_url: "https://runbooks.example.com/cpu-throttling"
+
+      # Container memory usage high
+      - alert: ContainerMemoryUsageHigh
+        expr: |
+          (container_memory_usage_bytes / container_spec_memory_limit_bytes) > 0.9
+        for: 10m
+        labels:
+          severity: warning
+          team: platform
+          component: kubernetes
+        annotations:
+          summary: "Container memory usage high - {{ $labels.namespace }}/{{ $labels.pod }}"
+          description: |
+            Container {{ $labels.container }} in pod {{ $labels.namespace }}/{{ $labels.pod }}
+            is using {{ $value | humanizePercentage }} of its memory limit.
+
+            Risk of OOMKill. Consider increasing memory limits.
+          runbook_url: "https://runbooks.example.com/high-memory"
+
+  - name: kubernetes_pv
+    interval: 30s
+    rules:
+      # PersistentVolume nearing full
+      - alert: PersistentVolumeFillingUp
+        expr: |
+          (kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes) < 0.15
+        for: 10m
+        labels:
+          severity: warning
+          team: platform
+          component: kubernetes
+        annotations:
+          summary: "PersistentVolume filling up - {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }}"
+          description: |
+            PersistentVolume {{ $labels.persistentvolumeclaim }} in namespace {{ $labels.namespace }}
+            is {{ $value | humanizePercentage }} full.
+
+            Available space is running low. Consider expanding volume.
+          runbook_url: "https://runbooks.example.com/pv-filling-up"
+
+      # PersistentVolume critically full
+      - alert: PersistentVolumeCriticallyFull
+        expr: |
+          (kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes) < 0.05
+        for: 5m
+        labels:
+          severity: critical
+          team: platform
+          component: kubernetes
+        annotations:
+          summary: "PersistentVolume critically full - {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }}"
+          description: |
+            PersistentVolume {{ $labels.persistentvolumeclaim }} in namespace {{ $labels.namespace }}
+            is {{ $value | humanizePercentage }} full.
+
+            Immediate action required to prevent application failures.
+          runbook_url: "https://runbooks.example.com/pv-critically-full"
+
+  - name: kubernetes_jobs
+    interval: 30s
+    rules:
+      # Job failed
+      - alert: JobFailed
+        expr: |
+          kube_job_status_failed > 0
+        for: 5m
+        labels:
+          severity: warning
+          team: platform
+          component: kubernetes
+        annotations:
+          summary: "Job failed - {{ $labels.namespace }}/{{ $labels.job_name }}"
+          description: |
+            Job {{ $labels.namespace }}/{{ $labels.job_name }} has failed.
+
+            Check job logs:
+            kubectl logs job/{{ $labels.job_name }} -n {{ $labels.namespace }}
+          runbook_url: "https://runbooks.example.com/job-failed"
+
+      # CronJob not running
+      - alert: CronJobNotRunning
+        expr: |
+          time() - kube_cronjob_status_last_schedule_time > 3600
+        for: 10m
+        labels:
+          severity: warning
+          team: platform
+          component: kubernetes
+        annotations:
+          summary: "CronJob not running - {{ $labels.namespace }}/{{ $labels.cronjob }}"
+          description: |
+            CronJob {{ $labels.namespace}}/{{ $labels.cronjob }} hasn't run in over an hour.
+
+            Check CronJob status:
+            kubectl describe cronjob {{ $labels.cronjob }} -n {{ $labels.namespace }}
+          runbook_url: "https://runbooks.example.com/cronjob-not-running"
--- a/assets/templates/prometheus-alerts/webapp-alerts.yml
+++ b/assets/templates/prometheus-alerts/webapp-alerts.yml
@@ -0,0 +1,243 @@
+---
+# Prometheus Alert Rules for Web Applications
+# Based on SLO best practices and multi-window burn rate alerting
+
+groups:
+  - name: webapp_availability
+    interval: 30s
+    rules:
+      # Fast burn rate alert (1h window) - SLO: 99.9%
+      - alert: ErrorBudgetFastBurn
+        expr: |
+          (
+            sum(rate(http_requests_total{job="webapp",status=~"5.."}[1h]))
+            /
+            sum(rate(http_requests_total{job="webapp"}[1h]))
+          ) > (14.4 * 0.001)
+        for: 2m
+        labels:
+          severity: critical
+          team: backend
+          component: webapp
+        annotations:
+          summary: "Fast error budget burn - {{ $labels.job }}"
+          description: |
+            Error rate is {{ $value | humanizePercentage }} over the last hour,
+            burning through error budget at 14.4x rate.
+
+            At this rate, the monthly error budget will be exhausted in 2 days.
+
+            Immediate investigation required.
+          runbook_url: "https://runbooks.example.com/error-budget-burn"
+          dashboard: "https://grafana.example.com/d/webapp"
+
+      # Slow burn rate alert (6h window)
+      - alert: ErrorBudgetSlowBurn
+        expr: |
+          (
+            sum(rate(http_requests_total{job="webapp",status=~"5.."}[6h]))
+            /
+            sum(rate(http_requests_total{job="webapp"}[6h]))
+          ) > (6 * 0.001)
+        for: 30m
+        labels:
+          severity: warning
+          team: backend
+          component: webapp
+        annotations:
+          summary: "Elevated error budget burn - {{ $labels.job }}"
+          description: |
+            Error rate is {{ $value | humanizePercentage }} over the last 6 hours,
+            burning through error budget at 6x rate.
+
+            Monitor closely and investigate if trend continues.
+          runbook_url: "https://runbooks.example.com/error-budget-burn"
+
+      # Service down alert
+      - alert: WebAppDown
+        expr: up{job="webapp"} == 0
+        for: 2m
+        labels:
+          severity: critical
+          team: backend
+          component: webapp
+        annotations:
+          summary: "Web application is down - {{ $labels.instance }}"
+          description: |
+            Web application instance {{ $labels.instance }} has been down for 2 minutes.
+
+            Check service health and logs immediately.
+          runbook_url: "https://runbooks.example.com/service-down"
+
+  - name: webapp_latency
+    interval: 30s
+    rules:
+      # High latency (p95)
+      - alert: HighLatencyP95
+        expr: |
+          histogram_quantile(0.95,
+            sum(rate(http_request_duration_seconds_bucket{job="webapp"}[5m])) by (le)
+          ) > 0.5
+        for: 10m
+        labels:
+          severity: warning
+          team: backend
+          component: webapp
+        annotations:
+          summary: "High p95 latency - {{ $labels.job }}"
+          description: |
+            P95 request latency is {{ $value }}s, exceeding 500ms threshold.
+
+            This may impact user experience. Check for:
+            - Slow database queries
+            - External API issues
+            - Resource saturation
+          runbook_url: "https://runbooks.example.com/high-latency"
+          dashboard: "https://grafana.example.com/d/webapp-latency"
+
+      # Very high latency (p99)
+      - alert: HighLatencyP99
+        expr: |
+          histogram_quantile(0.99,
+            sum(rate(http_request_duration_seconds_bucket{job="webapp"}[5m])) by (le)
+          ) > 2
+        for: 5m
+        labels:
+          severity: critical
+          team: backend
+          component: webapp
+        annotations:
+          summary: "Critical latency degradation - {{ $labels.job }}"
+          description: |
+            P99 request latency is {{ $value }}s, exceeding 2s threshold.
+
+            Severe performance degradation detected.
+          runbook_url: "https://runbooks.example.com/high-latency"
+
+  - name: webapp_resources
+    interval: 30s
+    rules:
+      # High CPU
+      - alert: HighCPU
+        expr: |
+          rate(process_cpu_seconds_total{job="webapp"}[5m]) * 100 > 80
+        for: 15m
+        labels:
+          severity: warning
+          team: backend
+          component: webapp
+        annotations:
+          summary: "High CPU usage - {{ $labels.instance }}"
+          description: |
+            CPU usage is {{ $value | humanize }}% on {{ $labels.instance }}.
+
+            Consider scaling up or investigating CPU-intensive operations.
+          runbook_url: "https://runbooks.example.com/high-cpu"
+
+      # High memory
+      - alert: HighMemory
+        expr: |
+          (process_resident_memory_bytes{job="webapp"} / node_memory_MemTotal_bytes) * 100 > 80
+        for: 15m
+        labels:
+          severity: warning
+          team: backend
+          component: webapp
+        annotations:
+          summary: "High memory usage - {{ $labels.instance }}"
+          description: |
+            Memory usage is {{ $value | humanize }}% on {{ $labels.instance }}.
+
+            Check for memory leaks or consider scaling up.
+          runbook_url: "https://runbooks.example.com/high-memory"
+
+  - name: webapp_traffic
+    interval: 30s
+    rules:
+      # Traffic spike
+      - alert: TrafficSpike
+        expr: |
+          sum(rate(http_requests_total{job="webapp"}[5m]))
+          >
+          1.5 * sum(rate(http_requests_total{job="webapp"}[5m] offset 1h))
+        for: 10m
+        labels:
+          severity: warning
+          team: backend
+          component: webapp
+        annotations:
+          summary: "Traffic spike detected - {{ $labels.job }}"
+          description: |
+            Request rate increased by 50% compared to 1 hour ago.
+
+            Current: {{ $value | humanize }} req/s
+
+            This could be:
+            - Legitimate traffic increase
+            - DDoS attack
+            - Retry storm
+
+            Monitor closely and be ready to scale.
+          runbook_url: "https://runbooks.example.com/traffic-spike"
+
+      # Traffic drop (potential issue)
+      - alert: TrafficDrop
+        expr: |
+          sum(rate(http_requests_total{job="webapp"}[5m]))
+          <
+          0.5 * sum(rate(http_requests_total{job="webapp"}[5m] offset 1h))
+        for: 10m
+        labels:
+          severity: warning
+          team: backend
+          component: webapp
+        annotations:
+          summary: "Traffic drop detected - {{ $labels.job }}"
+          description: |
+            Request rate dropped by 50% compared to 1 hour ago.
+
+            This could indicate:
+            - Upstream service issue
+            - DNS problems
+            - Load balancer misconfiguration
+          runbook_url: "https://runbooks.example.com/traffic-drop"
+
+  - name: webapp_dependencies
+    interval: 30s
+    rules:
+      # Database connection pool exhaustion
+      - alert: DatabasePoolExhausted
+        expr: |
+          (db_connection_pool_active / db_connection_pool_max) > 0.9
+        for: 5m
+        labels:
+          severity: critical
+          team: backend
+          component: database
+        annotations:
+          summary: "Database connection pool near exhaustion"
+          description: |
+            Connection pool is {{ $value | humanizePercentage }} full.
+
+            This will cause request failures. Immediate action required.
+          runbook_url: "https://runbooks.example.com/db-pool-exhausted"
+
+      # External API errors
+      - alert: ExternalAPIErrors
+        expr: |
+          sum(rate(external_api_requests_total{status=~"5.."}[5m])) by (api)
+          /
+          sum(rate(external_api_requests_total[5m])) by (api)
+          > 0.1
+        for: 5m
+        labels:
+          severity: warning
+          team: backend
+          component: integration
+        annotations:
+          summary: "High error rate from external API - {{ $labels.api }}"
+          description: |
+            {{ $labels.api }} is returning errors at {{ $value | humanizePercentage }} rate.
+
+            Check API status page and consider enabling circuit breaker.
+          runbook_url: "https://runbooks.example.com/external-api-errors"
--- a/assets/templates/runbooks/incident-runbook-template.md
+++ b/assets/templates/runbooks/incident-runbook-template.md
@@ -0,0 +1,409 @@
+# Runbook: [Alert Name]
+
+## Overview
+
+**Alert Name**: [e.g., HighLatency, ServiceDown, ErrorBudgetBurn]
+
+**Severity**: [Critical | Warning | Info]
+
+**Team**: [e.g., Backend, Platform, Database]
+
+**Component**: [e.g., API Gateway, User Service, PostgreSQL]
+
+**What it means**: [One-line description of what this alert indicates]
+
+**User impact**: [How does this affect users? High/Medium/Low]
+
+**Urgency**: [How quickly must this be addressed? Immediate/Hours/Days]
+
+---
+
+## Alert Details
+
+### When This Alert Fires
+
+This alert fires when:
+- [Specific condition, e.g., "P95 latency exceeds 500ms for 10 minutes"]
+- [Any additional conditions]
+
+### Symptoms
+
+Users will experience:
+- [ ] Slow response times
+- [ ] Errors or failures
+- [ ] Service unavailable
+- [ ] [Other symptoms]
+
+### Probable Causes
+
+Common causes include:
+1. **[Cause 1]**: [Description]
+   - Example: Database overload due to slow queries
+2. **[Cause 2]**: [Description]
+   - Example: Memory leak causing OOM errors
+3. **[Cause 3]**: [Description]
+   - Example: Upstream service degradation
+
+---
+
+## Investigation Steps
+
+### 1. Check Service Health
+
+**Dashboard**: [Link to primary dashboard]
+
+**Key metrics to check**:
+```bash
+# Request rate
+sum(rate(http_requests_total[5m]))
+
+# Error rate
+sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
+
+# Latency (p95, p99)
+histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
+```
+
+**What to look for**:
+- [ ] Has traffic spiked recently?
+- [ ] Is error rate elevated?
+- [ ] Are any endpoints particularly slow?
+
+### 2. Check Recent Changes
+
+**Deployments**:
+```bash
+# Kubernetes
+kubectl rollout history deployment/[service-name] -n [namespace]
+
+# Check when last deployed
+kubectl get pods -n [namespace] -o wide | grep [service-name]
+```
+
+**What to look for**:
+- [ ] Was there a recent deployment?
+- [ ] Did alert start after deployment?
+- [ ] Any configuration changes?
+
+### 3. Check Logs
+
+**Log query** (adjust for your log system):
+```bash
+# Kubernetes
+kubectl logs deployment/[service-name] -n [namespace] --tail=100 | grep ERROR
+
+# Elasticsearch/Kibana
+GET /logs-*/_search
+{
+  "query": {
+    "bool": {
+      "must": [
+        { "match": { "service": "[service-name]" } },
+        { "match": { "level": "error" } },
+        { "range": { "@timestamp": { "gte": "now-30m" } } }
+      ]
+    }
+  }
+}
+
+# Loki/LogQL
+{job="[service-name]"} |= "error" | json | level="error"
+```
+
+**What to look for**:
+- [ ] Repeated error messages
+- [ ] Stack traces
+- [ ] Connection errors
+- [ ] Timeout errors
+
+### 4. Check Dependencies
+
+**Database**:
+```bash
+# Check active connections
+SELECT count(*) FROM pg_stat_activity WHERE state = 'active';
+
+# Check slow queries
+SELECT pid, now() - pg_stat_activity.query_start AS duration, query
+FROM pg_stat_activity
+WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '5 seconds';
+```
+
+**External APIs**:
+- [ ] Check status pages: [Link to status pages]
+- [ ] Check API error rates in dashboard
+- [ ] Test API endpoints manually
+
+**Cache** (Redis/Memcached):
+```bash
+# Redis info
+redis-cli -h [host] INFO stats
+
+# Check memory usage
+redis-cli -h [host] INFO memory
+```
+
+### 5. Check Resource Usage
+
+**CPU and Memory**:
+```bash
+# Kubernetes
+kubectl top pods -n [namespace] | grep [service-name]
+
+# Node metrics
+kubectl top nodes
+```
+
+**Prometheus queries**:
+```promql
+# CPU usage by pod
+sum(rate(container_cpu_usage_seconds_total{pod=~"[service-name].*"}[5m])) by (pod)
+
+# Memory usage by pod
+sum(container_memory_usage_bytes{pod=~"[service-name].*"}) by (pod)
+```
+
+**What to look for**:
+- [ ] CPU throttling
+- [ ] Memory approaching limits
+- [ ] Disk space issues
+
+### 6. Check Traces (if available)
+
+**Trace query**:
+```bash
+# Jaeger
+# Search for slow traces (> 1s) in last 30 minutes
+
+# Tempo/TraceQL
+{ duration > 1s && resource.service.name = "[service-name]" }
+```
+
+**What to look for**:
+- [ ] Which operation is slow?
+- [ ] Where is time spent? (DB, external API, service logic)
+- [ ] Any N+1 query patterns?
+
+---
+
+## Common Scenarios and Solutions
+
+### Scenario 1: Recent Deployment Caused Issue
+
+**Symptoms**:
+- Alert started immediately after deployment
+- Error logs correlate with new code
+
+**Solution**:
+```bash
+# Rollback deployment
+kubectl rollout undo deployment/[service-name] -n [namespace]
+
+# Verify rollback succeeded
+kubectl rollout status deployment/[service-name] -n [namespace]
+
+# Monitor for alert resolution
+```
+
+**Follow-up**:
+- [ ] Create incident report
+- [ ] Review deployment process
+- [ ] Add pre-deployment checks
+
+### Scenario 2: Database Performance Issue
+
+**Symptoms**:
+- Slow query logs show problematic queries
+- Database CPU or connection pool exhausted
+
+**Solution**:
+```bash
+# Identify slow query
+# Kill long-running query (use with caution)
+SELECT pg_cancel_backend([pid]);
+
+# Or terminate if cancel doesn't work
+SELECT pg_terminate_backend([pid]);
+
+# Add index if missing (in maintenance window)
+CREATE INDEX CONCURRENTLY idx_name ON table_name (column_name);
+```
+
+**Follow-up**:
+- [ ] Add query performance test
+- [ ] Review and optimize query
+- [ ] Consider read replicas
+
+### Scenario 3: Memory Leak
+
+**Symptoms**:
+- Memory usage gradually increasing
+- Eventually OOMKilled
+- Restarts temporarily fix issue
+
+**Solution**:
+```bash
+# Immediate: Restart pods
+kubectl rollout restart deployment/[service-name] -n [namespace]
+
+# Increase memory limits (temporary)
+kubectl set resources deployment/[service-name] -n [namespace] \
+  --limits=memory=2Gi
+```
+
+**Follow-up**:
+- [ ] Profile application for memory leaks
+- [ ] Add memory usage alerts
+- [ ] Fix root cause
+
+### Scenario 4: Traffic Spike / DDoS
+
+**Symptoms**:
+- Sudden traffic increase
+- Traffic from unusual sources
+- High CPU/memory across all instances
+
+**Solution**:
+```bash
+# Scale up immediately
+kubectl scale deployment/[service-name] -n [namespace] --replicas=10
+
+# Enable rate limiting at load balancer level
+# (Specific steps depend on LB)
+
+# Block suspicious IPs if confirmed DDoS
+# (Use WAF or network policies)
+```
+
+**Follow-up**:
+- [ ] Implement rate limiting
+- [ ] Add DDoS protection (CloudFlare, WAF)
+- [ ] Set up auto-scaling
+
+### Scenario 5: Upstream Service Degradation
+
+**Symptoms**:
+- Errors calling external API
+- Timeouts to upstream service
+- Upstream status page shows issues
+
+**Solution**:
+```bash
+# Enable circuit breaker (if available)
+# Adjust timeout configuration
+# Switch to backup service/cached data
+
+# Monitor external service
+# Check status page: [Link]
+```
+
+**Follow-up**:
+- [ ] Implement circuit breaker pattern
+- [ ] Add fallback mechanisms
+- [ ] Set up external service monitoring
+
+---
+
+## Immediate Actions (< 5 minutes)
+
+These should be done first to mitigate impact:
+
+1. **[Action 1]**: [e.g., "Scale up service"]
+   ```bash
+   kubectl scale deployment/[service] --replicas=10
+   ```
+
+2. **[Action 2]**: [e.g., "Rollback deployment"]
+   ```bash
+   kubectl rollout undo deployment/[service]
+   ```
+
+3. **[Action 3]**: [e.g., "Enable circuit breaker"]
+
+---
+
+## Short-term Actions (< 30 minutes)
+
+After immediate mitigation:
+
+1. **[Action 1]**: [e.g., "Investigate root cause"]
+2. **[Action 2]**: [e.g., "Optimize slow query"]
+3. **[Action 3]**: [e.g., "Clear cache if stale"]
+
+---
+
+## Long-term Actions (Post-Incident)
+
+Preventive measures:
+
+1. **[Action 1]**: [e.g., "Add circuit breaker"]
+2. **[Action 2]**: [e.g., "Implement auto-scaling"]
+3. **[Action 3]**: [e.g., "Add query performance tests"]
+4. **[Action 4]**: [e.g., "Update alert thresholds"]
+
+---
+
+## Escalation
+
+If issue persists after 30 minutes:
+
+**Escalation Path**:
+1. **Primary oncall**: @[username] ([slack/email])
+2. **Team lead**: @[username] ([slack/email])
+3. **Engineering manager**: @[username] ([slack/email])
+4. **Incident commander**: @[username] ([slack/email])
+
+**Communication**:
+- **Slack channel**: #[incidents-channel]
+- **Status page**: [Link]
+- **Incident tracking**: [Link to incident management tool]
+
+---
+
+## Related Runbooks
+
+- [Related Runbook 1]
+- [Related Runbook 2]
+- [Related Runbook 3]
+
+## Related Dashboards
+
+- [Main Service Dashboard]
+- [Resource Usage Dashboard]
+- [Dependency Dashboard]
+
+## Related Documentation
+
+- [Architecture Diagram]
+- [Service Documentation]
+- [API Documentation]
+
+---
+
+## Recent Incidents
+
+| Date | Duration | Root Cause | Resolution | Ticket |
+|------|----------|------------|------------|--------|
+| 2024-10-15 | 23 min | Database pool exhausted | Increased pool size | INC-123 |
+| 2024-09-30 | 45 min | Memory leak | Fixed code, restarted | INC-120 |
+
+---
+
+## Runbook Metadata
+
+**Last Updated**: [Date]
+
+**Owner**: [Team name]
+
+**Reviewers**: [Names]
+
+**Next Review**: [Date]
+
+---
+
+## Notes
+
+- This runbook should be reviewed quarterly
+- Update after each incident to capture new learnings
+- Keep investigation steps concise and actionable
+- Include actual commands that can be copy-pasted
--- a/monitoring-observability.skill
+++ b/monitoring-observability.skill
--- a/plugin.lock.json
+++ b/plugin.lock.json
@@ -0,0 +1,125 @@
+{
+  "$schema": "internal://schemas/plugin.lock.v1.json",
+  "pluginId": "gh:ahmedasmar/devops-claude-skills:monitoring-observability",
+  "normalized": {
+    "repo": null,
+    "ref": "refs/tags/v20251128.0",
+    "commit": "9bb89b1ce889c2df6d7c3c2eedbd6d1301297561",
+    "treeHash": "9fd50a78a79b6d45553e3372bc2d5142f4c48ba4a945ca724356f89f9ce08825",
+    "generatedAt": "2025-11-28T10:13:03.403599Z",
+    "toolVersion": "publish_plugins.py@0.2.0"
+  },
+  "origin": {
+    "remote": "git@github.com:zhongweili/42plugin-data.git",
+    "branch": "master",
+    "commit": "aa1497ed0949fd50e99e70d6324a29c5b34f9390",
+    "repoRoot": "/Users/zhongweili/projects/openmind/42plugin-data"
+  },
+  "manifest": {
+    "name": "monitoring-observability",
+    "description": "Monitoring and observability strategy, metrics/logs/traces systems, SLOs/error budgets, Prometheus/Grafana/Loki, OpenTelemetry, and tool comparison",
+    "version": null
+  },
+  "content": {
+    "files": [
+      {
+        "path": "README.md",
+        "sha256": "b18b6358cf31ab285b751916a5b2c670b5bc2c8748ef17216f2c9106e4997f8e"
+      },
+      {
+        "path": "SKILL.md",
+        "sha256": "c02fcac42ed2d4d6fcda67a9f835000b1a1198734e4d8d18000546dda81402e4"
+      },
+      {
+        "path": "monitoring-observability.skill",
+        "sha256": "c2c368577bb73885c887cc824b695fb3d36f4a77e74b2e25dcd7815c331a71c1"
+      },
+      {
+        "path": "references/alerting_best_practices.md",
+        "sha256": "99cea7a40310b77a4fdff5543a0b1ee44189497508757bee0dc9ebbe11794a53"
+      },
+      {
+        "path": "references/metrics_design.md",
+        "sha256": "6edc73473e9d3c2ac7e46a4d97576d356d177ed701a2468c5e21d528ff9c29d7"
+      },
+      {
+        "path": "references/tracing_guide.md",
+        "sha256": "5e419d77a31d8b3ee5c16fb57e1fc6e3e16d31efb8f4a86dd756c7327a482fa0"
+      },
+      {
+        "path": "references/dql_promql_translation.md",
+        "sha256": "47113e77b03d9ac70fc35121efd93cf5e17e031b878d27791403493b71058c5c"
+      },
+      {
+        "path": "references/tool_comparison.md",
+        "sha256": "fd0fc7e4fc3641ca0ddc469a14fa1373457f5a4586fe4bc7ec23afe3de9f6171"
+      },
+      {
+        "path": "references/datadog_migration.md",
+        "sha256": "9ed5e276eb2ea67f72c91e1bb53374b293e164fa28c4c44f31ee9f8660dfaf02"
+      },
+      {
+        "path": "references/logging_guide.md",
+        "sha256": "2c94b61d6db2c0f6b8927c8092010f3a2f1ea20d2eefd330d8073e7b4bcf4c9d"
+      },
+      {
+        "path": "references/slo_sla_guide.md",
+        "sha256": "2a0cb69dd120897183f7bcab002a368dbe11bd5038817906da3391ca168e0052"
+      },
+      {
+        "path": "scripts/log_analyzer.py",
+        "sha256": "c7fb7e13c2d6507c81ee9575fc8514408d36b2f2e786caeb536ba927d517046e"
+      },
+      {
+        "path": "scripts/analyze_metrics.py",
+        "sha256": "50ad856cb043dfd70b60c6ca685b526d34b8bc5e5454dd0b530033da3da22545"
+      },
+      {
+        "path": "scripts/health_check_validator.py",
+        "sha256": "cef8c447fabf83dfd9bd28a8d22127b87b66aafa4d151cbccd9fe1f1db0bbcf2"
+      },
+      {
+        "path": "scripts/alert_quality_checker.py",
+        "sha256": "b561cf9c41e2de8d5f09557c018110553047d0ad54629bdc7a07a654d76263d1"
+      },
+      {
+        "path": "scripts/datadog_cost_analyzer.py",
+        "sha256": "05a1c6c0033b04f2f5206af015907f2df4c9cf57f4c2b8f10ba2565236a5c97f"
+      },
+      {
+        "path": "scripts/slo_calculator.py",
+        "sha256": "c26ab0f0a31e5efa830a9f24938ec356bfaef927438bd47b95f4ad0015cff662"
+      },
+      {
+        "path": "scripts/dashboard_generator.py",
+        "sha256": "6fe98a49ae431d67bc44eb631c542ba29199da72cc348e90ec99d73a05783ee5"
+      },
+      {
+        "path": ".claude-plugin/plugin.json",
+        "sha256": "7b6a16e6bce66bf87929c2f3c4ea32f4bfadd8d9606edd195f144c82ec85f151"
+      },
+      {
+        "path": "assets/templates/prometheus-alerts/webapp-alerts.yml",
+        "sha256": "d881081e53650c335ec5cc7d5d96bade03e607e55bff3bcbafe6811377055154"
+      },
+      {
+        "path": "assets/templates/prometheus-alerts/kubernetes-alerts.yml",
+        "sha256": "cb8c247b245ea1fb2a904f525fce8f74f9237d79eda04c2c60938135a7271415"
+      },
+      {
+        "path": "assets/templates/runbooks/incident-runbook-template.md",
+        "sha256": "1a5ba8951cf5b1408ea2101232ffe8d88fab75ed4ae63b0c9f1902059373112d"
+      },
+      {
+        "path": "assets/templates/otel-config/collector-config.yaml",
+        "sha256": "2696548b1c7f4034283cc2387f9730efa4811881d1c9c9219002e7affc8c29f2"
+      }
+    ],
+    "dirSha256": "9fd50a78a79b6d45553e3372bc2d5142f4c48ba4a945ca724356f89f9ce08825"
+  },
+  "security": {
+    "scannedAt": null,
+    "scannerVersion": null,
+    "flags": []
+  }
+}
--- a/references/alerting_best_practices.md
+++ b/references/alerting_best_practices.md
@@ -0,0 +1,609 @@
+# Alerting Best Practices
+
+## Core Principles
+
+### 1. Every Alert Should Be Actionable
+If you can't do something about it, don't alert on it.
+
+❌ Bad: `Alert: CPU > 50%` (What action should be taken?)
+✅ Good: `Alert: API latency p95 > 2s for 10m` (Investigate/scale up)
+
+### 2. Alert on Symptoms, Not Causes
+Alert on what users experience, not underlying components.
+
+❌ Bad: `Database connection pool 80% full`
+✅ Good: `Request latency p95 > 1s` (which might be caused by DB pool)
+
+### 3. Alert on SLO Violations
+Tie alerts to Service Level Objectives.
+
+✅ `Error rate exceeds 0.1% (SLO: 99.9% availability)`
+
+### 4. Reduce Noise
+Alert fatigue is real. Only page for critical issues.
+
+**Alert Severity Levels**:
+- **Critical**: Page on-call immediately (user-facing issue)
+- **Warning**: Create ticket, review during business hours
+- **Info**: Log for awareness, no action needed
+
+---
+
+## Alert Design Patterns
+
+### Pattern 1: Multi-Window Multi-Burn-Rate
+
+Google's recommended SLO alerting approach.
+
+**Concept**: Alert when error budget burn rate is high enough to exhaust the budget too quickly.
+
+```yaml
+# Fast burn (6% of budget in 1 hour)
+- alert: FastBurnRate
+  expr: |
+    sum(rate(http_requests_total{status=~"5.."}[1h]))
+      /
+    sum(rate(http_requests_total[1h]))
+    > (14.4 * 0.001)  # 14.4x burn rate for 99.9% SLO
+  for: 2m
+  labels:
+    severity: critical
+
+# Slow burn (6% of budget in 6 hours)
+- alert: SlowBurnRate
+  expr: |
+    sum(rate(http_requests_total{status=~"5.."}[6h]))
+      /
+    sum(rate(http_requests_total[6h]))
+    > (6 * 0.001)  # 6x burn rate for 99.9% SLO
+  for: 30m
+  labels:
+    severity: warning
+```
+
+**Burn Rate Multipliers for 99.9% SLO (0.1% error budget)**:
+- 1 hour window, 2m grace: 14.4x burn rate
+- 6 hour window, 30m grace: 6x burn rate
+- 3 day window, 6h grace: 1x burn rate
+
+### Pattern 2: Rate of Change
+Alert when metrics change rapidly.
+
+```yaml
+- alert: TrafficSpike
+  expr: |
+    sum(rate(http_requests_total[5m]))
+      >
+    1.5 * sum(rate(http_requests_total[5m] offset 1h))
+  for: 10m
+  annotations:
+    summary: "Traffic increased by 50% compared to 1 hour ago"
+```
+
+### Pattern 3: Threshold with Hysteresis
+Prevent flapping with different thresholds for firing and resolving.
+
+```yaml
+# Fire at 90%, resolve at 70%
+- alert: HighCPU
+  expr: cpu_usage > 90
+  for: 5m
+
+- alert: HighCPU_Resolved
+  expr: cpu_usage < 70
+  for: 5m
+```
+
+### Pattern 4: Absent Metrics
+Alert when expected metrics stop being reported (service down).
+
+```yaml
+- alert: ServiceDown
+  expr: absent(up{job="my-service"})
+  for: 5m
+  labels:
+    severity: critical
+  annotations:
+    summary: "Service {{ $labels.job }} is not reporting metrics"
+```
+
+### Pattern 5: Aggregate Alerts
+Alert on aggregate performance across multiple instances.
+
+```yaml
+- alert: HighOverallErrorRate
+  expr: |
+    sum(rate(http_requests_total{status=~"5.."}[5m]))
+      /
+    sum(rate(http_requests_total[5m]))
+    > 0.05
+  for: 10m
+  annotations:
+    summary: "Overall error rate is {{ $value | humanizePercentage }}"
+```
+
+---
+
+## Alert Annotation Best Practices
+
+### Required Fields
+
+**summary**: One-line description of the issue
+```yaml
+summary: "High error rate on {{ $labels.service }}: {{ $value | humanizePercentage }}"
+```
+
+**description**: Detailed explanation with context
+```yaml
+description: |
+  Error rate on {{ $labels.service }} is {{ $value | humanizePercentage }},
+  which exceeds the threshold of 1% for more than 10 minutes.
+
+  Current value: {{ $value }}
+  Runbook: https://runbooks.example.com/high-error-rate
+```
+
+**runbook_url**: Link to investigation steps
+```yaml
+runbook_url: "https://runbooks.example.com/alerts/{{ $labels.alertname }}"
+```
+
+### Optional but Recommended
+
+**dashboard**: Link to relevant dashboard
+```yaml
+dashboard: "https://grafana.example.com/d/service-dashboard?var-service={{ $labels.service }}"
+```
+
+**logs**: Link to logs
+```yaml
+logs: "https://kibana.example.com/app/discover#/?_a=(query:(query_string:(query:'service:{{ $labels.service }}')))"
+```
+
+---
+
+## Alert Label Best Practices
+
+### Required Labels
+
+**severity**: Critical, warning, or info
+```yaml
+labels:
+  severity: critical
+```
+
+**team**: Who should handle this alert
+```yaml
+labels:
+  team: platform
+  severity: critical
+```
+
+**component**: What part of the system
+```yaml
+labels:
+  component: api-gateway
+  severity: warning
+```
+
+### Example Complete Alert
+```yaml
+- alert: HighLatency
+  expr: |
+    histogram_quantile(0.95,
+      sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
+    ) > 1
+  for: 10m
+  labels:
+    severity: warning
+    team: backend
+    component: api
+    environment: "{{ $labels.environment }}"
+  annotations:
+    summary: "High latency on {{ $labels.service }}"
+    description: |
+      P95 latency on {{ $labels.service }} is {{ $value }}s, exceeding 1s threshold.
+
+      This may impact user experience. Check recent deployments and database performance.
+
+      Current p95: {{ $value }}s
+      Threshold: 1s
+      Duration: 10m+
+    runbook_url: "https://runbooks.example.com/high-latency"
+    dashboard: "https://grafana.example.com/d/api-dashboard"
+    logs: "https://kibana.example.com/app/discover#/?_a=(query:(query_string:(query:'service:{{ $labels.service }} AND level:error')))"
+```
+
+---
+
+## Alert Thresholds
+
+### General Guidelines
+
+**Response Time / Latency**:
+- Warning: p95 > 500ms or p99 > 1s
+- Critical: p95 > 2s or p99 > 5s
+
+**Error Rate**:
+- Warning: > 1%
+- Critical: > 5%
+
+**Availability**:
+- Warning: < 99.9%
+- Critical: < 99.5%
+
+**CPU Utilization**:
+- Warning: > 70% for 15m
+- Critical: > 90% for 5m
+
+**Memory Utilization**:
+- Warning: > 80% for 15m
+- Critical: > 95% for 5m
+
+**Disk Space**:
+- Warning: > 80% full
+- Critical: > 90% full
+
+**Queue Depth**:
+- Warning: > 70% of max capacity
+- Critical: > 90% of max capacity
+
+### Application-Specific Thresholds
+
+Set thresholds based on:
+1. **Historical performance**: Use p95 of last 30 days + 20%
+2. **SLO requirements**: If SLO is 99.9%, alert at 99.5%
+3. **Business impact**: What error rate causes user complaints?
+
+---
+
+## The "for" Clause
+
+Prevent alert flapping by requiring the condition to be true for a duration.
+
+### Guidelines
+
+**Critical alerts**: Short duration (2-5m)
+```yaml
+- alert: ServiceDown
+  expr: up == 0
+  for: 2m  # Quick detection for critical issues
+```
+
+**Warning alerts**: Longer duration (10-30m)
+```yaml
+- alert: HighMemoryUsage
+  expr: memory_usage > 80
+  for: 15m  # Avoid noise from temporary spikes
+```
+
+**Resource saturation**: Medium duration (5-10m)
+```yaml
+- alert: HighCPU
+  expr: cpu_usage > 90
+  for: 5m
+```
+
+---
+
+## Alert Routing
+
+### Severity-Based Routing
+
+```yaml
+# alertmanager.yml
+route:
+  group_by: ['alertname', 'cluster']
+  group_wait: 10s
+  group_interval: 5m
+  repeat_interval: 4h
+  receiver: 'default'
+
+  routes:
+  # Critical alerts → PagerDuty
+  - match:
+      severity: critical
+    receiver: pagerduty
+    group_wait: 10s
+    repeat_interval: 5m
+
+  # Warning alerts → Slack
+  - match:
+      severity: warning
+    receiver: slack
+    group_wait: 30s
+    repeat_interval: 12h
+
+  # Info alerts → Email
+  - match:
+      severity: info
+    receiver: email
+    repeat_interval: 24h
+```
+
+### Team-Based Routing
+
+```yaml
+routes:
+  # Platform team
+  - match:
+      team: platform
+    receiver: platform-pagerduty
+
+  # Backend team
+  - match:
+      team: backend
+    receiver: backend-slack
+
+  # Database team
+  - match:
+      component: database
+    receiver: dba-pagerduty
+```
+
+### Time-Based Routing
+
+```yaml
+# Only page during business hours for non-critical
+routes:
+  - match:
+      severity: warning
+    receiver: slack
+    active_time_intervals:
+      - business_hours
+
+time_intervals:
+  - name: business_hours
+    time_intervals:
+      - weekdays: ['monday:friday']
+        times:
+          - start_time: '09:00'
+            end_time: '17:00'
+        location: 'America/New_York'
+```
+
+---
+
+## Alert Grouping
+
+### Intelligent Grouping
+
+**Group by service and environment**:
+```yaml
+route:
+  group_by: ['alertname', 'service', 'environment']
+  group_wait: 30s
+  group_interval: 5m
+```
+
+This prevents:
+- 50 alerts for "HighCPU" on different pods → 1 grouped alert
+- Mixing production and staging alerts
+
+### Inhibition Rules
+
+Suppress related alerts when a parent alert fires.
+
+```yaml
+inhibit_rules:
+  # If service is down, suppress latency alerts
+  - source_match:
+      alertname: ServiceDown
+    target_match:
+      alertname: HighLatency
+    equal: ['service']
+
+  # If node is down, suppress all pod alerts on that node
+  - source_match:
+      alertname: NodeDown
+    target_match_re:
+      alertname: '(PodCrashLoop|HighCPU|HighMemory)'
+    equal: ['node']
+```
+
+---
+
+## Runbook Structure
+
+Every alert should link to a runbook with:
+
+### 1. Context
+- What does this alert mean?
+- What is the user impact?
+- What is the urgency?
+
+### 2. Investigation Steps
+```markdown
+## Investigation
+
+1. Check service health dashboard
+   https://grafana.example.com/d/service-dashboard
+
+2. Check recent deployments
+   kubectl rollout history deployment/myapp -n production
+
+3. Check error logs
+   kubectl logs deployment/myapp -n production --tail=100 | grep ERROR
+
+4. Check dependencies
+   - Database: Check slow query log
+   - Redis: Check memory usage
+   - External APIs: Check status pages
+```
+
+### 3. Common Causes
+```markdown
+## Common Causes
+
+- **Recent deployment**: Check if alert started after deployment
+- **Traffic spike**: Check request rate, might need to scale
+- **Database issues**: Check query performance and connection pool
+- **External API degradation**: Check third-party status pages
+```
+
+### 4. Resolution Steps
+```markdown
+## Resolution
+
+### Immediate Actions (< 5 minutes)
+1. Scale up if traffic spike: `kubectl scale deployment myapp --replicas=10`
+2. Rollback if recent deployment: `kubectl rollout undo deployment/myapp`
+
+### Short-term Actions (< 30 minutes)
+1. Restart pods if memory leak: `kubectl rollout restart deployment/myapp`
+2. Clear cache if stale data: `redis-cli -h cache.example.com FLUSHDB`
+
+### Long-term Actions (post-incident)
+1. Review and optimize slow queries
+2. Implement circuit breakers
+3. Add more capacity
+4. Update alert thresholds if false positive
+```
+
+### 5. Escalation
+```markdown
+## Escalation
+
+If issue persists after 30 minutes:
+- Slack: #backend-oncall
+- PagerDuty: Escalate to senior engineer
+- Incident Commander: Jane Doe (jane@example.com)
+```
+
+---
+
+## Anti-Patterns to Avoid
+
+### 1. Alert on Everything
+❌ Don't: Alert on every warning log
+✅ Do: Alert on error rate exceeding threshold
+
+### 2. Alert Without Context
+❌ Don't: "Error rate high"
+✅ Do: "Error rate 5.2% exceeds 1% threshold for 10m, impacting checkout flow"
+
+### 3. Static Thresholds for Dynamic Systems
+❌ Don't: `cpu_usage > 70` (fails during scale-up)
+✅ Do: Alert on SLO violations or rate of change
+
+### 4. No "for" Clause
+❌ Don't: Alert immediately on threshold breach
+✅ Do: Use `for: 5m` to avoid flapping
+
+### 5. Too Many Recipients
+❌ Don't: Page 10 people for every alert
+✅ Do: Route to specific on-call rotation
+
+### 6. Duplicate Alerts
+❌ Don't: Alert on both cause and symptom
+✅ Do: Alert on symptom, use inhibition for causes
+
+### 7. No Runbook
+❌ Don't: Alert without guidance
+✅ Do: Include runbook_url in every alert
+
+---
+
+## Alert Testing
+
+### Test Alert Firing
+```bash
+# Trigger test alert in Prometheus
+amtool alert add alertname="TestAlert" \
+  severity="warning" \
+  summary="Test alert"
+
+# Or use Alertmanager API
+curl -X POST http://alertmanager:9093/api/v1/alerts \
+  -d '[{
+    "labels": {"alertname": "TestAlert", "severity": "critical"},
+    "annotations": {"summary": "Test critical alert"}
+  }]'
+```
+
+### Verify Alert Rules
+```bash
+# Check syntax
+promtool check rules alerts.yml
+
+# Test expression
+promtool query instant http://prometheus:9090 \
+  'sum(rate(http_requests_total{status=~"5.."}[5m]))'
+
+# Unit test alerts
+promtool test rules test.yml
+```
+
+### Test Alertmanager Routing
+```bash
+# Test which receiver an alert would go to
+amtool config routes test \
+  --config.file=alertmanager.yml \
+  alertname="HighLatency" \
+  severity="critical" \
+  team="backend"
+```
+
+---
+
+## On-Call Best Practices
+
+### Rotation Schedule
+- **Primary on-call**: First responder
+- **Secondary on-call**: Escalation backup
+- **Rotation length**: 1 week (balance load vs context)
+- **Handoff**: Monday morning (not Friday evening)
+
+### On-Call Checklist
+```markdown
+## Pre-shift
+- [ ] Test pager/phone
+- [ ] Review recent incidents
+- [ ] Check upcoming deployments
+- [ ] Update contact info
+
+## During shift
+- [ ] Respond to pages within 5 minutes
+- [ ] Document all incidents
+- [ ] Update runbooks if gaps found
+- [ ] Communicate in #incidents channel
+
+## Post-shift
+- [ ] Hand off open incidents
+- [ ] Complete incident reports
+- [ ] Suggest improvements
+- [ ] Update team documentation
+```
+
+### Escalation Policy
+1. **Primary**: Responds within 5 minutes
+2. **Secondary**: Auto-escalate after 15 minutes
+3. **Manager**: Auto-escalate after 30 minutes
+4. **Incident Commander**: Critical incidents only
+
+---
+
+## Metrics About Alerts
+
+Monitor your monitoring system!
+
+### Key Metrics
+```promql
+# Alert firing frequency
+sum(ALERTS{alertstate="firing"}) by (alertname)
+
+# Alert duration
+ALERTS_FOR_STATE{alertstate="firing"}
+
+# Alerts per severity
+sum(ALERTS{alertstate="firing"}) by (severity)
+
+# Time to acknowledge (from PagerDuty/etc)
+pagerduty_incident_ack_duration_seconds
+```
+
+### Alert Quality Metrics
+- **Mean Time to Acknowledge (MTTA)**: < 5 minutes
+- **Mean Time to Resolve (MTTR)**: < 30 minutes
+- **False Positive Rate**: < 10%
+- **Alert Coverage**: % of incidents with preceding alert > 80%
--- a/references/datadog_migration.md
+++ b/references/datadog_migration.md
@@ -0,0 +1,649 @@
+# Migrating from Datadog to Open-Source Stack
+
+## Overview
+
+This guide helps you migrate from Datadog to a cost-effective open-source observability stack:
+- **Metrics**: Datadog → Prometheus + Grafana
+- **Logs**: Datadog → Loki + Grafana
+- **Traces**: Datadog APM → Tempo/Jaeger + Grafana
+- **Dashboards**: Datadog → Grafana
+- **Alerts**: Datadog Monitors → Prometheus Alertmanager
+
+**Estimated Cost Savings**: 60-80% for similar functionality
+
+---
+
+## Cost Comparison
+
+### Example: 100-host infrastructure
+
+**Datadog**:
+- Infrastructure Pro: $1,500/month (100 hosts × $15)
+- Custom Metrics: $50/month (5,000 extra metrics beyond included 10,000)
+- Logs: $2,000/month (20GB/day × $0.10/GB × 30 days)
+- APM: $3,100/month (100 hosts × $31)
+- **Total**: ~$6,650/month ($79,800/year)
+
+**Open-Source Stack** (self-hosted):
+- Infrastructure: $1,200/month (EC2/GKE for Prometheus, Grafana, Loki, Tempo)
+- Storage: $300/month (S3/GCS for long-term metrics and traces)
+- Operations time: Variable
+- **Total**: ~$1,500-2,500/month ($18,000-30,000/year)
+
+**Savings**: $49,800-61,800/year
+
+---
+
+## Migration Strategy
+
+### Phase 1: Run Parallel (Month 1-2)
+- Deploy open-source stack alongside Datadog
+- Migrate metrics first (lowest risk)
+- Validate data accuracy
+- Build confidence
+
+### Phase 2: Migrate Dashboards & Alerts (Month 2-3)
+- Convert Datadog dashboards to Grafana
+- Translate alert rules
+- Train team on new tools
+
+### Phase 3: Migrate Logs & Traces (Month 3-4)
+- Set up Loki for log aggregation
+- Deploy Tempo/Jaeger for tracing
+- Update application instrumentation
+
+### Phase 4: Decommission Datadog (Month 4-5)
+- Confirm all functionality migrated
+- Cancel Datadog subscription
+- Archive Datadog dashboards/alerts for reference
+
+---
+
+## 1. Metrics Migration (Datadog → Prometheus)
+
+### Step 1: Deploy Prometheus
+
+**Kubernetes** (recommended):
+```yaml
+# prometheus-values.yaml
+prometheus:
+  prometheusSpec:
+    retention: 30d
+    storageSpec:
+      volumeClaimTemplate:
+        spec:
+          resources:
+            requests:
+              storage: 100Gi
+
+    # Scrape configs
+    additionalScrapeConfigs:
+      - job_name: 'kubernetes-pods'
+        kubernetes_sd_configs:
+          - role: pod
+```
+
+**Install**:
+```bash
+helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
+helm install prometheus prometheus-community/kube-prometheus-stack -f prometheus-values.yaml
+```
+
+**Docker Compose**:
+```yaml
+version: '3'
+services:
+  prometheus:
+    image: prom/prometheus:latest
+    ports:
+      - "9090:9090"
+    volumes:
+      - ./prometheus.yml:/etc/prometheus/prometheus.yml
+      - prometheus-data:/prometheus
+    command:
+      - '--config.file=/etc/prometheus/prometheus.yml'
+      - '--storage.tsdb.retention.time=30d'
+
+volumes:
+  prometheus-data:
+```
+
+### Step 2: Replace DogStatsD with Prometheus Exporters
+
+**Before (DogStatsD)**:
+```python
+from datadog import statsd
+
+statsd.increment('page.views')
+statsd.histogram('request.duration', 0.5)
+statsd.gauge('active_users', 100)
+```
+
+**After (Prometheus Python client)**:
+```python
+from prometheus_client import Counter, Histogram, Gauge
+
+page_views = Counter('page_views_total', 'Page views')
+request_duration = Histogram('request_duration_seconds', 'Request duration')
+active_users = Gauge('active_users', 'Active users')
+
+# Usage
+page_views.inc()
+request_duration.observe(0.5)
+active_users.set(100)
+```
+
+### Step 3: Metric Name Translation
+
+| Datadog Metric | Prometheus Equivalent |
+|----------------|----------------------|
+| `system.cpu.idle` | `node_cpu_seconds_total{mode="idle"}` |
+| `system.mem.free` | `node_memory_MemFree_bytes` |
+| `system.disk.used` | `node_filesystem_size_bytes - node_filesystem_free_bytes` |
+| `docker.cpu.usage` | `container_cpu_usage_seconds_total` |
+| `kubernetes.pods.running` | `kube_pod_status_phase{phase="Running"}` |
+
+### Step 4: Export Existing Datadog Metrics (Optional)
+
+Use Datadog API to export historical data:
+
+```python
+from datadog import api, initialize
+
+options = {
+    'api_key': 'YOUR_API_KEY',
+    'app_key': 'YOUR_APP_KEY'
+}
+initialize(**options)
+
+# Query metric
+result = api.Metric.query(
+    start=int(time.time() - 86400),  # Last 24h
+    end=int(time.time()),
+    query='avg:system.cpu.user{*}'
+)
+
+# Convert to Prometheus format and import
+```
+
+---
+
+## 2. Dashboard Migration (Datadog → Grafana)
+
+### Step 1: Export Datadog Dashboards
+
+```python
+import requests
+import json
+
+api_key = "YOUR_API_KEY"
+app_key = "YOUR_APP_KEY"
+
+headers = {
+    'DD-API-KEY': api_key,
+    'DD-APPLICATION-KEY': app_key
+}
+
+# Get all dashboards
+response = requests.get(
+    'https://api.datadoghq.com/api/v1/dashboard',
+    headers=headers
+)
+
+dashboards = response.json()
+
+# Export each dashboard
+for dashboard in dashboards['dashboards']:
+    dash_id = dashboard['id']
+    detail = requests.get(
+        f'https://api.datadoghq.com/api/v1/dashboard/{dash_id}',
+        headers=headers
+    ).json()
+
+    with open(f'datadog_{dash_id}.json', 'w') as f:
+        json.dump(detail, f, indent=2)
+```
+
+### Step 2: Convert to Grafana Format
+
+**Manual Conversion Template**:
+
+| Datadog Widget | Grafana Panel Type |
+|----------------|-------------------|
+| Timeseries | Graph / Time series |
+| Query Value | Stat |
+| Toplist | Table / Bar gauge |
+| Heatmap | Heatmap |
+| Distribution | Histogram |
+
+**Automated Conversion** (basic example):
+```python
+def convert_datadog_to_grafana(datadog_dashboard):
+    grafana_dashboard = {
+        "title": datadog_dashboard['title'],
+        "panels": []
+    }
+
+    for widget in datadog_dashboard['widgets']:
+        panel = {
+            "title": widget['definition'].get('title', ''),
+            "type": map_widget_type(widget['definition']['type']),
+            "targets": convert_queries(widget['definition']['requests'])
+        }
+        grafana_dashboard['panels'].append(panel)
+
+    return grafana_dashboard
+```
+
+### Step 3: Common Query Translations
+
+See `dql_promql_translation.md` for comprehensive query mappings.
+
+**Example conversions**:
+
+```
+Datadog: avg:system.cpu.user{*}
+Prometheus: avg(rate(node_cpu_seconds_total{mode="user"}[5m])) * 100
+
+Datadog: sum:requests.count{status:200}.as_rate()
+Prometheus: sum(rate(http_requests_total{status="200"}[5m]))
+
+Datadog: p95:request.duration{*}
+Prometheus: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
+```
+
+---
+
+## 3. Alert Migration (Datadog Monitors → Prometheus Alerts)
+
+### Step 1: Export Datadog Monitors
+
+```python
+import requests
+
+api_key = "YOUR_API_KEY"
+app_key = "YOUR_APP_KEY"
+
+headers = {
+    'DD-API-KEY': api_key,
+    'DD-APPLICATION-KEY': app_key
+}
+
+response = requests.get(
+    'https://api.datadoghq.com/api/v1/monitor',
+    headers=headers
+)
+
+monitors = response.json()
+
+# Save each monitor
+for monitor in monitors:
+    with open(f'monitor_{monitor["id"]}.json', 'w') as f:
+        json.dump(monitor, f, indent=2)
+```
+
+### Step 2: Convert to Prometheus Alert Rules
+
+**Datadog Monitor**:
+```json
+{
+  "name": "High CPU Usage",
+  "type": "metric alert",
+  "query": "avg(last_5m):avg:system.cpu.user{*} > 80",
+  "message": "CPU usage is high on {{host.name}}"
+}
+```
+
+**Prometheus Alert**:
+```yaml
+groups:
+  - name: infrastructure
+    rules:
+      - alert: HighCPUUsage
+        expr: |
+          100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
+        for: 5m
+        labels:
+          severity: warning
+        annotations:
+          summary: "High CPU usage on {{ $labels.instance }}"
+          description: "CPU usage is {{ $value }}%"
+```
+
+### Step 3: Alert Routing (Datadog → Alertmanager)
+
+**Datadog notification channels** → **Alertmanager receivers**
+
+```yaml
+# alertmanager.yml
+route:
+  group_by: ['alertname', 'severity']
+  receiver: 'slack-notifications'
+
+receivers:
+  - name: 'slack-notifications'
+    slack_configs:
+      - api_url: 'YOUR_SLACK_WEBHOOK'
+        channel: '#alerts'
+
+  - name: 'pagerduty-critical'
+    pagerduty_configs:
+      - service_key: 'YOUR_PAGERDUTY_KEY'
+```
+
+---
+
+## 4. Log Migration (Datadog → Loki)
+
+### Step 1: Deploy Loki
+
+**Kubernetes**:
+```bash
+helm repo add grafana https://grafana.github.io/helm-charts
+helm install loki grafana/loki-stack \
+  --set loki.persistence.enabled=true \
+  --set loki.persistence.size=100Gi \
+  --set promtail.enabled=true
+```
+
+**Docker Compose**:
+```yaml
+version: '3'
+services:
+  loki:
+    image: grafana/loki:latest
+    ports:
+      - "3100:3100"
+    volumes:
+      - ./loki-config.yaml:/etc/loki/local-config.yaml
+      - loki-data:/loki
+
+  promtail:
+    image: grafana/promtail:latest
+    volumes:
+      - /var/log:/var/log
+      - ./promtail-config.yaml:/etc/promtail/config.yml
+
+volumes:
+  loki-data:
+```
+
+### Step 2: Replace Datadog Log Forwarder
+
+**Before (Datadog Agent)**:
+```yaml
+# datadog.yaml
+logs_enabled: true
+
+logs_config:
+  container_collect_all: true
+```
+
+**After (Promtail)**:
+```yaml
+# promtail-config.yaml
+server:
+  http_listen_port: 9080
+
+positions:
+  filename: /tmp/positions.yaml
+
+clients:
+  - url: http://loki:3100/loki/api/v1/push
+
+scrape_configs:
+  - job_name: system
+    static_configs:
+      - targets:
+          - localhost
+        labels:
+          job: varlogs
+          __path__: /var/log/*.log
+```
+
+### Step 3: Query Translation
+
+**Datadog Logs Query**:
+```
+service:my-app status:error
+```
+
+**Loki LogQL**:
+```logql
+{job="my-app", level="error"}
+```
+
+**More examples**:
+```
+Datadog: service:api-gateway status:error @http.status_code:>=500
+Loki: {service="api-gateway", level="error"} | json | http_status_code >= 500
+
+Datadog: source:nginx "404"
+Loki: {source="nginx"} |= "404"
+```
+
+---
+
+## 5. APM Migration (Datadog APM → Tempo/Jaeger)
+
+### Step 1: Choose Tracing Backend
+
+- **Tempo**: Better for high volume, cheaper storage (object storage)
+- **Jaeger**: More mature, better UI, requires separate storage
+
+### Step 2: Replace Datadog Tracer with OpenTelemetry
+
+**Before (Datadog Python)**:
+```python
+from ddtrace import tracer
+
+@tracer.wrap()
+def my_function():
+    pass
+```
+
+**After (OpenTelemetry)**:
+```python
+from opentelemetry import trace
+from opentelemetry.sdk.trace import TracerProvider
+from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
+
+# Setup
+trace.set_tracer_provider(TracerProvider())
+tracer = trace.get_tracer(__name__)
+exporter = OTLPSpanExporter(endpoint="tempo:4317")
+
+@tracer.start_as_current_span("my_function")
+def my_function():
+    pass
+```
+
+### Step 3: Deploy Tempo
+
+```yaml
+# tempo.yaml
+server:
+  http_listen_port: 3200
+
+distributor:
+  receivers:
+    otlp:
+      protocols:
+        grpc:
+          endpoint: 0.0.0.0:4317
+
+storage:
+  trace:
+    backend: s3
+    s3:
+      bucket: tempo-traces
+      endpoint: s3.amazonaws.com
+```
+
+---
+
+## 6. Infrastructure Migration
+
+### Recommended Architecture
+
+```
+┌─────────────────────────────────────────┐
+│  Grafana (Visualization)                │
+│  - Dashboards                           │
+│  - Unified view                         │
+└─────────────────────────────────────────┘
+         ↓           ↓           ↓
+┌──────────────┐ ┌──────────┐ ┌──────────┐
+│  Prometheus  │ │   Loki   │ │  Tempo   │
+│  (Metrics)   │ │  (Logs)  │ │ (Traces) │
+└──────────────┘ └──────────┘ └──────────┘
+         ↓           ↓           ↓
+┌─────────────────────────────────────────┐
+│  Applications (OpenTelemetry)           │
+└─────────────────────────────────────────┘
+```
+
+### Sizing Recommendations
+
+**100-host environment**:
+
+- **Prometheus**: 2-4 CPU, 8-16GB RAM, 100GB SSD
+- **Grafana**: 1 CPU, 2GB RAM
+- **Loki**: 2-4 CPU, 8GB RAM, 100GB SSD
+- **Tempo**: 2-4 CPU, 8GB RAM, S3 for storage
+- **Alertmanager**: 1 CPU, 1GB RAM
+
+**Total**: ~8-16 CPU, 32-64GB RAM, 200GB SSD + object storage
+
+---
+
+## 7. Migration Checklist
+
+### Pre-Migration
+- [ ] Calculate current Datadog costs
+- [ ] Identify all Datadog integrations
+- [ ] Export all dashboards
+- [ ] Export all monitors
+- [ ] Document custom metrics
+- [ ] Get stakeholder approval
+
+### During Migration
+- [ ] Deploy Prometheus + Grafana
+- [ ] Deploy Loki + Promtail
+- [ ] Deploy Tempo/Jaeger (if using APM)
+- [ ] Migrate metrics instrumentation
+- [ ] Convert dashboards (top 10 critical first)
+- [ ] Convert alerts (critical alerts first)
+- [ ] Update application logging
+- [ ] Replace APM instrumentation
+- [ ] Run parallel for 2-4 weeks
+- [ ] Validate data accuracy
+- [ ] Train team on new tools
+
+### Post-Migration
+- [ ] Decommission Datadog agent from all hosts
+- [ ] Cancel Datadog subscription
+- [ ] Archive Datadog configs
+- [ ] Document new workflows
+- [ ] Create runbooks for common tasks
+
+---
+
+## 8. Common Challenges & Solutions
+
+### Challenge: Missing Datadog Features
+
+**Datadog Synthetic Monitoring**:
+- Solution: Use **Blackbox Exporter** (Prometheus) or **Grafana Synthetic Monitoring**
+
+**Datadog Network Performance Monitoring**:
+- Solution: Use **Cilium Hubble** (Kubernetes) or **eBPF-based tools**
+
+**Datadog RUM (Real User Monitoring)**:
+- Solution: Use **Grafana Faro** or **OpenTelemetry Browser SDK**
+
+### Challenge: Team Learning Curve
+
+**Solution**:
+- Provide training sessions (2-3 hours per tool)
+- Create internal documentation with examples
+- Set up sandbox environment for practice
+- Assign champions for each tool
+
+### Challenge: Query Performance
+
+**Prometheus too slow**:
+- Use **Thanos** or **Cortex** for scaling
+- Implement recording rules for expensive queries
+- Increase retention only where needed
+
+**Loki too slow**:
+- Add more labels for better filtering
+- Use chunk caching
+- Consider **parallel query execution**
+
+---
+
+## 9. Maintenance Comparison
+
+### Datadog (Managed)
+- **Ops burden**: Low (fully managed)
+- **Upgrades**: Automatic
+- **Scaling**: Automatic
+- **Cost**: High ($6k-10k+/month)
+
+### Open-Source Stack (Self-hosted)
+- **Ops burden**: Medium (requires ops team)
+- **Upgrades**: Manual (quarterly)
+- **Scaling**: Manual planning required
+- **Cost**: Low ($1.5k-3k/month infrastructure)
+
+**Hybrid Option**: Use **Grafana Cloud** (managed Prometheus/Loki/Tempo)
+- Cost: ~$3k/month for 100 hosts
+- Ops burden: Low
+- Savings: ~50% vs Datadog
+
+---
+
+## 10. ROI Calculation
+
+### Example Scenario
+
+**Before (Datadog)**:
+- Monthly cost: $7,000
+- Annual cost: $84,000
+
+**After (Self-hosted OSS)**:
+- Infrastructure: $1,800/month
+- Operations (0.5 FTE): $4,000/month
+- Annual cost: $69,600
+
+**Savings**: $14,400/year
+
+**After (Grafana Cloud)**:
+- Monthly cost: $3,500
+- Annual cost: $42,000
+
+**Savings**: $42,000/year (50%)
+
+**Break-even**: Immediate (no migration costs beyond engineering time)
+
+---
+
+## Resources
+
+- **Prometheus**: https://prometheus.io/docs/
+- **Grafana**: https://grafana.com/docs/
+- **Loki**: https://grafana.com/docs/loki/
+- **Tempo**: https://grafana.com/docs/tempo/
+- **OpenTelemetry**: https://opentelemetry.io/
+- **Migration Tools**: https://github.com/grafana/dashboard-linter
+
+---
+
+## Support
+
+If you need help with migration:
+- Grafana Labs offers migration consulting
+- Many SRE consulting firms specialize in this
+- Community support via Slack/Discord channels
--- a/references/dql_promql_translation.md
+++ b/references/dql_promql_translation.md
@@ -0,0 +1,756 @@
+# DQL (Datadog Query Language) ↔ PromQL Translation Guide
+
+## Quick Reference
+
+| Concept | Datadog (DQL) | Prometheus (PromQL) |
+|---------|---------------|---------------------|
+| Aggregation | `avg:`, `sum:`, `min:`, `max:` | `avg()`, `sum()`, `min()`, `max()` |
+| Rate | `.as_rate()`, `.as_count()` | `rate()`, `increase()` |
+| Percentile | `p50:`, `p95:`, `p99:` | `histogram_quantile()` |
+| Filtering | `{tag:value}` | `{label="value"}` |
+| Time window | `last_5m`, `last_1h` | `[5m]`, `[1h]` |
+
+---
+
+## Basic Queries
+
+### Simple Metric Query
+
+**Datadog**:
+```
+system.cpu.user
+```
+
+**Prometheus**:
+```promql
+node_cpu_seconds_total{mode="user"}
+```
+
+---
+
+### Metric with Filter
+
+**Datadog**:
+```
+system.cpu.user{host:web-01}
+```
+
+**Prometheus**:
+```promql
+node_cpu_seconds_total{mode="user", instance="web-01"}
+```
+
+---
+
+### Multiple Filters (AND)
+
+**Datadog**:
+```
+system.cpu.user{host:web-01,env:production}
+```
+
+**Prometheus**:
+```promql
+node_cpu_seconds_total{mode="user", instance="web-01", env="production"}
+```
+
+---
+
+### Wildcard Filters
+
+**Datadog**:
+```
+system.cpu.user{host:web-*}
+```
+
+**Prometheus**:
+```promql
+node_cpu_seconds_total{mode="user", instance=~"web-.*"}
+```
+
+---
+
+### OR Filters
+
+**Datadog**:
+```
+system.cpu.user{host:web-01 OR host:web-02}
+```
+
+**Prometheus**:
+```promql
+node_cpu_seconds_total{mode="user", instance=~"web-01|web-02"}
+```
+
+---
+
+## Aggregations
+
+### Average
+
+**Datadog**:
+```
+avg:system.cpu.user{*}
+```
+
+**Prometheus**:
+```promql
+avg(node_cpu_seconds_total{mode="user"})
+```
+
+---
+
+### Sum
+
+**Datadog**:
+```
+sum:requests.count{*}
+```
+
+**Prometheus**:
+```promql
+sum(http_requests_total)
+```
+
+---
+
+### Min/Max
+
+**Datadog**:
+```
+min:system.mem.free{*}
+max:system.mem.free{*}
+```
+
+**Prometheus**:
+```promql
+min(node_memory_MemFree_bytes)
+max(node_memory_MemFree_bytes)
+```
+
+---
+
+### Aggregation by Tag/Label
+
+**Datadog**:
+```
+avg:system.cpu.user{*} by {host}
+```
+
+**Prometheus**:
+```promql
+avg by (instance) (node_cpu_seconds_total{mode="user"})
+```
+
+---
+
+## Rates and Counts
+
+### Rate (per second)
+
+**Datadog**:
+```
+sum:requests.count{*}.as_rate()
+```
+
+**Prometheus**:
+```promql
+sum(rate(http_requests_total[5m]))
+```
+
+Note: Prometheus requires explicit time window `[5m]`
+
+---
+
+### Count (total over time)
+
+**Datadog**:
+```
+sum:requests.count{*}.as_count()
+```
+
+**Prometheus**:
+```promql
+sum(increase(http_requests_total[1h]))
+```
+
+---
+
+### Derivative (change over time)
+
+**Datadog**:
+```
+derivative(avg:system.disk.used{*})
+```
+
+**Prometheus**:
+```promql
+deriv(node_filesystem_size_bytes[5m])
+```
+
+---
+
+## Percentiles
+
+### P50 (Median)
+
+**Datadog**:
+```
+p50:request.duration{*}
+```
+
+**Prometheus** (requires histogram):
+```promql
+histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
+```
+
+---
+
+### P95
+
+**Datadog**:
+```
+p95:request.duration{*}
+```
+
+**Prometheus**:
+```promql
+histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
+```
+
+---
+
+### P99
+
+**Datadog**:
+```
+p99:request.duration{*}
+```
+
+**Prometheus**:
+```promql
+histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
+```
+
+---
+
+## Time Windows
+
+### Last 5 minutes
+
+**Datadog**:
+```
+avg(last_5m):system.cpu.user{*}
+```
+
+**Prometheus**:
+```promql
+avg(node_cpu_seconds_total{mode="user"}[5m])
+```
+
+---
+
+### Last 1 hour
+
+**Datadog**:
+```
+avg(last_1h):system.cpu.user{*}
+```
+
+**Prometheus**:
+```promql
+avg_over_time(node_cpu_seconds_total{mode="user"}[1h])
+```
+
+---
+
+## Math Operations
+
+### Division
+
+**Datadog**:
+```
+avg:system.mem.used{*} / avg:system.mem.total{*}
+```
+
+**Prometheus**:
+```promql
+node_memory_MemUsed_bytes / node_memory_MemTotal_bytes
+```
+
+---
+
+### Multiplication
+
+**Datadog**:
+```
+avg:system.cpu.user{*} * 100
+```
+
+**Prometheus**:
+```promql
+avg(node_cpu_seconds_total{mode="user"}) * 100
+```
+
+---
+
+### Percentage Calculation
+
+**Datadog**:
+```
+(sum:requests.errors{*} / sum:requests.count{*}) * 100
+```
+
+**Prometheus**:
+```promql
+(sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) * 100
+```
+
+---
+
+## Common Use Cases
+
+### CPU Usage Percentage
+
+**Datadog**:
+```
+100 - avg:system.cpu.idle{*}
+```
+
+**Prometheus**:
+```promql
+100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
+```
+
+---
+
+### Memory Usage Percentage
+
+**Datadog**:
+```
+(avg:system.mem.used{*} / avg:system.mem.total{*}) * 100
+```
+
+**Prometheus**:
+```promql
+(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
+```
+
+---
+
+### Disk Usage Percentage
+
+**Datadog**:
+```
+(avg:system.disk.used{*} / avg:system.disk.total{*}) * 100
+```
+
+**Prometheus**:
+```promql
+(node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes * 100
+```
+
+---
+
+### Request Rate (requests/sec)
+
+**Datadog**:
+```
+sum:requests.count{*}.as_rate()
+```
+
+**Prometheus**:
+```promql
+sum(rate(http_requests_total[5m]))
+```
+
+---
+
+### Error Rate Percentage
+
+**Datadog**:
+```
+(sum:requests.errors{*}.as_rate() / sum:requests.count{*}.as_rate()) * 100
+```
+
+**Prometheus**:
+```promql
+(sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) * 100
+```
+
+---
+
+### Request Latency (P95)
+
+**Datadog**:
+```
+p95:request.duration{*}
+```
+
+**Prometheus**:
+```promql
+histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
+```
+
+---
+
+### Top 5 Hosts by CPU
+
+**Datadog**:
+```
+top(avg:system.cpu.user{*} by {host}, 5, 'mean', 'desc')
+```
+
+**Prometheus**:
+```promql
+topk(5, avg by (instance) (rate(node_cpu_seconds_total{mode="user"}[5m])))
+```
+
+---
+
+## Functions
+
+### Absolute Value
+
+**Datadog**:
+```
+abs(diff(avg:system.cpu.user{*}))
+```
+
+**Prometheus**:
+```promql
+abs(delta(node_cpu_seconds_total{mode="user"}[5m]))
+```
+
+---
+
+### Ceiling/Floor
+
+**Datadog**:
+```
+ceil(avg:system.cpu.user{*})
+floor(avg:system.cpu.user{*})
+```
+
+**Prometheus**:
+```promql
+ceil(avg(node_cpu_seconds_total{mode="user"}))
+floor(avg(node_cpu_seconds_total{mode="user"}))
+```
+
+---
+
+### Clamp (Limit Range)
+
+**Datadog**:
+```
+clamp_min(avg:system.cpu.user{*}, 0)
+clamp_max(avg:system.cpu.user{*}, 100)
+```
+
+**Prometheus**:
+```promql
+clamp_min(avg(node_cpu_seconds_total{mode="user"}), 0)
+clamp_max(avg(node_cpu_seconds_total{mode="user"}), 100)
+```
+
+---
+
+### Moving Average
+
+**Datadog**:
+```
+moving_rollup(avg:system.cpu.user{*}, 60, 'avg')
+```
+
+**Prometheus**:
+```promql
+avg_over_time(node_cpu_seconds_total{mode="user"}[1h])
+```
+
+---
+
+## Advanced Patterns
+
+### Compare to Previous Period
+
+**Datadog**:
+```
+sum:requests.count{*}.as_rate() / timeshift(sum:requests.count{*}.as_rate(), 3600)
+```
+
+**Prometheus**:
+```promql
+sum(rate(http_requests_total[5m])) / sum(rate(http_requests_total[5m] offset 1h))
+```
+
+---
+
+### Forecast
+
+**Datadog**:
+```
+forecast(avg:system.disk.used{*}, 'linear', 1)
+```
+
+**Prometheus**:
+```promql
+predict_linear(node_filesystem_size_bytes[1h], 3600)
+```
+
+Note: Predicts value 1 hour in future based on last 1 hour trend
+
+---
+
+### Anomaly Detection
+
+**Datadog**:
+```
+anomalies(avg:system.cpu.user{*}, 'basic', 2)
+```
+
+**Prometheus**: No built-in function
+- Use recording rules with stddev
+- External tools like **Robust Perception's anomaly detector**
+- Or use **Grafana ML** plugin
+
+---
+
+### Outlier Detection
+
+**Datadog**:
+```
+outliers(avg:system.cpu.user{*} by {host}, 'mad')
+```
+
+**Prometheus**: No built-in function
+- Calculate manually with stddev:
+```promql
+abs(metric - avg(metric)) > 2 * stddev(metric)
+```
+
+---
+
+## Container & Kubernetes
+
+### Container CPU Usage
+
+**Datadog**:
+```
+avg:docker.cpu.usage{*} by {container_name}
+```
+
+**Prometheus**:
+```promql
+avg by (container) (rate(container_cpu_usage_seconds_total[5m]))
+```
+
+---
+
+### Container Memory Usage
+
+**Datadog**:
+```
+avg:docker.mem.rss{*} by {container_name}
+```
+
+**Prometheus**:
+```promql
+avg by (container) (container_memory_rss)
+```
+
+---
+
+### Pod Count by Status
+
+**Datadog**:
+```
+sum:kubernetes.pods.running{*} by {kube_namespace}
+```
+
+**Prometheus**:
+```promql
+sum by (namespace) (kube_pod_status_phase{phase="Running"})
+```
+
+---
+
+## Database Queries
+
+### MySQL Queries Per Second
+
+**Datadog**:
+```
+sum:mysql.performance.queries{*}.as_rate()
+```
+
+**Prometheus**:
+```promql
+sum(rate(mysql_global_status_queries[5m]))
+```
+
+---
+
+### PostgreSQL Active Connections
+
+**Datadog**:
+```
+avg:postgresql.connections{*}
+```
+
+**Prometheus**:
+```promql
+avg(pg_stat_database_numbackends)
+```
+
+---
+
+### Redis Memory Usage
+
+**Datadog**:
+```
+avg:redis.mem.used{*}
+```
+
+**Prometheus**:
+```promql
+avg(redis_memory_used_bytes)
+```
+
+---
+
+## Network Metrics
+
+### Network Bytes Sent
+
+**Datadog**:
+```
+sum:system.net.bytes_sent{*}.as_rate()
+```
+
+**Prometheus**:
+```promql
+sum(rate(node_network_transmit_bytes_total[5m]))
+```
+
+---
+
+### Network Bytes Received
+
+**Datadog**:
+```
+sum:system.net.bytes_rcvd{*}.as_rate()
+```
+
+**Prometheus**:
+```promql
+sum(rate(node_network_receive_bytes_total[5m]))
+```
+
+---
+
+## Key Differences
+
+### 1. Time Windows
+- **Datadog**: Optional, defaults to query time range
+- **Prometheus**: Always required for rate/increase functions
+
+### 2. Histograms
+- **Datadog**: Percentiles available directly
+- **Prometheus**: Requires histogram buckets + `histogram_quantile()`
+
+### 3. Default Aggregation
+- **Datadog**: No default, must specify
+- **Prometheus**: Returns all time series unless aggregated
+
+### 4. Metric Types
+- **Datadog**: All metrics treated similarly
+- **Prometheus**: Explicit types (counter, gauge, histogram, summary)
+
+### 5. Tag vs Label
+- **Datadog**: Uses "tags" (key:value)
+- **Prometheus**: Uses "labels" (key="value")
+
+---
+
+## Migration Tips
+
+1. **Start with dashboards**: Convert most-used dashboards first
+2. **Use recording rules**: Pre-calculate expensive PromQL queries
+3. **Test in parallel**: Run both systems during migration
+4. **Document mappings**: Create team-specific translation guide
+5. **Train team**: PromQL has learning curve, invest in training
+
+---
+
+## Tools
+
+- **Datadog Dashboard Exporter**: Export JSON dashboards
+- **Grafana Dashboard Linter**: Validate converted dashboards
+- **PromQL Learning Resources**: https://prometheus.io/docs/prometheus/latest/querying/basics/
+
+---
+
+## Common Gotchas
+
+### Rate without Time Window
+
+❌ **Wrong**:
+```promql
+rate(http_requests_total)
+```
+
+✅ **Correct**:
+```promql
+rate(http_requests_total[5m])
+```
+
+---
+
+### Aggregating Before Rate
+
+❌ **Wrong**:
+```promql
+rate(sum(http_requests_total)[5m])
+```
+
+✅ **Correct**:
+```promql
+sum(rate(http_requests_total[5m]))
+```
+
+---
+
+### Histogram Quantile Without by (le)
+
+❌ **Wrong**:
+```promql
+histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
+```
+
+✅ **Correct**:
+```promql
+histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
+```
+
+---
+
+## Quick Conversion Checklist
+
+When converting a Datadog query to PromQL:
+
+- [ ] Replace metric name (e.g., `system.cpu.user` → `node_cpu_seconds_total`)
+- [ ] Convert tags to labels (`{tag:value}` → `{label="value"}`)
+- [ ] Add time window for rate/increase (`[5m]`)
+- [ ] Change aggregation syntax (`avg:` → `avg()`)
+- [ ] Convert percentiles to histogram_quantile if needed
+- [ ] Test query in Prometheus before adding to dashboard
+- [ ] Add `by (label)` for grouped aggregations
+
+---
+
+## Need More Help?
+
+- See `datadog_migration.md` for full migration guide
+- PromQL documentation: https://prometheus.io/docs/prometheus/latest/querying/
+- Practice at: https://demo.promlens.com/
--- a/references/logging_guide.md
+++ b/references/logging_guide.md
@@ -0,0 +1,775 @@
+# Logging Guide
+
+## Structured Logging
+
+### Why Structured Logs?
+
+**Unstructured** (text):
+```
+2024-10-28 14:32:15 User john@example.com logged in from 192.168.1.1
+```
+
+**Structured** (JSON):
+```json
+{
+  "timestamp": "2024-10-28T14:32:15Z",
+  "level": "info",
+  "message": "User logged in",
+  "user": "john@example.com",
+  "ip": "192.168.1.1",
+  "event_type": "user_login"
+}
+```
+
+**Benefits**:
+- Easy to parse and query
+- Consistent format
+- Machine-readable
+- Efficient storage and indexing
+
+---
+
+## Log Levels
+
+Use appropriate log levels for better filtering and alerting.
+
+### DEBUG
+**When**: Development, troubleshooting
+**Examples**:
+- Function entry/exit
+- Variable values
+- Internal state changes
+
+```python
+logger.debug("Processing request", extra={
+    "request_id": req_id,
+    "params": params
+})
+```
+
+### INFO
+**When**: Important business events
+**Examples**:
+- User actions (login, purchase)
+- System state changes (started, stopped)
+- Significant milestones
+
+```python
+logger.info("Order placed", extra={
+    "order_id": "12345",
+    "user_id": "user123",
+    "amount": 99.99
+})
+```
+
+### WARN
+**When**: Potentially problematic situations
+**Examples**:
+- Deprecated API usage
+- Slow operations (but not failing)
+- Retry attempts
+- Resource usage approaching limits
+
+```python
+logger.warning("API response slow", extra={
+    "endpoint": "/api/users",
+    "duration_ms": 2500,
+    "threshold_ms": 1000
+})
+```
+
+### ERROR
+**When**: Error conditions that need attention
+**Examples**:
+- Failed requests
+- Exceptions caught and handled
+- Integration failures
+- Data validation errors
+
+```python
+logger.error("Payment processing failed", extra={
+    "order_id": "12345",
+    "error": str(e),
+    "payment_gateway": "stripe"
+}, exc_info=True)
+```
+
+### FATAL/CRITICAL
+**When**: Severe errors causing shutdown
+**Examples**:
+- Database connection lost
+- Out of memory
+- Configuration errors preventing startup
+
+```python
+logger.critical("Database connection lost", extra={
+    "database": "postgres",
+    "host": "db.example.com",
+    "attempt": 3
+})
+```
+
+---
+
+## Required Fields
+
+Every log entry should include:
+
+### 1. Timestamp
+ISO 8601 format with timezone:
+```json
+{
+  "timestamp": "2024-10-28T14:32:15.123Z"
+}
+```
+
+### 2. Level
+Standard levels: debug, info, warn, error, critical
+```json
+{
+  "level": "error"
+}
+```
+
+### 3. Message
+Human-readable description:
+```json
+{
+  "message": "User authentication failed"
+}
+```
+
+### 4. Service/Application
+What component logged this:
+```json
+{
+  "service": "api-gateway",
+  "version": "1.2.3"
+}
+```
+
+### 5. Environment
+```json
+{
+  "environment": "production"
+}
+```
+
+---
+
+## Recommended Fields
+
+### Request Context
+```json
+{
+  "request_id": "550e8400-e29b-41d4-a716-446655440000",
+  "user_id": "user123",
+  "session_id": "sess_abc",
+  "ip_address": "192.168.1.1",
+  "user_agent": "Mozilla/5.0..."
+}
+```
+
+### Performance Metrics
+```json
+{
+  "duration_ms": 245,
+  "response_size_bytes": 1024
+}
+```
+
+### Error Details
+```json
+{
+  "error_type": "ValidationError",
+  "error_message": "Invalid email format",
+  "stack_trace": "...",
+  "error_code": "VAL_001"
+}
+```
+
+### Business Context
+```json
+{
+  "order_id": "ORD-12345",
+  "customer_id": "CUST-789",
+  "transaction_amount": 99.99,
+  "payment_method": "credit_card"
+}
+```
+
+---
+
+## Implementation Examples
+
+### Python (using structlog)
+```python
+import structlog
+
+logger = structlog.get_logger()
+
+# Configure structured logging
+structlog.configure(
+    processors=[
+        structlog.processors.TimeStamper(fmt="iso"),
+        structlog.processors.add_log_level,
+        structlog.processors.JSONRenderer()
+    ]
+)
+
+# Usage
+logger.info(
+    "user_logged_in",
+    user_id="user123",
+    ip_address="192.168.1.1",
+    login_method="oauth"
+)
+```
+
+### Node.js (using Winston)
+```javascript
+const winston = require('winston');
+
+const logger = winston.createLogger({
+  format: winston.format.json(),
+  defaultMeta: { service: 'api-gateway' },
+  transports: [
+    new winston.transports.Console()
+  ]
+});
+
+logger.info('User logged in', {
+  userId: 'user123',
+  ipAddress: '192.168.1.1',
+  loginMethod: 'oauth'
+});
+```
+
+### Go (using zap)
+```go
+import "go.uber.org/zap"
+
+logger, _ := zap.NewProduction()
+defer logger.Sync()
+
+logger.Info("User logged in",
+    zap.String("userId", "user123"),
+    zap.String("ipAddress", "192.168.1.1"),
+    zap.String("loginMethod", "oauth"),
+)
+```
+
+### Java (using Logback with JSON)
+```java
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import net.logstash.logback.argument.StructuredArguments;
+
+Logger logger = LoggerFactory.getLogger(MyClass.class);
+
+logger.info("User logged in",
+    StructuredArguments.kv("userId", "user123"),
+    StructuredArguments.kv("ipAddress", "192.168.1.1"),
+    StructuredArguments.kv("loginMethod", "oauth")
+);
+```
+
+---
+
+## Log Aggregation Patterns
+
+### Pattern 1: ELK Stack (Elasticsearch, Logstash, Kibana)
+
+**Architecture**:
+```
+Application → Filebeat → Logstash → Elasticsearch → Kibana
+```
+
+**filebeat.yml**:
+```yaml
+filebeat.inputs:
+  - type: log
+    enabled: true
+    paths:
+      - /var/log/app/*.log
+    json.keys_under_root: true
+    json.add_error_key: true
+
+output.logstash:
+  hosts: ["logstash:5044"]
+```
+
+**logstash.conf**:
+```
+input {
+  beats {
+    port => 5044
+  }
+}
+
+filter {
+  json {
+    source => "message"
+  }
+
+  date {
+    match => ["timestamp", "ISO8601"]
+  }
+
+  grok {
+    match => { "message" => "%{COMBINEDAPACHELOG}" }
+  }
+}
+
+output {
+  elasticsearch {
+    hosts => ["elasticsearch:9200"]
+    index => "app-logs-%{+YYYY.MM.dd}"
+  }
+}
+```
+
+### Pattern 2: Loki (Grafana Loki)
+
+**Architecture**:
+```
+Application → Promtail → Loki → Grafana
+```
+
+**promtail-config.yml**:
+```yaml
+server:
+  http_listen_port: 9080
+
+positions:
+  filename: /tmp/positions.yaml
+
+clients:
+  - url: http://loki:3100/loki/api/v1/push
+
+scrape_configs:
+  - job_name: app
+    static_configs:
+      - targets:
+          - localhost
+        labels:
+          job: app
+          __path__: /var/log/app/*.log
+    pipeline_stages:
+      - json:
+          expressions:
+            level: level
+            timestamp: timestamp
+      - labels:
+          level:
+          service:
+      - timestamp:
+          source: timestamp
+          format: RFC3339
+```
+
+**Query in Grafana**:
+```logql
+{job="app"} |= "error" | json | level="error"
+```
+
+### Pattern 3: CloudWatch Logs
+
+**Install CloudWatch agent**:
+```json
+{
+  "logs": {
+    "logs_collected": {
+      "files": {
+        "collect_list": [
+          {
+            "file_path": "/var/log/app/*.log",
+            "log_group_name": "/aws/app/production",
+            "log_stream_name": "{instance_id}",
+            "timezone": "UTC"
+          }
+        ]
+      }
+    }
+  }
+}
+```
+
+**Query with CloudWatch Insights**:
+```
+fields @timestamp, level, message, user_id
+| filter level = "error"
+| sort @timestamp desc
+| limit 100
+```
+
+### Pattern 4: Fluentd/Fluent Bit
+
+**fluent-bit.conf**:
+```
+[INPUT]
+    Name              tail
+    Path              /var/log/app/*.log
+    Parser            json
+    Tag               app.*
+
+[FILTER]
+    Name              record_modifier
+    Match             *
+    Record            hostname ${HOSTNAME}
+    Record            cluster production
+
+[OUTPUT]
+    Name              es
+    Match             *
+    Host              elasticsearch
+    Port              9200
+    Index             app-logs
+    Type              _doc
+```
+
+---
+
+## Query Patterns
+
+### Find Errors in Time Range
+**Elasticsearch**:
+```json
+GET /app-logs-*/_search
+{
+  "query": {
+    "bool": {
+      "must": [
+        { "match": { "level": "error" } },
+        { "range": { "@timestamp": {
+            "gte": "now-1h",
+            "lte": "now"
+        }}}
+      ]
+    }
+  }
+}
+```
+
+**Loki (LogQL)**:
+```logql
+{job="app", level="error"} |= "error"
+```
+
+**CloudWatch Insights**:
+```
+fields @timestamp, @message
+| filter level = "error"
+| filter @timestamp > ago(1h)
+```
+
+### Count Errors by Type
+**Elasticsearch**:
+```json
+GET /app-logs-*/_search
+{
+  "size": 0,
+  "query": { "match": { "level": "error" } },
+  "aggs": {
+    "error_types": {
+      "terms": { "field": "error_type.keyword" }
+    }
+  }
+}
+```
+
+**Loki**:
+```logql
+sum by (error_type) (count_over_time({job="app", level="error"}[1h]))
+```
+
+### Find Slow Requests
+**Elasticsearch**:
+```json
+GET /app-logs-*/_search
+{
+  "query": {
+    "range": { "duration_ms": { "gte": 1000 } }
+  },
+  "sort": [ { "duration_ms": "desc" } ]
+}
+```
+
+### Trace Request Through Services
+**Elasticsearch** (using request_id):
+```json
+GET /_search
+{
+  "query": {
+    "match": { "request_id": "550e8400-e29b-41d4-a716-446655440000" }
+  },
+  "sort": [ { "@timestamp": "asc" } ]
+}
+```
+
+---
+
+## Sampling and Rate Limiting
+
+### When to Sample
+- **High volume services**: > 10,000 logs/second
+- **Debug logs in production**: Sample 1-10%
+- **Cost optimization**: Reduce storage costs
+
+### Sampling Strategies
+
+**1. Random Sampling**:
+```python
+import random
+
+if random.random() < 0.1:  # Sample 10%
+    logger.debug("Debug message", ...)
+```
+
+**2. Rate Limiting**:
+```python
+from rate_limiter import RateLimiter
+
+limiter = RateLimiter(max_per_second=100)
+
+if limiter.allow():
+    logger.info("Rate limited log", ...)
+```
+
+**3. Error-Biased Sampling**:
+```python
+# Always log errors, sample successful requests
+if level == "error" or random.random() < 0.01:
+    logger.log(level, message, ...)
+```
+
+**4. Head-Based Sampling** (trace-aware):
+```python
+# If trace is sampled, log all related logs
+if trace_context.is_sampled():
+    logger.info("Traced log", trace_id=trace_context.trace_id)
+```
+
+---
+
+## Log Retention
+
+### Retention Strategy
+
+**Hot tier** (fast SSD): 7-30 days
+- Recent logs
+- Full query performance
+- High cost
+
+**Warm tier** (regular disk): 30-90 days
+- Older logs
+- Slower queries acceptable
+- Medium cost
+
+**Cold tier** (object storage): 90+ days
+- Archive logs
+- Query via restore
+- Low cost
+
+### Example: Elasticsearch ILM Policy
+```json
+{
+  "policy": {
+    "phases": {
+      "hot": {
+        "actions": {
+          "rollover": {
+            "max_size": "50GB",
+            "max_age": "1d"
+          }
+        }
+      },
+      "warm": {
+        "min_age": "7d",
+        "actions": {
+          "allocate": { "number_of_replicas": 1 },
+          "shrink": { "number_of_shards": 1 }
+        }
+      },
+      "cold": {
+        "min_age": "30d",
+        "actions": {
+          "allocate": { "require": { "box_type": "cold" } }
+        }
+      },
+      "delete": {
+        "min_age": "90d",
+        "actions": {
+          "delete": {}
+        }
+      }
+    }
+  }
+}
+```
+
+---
+
+## Security and Compliance
+
+### PII Redaction
+
+**Before logging**:
+```python
+import re
+
+def redact_pii(data):
+    # Redact email
+    data = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
+                  '[EMAIL]', data)
+    # Redact credit card
+    data = re.sub(r'\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b',
+                  '[CARD]', data)
+    # Redact SSN
+    data = re.sub(r'\b\d{3}-\d{2}-\d{4}\b', '[SSN]', data)
+    return data
+
+logger.info("User data", user_input=redact_pii(user_input))
+```
+
+**In Logstash**:
+```
+filter {
+  mutate {
+    gsub => [
+      "message", "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b", "[EMAIL]",
+      "message", "\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b", "[CARD]"
+    ]
+  }
+}
+```
+
+### Access Control
+
+**Elasticsearch** (with Security):
+```yaml
+# Role for developers
+dev_logs:
+  indices:
+    - names: ['app-logs-*']
+      privileges: ['read']
+      query: '{"match": {"environment": "development"}}'
+```
+
+**CloudWatch** (IAM Policy):
+```json
+{
+  "Effect": "Allow",
+  "Action": [
+    "logs:DescribeLogGroups",
+    "logs:GetLogEvents",
+    "logs:FilterLogEvents"
+  ],
+  "Resource": "arn:aws:logs:*:*:log-group:/aws/app/production:*"
+}
+```
+
+---
+
+## Common Pitfalls
+
+### 1. Logging Sensitive Data
+❌ `logger.info("Login", password=password)`
+✅ `logger.info("Login", user_id=user_id)`
+
+### 2. Excessive Logging
+❌ Logging every iteration of a loop
+✅ Log aggregate results or sample
+
+### 3. Not Including Context
+❌ `logger.error("Failed")`
+✅ `logger.error("Payment failed", order_id=order_id, error=str(e))`
+
+### 4. Inconsistent Formats
+❌ Mix of JSON and plain text
+✅ Pick one format and stick to it
+
+### 5. No Request IDs
+❌ Can't trace request across services
+✅ Generate and propagate request_id
+
+### 6. Logging to Multiple Places
+❌ Log to file AND stdout AND syslog
+✅ Log to stdout, let agent handle routing
+
+### 7. Blocking on Log Writes
+❌ Synchronous writes to remote systems
+✅ Asynchronous buffered writes
+
+---
+
+## Performance Optimization
+
+### 1. Async Logging
+```python
+import logging
+from logging.handlers import QueueHandler, QueueListener
+import queue
+
+# Create queue
+log_queue = queue.Queue()
+
+# Configure async handler
+queue_handler = QueueHandler(log_queue)
+logger.addHandler(queue_handler)
+
+# Process logs in background thread
+listener = QueueListener(log_queue, *handlers)
+listener.start()
+```
+
+### 2. Conditional Logging
+```python
+# Avoid expensive operations if not logging
+if logger.isEnabledFor(logging.DEBUG):
+    logger.debug("Details", data=expensive_serialization(obj))
+```
+
+### 3. Batching
+```python
+# Batch logs before sending
+batch = []
+for log in logs:
+    batch.append(log)
+    if len(batch) >= 100:
+        send_to_aggregator(batch)
+        batch = []
+```
+
+### 4. Compression
+```yaml
+# Filebeat with compression
+output.logstash:
+  hosts: ["logstash:5044"]
+  compression_level: 3
+```
+
+---
+
+## Monitoring Log Pipeline
+
+Track pipeline health with metrics:
+
+```promql
+# Log ingestion rate
+rate(logs_ingested_total[5m])
+
+# Pipeline lag
+log_processing_lag_seconds
+
+# Dropped logs
+rate(logs_dropped_total[5m])
+
+# Error parsing rate
+rate(logs_parse_errors_total[5m])
+```
+
+Alert on:
+- Sudden drop in log volume (service down?)
+- High parse error rate (format changed?)
+- Pipeline lag > 1 minute (capacity issue?)
--- a/references/metrics_design.md
+++ b/references/metrics_design.md
@@ -0,0 +1,406 @@
+# Metrics Design Guide
+
+## The Four Golden Signals
+
+The Four Golden Signals from Google's SRE book provide a comprehensive view of system health:
+
+### 1. Latency
+**What**: Time to service a request
+
+**Why Monitor**: Directly impacts user experience
+
+**Key Metrics**:
+- Request duration (p50, p95, p99, p99.9)
+- Time to first byte (TTFB)
+- Backend processing time
+- Database query latency
+
+**PromQL Examples**:
+```promql
+# P95 latency
+histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
+
+# Average latency by endpoint
+avg(rate(http_request_duration_seconds_sum[5m])) by (endpoint)
+  /
+avg(rate(http_request_duration_seconds_count[5m])) by (endpoint)
+```
+
+**Alert Thresholds**:
+- Warning: p95 > 500ms
+- Critical: p99 > 2s
+
+### 2. Traffic
+**What**: Demand on your system
+
+**Why Monitor**: Understand load patterns, capacity planning
+
+**Key Metrics**:
+- Requests per second (RPS)
+- Transactions per second (TPS)
+- Concurrent connections
+- Network throughput
+
+**PromQL Examples**:
+```promql
+# Requests per second
+sum(rate(http_requests_total[5m]))
+
+# Requests per second by status code
+sum(rate(http_requests_total[5m])) by (status)
+
+# Traffic growth rate (week over week)
+sum(rate(http_requests_total[5m]))
+  /
+sum(rate(http_requests_total[5m] offset 7d))
+```
+
+**Alert Thresholds**:
+- Warning: RPS > 80% of capacity
+- Critical: RPS > 95% of capacity
+
+### 3. Errors
+**What**: Rate of requests that fail
+
+**Why Monitor**: Direct indicator of user-facing problems
+
+**Key Metrics**:
+- Error rate (%)
+- 5xx response codes
+- Failed transactions
+- Exception counts
+
+**PromQL Examples**:
+```promql
+# Error rate percentage
+sum(rate(http_requests_total{status=~"5.."}[5m]))
+  /
+sum(rate(http_requests_total[5m])) * 100
+
+# Error count by type
+sum(rate(http_requests_total{status=~"5.."}[5m])) by (status)
+
+# Application errors
+rate(application_errors_total[5m])
+```
+
+**Alert Thresholds**:
+- Warning: Error rate > 1%
+- Critical: Error rate > 5%
+
+### 4. Saturation
+**What**: How "full" your service is
+
+**Why Monitor**: Predict capacity issues before they impact users
+
+**Key Metrics**:
+- CPU utilization
+- Memory utilization
+- Disk I/O
+- Network bandwidth
+- Queue depth
+- Thread pool usage
+
+**PromQL Examples**:
+```promql
+# CPU saturation
+100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
+
+# Memory saturation
+(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
+
+# Disk saturation
+rate(node_disk_io_time_seconds_total[5m]) * 100
+
+# Queue depth
+queue_depth_current / queue_depth_max * 100
+```
+
+**Alert Thresholds**:
+- Warning: > 70% utilization
+- Critical: > 90% utilization
+
+---
+
+## RED Method (for Services)
+
+**R**ate, **E**rrors, **D**uration - a simplified approach for request-driven services
+
+### Rate
+Number of requests per second:
+```promql
+sum(rate(http_requests_total[5m]))
+```
+
+### Errors
+Number of failed requests per second:
+```promql
+sum(rate(http_requests_total{status=~"5.."}[5m]))
+```
+
+### Duration
+Time taken to process requests:
+```promql
+histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
+```
+
+**When to Use**: Microservices, APIs, web applications
+
+---
+
+## USE Method (for Resources)
+
+**U**tilization, **S**aturation, **E**rrors - for infrastructure resources
+
+### Utilization
+Percentage of time resource is busy:
+```promql
+# CPU utilization
+100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
+
+# Disk utilization
+(node_filesystem_size_bytes - node_filesystem_avail_bytes)
+  / node_filesystem_size_bytes * 100
+```
+
+### Saturation
+Amount of work the resource cannot service (queued):
+```promql
+# Load average (saturation indicator)
+node_load15
+
+# Disk I/O wait time
+rate(node_disk_io_time_weighted_seconds_total[5m])
+```
+
+### Errors
+Count of error events:
+```promql
+# Network errors
+rate(node_network_receive_errs_total[5m])
+rate(node_network_transmit_errs_total[5m])
+
+# Disk errors
+rate(node_disk_io_errors_total[5m])
+```
+
+**When to Use**: Servers, databases, network devices
+
+---
+
+## Metric Types
+
+### Counter
+Monotonically increasing value (never decreases)
+
+**Examples**: Request count, error count, bytes sent
+
+**Usage**:
+```promql
+# Always use rate() or increase() with counters
+rate(http_requests_total[5m])  # Requests per second
+increase(http_requests_total[1h])  # Total requests in 1 hour
+```
+
+### Gauge
+Value that can go up and down
+
+**Examples**: Memory usage, queue depth, concurrent connections
+
+**Usage**:
+```promql
+# Use directly or with aggregations
+avg(memory_usage_bytes)
+max(queue_depth)
+```
+
+### Histogram
+Samples observations and counts them in configurable buckets
+
+**Examples**: Request duration, response size
+
+**Usage**:
+```promql
+# Calculate percentiles
+histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
+
+# Average from histogram
+rate(http_request_duration_seconds_sum[5m])
+  /
+rate(http_request_duration_seconds_count[5m])
+```
+
+### Summary
+Similar to histogram but calculates quantiles on client side
+
+**Usage**: Less flexible than histograms, avoid for new metrics
+
+---
+
+## Cardinality Best Practices
+
+**Cardinality**: Number of unique time series
+
+### High Cardinality Labels (AVOID)
+❌ User ID
+❌ Email address
+❌ IP address
+❌ Timestamp
+❌ Random IDs
+
+### Low Cardinality Labels (GOOD)
+✅ Environment (prod, staging)
+✅ Region (us-east-1, eu-west-1)
+✅ Service name
+✅ HTTP status code category (2xx, 4xx, 5xx)
+✅ Endpoint/route
+
+### Calculating Cardinality Impact
+```
+Time series = unique combinations of labels
+
+Example:
+service (5) × environment (3) × region (4) × status (5) = 300 time series ✅
+
+service (5) × environment (3) × region (4) × user_id (1M) = 60M time series ❌
+```
+
+---
+
+## Naming Conventions
+
+### Prometheus Naming
+```
+<namespace>_<name>_<unit>_total
+
+Examples:
+http_requests_total
+http_request_duration_seconds
+process_cpu_seconds_total
+node_memory_MemAvailable_bytes
+```
+
+**Rules**:
+- Use snake_case
+- Include unit in name (seconds, bytes, ratio)
+- Use `_total` suffix for counters
+- Namespace by application/component
+
+### CloudWatch Naming
+```
+<Namespace>/<MetricName>
+
+Examples:
+AWS/EC2/CPUUtilization
+MyApp/RequestCount
+```
+
+**Rules**:
+- Use PascalCase
+- Group by namespace
+- No unit in name (specified separately)
+
+---
+
+## Dashboard Design
+
+### Key Principles
+
+1. **Top-Down Layout**: Most important metrics first
+2. **Color Coding**: Red (critical), yellow (warning), green (healthy)
+3. **Consistent Time Windows**: All panels use same time range
+4. **Limit Panels**: 8-12 panels per dashboard maximum
+5. **Include Context**: Show related metrics together
+
+### Dashboard Structure
+
+```
+┌─────────────────────────────────────────────┐
+│  Overall Health (Single Stats)              │
+│  [Requests/s] [Error%] [P95 Latency]        │
+└─────────────────────────────────────────────┘
+
+┌─────────────────────────────────────────────┐
+│  Request Rate & Errors (Graphs)             │
+└─────────────────────────────────────────────┘
+
+┌─────────────────────────────────────────────┐
+│  Latency Distribution (Graphs)              │
+└─────────────────────────────────────────────┘
+
+┌─────────────────────────────────────────────┐
+│  Resource Usage (Graphs)                    │
+└─────────────────────────────────────────────┘
+
+┌─────────────────────────────────────────────┐
+│  Dependencies (Graphs)                      │
+└─────────────────────────────────────────────┘
+```
+
+### Template Variables
+Use variables for filtering:
+- Environment: `$environment`
+- Service: `$service`
+- Region: `$region`
+- Pod: `$pod`
+
+---
+
+## Common Pitfalls
+
+### 1. Monitoring What You Build, Not What Users Experience
+❌ `backend_processing_complete`
+✅ `user_request_completed`
+
+### 2. Too Many Metrics
+- Start with Four Golden Signals
+- Add metrics only when needed for specific issues
+- Remove unused metrics
+
+### 3. Incorrect Aggregations
+❌ `avg(rate(...))` - averages rates incorrectly
+✅ `sum(rate(...)) / count(...)` - correct average
+
+### 4. Wrong Time Windows
+- Too short (< 1m): Noisy data
+- Too long (> 15m): Miss short-lived issues
+- Sweet spot: 5m for most alerts
+
+### 5. Missing Labels
+❌ `http_requests_total`
+✅ `http_requests_total{method="GET", status="200", endpoint="/api/users"}`
+
+---
+
+## Metric Collection Best Practices
+
+### Application Instrumentation
+```python
+from prometheus_client import Counter, Histogram, Gauge
+
+# Counter for requests
+requests_total = Counter('http_requests_total',
+                        'Total HTTP requests',
+                        ['method', 'endpoint', 'status'])
+
+# Histogram for latency
+request_duration = Histogram('http_request_duration_seconds',
+                            'HTTP request duration',
+                            ['method', 'endpoint'])
+
+# Gauge for in-progress requests
+requests_in_progress = Gauge('http_requests_in_progress',
+                            'HTTP requests currently being processed')
+```
+
+### Collection Intervals
+- Application metrics: 15-30s
+- Infrastructure metrics: 30-60s
+- Billing/cost metrics: 5-15m
+- External API checks: 1-5m
+
+### Retention
+- Raw metrics: 15-30 days
+- 5m aggregates: 90 days
+- 1h aggregates: 1 year
+- Daily aggregates: 2+ years
--- a/references/slo_sla_guide.md
+++ b/references/slo_sla_guide.md
@@ -0,0 +1,652 @@
+# SLI, SLO, and SLA Guide
+
+## Definitions
+
+### SLI (Service Level Indicator)
+**What**: A quantitative measure of service quality
+
+**Examples**:
+- Request latency (ms)
+- Error rate (%)
+- Availability (%)
+- Throughput (requests/sec)
+
+### SLO (Service Level Objective)
+**What**: Target value or range for an SLI
+
+**Examples**:
+- "99.9% of requests return in < 500ms"
+- "99.95% availability"
+- "Error rate < 0.1%"
+
+### SLA (Service Level Agreement)
+**What**: Business contract with consequences for SLO violations
+
+**Examples**:
+- "99.9% uptime or 10% monthly credit"
+- "p95 latency < 1s or refund"
+
+### Relationship
+```
+SLI = Measurement
+SLO = Target (internal goal)
+SLA = Promise (customer contract with penalties)
+
+Example:
+SLI: Actual availability this month = 99.92%
+SLO: Target availability = 99.9%
+SLA: Guaranteed availability = 99.5% (with penalties)
+```
+
+---
+
+## Choosing SLIs
+
+### The Four Golden Signals as SLIs
+
+1. **Latency SLIs**
+   - Request duration (p50, p95, p99)
+   - Time to first byte
+   - Page load time
+
+2. **Availability/Success SLIs**
+   - % of successful requests
+   - % uptime
+   - % of requests completing
+
+3. **Throughput SLIs** (less common)
+   - Requests per second
+   - Transactions per second
+
+4. **Saturation SLIs** (internal only)
+   - Resource utilization
+   - Queue depth
+
+### SLI Selection Criteria
+
+✅ **Good SLIs**:
+- Measured from user perspective
+- Directly impact user experience
+- Aggregatable across instances
+- Proportional to user happiness
+
+❌ **Bad SLIs**:
+- Internal metrics only
+- Not user-facing
+- Hard to measure consistently
+
+### Examples by Service Type
+
+**Web Application**:
+```
+SLI 1: Request Success Rate
+  = successful_requests / total_requests
+
+SLI 2: Request Latency (p95)
+  = 95th percentile of response times
+
+SLI 3: Availability
+  = time_service_responding / total_time
+```
+
+**API Service**:
+```
+SLI 1: Error Rate
+  = (4xx_errors + 5xx_errors) / total_requests
+
+SLI 2: Response Time (p99)
+  = 99th percentile latency
+
+SLI 3: Throughput
+  = requests_per_second
+```
+
+**Batch Processing**:
+```
+SLI 1: Job Success Rate
+  = successful_jobs / total_jobs
+
+SLI 2: Processing Latency
+  = time_from_submission_to_completion
+
+SLI 3: Freshness
+  = age_of_oldest_unprocessed_item
+```
+
+**Storage Service**:
+```
+SLI 1: Durability
+  = data_not_lost / total_data
+
+SLI 2: Read Latency (p99)
+  = 99th percentile read time
+
+SLI 3: Write Success Rate
+  = successful_writes / total_writes
+```
+
+---
+
+## Setting SLO Targets
+
+### Start with Current Performance
+
+1. **Measure baseline**: Collect 30 days of data
+2. **Analyze distribution**: Look at p50, p95, p99, p99.9
+3. **Set initial SLO**: Slightly better than worst performer
+4. **Iterate**: Tighten or loosen based on feasibility
+
+### Example Process
+
+**Current Performance** (30 days):
+```
+p50 latency: 120ms
+p95 latency: 450ms
+p99 latency: 1200ms
+p99.9 latency: 3500ms
+
+Error rate: 0.05%
+Availability: 99.95%
+```
+
+**Initial SLOs**:
+```
+Latency: p95 < 500ms (slightly worse than current p95)
+Error rate: < 0.1% (double current rate)
+Availability: 99.9% (slightly worse than current)
+```
+
+**Rationale**: Start loose, prevent false alarms, tighten over time
+
+### Common SLO Targets
+
+**Availability**:
+- **99%** (3.65 days downtime/year): Internal tools
+- **99.5%** (1.83 days/year): Non-critical services
+- **99.9%** (8.76 hours/year): Standard production
+- **99.95%** (4.38 hours/year): Critical services
+- **99.99%** (52 minutes/year): High availability
+- **99.999%** (5 minutes/year): Mission critical
+
+**Latency**:
+- **p50 < 100ms**: Excellent responsiveness
+- **p95 < 500ms**: Standard web applications
+- **p99 < 1s**: Acceptable for most users
+- **p99.9 < 5s**: Acceptable for rare edge cases
+
+**Error Rate**:
+- **< 0.01%** (99.99% success): Critical operations
+- **< 0.1%** (99.9% success): Standard production
+- **< 1%** (99% success): Non-critical services
+
+---
+
+## Error Budgets
+
+### Concept
+
+Error budget = (100% - SLO target)
+
+If SLO is 99.9%, error budget is 0.1%
+
+**Purpose**: Balance reliability with feature velocity
+
+### Calculation
+
+**For availability**:
+```
+Monthly error budget = (1 - SLO) × time_period
+
+Example (99.9% SLO, 30 days):
+Error budget = 0.001 × 30 days = 0.03 days = 43.2 minutes
+```
+
+**For request-based SLIs**:
+```
+Error budget = (1 - SLO) × total_requests
+
+Example (99.9% SLO, 10M requests/month):
+Error budget = 0.001 × 10,000,000 = 10,000 failed requests
+```
+
+### Error Budget Consumption
+
+**Formula**:
+```
+Budget consumed = actual_errors / allowed_errors × 100%
+
+Example:
+SLO: 99.9% (0.1% error budget)
+Total requests: 1,000,000
+Failed requests: 500
+Allowed failures: 1,000
+
+Budget consumed = 500 / 1,000 × 100% = 50%
+Budget remaining = 50%
+```
+
+### Error Budget Policy
+
+**Example policy**:
+
+```markdown
+## Error Budget Policy
+
+### If error budget > 50%
+- Deploy frequently (multiple times per day)
+- Take calculated risks
+- Experiment with new features
+- Acceptable to have some incidents
+
+### If error budget 20-50%
+- Deploy normally (once per day)
+- Increase testing
+- Review recent changes
+- Monitor closely
+
+### If error budget < 20%
+- Freeze non-critical deploys
+- Focus on reliability improvements
+- Postmortem all incidents
+- Reduce change velocity
+
+### If error budget exhausted (< 0%)
+- Complete deploy freeze except rollbacks
+- All hands on reliability
+- Mandatory postmortems
+- Executive escalation
+```
+
+---
+
+## Error Budget Burn Rate
+
+### Concept
+
+Burn rate = rate of error budget consumption
+
+**Example**:
+- Monthly budget: 43.2 minutes (99.9% SLO)
+- If consuming at 2x rate: Budget exhausted in 15 days
+- If consuming at 10x rate: Budget exhausted in 3 days
+
+### Burn Rate Calculation
+
+```
+Burn rate = (actual_error_rate / allowed_error_rate)
+
+Example:
+SLO: 99.9% (0.1% allowed error rate)
+Current error rate: 0.5%
+
+Burn rate = 0.5% / 0.1% = 5x
+Time to exhaust = 30 days / 5 = 6 days
+```
+
+### Multi-Window Alerting
+
+Alert on burn rate across multiple time windows:
+
+**Fast burn** (1 hour window):
+```
+Burn rate > 14.4x → Exhausts budget in 2 days
+Alert after 2 minutes
+Severity: Critical (page immediately)
+```
+
+**Moderate burn** (6 hour window):
+```
+Burn rate > 6x → Exhausts budget in 5 days
+Alert after 30 minutes
+Severity: Warning (create ticket)
+```
+
+**Slow burn** (3 day window):
+```
+Burn rate > 1x → Exhausts budget by end of month
+Alert after 6 hours
+Severity: Info (monitor)
+```
+
+### Implementation
+
+**Prometheus**:
+```yaml
+# Fast burn alert (1h window, 2m grace period)
+- alert: ErrorBudgetFastBurn
+  expr: |
+    (
+      sum(rate(http_requests_total{status=~"5.."}[1h]))
+      /
+      sum(rate(http_requests_total[1h]))
+    ) > (14.4 * 0.001)  # 14.4x burn rate for 99.9% SLO
+  for: 2m
+  labels:
+    severity: critical
+  annotations:
+    summary: "Fast error budget burn detected"
+    description: "Error budget will be exhausted in 2 days at current rate"
+
+# Slow burn alert (6h window, 30m grace period)
+- alert: ErrorBudgetSlowBurn
+  expr: |
+    (
+      sum(rate(http_requests_total{status=~"5.."}[6h]))
+      /
+      sum(rate(http_requests_total[6h]))
+    ) > (6 * 0.001)  # 6x burn rate for 99.9% SLO
+  for: 30m
+  labels:
+    severity: warning
+  annotations:
+    summary: "Elevated error budget burn detected"
+```
+
+---
+
+## SLO Reporting
+
+### Dashboard Structure
+
+**Overall Health**:
+```
+┌─────────────────────────────────────────┐
+│  SLO Compliance: 99.92% ✅              │
+│  Error Budget Remaining: 73% 🟢         │
+│  Burn Rate: 0.8x 🟢                     │
+└─────────────────────────────────────────┘
+```
+
+**SLI Performance**:
+```
+Latency p95: 420ms (Target: 500ms) ✅
+Error Rate: 0.08% (Target: < 0.1%) ✅
+Availability: 99.95% (Target: > 99.9%) ✅
+```
+
+**Error Budget Trend**:
+```
+Graph showing:
+- Error budget consumption over time
+- Burn rate spikes
+- Incidents marked
+- Deploy events overlaid
+```
+
+### Monthly SLO Report
+
+**Template**:
+```markdown
+# SLO Report: October 2024
+
+## Executive Summary
+- ✅ All SLOs met this month
+- 🟡 Latency SLO came close to violation (99.1% compliance)
+- 3 incidents consumed 47% of error budget
+- Error budget remaining: 53%
+
+## SLO Performance
+
+### Availability SLO: 99.9%
+- Actual: 99.92%
+- Status: ✅ Met
+- Error budget consumed: 33%
+- Downtime: 23 minutes (allowed: 43.2 minutes)
+
+### Latency SLO: p95 < 500ms
+- Actual p95: 445ms
+- Status: ✅ Met
+- Compliance: 99.1% (target: 99%)
+- 0.9% of requests exceeded threshold
+
+### Error Rate SLO: < 0.1%
+- Actual: 0.05%
+- Status: ✅ Met
+- Error budget consumed: 50%
+
+## Incidents
+
+### Incident #1: Database Overload (Oct 5)
+- Duration: 15 minutes
+- Error budget consumed: 35%
+- Root cause: Slow query after schema change
+- Prevention: Added query review to deploy checklist
+
+### Incident #2: API Gateway Timeout (Oct 12)
+- Duration: 5 minutes
+- Error budget consumed: 10%
+- Root cause: Configuration error in load balancer
+- Prevention: Automated configuration validation
+
+### Incident #3: Upstream Service Degradation (Oct 20)
+- Duration: 3 minutes
+- Error budget consumed: 2%
+- Root cause: Third-party API outage
+- Prevention: Implemented circuit breaker
+
+## Recommendations
+1. Investigate latency near-miss (Oct 15-17)
+2. Add automated rollback for database changes
+3. Increase circuit breaker thresholds for third-party APIs
+4. Consider tightening availability SLO to 99.95%
+
+## Next Month's Focus
+- Reduce p95 latency to 400ms
+- Implement automated canary deployments
+- Add synthetic monitoring for critical paths
+```
+
+---
+
+## SLA Structure
+
+### Components
+
+**Service Description**:
+```
+The API Service provides RESTful endpoints for user management,
+authentication, and data retrieval.
+```
+
+**Covered Metrics**:
+```
+- Availability: Service is reachable and returns valid responses
+- Latency: Time from request to response
+- Error Rate: Percentage of requests returning errors
+```
+
+**SLA Targets**:
+```
+Service commits to:
+1. 99.9% monthly uptime
+2. p95 API response time < 1 second
+3. Error rate < 0.5%
+```
+
+**Measurement**:
+```
+Metrics calculated from server-side monitoring:
+- Uptime: Successful health check probes / total probes
+- Latency: Server-side request duration (p95)
+- Errors: HTTP 5xx responses / total responses
+
+Calculated monthly (first of month for previous month).
+```
+
+**Exclusions**:
+```
+SLA does not cover:
+- Scheduled maintenance (with 7 days notice)
+- Client-side network issues
+- DDoS attacks or force majeure
+- Beta/preview features
+- Issues caused by customer misuse
+```
+
+**Service Credits**:
+```
+Monthly Uptime    | Service Credit
+----------------  | --------------
+< 99.9% (SLA)     | 10%
+< 99.0%           | 25%
+< 95.0%           | 50%
+```
+
+**Claiming Credits**:
+```
+Customer must:
+1. Report violation within 30 days
+2. Provide ticket numbers for support requests
+3. Credits applied to next month's invoice
+4. Credits do not exceed monthly fee
+```
+
+### Example SLAs by Industry
+
+**E-commerce**:
+```
+- 99.95% availability
+- p95 page load < 2s
+- p99 checkout < 5s
+- Credits: 5% per 0.1% below target
+```
+
+**Financial Services**:
+```
+- 99.99% availability
+- p99 transaction < 500ms
+- Zero data loss
+- Penalties: $10,000 per hour of downtime
+```
+
+**Media/Content**:
+```
+- 99.9% availability
+- p95 video start < 3s
+- No credit system (best effort latency)
+```
+
+---
+
+## Best Practices
+
+### 1. SLOs Should Be User-Centric
+❌ "Database queries < 100ms"
+✅ "API response time p95 < 500ms"
+
+### 2. Start Loose, Tighten Over Time
+- Begin with achievable targets
+- Build reliability culture
+- Gradually raise bar
+
+### 3. Fewer, Better SLOs
+- 1-3 SLOs per service
+- Focus on user impact
+- Avoid SLO sprawl
+
+### 4. SLAs More Conservative Than SLOs
+```
+Internal SLO: 99.95%
+Customer SLA: 99.9%
+Margin: 0.05% buffer
+```
+
+### 5. Make Error Budgets Actionable
+- Define policies at different thresholds
+- Empower teams to make tradeoffs
+- Review in planning meetings
+
+### 6. Document Everything
+- How SLIs are measured
+- Why targets were chosen
+- Who owns each SLO
+- How to interpret metrics
+
+### 7. Review Regularly
+- Monthly SLO reviews
+- Quarterly SLO adjustments
+- Annual SLA renegotiation
+
+---
+
+## Common Pitfalls
+
+### 1. Too Many SLOs
+❌ 20 different SLOs per service
+✅ 2-3 critical SLOs
+
+### 2. Unrealistic Targets
+❌ 99.999% for non-critical service
+✅ 99.9% with room to improve
+
+### 3. SLOs Without Error Budgets
+❌ "Must always be 99.9%"
+✅ "Budget for 0.1% errors"
+
+### 4. No Consequences
+❌ Missing SLO has no impact
+✅ Deploy freeze when budget exhausted
+
+### 5. SLA Equals SLO
+❌ Promise exactly what you target
+✅ SLA more conservative than SLO
+
+### 6. Ignoring User Experience
+❌ "Our servers are up 99.99%"
+✅ "Users can complete actions 99.9% of the time"
+
+### 7. Static Targets
+❌ Set once, never revisit
+✅ Quarterly reviews and adjustments
+
+---
+
+## Tools and Automation
+
+### SLO Tracking Tools
+
+**Prometheus + Grafana**:
+- Use recording rules for SLIs
+- Alert on burn rates
+- Dashboard for compliance
+
+**Google Cloud SLO Monitoring**:
+- Built-in SLO tracking
+- Automatic error budget calculation
+- Integration with alerting
+
+**Datadog SLOs**:
+- UI for SLO definition
+- Automatic burn rate alerts
+- Status pages
+
+**Custom Tools**:
+- sloth: Generate Prometheus rules from SLO definitions
+- slo-libsonnet: Jsonnet library for SLO monitoring
+
+### Example: Prometheus Recording Rules
+
+```yaml
+groups:
+  - name: sli_recording
+    interval: 30s
+    rules:
+      # SLI: Request success rate
+      - record: sli:request_success:ratio
+        expr: |
+          sum(rate(http_requests_total{status!~"5.."}[5m]))
+          /
+          sum(rate(http_requests_total[5m]))
+
+      # SLI: Request latency (p95)
+      - record: sli:request_latency:p95
+        expr: |
+          histogram_quantile(0.95,
+            sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
+          )
+
+      # Error budget burn rate (1h window)
+      - record: slo:error_budget_burn_rate:1h
+        expr: |
+          (1 - sli:request_success:ratio) / 0.001
+```
--- a/references/tool_comparison.md
+++ b/references/tool_comparison.md
@@ -0,0 +1,697 @@
+# Monitoring Tools Comparison
+
+## Overview Matrix
+
+| Tool | Type | Best For | Complexity | Cost | Cloud/Self-Hosted |
+|------|------|----------|------------|------|-------------------|
+| **Prometheus** | Metrics | Kubernetes, time-series | Medium | Free | Self-hosted |
+| **Grafana** | Visualization | Dashboards, multi-source | Low-Medium | Free | Both |
+| **Datadog** | Full-stack | Ease of use, APM | Low | High | Cloud |
+| **New Relic** | Full-stack | APM, traces | Low | High | Cloud |
+| **Elasticsearch (ELK)** | Logs | Log search, analysis | High | Medium | Both |
+| **Grafana Loki** | Logs | Cost-effective logs | Medium | Free | Both |
+| **CloudWatch** | AWS-native | AWS infrastructure | Low | Medium | Cloud |
+| **Jaeger** | Tracing | Distributed tracing | Medium | Free | Self-hosted |
+| **Grafana Tempo** | Tracing | Cost-effective tracing | Medium | Free | Self-hosted |
+
+---
+
+## Metrics Platforms
+
+### Prometheus
+
+**Type**: Open-source time-series database
+
+**Strengths**:
+- ✅ Industry standard for Kubernetes
+- ✅ Powerful query language (PromQL)
+- ✅ Pull-based model (no agent config)
+- ✅ Service discovery
+- ✅ Free and open source
+- ✅ Huge ecosystem (exporters for everything)
+
+**Weaknesses**:
+- ❌ No built-in dashboards (need Grafana)
+- ❌ Single-node only (no HA without federation)
+- ❌ Limited long-term storage (need Thanos/Cortex)
+- ❌ Steep learning curve for PromQL
+
+**Best For**:
+- Kubernetes monitoring
+- Infrastructure metrics
+- Custom application metrics
+- Organizations that need control
+
+**Pricing**: Free (open source)
+
+**Setup Complexity**: Medium
+
+**Example**:
+```yaml
+# prometheus.yml
+scrape_configs:
+  - job_name: 'app'
+    static_configs:
+      - targets: ['localhost:8080']
+```
+
+---
+
+### Datadog
+
+**Type**: SaaS monitoring platform
+
+**Strengths**:
+- ✅ Easy to set up (install agent, done)
+- ✅ Beautiful pre-built dashboards
+- ✅ APM, logs, metrics, traces in one platform
+- ✅ Great anomaly detection
+- ✅ Excellent integrations (500+)
+- ✅ Good mobile app
+
+**Weaknesses**:
+- ❌ Very expensive at scale
+- ❌ Vendor lock-in
+- ❌ Cost can be unpredictable (per-host pricing)
+- ❌ Limited PromQL support
+
+**Best For**:
+- Teams that want quick setup
+- Companies prioritizing ease of use over cost
+- Organizations needing full observability
+
+**Pricing**: $15-$31/host/month + custom metrics fees
+
+**Setup Complexity**: Low
+
+**Example**:
+```bash
+# Install agent
+DD_API_KEY=xxx bash -c "$(curl -L https://s3.amazonaws.com/dd-agent/scripts/install_script.sh)"
+```
+
+---
+
+### New Relic
+
+**Type**: SaaS application performance monitoring
+
+**Strengths**:
+- ✅ Excellent APM capabilities
+- ✅ User-friendly interface
+- ✅ Good transaction tracing
+- ✅ Comprehensive alerting
+- ✅ Generous free tier
+
+**Weaknesses**:
+- ❌ Can get expensive at scale
+- ❌ Vendor lock-in
+- ❌ Query language less powerful than PromQL
+- ❌ Limited customization
+
+**Best For**:
+- Application performance monitoring
+- Teams focused on APM over infrastructure
+- Startups (free tier is generous)
+
+**Pricing**: Free up to 100GB/month, then $0.30/GB
+
+**Setup Complexity**: Low
+
+**Example**:
+```python
+import newrelic.agent
+newrelic.agent.initialize('newrelic.ini')
+```
+
+---
+
+### CloudWatch
+
+**Type**: AWS-native monitoring
+
+**Strengths**:
+- ✅ Zero setup for AWS services
+- ✅ Native integration with AWS
+- ✅ Automatic dashboards for AWS resources
+- ✅ Tightly integrated with other AWS services
+- ✅ Good for cost if already on AWS
+
+**Weaknesses**:
+- ❌ AWS-only (not multi-cloud)
+- ❌ Limited query capabilities
+- ❌ High costs for custom metrics
+- ❌ Basic visualization
+- ❌ 1-minute minimum resolution
+
+**Best For**:
+- AWS-centric infrastructure
+- Quick setup for AWS services
+- Organizations already invested in AWS
+
+**Pricing**:
+- First 10 custom metrics: Free
+- Additional: $0.30/metric/month
+- API calls: $0.01/1000 requests
+
+**Setup Complexity**: Low (for AWS), Medium (for custom metrics)
+
+**Example**:
+```python
+import boto3
+cloudwatch = boto3.client('cloudwatch')
+cloudwatch.put_metric_data(
+    Namespace='MyApp',
+    MetricData=[{'MetricName': 'RequestCount', 'Value': 1}]
+)
+```
+
+---
+
+### Grafana Cloud / Mimir
+
+**Type**: Managed Prometheus-compatible
+
+**Strengths**:
+- ✅ Prometheus-compatible (PromQL)
+- ✅ Managed service (no ops burden)
+- ✅ Good cost model (pay for what you use)
+- ✅ Grafana dashboards included
+- ✅ Long-term storage
+
+**Weaknesses**:
+- ❌ Relatively new (less mature)
+- ❌ Some Prometheus features missing
+- ❌ Requires Grafana for visualization
+
+**Best For**:
+- Teams wanting Prometheus without ops overhead
+- Multi-cloud environments
+- Organizations already using Grafana
+
+**Pricing**: $8/month + $0.29/1M samples
+
+**Setup Complexity**: Low-Medium
+
+---
+
+## Logging Platforms
+
+### Elasticsearch (ELK Stack)
+
+**Type**: Open-source log search and analytics
+
+**Full Stack**: Elasticsearch + Logstash + Kibana
+
+**Strengths**:
+- ✅ Powerful search capabilities
+- ✅ Rich query language
+- ✅ Great for log analysis
+- ✅ Mature ecosystem
+- ✅ Can handle large volumes
+- ✅ Flexible data model
+
+**Weaknesses**:
+- ❌ Complex to operate
+- ❌ Resource intensive (RAM hungry)
+- ❌ Expensive at scale
+- ❌ Requires dedicated ops team
+- ❌ Slow for high-cardinality queries
+
+**Best For**:
+- Large organizations with ops teams
+- Deep log analysis needs
+- Search-heavy use cases
+
+**Pricing**: Free (open source) + infrastructure costs
+
+**Infrastructure**: ~$500-2000/month for medium scale
+
+**Setup Complexity**: High
+
+**Example**:
+```json
+PUT /logs-2024.10/_doc/1
+{
+  "timestamp": "2024-10-28T14:32:15Z",
+  "level": "error",
+  "message": "Payment failed"
+}
+```
+
+---
+
+### Grafana Loki
+
+**Type**: Log aggregation system
+
+**Strengths**:
+- ✅ Cost-effective (labels only, not full-text indexing)
+- ✅ Easy to operate
+- ✅ Prometheus-like label model
+- ✅ Great Grafana integration
+- ✅ Low resource usage
+- ✅ Fast time-range queries
+
+**Weaknesses**:
+- ❌ Limited full-text search
+- ❌ Requires careful label design
+- ❌ Younger ecosystem than ELK
+- ❌ Not ideal for complex queries
+
+**Best For**:
+- Cost-conscious organizations
+- Kubernetes environments
+- Teams already using Prometheus
+- Time-series log queries
+
+**Pricing**: Free (open source) + infrastructure costs
+
+**Infrastructure**: ~$100-500/month for medium scale
+
+**Setup Complexity**: Medium
+
+**Example**:
+```logql
+{job="api", environment="prod"} |= "error" | json | level="error"
+```
+
+---
+
+### Splunk
+
+**Type**: Enterprise log management
+
+**Strengths**:
+- ✅ Extremely powerful search
+- ✅ Great for security/compliance
+- ✅ Mature platform
+- ✅ Enterprise support
+- ✅ Machine learning features
+
+**Weaknesses**:
+- ❌ Very expensive
+- ❌ Complex pricing (per GB ingested)
+- ❌ Steep learning curve
+- ❌ Heavy resource usage
+
+**Best For**:
+- Large enterprises
+- Security operations centers (SOCs)
+- Compliance-heavy industries
+
+**Pricing**: $150-$1800/GB/month (depending on tier)
+
+**Setup Complexity**: Medium-High
+
+---
+
+### CloudWatch Logs
+
+**Type**: AWS-native log management
+
+**Strengths**:
+- ✅ Zero setup for AWS services
+- ✅ Integrated with AWS ecosystem
+- ✅ CloudWatch Insights for queries
+- ✅ Reasonable cost for low volume
+
+**Weaknesses**:
+- ❌ AWS-only
+- ❌ Limited query capabilities
+- ❌ Expensive at high volume
+- ❌ Basic visualization
+
+**Best For**:
+- AWS-centric applications
+- Low-volume logging
+- Simple log aggregation
+
+**Pricing**: Tiered (as of May 2025)
+- Vended Logs: $0.50/GB (first 10TB), $0.25/GB (next 20TB), then lower tiers
+- Standard logs: $0.50/GB flat
+- Storage: $0.03/GB
+
+**Setup Complexity**: Low (AWS), Medium (custom)
+
+---
+
+### Sumo Logic
+
+**Type**: SaaS log management
+
+**Strengths**:
+- ✅ Easy to use
+- ✅ Good for cloud-native apps
+- ✅ Real-time analytics
+- ✅ Good compliance features
+
+**Weaknesses**:
+- ❌ Expensive at scale
+- ❌ Vendor lock-in
+- ❌ Limited customization
+
+**Best For**:
+- Cloud-native applications
+- Teams wanting managed solution
+- Security and compliance use cases
+
+**Pricing**: $90-$180/GB/month
+
+**Setup Complexity**: Low
+
+---
+
+## Tracing Platforms
+
+### Jaeger
+
+**Type**: Open-source distributed tracing
+
+**Strengths**:
+- ✅ Industry standard
+- ✅ CNCF graduated project
+- ✅ Supports OpenTelemetry
+- ✅ Good UI
+- ✅ Free and open source
+
+**Weaknesses**:
+- ❌ Requires separate storage backend
+- ❌ Limited query capabilities
+- ❌ No built-in analytics
+
+**Best For**:
+- Microservices architectures
+- Kubernetes environments
+- OpenTelemetry users
+
+**Pricing**: Free (open source) + storage costs
+
+**Setup Complexity**: Medium
+
+---
+
+### Grafana Tempo
+
+**Type**: Open-source distributed tracing
+
+**Strengths**:
+- ✅ Cost-effective (object storage)
+- ✅ Easy to operate
+- ✅ Great Grafana integration
+- ✅ TraceQL query language
+- ✅ Supports OpenTelemetry
+
+**Weaknesses**:
+- ❌ Younger than Jaeger
+- ❌ Limited third-party integrations
+- ❌ Requires Grafana for UI
+
+**Best For**:
+- Cost-conscious organizations
+- Teams using Grafana stack
+- High trace volumes
+
+**Pricing**: Free (open source) + storage costs
+
+**Setup Complexity**: Medium
+
+---
+
+### Datadog APM
+
+**Type**: SaaS application performance monitoring
+
+**Strengths**:
+- ✅ Easy to set up
+- ✅ Excellent trace visualization
+- ✅ Integrated with metrics/logs
+- ✅ Automatic service map
+- ✅ Good profiling features
+
+**Weaknesses**:
+- ❌ Expensive ($31/host/month)
+- ❌ Vendor lock-in
+- ❌ Limited sampling control
+
+**Best For**:
+- Teams wanting ease of use
+- Organizations already using Datadog
+- Complex microservices
+
+**Pricing**: $31/host/month + $1.70/million spans
+
+**Setup Complexity**: Low
+
+---
+
+### AWS X-Ray
+
+**Type**: AWS-native distributed tracing
+
+**Strengths**:
+- ✅ Native AWS integration
+- ✅ Automatic instrumentation for AWS services
+- ✅ Low cost
+
+**Weaknesses**:
+- ❌ AWS-only
+- ❌ Basic UI
+- ❌ Limited query capabilities
+
+**Best For**:
+- AWS-centric applications
+- Serverless architectures (Lambda)
+- Cost-sensitive projects
+
+**Pricing**: $5/million traces, first 100k free/month
+
+**Setup Complexity**: Low (AWS), Medium (custom)
+
+---
+
+## Full-Stack Observability
+
+### Datadog (Full Platform)
+
+**Components**: Metrics, logs, traces, RUM, synthetics
+
+**Strengths**:
+- ✅ Everything in one platform
+- ✅ Excellent user experience
+- ✅ Correlation across signals
+- ✅ Great for teams
+
+**Weaknesses**:
+- ❌ Very expensive ($50-100+/host/month)
+- ❌ Vendor lock-in
+- ❌ Unpredictable costs
+
+**Total Cost** (example 100 hosts):
+- Infrastructure: $3,100/month
+- APM: $3,100/month
+- Logs: ~$2,000/month
+- **Total: ~$8,000/month**
+
+---
+
+### Grafana Stack (LGTM)
+
+**Components**: Loki (logs), Grafana (viz), Tempo (traces), Mimir/Prometheus (metrics)
+
+**Strengths**:
+- ✅ Open source and cost-effective
+- ✅ Unified visualization
+- ✅ Prometheus-compatible
+- ✅ Great for cloud-native
+
+**Weaknesses**:
+- ❌ Requires self-hosting or Grafana Cloud
+- ❌ More ops burden
+- ❌ Less polished than commercial tools
+
+**Total Cost** (self-hosted, 100 hosts):
+- Infrastructure: ~$1,500/month
+- Ops time: Variable
+- **Total: ~$1,500-3,000/month**
+
+---
+
+### Elastic Observability
+
+**Components**: Elasticsearch (logs), Kibana (viz), APM, metrics
+
+**Strengths**:
+- ✅ Powerful search
+- ✅ Mature platform
+- ✅ Good for log-heavy use cases
+
+**Weaknesses**:
+- ❌ Complex to operate
+- ❌ Expensive infrastructure
+- ❌ Resource intensive
+
+**Total Cost** (self-hosted, 100 hosts):
+- Infrastructure: ~$3,000-5,000/month
+- Ops time: High
+- **Total: ~$4,000-7,000/month**
+
+---
+
+### New Relic One
+
+**Components**: Metrics, logs, traces, synthetics
+
+**Strengths**:
+- ✅ Generous free tier (100GB)
+- ✅ User-friendly
+- ✅ Good for startups
+
+**Weaknesses**:
+- ❌ Costs increase quickly after free tier
+- ❌ Vendor lock-in
+
+**Total Cost**:
+- Free: up to 100GB/month
+- Paid: $0.30/GB beyond 100GB
+
+---
+
+## Cloud Provider Native
+
+### AWS (CloudWatch + X-Ray)
+
+**Use When**:
+- Primarily on AWS
+- Simple monitoring needs
+- Want minimal setup
+
+**Avoid When**:
+- Multi-cloud environment
+- Need advanced features
+- High log volume (expensive)
+
+**Cost** (example):
+- 100 EC2 instances with basic metrics: ~$150/month
+- 1TB logs: ~$500/month ingestion + storage
+- X-Ray: ~$50/month
+
+---
+
+### GCP (Cloud Monitoring + Cloud Trace)
+
+**Use When**:
+- Primarily on GCP
+- Using GKE
+- Want tight GCP integration
+
+**Avoid When**:
+- Multi-cloud environment
+- Need advanced querying
+
+**Cost** (example):
+- First 150MB/month per resource: Free
+- Additional: $0.2508/MB
+
+---
+
+### Azure (Azure Monitor)
+
+**Use When**:
+- Primarily on Azure
+- Using AKS
+- Need Azure integration
+
+**Avoid When**:
+- Multi-cloud
+- Need advanced features
+
+**Cost** (example):
+- First 5GB: Free
+- Additional: $2.76/GB
+
+---
+
+## Decision Matrix
+
+### Choose Prometheus + Grafana If:
+- ✅ Using Kubernetes
+- ✅ Want control and customization
+- ✅ Have ops capacity
+- ✅ Budget-conscious
+- ✅ Need Prometheus ecosystem
+
+### Choose Datadog If:
+- ✅ Want ease of use
+- ✅ Need full observability now
+- ✅ Budget allows ($8k+/month for 100 hosts)
+- ✅ Limited ops team
+- ✅ Need excellent UX
+
+### Choose ELK If:
+- ✅ Heavy log analysis needs
+- ✅ Need powerful search
+- ✅ Have dedicated ops team
+- ✅ Compliance requirements
+- ✅ Willing to invest in infrastructure
+
+### Choose Grafana Stack (LGTM) If:
+- ✅ Want open source full stack
+- ✅ Cost-effective solution
+- ✅ Cloud-native architecture
+- ✅ Already using Prometheus
+- ✅ Have some ops capacity
+
+### Choose New Relic If:
+- ✅ Startup with free tier
+- ✅ APM is priority
+- ✅ Want easy setup
+- ✅ Don't need heavy customization
+
+### Choose Cloud Native (CloudWatch/etc) If:
+- ✅ Single cloud provider
+- ✅ Simple needs
+- ✅ Want minimal setup
+- ✅ Low to medium scale
+
+---
+
+## Cost Comparison
+
+**Example: 100 hosts, 1TB logs/month, 1M spans/day**
+
+| Solution | Monthly Cost | Setup | Ops Burden |
+|----------|-------------|--------|------------|
+| **Prometheus + Loki + Tempo** | $1,500 | Medium | Medium |
+| **Grafana Cloud** | $3,000 | Low | Low |
+| **Datadog** | $8,000 | Low | None |
+| **New Relic** | $3,500 | Low | None |
+| **ELK Stack** | $4,000 | High | High |
+| **CloudWatch** | $2,000 | Low | Low |
+
+---
+
+## Recommendations by Company Size
+
+### Startup (< 10 engineers)
+**Recommendation**: New Relic or Grafana Cloud
+- Minimal ops burden
+- Good free tiers
+- Easy to get started
+
+### Small Company (10-50 engineers)
+**Recommendation**: Prometheus + Grafana + Loki (self-hosted or cloud)
+- Cost-effective
+- Growing ops capacity
+- Flexibility
+
+### Medium Company (50-200 engineers)
+**Recommendation**: Datadog or Grafana Stack
+- Datadog if budget allows
+- Grafana Stack if cost-conscious
+
+### Large Enterprise (200+ engineers)
+**Recommendation**: Build observability platform
+- Mix of tools based on needs
+- Dedicated observability team
+- Custom integrations
--- a/references/tracing_guide.md
+++ b/references/tracing_guide.md
@@ -0,0 +1,663 @@
+# Distributed Tracing Guide
+
+## What is Distributed Tracing?
+
+Distributed tracing tracks a request as it flows through multiple services in a distributed system.
+
+### Key Concepts
+
+**Trace**: End-to-end journey of a request
+**Span**: Single operation within a trace
+**Context**: Metadata propagated between services (trace_id, span_id)
+
+### Example Flow
+```
+User Request → API Gateway → Auth Service → User Service → Database
+                    ↓              ↓             ↓
+              [Trace ID: abc123]
+              Span 1: gateway (50ms)
+              Span 2: auth (20ms)
+              Span 3: user_service (100ms)
+              Span 4: db_query (80ms)
+
+Total: 250ms with waterfall view showing dependencies
+```
+
+---
+
+## OpenTelemetry (OTel)
+
+OpenTelemetry is the industry standard for instrumentation.
+
+### Components
+
+**API**: Instrument code (create spans, add attributes)
+**SDK**: Implement API, configure exporters
+**Collector**: Receive, process, and export telemetry data
+**Exporters**: Send data to backends (Jaeger, Tempo, Zipkin)
+
+### Architecture
+```
+Application → OTel SDK → OTel Collector → Backend (Jaeger/Tempo)
+                                              ↓
+                                          Visualization
+```
+
+---
+
+## Instrumentation Examples
+
+### Python (using OpenTelemetry)
+
+**Setup**:
+```python
+from opentelemetry import trace
+from opentelemetry.sdk.trace import TracerProvider
+from opentelemetry.sdk.trace.export import BatchSpanProcessor
+from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
+
+# Setup tracer
+trace.set_tracer_provider(TracerProvider())
+tracer = trace.get_tracer(__name__)
+
+# Configure exporter
+otlp_exporter = OTLPSpanExporter(endpoint="localhost:4317")
+span_processor = BatchSpanProcessor(otlp_exporter)
+trace.get_tracer_provider().add_span_processor(span_processor)
+```
+
+**Manual instrumentation**:
+```python
+from opentelemetry import trace
+
+tracer = trace.get_tracer(__name__)
+
+@tracer.start_as_current_span("process_order")
+def process_order(order_id):
+    span = trace.get_current_span()
+    span.set_attribute("order.id", order_id)
+    span.set_attribute("order.amount", 99.99)
+
+    try:
+        result = payment_service.charge(order_id)
+        span.set_attribute("payment.status", "success")
+        return result
+    except Exception as e:
+        span.set_status(trace.Status(trace.StatusCode.ERROR))
+        span.record_exception(e)
+        raise
+```
+
+**Auto-instrumentation** (Flask example):
+```python
+from opentelemetry.instrumentation.flask import FlaskInstrumentor
+from opentelemetry.instrumentation.requests import RequestsInstrumentor
+from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
+
+# Auto-instrument Flask
+FlaskInstrumentor().instrument_app(app)
+
+# Auto-instrument requests library
+RequestsInstrumentor().instrument()
+
+# Auto-instrument SQLAlchemy
+SQLAlchemyInstrumentor().instrument(engine=db.engine)
+```
+
+### Node.js (using OpenTelemetry)
+
+**Setup**:
+```javascript
+const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
+const { BatchSpanProcessor } = require('@opentelemetry/sdk-trace-base');
+const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
+
+// Setup provider
+const provider = new NodeTracerProvider();
+const exporter = new OTLPTraceExporter({ url: 'localhost:4317' });
+provider.addSpanProcessor(new BatchSpanProcessor(exporter));
+provider.register();
+```
+
+**Manual instrumentation**:
+```javascript
+const tracer = provider.getTracer('my-service');
+
+async function processOrder(orderId) {
+  const span = tracer.startSpan('process_order');
+  span.setAttribute('order.id', orderId);
+
+  try {
+    const result = await paymentService.charge(orderId);
+    span.setAttribute('payment.status', 'success');
+    return result;
+  } catch (error) {
+    span.setStatus({ code: SpanStatusCode.ERROR });
+    span.recordException(error);
+    throw error;
+  } finally {
+    span.end();
+  }
+}
+```
+
+**Auto-instrumentation**:
+```javascript
+const { registerInstrumentations } = require('@opentelemetry/instrumentation');
+const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');
+const { ExpressInstrumentation } = require('@opentelemetry/instrumentation-express');
+const { MongoDBInstrumentation } = require('@opentelemetry/instrumentation-mongodb');
+
+registerInstrumentations({
+  instrumentations: [
+    new HttpInstrumentation(),
+    new ExpressInstrumentation(),
+    new MongoDBInstrumentation()
+  ]
+});
+```
+
+### Go (using OpenTelemetry)
+
+**Setup**:
+```go
+import (
+    "go.opentelemetry.io/otel"
+    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
+    "go.opentelemetry.io/otel/sdk/trace"
+)
+
+func initTracer() {
+    exporter, _ := otlptracegrpc.New(context.Background())
+    tp := trace.NewTracerProvider(
+        trace.WithBatcher(exporter),
+    )
+    otel.SetTracerProvider(tp)
+}
+```
+
+**Manual instrumentation**:
+```go
+import (
+    "go.opentelemetry.io/otel"
+    "go.opentelemetry.io/otel/attribute"
+)
+
+func processOrder(ctx context.Context, orderID string) error {
+    tracer := otel.Tracer("my-service")
+    ctx, span := tracer.Start(ctx, "process_order")
+    defer span.End()
+
+    span.SetAttributes(
+        attribute.String("order.id", orderID),
+        attribute.Float64("order.amount", 99.99),
+    )
+
+    err := paymentService.Charge(ctx, orderID)
+    if err != nil {
+        span.RecordError(err)
+        return err
+    }
+
+    span.SetAttributes(attribute.String("payment.status", "success"))
+    return nil
+}
+```
+
+---
+
+## Span Attributes
+
+### Semantic Conventions
+
+Follow OpenTelemetry semantic conventions for consistency:
+
+**HTTP**:
+```python
+span.set_attribute("http.method", "GET")
+span.set_attribute("http.url", "https://api.example.com/users")
+span.set_attribute("http.status_code", 200)
+span.set_attribute("http.user_agent", "Mozilla/5.0...")
+```
+
+**Database**:
+```python
+span.set_attribute("db.system", "postgresql")
+span.set_attribute("db.name", "users_db")
+span.set_attribute("db.statement", "SELECT * FROM users WHERE id = ?")
+span.set_attribute("db.operation", "SELECT")
+```
+
+**RPC/gRPC**:
+```python
+span.set_attribute("rpc.system", "grpc")
+span.set_attribute("rpc.service", "UserService")
+span.set_attribute("rpc.method", "GetUser")
+span.set_attribute("rpc.grpc.status_code", 0)
+```
+
+**Messaging**:
+```python
+span.set_attribute("messaging.system", "kafka")
+span.set_attribute("messaging.destination", "user-events")
+span.set_attribute("messaging.operation", "publish")
+span.set_attribute("messaging.message_id", "msg123")
+```
+
+### Custom Attributes
+
+Add business context:
+```python
+span.set_attribute("user.id", "user123")
+span.set_attribute("order.id", "ORD-456")
+span.set_attribute("feature.flag.checkout_v2", True)
+span.set_attribute("cache.hit", False)
+```
+
+---
+
+## Context Propagation
+
+### W3C Trace Context (Standard)
+
+Headers propagated between services:
+```
+traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
+tracestate: vendor1=value1,vendor2=value2
+```
+
+**Format**: `version-trace_id-parent_span_id-trace_flags`
+
+### Implementation
+
+**Python**:
+```python
+from opentelemetry.propagate import inject, extract
+import requests
+
+# Inject context into outgoing request
+headers = {}
+inject(headers)
+requests.get("https://api.example.com", headers=headers)
+
+# Extract context from incoming request
+from flask import request
+ctx = extract(request.headers)
+```
+
+**Node.js**:
+```javascript
+const { propagation } = require('@opentelemetry/api');
+
+// Inject
+const headers = {};
+propagation.inject(context.active(), headers);
+axios.get('https://api.example.com', { headers });
+
+// Extract
+const ctx = propagation.extract(context.active(), req.headers);
+```
+
+**HTTP Example**:
+```bash
+curl -H "traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01" \
+     https://api.example.com/users
+```
+
+---
+
+## Sampling Strategies
+
+### 1. Always On/Off
+```python
+from opentelemetry.sdk.trace import TracerProvider
+from opentelemetry.sdk.trace.sampling import ALWAYS_ON, ALWAYS_OFF
+
+# Development: trace everything
+provider = TracerProvider(sampler=ALWAYS_ON)
+
+# Production: trace nothing (usually not desired)
+provider = TracerProvider(sampler=ALWAYS_OFF)
+```
+
+### 2. Probability-Based
+```python
+from opentelemetry.sdk.trace.sampling import TraceIdRatioBased
+
+# Sample 10% of traces
+provider = TracerProvider(sampler=TraceIdRatioBased(0.1))
+```
+
+### 3. Rate Limiting
+```python
+from opentelemetry.sdk.trace.sampling import ParentBased, RateLimitingSampler
+
+# Sample max 100 traces per second
+sampler = ParentBased(root=RateLimitingSampler(100))
+provider = TracerProvider(sampler=sampler)
+```
+
+### 4. Parent-Based (Default)
+```python
+from opentelemetry.sdk.trace.sampling import ParentBased, TraceIdRatioBased
+
+# If parent span is sampled, sample child spans
+sampler = ParentBased(root=TraceIdRatioBased(0.1))
+provider = TracerProvider(sampler=sampler)
+```
+
+### 5. Custom Sampling
+```python
+from opentelemetry.sdk.trace.sampling import Sampler, Decision
+
+class ErrorSampler(Sampler):
+    """Always sample errors, sample 1% of successes"""
+
+    def should_sample(self, parent_context, trace_id, name, **kwargs):
+        attributes = kwargs.get('attributes', {})
+
+        # Always sample if error
+        if attributes.get('error', False):
+            return Decision.RECORD_AND_SAMPLE
+
+        # Sample 1% of successes
+        if trace_id & 0xFF < 3:  # ~1%
+            return Decision.RECORD_AND_SAMPLE
+
+        return Decision.DROP
+
+provider = TracerProvider(sampler=ErrorSampler())
+```
+
+---
+
+## Backends
+
+### Jaeger
+
+**Docker Compose**:
+```yaml
+version: '3'
+services:
+  jaeger:
+    image: jaegertracing/all-in-one:latest
+    ports:
+      - "16686:16686"  # UI
+      - "4317:4317"    # OTLP gRPC
+      - "4318:4318"    # OTLP HTTP
+    environment:
+      - COLLECTOR_OTLP_ENABLED=true
+```
+
+**Query traces**:
+```bash
+# UI: http://localhost:16686
+
+# API: Get trace by ID
+curl http://localhost:16686/api/traces/abc123
+
+# Search traces
+curl "http://localhost:16686/api/traces?service=my-service&limit=20"
+```
+
+### Grafana Tempo
+
+**Docker Compose**:
+```yaml
+version: '3'
+services:
+  tempo:
+    image: grafana/tempo:latest
+    ports:
+      - "3200:3200"   # Tempo
+      - "4317:4317"   # OTLP gRPC
+    volumes:
+      - ./tempo.yaml:/etc/tempo.yaml
+    command: ["-config.file=/etc/tempo.yaml"]
+```
+
+**tempo.yaml**:
+```yaml
+server:
+  http_listen_port: 3200
+
+distributor:
+  receivers:
+    otlp:
+      protocols:
+        grpc:
+          endpoint: 0.0.0.0:4317
+
+storage:
+  trace:
+    backend: local
+    local:
+      path: /tmp/tempo/traces
+```
+
+**Query in Grafana**:
+- Install Tempo data source
+- Use TraceQL: `{ span.http.status_code = 500 }`
+
+### AWS X-Ray
+
+**Configuration**:
+```python
+from aws_xray_sdk.core import xray_recorder
+from aws_xray_sdk.ext.flask.middleware import XRayMiddleware
+
+xray_recorder.configure(service='my-service')
+XRayMiddleware(app, xray_recorder)
+```
+
+**Query**:
+```bash
+aws xray get-trace-summaries \
+  --start-time 2024-10-28T00:00:00 \
+  --end-time 2024-10-28T23:59:59 \
+  --filter-expression 'error = true'
+```
+
+---
+
+## Analysis Patterns
+
+### Find Slow Traces
+```
+# Jaeger UI
+- Filter by service
+- Set min duration: 1000ms
+- Sort by duration
+
+# TraceQL (Tempo)
+{ duration > 1s }
+```
+
+### Find Error Traces
+```
+# Jaeger UI
+- Filter by tag: error=true
+- Or by HTTP status: http.status_code=500
+
+# TraceQL (Tempo)
+{ span.http.status_code >= 500 }
+```
+
+### Find Traces by User
+```
+# Jaeger UI
+- Filter by tag: user.id=user123
+
+# TraceQL (Tempo)
+{ span.user.id = "user123" }
+```
+
+### Find N+1 Query Problems
+Look for:
+- Many sequential database spans
+- Same query repeated multiple times
+- Pattern: API call → DB query → DB query → DB query...
+
+### Find Service Bottlenecks
+- Identify spans with longest duration
+- Check if time is spent in service logic or waiting for dependencies
+- Look at span relationships (parallel vs sequential)
+
+---
+
+## Integration with Logs
+
+### Trace ID in Logs
+
+**Python**:
+```python
+from opentelemetry import trace
+
+def add_trace_context():
+    span = trace.get_current_span()
+    trace_id = span.get_span_context().trace_id
+    span_id = span.get_span_context().span_id
+
+    return {
+        "trace_id": format(trace_id, '032x'),
+        "span_id": format(span_id, '016x')
+    }
+
+logger.info("Processing order", **add_trace_context(), order_id=order_id)
+```
+
+**Query logs for trace**:
+```
+# Elasticsearch
+GET /logs/_search
+{
+  "query": {
+    "match": { "trace_id": "0af7651916cd43dd8448eb211c80319c" }
+  }
+}
+
+# Loki (LogQL)
+{job="app"} |= "0af7651916cd43dd8448eb211c80319c"
+```
+
+### Trace from Log (Grafana)
+
+Configure derived fields in Grafana:
+```yaml
+datasources:
+  - name: Loki
+    type: loki
+    jsonData:
+      derivedFields:
+        - name: TraceID
+          matcherRegex: "trace_id=([\\w]+)"
+          url: "http://tempo:3200/trace/$${__value.raw}"
+          datasourceUid: tempo_uid
+```
+
+---
+
+## Best Practices
+
+### 1. Span Naming
+✅ Use operation names, not IDs
+- Good: `GET /api/users`, `UserService.GetUser`, `db.query.users`
+- Bad: `/api/users/123`, `span_abc`, `query_1`
+
+### 2. Span Granularity
+✅ One span per logical operation
+- Too coarse: One span for entire request
+- Too fine: Span for every variable assignment
+- Just right: Span per service call, database query, external API
+
+### 3. Add Context
+Always include:
+- Operation name
+- Service name
+- Error status
+- Business identifiers (user_id, order_id)
+
+### 4. Handle Errors
+```python
+try:
+    result = operation()
+except Exception as e:
+    span.set_status(trace.Status(trace.StatusCode.ERROR))
+    span.record_exception(e)
+    raise
+```
+
+### 5. Sampling Strategy
+- Development: 100%
+- Staging: 50-100%
+- Production: 1-10% (or error-based)
+
+### 6. Performance Impact
+- Overhead: ~1-5% CPU
+- Use async exporters
+- Batch span exports
+- Sample appropriately
+
+### 7. Cardinality
+Avoid high-cardinality attributes:
+- ❌ Email addresses
+- ❌ Full URLs with unique IDs
+- ❌ Timestamps
+- ✅ User ID
+- ✅ Endpoint pattern
+- ✅ Status code
+
+---
+
+## Common Issues
+
+### Missing Traces
+**Cause**: Context not propagated
+**Solution**: Verify headers are injected/extracted
+
+### Incomplete Traces
+**Cause**: Spans not closed properly
+**Solution**: Always use `defer span.End()` or context managers
+
+### High Overhead
+**Cause**: Too many spans or synchronous export
+**Solution**: Reduce span count, use batch processor
+
+### No Error Traces
+**Cause**: Errors not recorded on spans
+**Solution**: Call `span.record_exception()` and set error status
+
+---
+
+## Metrics from Traces
+
+Generate RED metrics from trace data:
+
+**Rate**: Traces per second
+**Errors**: Traces with error status
+**Duration**: Span duration percentiles
+
+**Example** (using Tempo + Prometheus):
+```yaml
+# Generate metrics from spans
+metrics_generator:
+  processor:
+    span_metrics:
+      dimensions:
+        - http.method
+        - http.status_code
+```
+
+**Query**:
+```promql
+# Request rate
+rate(traces_spanmetrics_calls_total[5m])
+
+# Error rate
+rate(traces_spanmetrics_calls_total{status_code="STATUS_CODE_ERROR"}[5m])
+  /
+rate(traces_spanmetrics_calls_total[5m])
+
+# P95 latency
+histogram_quantile(0.95, traces_spanmetrics_latency_bucket)
+```
--- a/scripts/alert_quality_checker.py
+++ b/scripts/alert_quality_checker.py
@@ -0,0 +1,315 @@
+#!/usr/bin/env python3
+"""
+Audit Prometheus alert rules against best practices.
+Checks for: alert naming, severity labels, runbook links, expression quality.
+"""
+
+import argparse
+import sys
+import os
+import re
+from typing import Dict, List, Any
+from pathlib import Path
+
+try:
+    import yaml
+except ImportError:
+    print("⚠️  Warning: 'PyYAML' library not found. Install with: pip install pyyaml")
+    sys.exit(1)
+
+
+class AlertQualityChecker:
+    def __init__(self):
+        self.issues = []
+        self.warnings = []
+        self.recommendations = []
+
+    def check_alert_name(self, alert_name: str) -> List[str]:
+        """Check alert naming conventions."""
+        issues = []
+
+        # Should be PascalCase or camelCase
+        if not re.match(r'^[A-Z][a-zA-Z0-9]*$', alert_name):
+            issues.append(f"Alert name '{alert_name}' should use PascalCase (e.g., HighCPUUsage)")
+
+        # Should be descriptive
+        if len(alert_name) < 5:
+            issues.append(f"Alert name '{alert_name}' is too short, use descriptive names")
+
+        # Avoid generic names
+        generic_names = ['Alert', 'Test', 'Warning', 'Error']
+        if alert_name in generic_names:
+            issues.append(f"Alert name '{alert_name}' is too generic")
+
+        return issues
+
+    def check_labels(self, alert: Dict[str, Any]) -> List[str]:
+        """Check required and recommended labels."""
+        issues = []
+        labels = alert.get('labels', {})
+
+        # Required labels
+        if 'severity' not in labels:
+            issues.append("Missing required 'severity' label (critical/warning/info)")
+        elif labels['severity'] not in ['critical', 'warning', 'info']:
+            issues.append(f"Severity '{labels['severity']}' should be one of: critical, warning, info")
+
+        # Recommended labels
+        if 'team' not in labels:
+            self.recommendations.append("Consider adding 'team' label for routing")
+
+        if 'component' not in labels and 'service' not in labels:
+            self.recommendations.append("Consider adding 'component' or 'service' label")
+
+        return issues
+
+    def check_annotations(self, alert: Dict[str, Any]) -> List[str]:
+        """Check annotations quality."""
+        issues = []
+        annotations = alert.get('annotations', {})
+
+        # Required annotations
+        if 'summary' not in annotations:
+            issues.append("Missing 'summary' annotation")
+        elif len(annotations['summary']) < 10:
+            issues.append("Summary annotation is too short, provide clear description")
+
+        if 'description' not in annotations:
+            issues.append("Missing 'description' annotation")
+
+        # Runbook
+        if 'runbook_url' not in annotations and 'runbook' not in annotations:
+            self.recommendations.append("Consider adding 'runbook_url' for incident response")
+
+        # Check for templating
+        if 'summary' in annotations:
+            if '{{ $value }}' not in annotations['summary'] and '{{' not in annotations['summary']:
+                self.recommendations.append("Consider using template variables in summary (e.g., {{ $value }})")
+
+        return issues
+
+    def check_expression(self, expr: str, alert_name: str) -> List[str]:
+        """Check PromQL expression quality."""
+        issues = []
+
+        # Should have a threshold
+        if '>' not in expr and '<' not in expr and '==' not in expr and '!=' not in expr:
+            issues.append("Expression should include a comparison operator")
+
+        # Should use rate() for counters
+        if '_total' in expr and 'rate(' not in expr and 'increase(' not in expr:
+            self.recommendations.append("Consider using rate() or increase() for counter metrics (*_total)")
+
+        # Avoid instant queries without aggregation
+        if not any(agg in expr for agg in ['sum(', 'avg(', 'min(', 'max(', 'count(']):
+            if expr.count('{') > 1:  # Multiple metrics without aggregation
+                self.recommendations.append("Consider aggregating metrics with sum(), avg(), etc.")
+
+        # Check for proper time windows
+        if '[' not in expr and 'rate(' in expr:
+            issues.append("rate() requires a time window (e.g., rate(metric[5m]))")
+
+        return issues
+
+    def check_for_duration(self, rule: Dict[str, Any]) -> List[str]:
+        """Check for 'for' clause to prevent flapping."""
+        issues = []
+        severity = rule.get('labels', {}).get('severity', 'unknown')
+
+        if 'for' not in rule:
+            if severity == 'critical':
+                issues.append("Critical alerts should have 'for' clause to prevent flapping")
+            else:
+                self.warnings.append("Consider adding 'for' clause to prevent alert flapping")
+        else:
+            # Parse duration
+            duration = rule['for']
+            if severity == 'critical' and any(x in duration for x in ['0s', '30s', '1m']):
+                self.warnings.append(f"'for' duration ({duration}) might be too short for critical alerts")
+
+        return issues
+
+    def check_alert_rule(self, rule: Dict[str, Any]) -> Dict[str, Any]:
+        """Check a single alert rule."""
+        alert_name = rule.get('alert', 'Unknown')
+        issues = []
+
+        # Check alert name
+        issues.extend(self.check_alert_name(alert_name))
+
+        # Check expression
+        if 'expr' not in rule:
+            issues.append("Missing 'expr' field")
+        else:
+            issues.extend(self.check_expression(rule['expr'], alert_name))
+
+        # Check labels
+        issues.extend(self.check_labels(rule))
+
+        # Check annotations
+        issues.extend(self.check_annotations(rule))
+
+        # Check for duration
+        issues.extend(self.check_for_duration(rule))
+
+        return {
+            "alert": alert_name,
+            "issues": issues,
+            "severity": rule.get('labels', {}).get('severity', 'unknown')
+        }
+
+    def analyze_file(self, filepath: str) -> Dict[str, Any]:
+        """Analyze a Prometheus rules file."""
+        try:
+            with open(filepath, 'r') as f:
+                data = yaml.safe_load(f)
+
+            if not data:
+                return {"error": "Empty or invalid YAML file"}
+
+            results = []
+            groups = data.get('groups', [])
+
+            for group in groups:
+                group_name = group.get('name', 'Unknown')
+                rules = group.get('rules', [])
+
+                for rule in rules:
+                    # Only check alerting rules, not recording rules
+                    if 'alert' in rule:
+                        result = self.check_alert_rule(rule)
+                        result['group'] = group_name
+                        results.append(result)
+
+            return {
+                "file": filepath,
+                "groups": len(groups),
+                "alerts_checked": len(results),
+                "results": results
+            }
+
+        except Exception as e:
+            return {"error": f"Failed to parse file: {e}"}
+
+
+def print_results(analysis: Dict[str, Any], checker: AlertQualityChecker):
+    """Pretty print analysis results."""
+    print("\n" + "="*60)
+    print("🚨 ALERT QUALITY CHECK RESULTS")
+    print("="*60)
+
+    if "error" in analysis:
+        print(f"\n❌ Error: {analysis['error']}")
+        return
+
+    print(f"\n📁 File: {analysis['file']}")
+    print(f"📊 Groups: {analysis['groups']}")
+    print(f"🔔 Alerts Checked: {analysis['alerts_checked']}")
+
+    # Count issues by severity
+    critical_count = 0
+    warning_count = 0
+
+    for result in analysis['results']:
+        if result['issues']:
+            critical_count += 1
+
+    print(f"\n{'='*60}")
+    print(f"📈 Summary:")
+    print(f"   ❌ Alerts with Issues: {critical_count}")
+    print(f"   ⚠️  Warnings: {len(checker.warnings)}")
+    print(f"   💡 Recommendations: {len(checker.recommendations)}")
+
+    # Print detailed results
+    if critical_count > 0:
+        print(f"\n{'='*60}")
+        print("❌ ALERTS WITH ISSUES:")
+        print(f"{'='*60}")
+
+        for result in analysis['results']:
+            if result['issues']:
+                print(f"\n🔔 Alert: {result['alert']} (Group: {result['group']})")
+                print(f"   Severity: {result['severity']}")
+                print("   Issues:")
+                for issue in result['issues']:
+                    print(f"   • {issue}")
+
+    # Print warnings
+    if checker.warnings:
+        print(f"\n{'='*60}")
+        print("⚠️  WARNINGS:")
+        print(f"{'='*60}")
+        for warning in set(checker.warnings):  # Remove duplicates
+            print(f"• {warning}")
+
+    # Print recommendations
+    if checker.recommendations:
+        print(f"\n{'='*60}")
+        print("💡 RECOMMENDATIONS:")
+        print(f"{'='*60}")
+        for rec in list(set(checker.recommendations))[:10]:  # Top 10 unique recommendations
+            print(f"• {rec}")
+
+    # Overall score
+    total_alerts = analysis['alerts_checked']
+    if total_alerts > 0:
+        quality_score = ((total_alerts - critical_count) / total_alerts) * 100
+        print(f"\n{'='*60}")
+        print(f"📊 Quality Score: {quality_score:.1f}% ({total_alerts - critical_count}/{total_alerts} alerts passing)")
+        print(f"{'='*60}\n")
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="Audit Prometheus alert rules for quality and best practices",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog="""
+Examples:
+  # Check a single file
+  python3 alert_quality_checker.py alerts.yml
+
+  # Check all YAML files in a directory
+  python3 alert_quality_checker.py /path/to/prometheus/rules/
+
+Best Practices Checked:
+  ✓ Alert naming conventions (PascalCase, descriptive)
+  ✓ Required labels (severity)
+  ✓ Required annotations (summary, description)
+  ✓ Runbook URL presence
+  ✓ PromQL expression quality
+  ✓ 'for' clause to prevent flapping
+  ✓ Template variable usage
+        """
+    )
+
+    parser.add_argument('path', help='Path to alert rules file or directory')
+    parser.add_argument('--verbose', action='store_true', help='Show all recommendations')
+
+    args = parser.parse_args()
+
+    checker = AlertQualityChecker()
+
+    # Check if path is file or directory
+    path = Path(args.path)
+
+    if path.is_file():
+        files = [str(path)]
+    elif path.is_dir():
+        files = [str(f) for f in path.rglob('*.yml')] + [str(f) for f in path.rglob('*.yaml')]
+    else:
+        print(f"❌ Path not found: {args.path}")
+        sys.exit(1)
+
+    if not files:
+        print(f"❌ No YAML files found in: {args.path}")
+        sys.exit(1)
+
+    print(f"🔍 Checking {len(files)} file(s)...")
+
+    for filepath in files:
+        analysis = checker.analyze_file(filepath)
+        print_results(analysis, checker)
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/analyze_metrics.py
+++ b/scripts/analyze_metrics.py
@@ -0,0 +1,279 @@
+#!/usr/bin/env python3
+"""
+Analyze metrics from Prometheus or CloudWatch and detect anomalies.
+Supports: rate of change analysis, spike detection, trend analysis.
+"""
+
+import argparse
+import sys
+import json
+from datetime import datetime, timedelta
+from typing import Dict, List, Any, Optional
+import statistics
+
+try:
+    import requests
+except ImportError:
+    print("⚠️  Warning: 'requests' library not found. Install with: pip install requests")
+    sys.exit(1)
+
+try:
+    import boto3
+except ImportError:
+    boto3 = None
+
+
+class MetricAnalyzer:
+    def __init__(self, source: str, endpoint: Optional[str] = None, region: str = "us-east-1"):
+        self.source = source
+        self.endpoint = endpoint
+        self.region = region
+        if source == "cloudwatch" and boto3:
+            self.cloudwatch = boto3.client('cloudwatch', region_name=region)
+        elif source == "cloudwatch" and not boto3:
+            print("⚠️  boto3 not installed. Install with: pip install boto3")
+            sys.exit(1)
+
+    def query_prometheus(self, query: str, hours: int = 24) -> List[Dict]:
+        """Query Prometheus for metric data."""
+        if not self.endpoint:
+            print("❌ Prometheus endpoint required")
+            sys.exit(1)
+
+        try:
+            # Query range for last N hours
+            end_time = datetime.now()
+            start_time = end_time - timedelta(hours=hours)
+
+            params = {
+                'query': query,
+                'start': start_time.timestamp(),
+                'end': end_time.timestamp(),
+                'step': '5m'  # 5-minute resolution
+            }
+
+            response = requests.get(f"{self.endpoint}/api/v1/query_range", params=params, timeout=30)
+            response.raise_for_status()
+
+            data = response.json()
+            if data['status'] != 'success':
+                print(f"❌ Prometheus query failed: {data}")
+                return []
+
+            return data['data']['result']
+
+        except Exception as e:
+            print(f"❌ Error querying Prometheus: {e}")
+            return []
+
+    def query_cloudwatch(self, namespace: str, metric_name: str, dimensions: Dict[str, str],
+                         hours: int = 24, stat: str = "Average") -> List[Dict]:
+        """Query CloudWatch for metric data."""
+        try:
+            end_time = datetime.now()
+            start_time = end_time - timedelta(hours=hours)
+
+            dimensions_list = [{'Name': k, 'Value': v} for k, v in dimensions.items()]
+
+            response = self.cloudwatch.get_metric_statistics(
+                Namespace=namespace,
+                MetricName=metric_name,
+                Dimensions=dimensions_list,
+                StartTime=start_time,
+                EndTime=end_time,
+                Period=300,  # 5-minute intervals
+                Statistics=[stat]
+            )
+
+            return sorted(response['Datapoints'], key=lambda x: x['Timestamp'])
+
+        except Exception as e:
+            print(f"❌ Error querying CloudWatch: {e}")
+            return []
+
+    def detect_anomalies(self, values: List[float], sensitivity: float = 2.0) -> Dict[str, Any]:
+        """Detect anomalies using standard deviation method."""
+        if len(values) < 10:
+            return {
+                "anomalies_detected": False,
+                "message": "Insufficient data points for anomaly detection"
+            }
+
+        mean = statistics.mean(values)
+        stdev = statistics.stdev(values)
+        threshold_upper = mean + (sensitivity * stdev)
+        threshold_lower = mean - (sensitivity * stdev)
+
+        anomalies = []
+        for i, value in enumerate(values):
+            if value > threshold_upper or value < threshold_lower:
+                anomalies.append({
+                    "index": i,
+                    "value": value,
+                    "deviation": abs(value - mean) / stdev if stdev > 0 else 0
+                })
+
+        return {
+            "anomalies_detected": len(anomalies) > 0,
+            "count": len(anomalies),
+            "anomalies": anomalies,
+            "stats": {
+                "mean": mean,
+                "stdev": stdev,
+                "threshold_upper": threshold_upper,
+                "threshold_lower": threshold_lower,
+                "total_points": len(values)
+            }
+        }
+
+    def analyze_trend(self, values: List[float]) -> Dict[str, Any]:
+        """Analyze trend using simple linear regression."""
+        if len(values) < 2:
+            return {"trend": "unknown", "message": "Insufficient data"}
+
+        n = len(values)
+        x = list(range(n))
+        x_mean = sum(x) / n
+        y_mean = sum(values) / n
+
+        numerator = sum((x[i] - x_mean) * (values[i] - y_mean) for i in range(n))
+        denominator = sum((x[i] - x_mean) ** 2 for i in range(n))
+
+        if denominator == 0:
+            return {"trend": "flat", "slope": 0}
+
+        slope = numerator / denominator
+
+        # Determine trend direction
+        if abs(slope) < 0.01 * y_mean:  # Less than 1% change per interval
+            trend = "stable"
+        elif slope > 0:
+            trend = "increasing"
+        else:
+            trend = "decreasing"
+
+        return {
+            "trend": trend,
+            "slope": slope,
+            "rate_of_change": (slope / y_mean * 100) if y_mean != 0 else 0
+        }
+
+
+def print_results(results: Dict[str, Any]):
+    """Pretty print analysis results."""
+    print("\n" + "="*60)
+    print("📊 METRIC ANALYSIS RESULTS")
+    print("="*60)
+
+    if "error" in results:
+        print(f"\n❌ Error: {results['error']}")
+        return
+
+    print(f"\n📈 Data Points: {results.get('data_points', 0)}")
+
+    # Trend analysis
+    if "trend" in results:
+        trend_emoji = {"increasing": "📈", "decreasing": "📉", "stable": "➡️"}.get(results["trend"]["trend"], "❓")
+        print(f"\n{trend_emoji} Trend: {results['trend']['trend'].upper()}")
+        if "rate_of_change" in results["trend"]:
+            print(f"   Rate of Change: {results['trend']['rate_of_change']:.2f}% per interval")
+
+    # Anomaly detection
+    if "anomalies" in results:
+        anomaly_data = results["anomalies"]
+        if anomaly_data["anomalies_detected"]:
+            print(f"\n⚠️  ANOMALIES DETECTED: {anomaly_data['count']}")
+            print(f"   Mean: {anomaly_data['stats']['mean']:.2f}")
+            print(f"   Std Dev: {anomaly_data['stats']['stdev']:.2f}")
+            print(f"   Threshold: [{anomaly_data['stats']['threshold_lower']:.2f}, {anomaly_data['stats']['threshold_upper']:.2f}]")
+
+            print("\n   Top Anomalies:")
+            for anomaly in sorted(anomaly_data['anomalies'], key=lambda x: x['deviation'], reverse=True)[:5]:
+                print(f"   • Index {anomaly['index']}: {anomaly['value']:.2f} ({anomaly['deviation']:.2f}σ)")
+        else:
+            print("\n✅ No anomalies detected")
+
+    print("\n" + "="*60)
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="Analyze metrics from Prometheus or CloudWatch",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog="""
+Examples:
+  # Prometheus: Analyze request rate
+  python3 analyze_metrics.py prometheus \\
+    --endpoint http://localhost:9090 \\
+    --query 'rate(http_requests_total[5m])' \\
+    --hours 24
+
+  # CloudWatch: Analyze CPU utilization
+  python3 analyze_metrics.py cloudwatch \\
+    --namespace AWS/EC2 \\
+    --metric CPUUtilization \\
+    --dimensions InstanceId=i-1234567890abcdef0 \\
+    --hours 48
+        """
+    )
+
+    parser.add_argument('source', choices=['prometheus', 'cloudwatch'],
+                       help='Metric source')
+    parser.add_argument('--endpoint', help='Prometheus endpoint URL')
+    parser.add_argument('--query', help='PromQL query')
+    parser.add_argument('--namespace', help='CloudWatch namespace')
+    parser.add_argument('--metric', help='CloudWatch metric name')
+    parser.add_argument('--dimensions', help='CloudWatch dimensions (key=value,key2=value2)')
+    parser.add_argument('--hours', type=int, default=24, help='Hours of data to analyze (default: 24)')
+    parser.add_argument('--sensitivity', type=float, default=2.0,
+                       help='Anomaly detection sensitivity (std deviations, default: 2.0)')
+    parser.add_argument('--region', default='us-east-1', help='AWS region (default: us-east-1)')
+
+    args = parser.parse_args()
+
+    analyzer = MetricAnalyzer(args.source, args.endpoint, args.region)
+
+    # Query metrics
+    if args.source == 'prometheus':
+        if not args.query:
+            print("❌ --query required for Prometheus")
+            sys.exit(1)
+
+        print(f"🔍 Querying Prometheus: {args.query}")
+        results = analyzer.query_prometheus(args.query, args.hours)
+
+        if not results:
+            print("❌ No data returned")
+            sys.exit(1)
+
+        # Extract values from first result series
+        values = [float(v[1]) for v in results[0].get('values', [])]
+
+    elif args.source == 'cloudwatch':
+        if not all([args.namespace, args.metric, args.dimensions]):
+            print("❌ --namespace, --metric, and --dimensions required for CloudWatch")
+            sys.exit(1)
+
+        dims = dict(item.split('=') for item in args.dimensions.split(','))
+
+        print(f"🔍 Querying CloudWatch: {args.namespace}/{args.metric}")
+        results = analyzer.query_cloudwatch(args.namespace, args.metric, dims, args.hours)
+
+        if not results:
+            print("❌ No data returned")
+            sys.exit(1)
+
+        values = [point['Average'] for point in results]
+
+    # Analyze metrics
+    analysis_results = {
+        "data_points": len(values),
+        "trend": analyzer.analyze_trend(values),
+        "anomalies": analyzer.detect_anomalies(values, args.sensitivity)
+    }
+
+    print_results(analysis_results)
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/dashboard_generator.py
+++ b/scripts/dashboard_generator.py
@@ -0,0 +1,395 @@
+#!/usr/bin/env python3
+"""
+Generate Grafana dashboards from templates.
+Supports: web applications, Kubernetes, databases, Redis, and custom metrics.
+"""
+
+import argparse
+import sys
+import json
+from typing import Dict, List, Any, Optional
+from pathlib import Path
+
+
+class DashboardGenerator:
+    def __init__(self, title: str, datasource: str = "Prometheus"):
+        self.title = title
+        self.datasource = datasource
+        self.dashboard = self._create_base_dashboard()
+        self.panel_id = 1
+        self.row_y = 0
+
+    def _create_base_dashboard(self) -> Dict[str, Any]:
+        """Create base dashboard structure."""
+        return {
+            "dashboard": {
+                "title": self.title,
+                "tags": [],
+                "timezone": "browser",
+                "schemaVersion": 16,
+                "version": 0,
+                "refresh": "30s",
+                "panels": [],
+                "templating": {
+                    "list": []
+                },
+                "time": {
+                    "from": "now-6h",
+                    "to": "now"
+                }
+            },
+            "overwrite": True
+        }
+
+    def add_variable(self, name: str, label: str, query: str):
+        """Add a template variable."""
+        variable = {
+            "name": name,
+            "label": label,
+            "type": "query",
+            "datasource": self.datasource,
+            "query": query,
+            "refresh": 1,
+            "regex": "",
+            "multi": False,
+            "includeAll": False
+        }
+        self.dashboard["dashboard"]["templating"]["list"].append(variable)
+
+    def add_row(self, title: str):
+        """Add a row panel."""
+        panel = {
+            "id": self.panel_id,
+            "type": "row",
+            "title": title,
+            "collapsed": False,
+            "gridPos": {"h": 1, "w": 24, "x": 0, "y": self.row_y}
+        }
+        self.dashboard["dashboard"]["panels"].append(panel)
+        self.panel_id += 1
+        self.row_y += 1
+
+    def add_graph(self, title: str, targets: List[Dict[str, str]], unit: str = "short",
+                  width: int = 12, height: int = 8):
+        """Add a graph panel."""
+        panel = {
+            "id": self.panel_id,
+            "type": "graph",
+            "title": title,
+            "datasource": self.datasource,
+            "targets": [
+                {
+                    "expr": target["query"],
+                    "legendFormat": target.get("legend", ""),
+                    "refId": chr(65 + i)  # A, B, C, etc.
+                }
+                for i, target in enumerate(targets)
+            ],
+            "gridPos": {"h": height, "w": width, "x": 0, "y": self.row_y},
+            "yaxes": [
+                {"format": unit, "label": None, "show": True},
+                {"format": "short", "label": None, "show": True}
+            ],
+            "lines": True,
+            "fill": 1,
+            "linewidth": 2,
+            "legend": {
+                "show": True,
+                "alignAsTable": True,
+                "avg": True,
+                "current": True,
+                "max": True,
+                "min": False,
+                "total": False,
+                "values": True
+            }
+        }
+        self.dashboard["dashboard"]["panels"].append(panel)
+        self.panel_id += 1
+        self.row_y += height
+
+    def add_stat(self, title: str, query: str, unit: str = "short",
+                 width: int = 6, height: int = 4):
+        """Add a stat panel (single value)."""
+        panel = {
+            "id": self.panel_id,
+            "type": "stat",
+            "title": title,
+            "datasource": self.datasource,
+            "targets": [
+                {
+                    "expr": query,
+                    "refId": "A"
+                }
+            ],
+            "gridPos": {"h": height, "w": width, "x": 0, "y": self.row_y},
+            "options": {
+                "graphMode": "area",
+                "orientation": "auto",
+                "reduceOptions": {
+                    "values": False,
+                    "calcs": ["lastNotNull"]
+                }
+            },
+            "fieldConfig": {
+                "defaults": {
+                    "unit": unit,
+                    "thresholds": {
+                        "mode": "absolute",
+                        "steps": [
+                            {"value": None, "color": "green"},
+                            {"value": 80, "color": "red"}
+                        ]
+                    }
+                }
+            }
+        }
+        self.dashboard["dashboard"]["panels"].append(panel)
+        self.panel_id += 1
+
+    def generate_webapp_dashboard(self, service: str):
+        """Generate dashboard for web application."""
+        self.add_variable("service", "Service", f"label_values({service}_http_requests_total, service)")
+
+        # Request metrics
+        self.add_row("Request Metrics")
+
+        self.add_graph(
+            "Request Rate",
+            [{"query": f'sum(rate({service}_http_requests_total[5m])) by (status)', "legend": "{{status}}"}],
+            unit="reqps",
+            width=12
+        )
+
+        self.add_graph(
+            "Request Latency (p50, p95, p99)",
+            [
+                {"query": f'histogram_quantile(0.50, sum(rate({service}_http_request_duration_seconds_bucket[5m])) by (le))', "legend": "p50"},
+                {"query": f'histogram_quantile(0.95, sum(rate({service}_http_request_duration_seconds_bucket[5m])) by (le))', "legend": "p95"},
+                {"query": f'histogram_quantile(0.99, sum(rate({service}_http_request_duration_seconds_bucket[5m])) by (le))', "legend": "p99"}
+            ],
+            unit="s",
+            width=12
+        )
+
+        # Error rate
+        self.add_row("Errors")
+
+        self.add_graph(
+            "Error Rate (%)",
+            [{"query": f'sum(rate({service}_http_requests_total{{status=~"5.."}}[5m])) / sum(rate({service}_http_requests_total[5m])) * 100', "legend": "Error Rate"}],
+            unit="percent",
+            width=12
+        )
+
+        # Resource usage
+        self.add_row("Resource Usage")
+
+        self.add_graph(
+            "CPU Usage",
+            [{"query": f'sum(rate(process_cpu_seconds_total{{job="{service}"}}[5m])) * 100', "legend": "CPU %"}],
+            unit="percent",
+            width=12
+        )
+
+        self.add_graph(
+            "Memory Usage",
+            [{"query": f'process_resident_memory_bytes{{job="{service}"}}', "legend": "Memory"}],
+            unit="bytes",
+            width=12
+        )
+
+    def generate_kubernetes_dashboard(self, namespace: str):
+        """Generate dashboard for Kubernetes cluster."""
+        self.add_variable("namespace", "Namespace", f"label_values(kube_pod_info, namespace)")
+
+        # Cluster overview
+        self.add_row("Cluster Overview")
+
+        self.add_stat("Total Pods", f'count(kube_pod_info{{namespace="{namespace}"}})', width=6)
+        self.add_stat("Running Pods", f'count(kube_pod_status_phase{{namespace="{namespace}", phase="Running"}})', width=6)
+        self.add_stat("Pending Pods", f'count(kube_pod_status_phase{{namespace="{namespace}", phase="Pending"}})', width=6)
+        self.add_stat("Failed Pods", f'count(kube_pod_status_phase{{namespace="{namespace}", phase="Failed"}})', width=6)
+
+        # Resource usage
+        self.add_row("Resource Usage")
+
+        self.add_graph(
+            "CPU Usage by Pod",
+            [{"query": f'sum(rate(container_cpu_usage_seconds_total{{namespace="{namespace}"}}[5m])) by (pod)', "legend": "{{pod}}"}],
+            unit="percent",
+            width=12
+        )
+
+        self.add_graph(
+            "Memory Usage by Pod",
+            [{"query": f'sum(container_memory_usage_bytes{{namespace="{namespace}"}}) by (pod)', "legend": "{{pod}}"}],
+            unit="bytes",
+            width=12
+        )
+
+        # Network
+        self.add_row("Network")
+
+        self.add_graph(
+            "Network I/O",
+            [
+                {"query": f'sum(rate(container_network_receive_bytes_total{{namespace="{namespace}"}}[5m])) by (pod)', "legend": "Receive - {{pod}}"},
+                {"query": f'sum(rate(container_network_transmit_bytes_total{{namespace="{namespace}"}}[5m])) by (pod)', "legend": "Transmit - {{pod}}"}
+            ],
+            unit="Bps",
+            width=12
+        )
+
+    def generate_database_dashboard(self, db_type: str, instance: str):
+        """Generate dashboard for database (postgres/mysql)."""
+        if db_type == "postgres":
+            self._generate_postgres_dashboard(instance)
+        elif db_type == "mysql":
+            self._generate_mysql_dashboard(instance)
+
+    def _generate_postgres_dashboard(self, instance: str):
+        """Generate PostgreSQL dashboard."""
+        self.add_row("PostgreSQL Metrics")
+
+        self.add_graph(
+            "Connections",
+            [
+                {"query": f'pg_stat_database_numbackends{{instance="{instance}"}}', "legend": "{{datname}}"}
+            ],
+            unit="short",
+            width=12
+        )
+
+        self.add_graph(
+            "Transactions per Second",
+            [
+                {"query": f'rate(pg_stat_database_xact_commit{{instance="{instance}"}}[5m])', "legend": "Commits"},
+                {"query": f'rate(pg_stat_database_xact_rollback{{instance="{instance}"}}[5m])', "legend": "Rollbacks"}
+            ],
+            unit="tps",
+            width=12
+        )
+
+        self.add_graph(
+            "Query Duration (p95)",
+            [
+                {"query": f'histogram_quantile(0.95, rate(pg_stat_statements_total_time_bucket{{instance="{instance}"}}[5m]))', "legend": "p95"}
+            ],
+            unit="ms",
+            width=12
+        )
+
+    def _generate_mysql_dashboard(self, instance: str):
+        """Generate MySQL dashboard."""
+        self.add_row("MySQL Metrics")
+
+        self.add_graph(
+            "Connections",
+            [
+                {"query": f'mysql_global_status_threads_connected{{instance="{instance}"}}', "legend": "Connected"},
+                {"query": f'mysql_global_status_threads_running{{instance="{instance}"}}', "legend": "Running"}
+            ],
+            unit="short",
+            width=12
+        )
+
+        self.add_graph(
+            "Queries per Second",
+            [
+                {"query": f'rate(mysql_global_status_queries{{instance="{instance}"}}[5m])', "legend": "Queries"}
+            ],
+            unit="qps",
+            width=12
+        )
+
+    def save(self, output_file: str):
+        """Save dashboard to file."""
+        try:
+            with open(output_file, 'w') as f:
+                json.dump(self.dashboard, f, indent=2)
+            return True
+        except Exception as e:
+            print(f"❌ Error saving dashboard: {e}")
+            return False
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="Generate Grafana dashboards from templates",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog="""
+Examples:
+  # Web application dashboard
+  python3 dashboard_generator.py webapp \\
+    --title "My API Dashboard" \\
+    --service my_api \\
+    --output dashboard.json
+
+  # Kubernetes dashboard
+  python3 dashboard_generator.py kubernetes \\
+    --title "K8s Namespace" \\
+    --namespace production \\
+    --output k8s-dashboard.json
+
+  # Database dashboard
+  python3 dashboard_generator.py database \\
+    --title "PostgreSQL" \\
+    --db-type postgres \\
+    --instance db.example.com:5432 \\
+    --output db-dashboard.json
+        """
+    )
+
+    parser.add_argument('type', choices=['webapp', 'kubernetes', 'database'],
+                       help='Dashboard type')
+    parser.add_argument('--title', required=True, help='Dashboard title')
+    parser.add_argument('--output', required=True, help='Output file path')
+    parser.add_argument('--datasource', default='Prometheus', help='Data source name')
+
+    # Web app specific
+    parser.add_argument('--service', help='Service name (for webapp)')
+
+    # Kubernetes specific
+    parser.add_argument('--namespace', help='Kubernetes namespace')
+
+    # Database specific
+    parser.add_argument('--db-type', choices=['postgres', 'mysql'], help='Database type')
+    parser.add_argument('--instance', help='Database instance')
+
+    args = parser.parse_args()
+
+    print(f"🎨 Generating {args.type} dashboard: {args.title}")
+
+    generator = DashboardGenerator(args.title, args.datasource)
+
+    if args.type == 'webapp':
+        if not args.service:
+            print("❌ --service required for webapp dashboard")
+            sys.exit(1)
+        generator.generate_webapp_dashboard(args.service)
+
+    elif args.type == 'kubernetes':
+        if not args.namespace:
+            print("❌ --namespace required for kubernetes dashboard")
+            sys.exit(1)
+        generator.generate_kubernetes_dashboard(args.namespace)
+
+    elif args.type == 'database':
+        if not args.db_type or not args.instance:
+            print("❌ --db-type and --instance required for database dashboard")
+            sys.exit(1)
+        generator.generate_database_dashboard(args.db_type, args.instance)
+
+    if generator.save(args.output):
+        print(f"✅ Dashboard saved to: {args.output}")
+        print(f"\n📝 Import to Grafana:")
+        print(f"   1. Go to Grafana → Dashboards → Import")
+        print(f"   2. Upload {args.output}")
+        print(f"   3. Select datasource and save")
+    else:
+        sys.exit(1)
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/datadog_cost_analyzer.py
+++ b/scripts/datadog_cost_analyzer.py
@@ -0,0 +1,477 @@
+#!/usr/bin/env python3
+"""
+Analyze Datadog usage and identify cost optimization opportunities.
+Helps find waste in custom metrics, logs, APM, and infrastructure monitoring.
+"""
+
+import argparse
+import sys
+import os
+from datetime import datetime, timedelta
+from typing import Dict, List, Any, Optional
+from collections import defaultdict
+
+try:
+    import requests
+except ImportError:
+    print("⚠️  Warning: 'requests' library not found. Install with: pip install requests")
+    sys.exit(1)
+
+try:
+    from tabulate import tabulate
+except ImportError:
+    tabulate = None
+
+
+class DatadogCostAnalyzer:
+    # Pricing (as of 2024-2025)
+    PRICING = {
+        'infrastructure_pro': 15,  # per host per month
+        'infrastructure_enterprise': 23,
+        'custom_metric': 0.01,  # per metric per month (first 100 free per host)
+        'log_ingestion': 0.10,  # per GB ingested per month
+        'apm_host': 31,  # APM Pro per host per month
+        'apm_span': 1.70,  # per million indexed spans
+    }
+
+    def __init__(self, api_key: str, app_key: str, site: str = "datadoghq.com"):
+        self.api_key = api_key
+        self.app_key = app_key
+        self.site = site
+        self.base_url = f"https://api.{site}"
+        self.headers = {
+            'DD-API-KEY': api_key,
+            'DD-APPLICATION-KEY': app_key,
+            'Content-Type': 'application/json'
+        }
+
+    def _make_request(self, endpoint: str, params: Optional[Dict] = None) -> Dict:
+        """Make API request to Datadog."""
+        try:
+            url = f"{self.base_url}{endpoint}"
+            response = requests.get(url, headers=self.headers, params=params, timeout=30)
+            response.raise_for_status()
+            return response.json()
+        except requests.exceptions.RequestException as e:
+            print(f"❌ API Error: {e}")
+            return {}
+
+    def get_usage_metrics(self, start_date: str, end_date: str) -> Dict[str, Any]:
+        """Get usage metrics for specified date range."""
+        endpoint = "/api/v1/usage/summary"
+        params = {
+            'start_month': start_date,
+            'end_month': end_date,
+            'include_org_details': 'true'
+        }
+
+        data = self._make_request(endpoint, params)
+        return data.get('usage', [])
+
+    def get_custom_metrics(self) -> Dict[str, Any]:
+        """Get custom metrics usage and identify high-cardinality metrics."""
+        endpoint = "/api/v1/usage/timeseries"
+
+        # Get last 30 days
+        end_date = datetime.now()
+        start_date = end_date - timedelta(days=30)
+
+        params = {
+            'start_hr': int(start_date.timestamp()),
+            'end_hr': int(end_date.timestamp())
+        }
+
+        data = self._make_request(endpoint, params)
+
+        if not data:
+            return {'metrics': [], 'total_count': 0}
+
+        # Extract custom metrics info
+        usage_data = data.get('usage', [])
+
+        metrics_summary = {
+            'total_custom_metrics': 0,
+            'avg_custom_metrics': 0,
+            'billable_metrics': 0
+        }
+
+        for day in usage_data:
+            if 'timeseries' in day:
+                for ts in day['timeseries']:
+                    if ts.get('metric_category') == 'custom':
+                        metrics_summary['total_custom_metrics'] = max(
+                            metrics_summary['total_custom_metrics'],
+                            ts.get('num_custom_timeseries', 0)
+                        )
+
+        # Calculate billable (first 100 free)
+        metrics_summary['billable_metrics'] = max(0, metrics_summary['total_custom_metrics'] - 100)
+
+        return metrics_summary
+
+    def get_infrastructure_hosts(self) -> Dict[str, Any]:
+        """Get infrastructure host count and breakdown."""
+        endpoint = "/api/v1/usage/hosts"
+
+        end_date = datetime.now()
+        start_date = end_date - timedelta(days=30)
+
+        params = {
+            'start_hr': int(start_date.timestamp()),
+            'end_hr': int(end_date.timestamp())
+        }
+
+        data = self._make_request(endpoint, params)
+
+        if not data:
+            return {'total_hosts': 0}
+
+        usage = data.get('usage', [])
+
+        host_summary = {
+            'total_hosts': 0,
+            'agent_hosts': 0,
+            'aws_hosts': 0,
+            'azure_hosts': 0,
+            'gcp_hosts': 0,
+            'container_count': 0
+        }
+
+        for day in usage:
+            host_summary['total_hosts'] = max(host_summary['total_hosts'], day.get('host_count', 0))
+            host_summary['agent_hosts'] = max(host_summary['agent_hosts'], day.get('agent_host_count', 0))
+            host_summary['aws_hosts'] = max(host_summary['aws_hosts'], day.get('aws_host_count', 0))
+            host_summary['azure_hosts'] = max(host_summary['azure_hosts'], day.get('azure_host_count', 0))
+            host_summary['gcp_hosts'] = max(host_summary['gcp_hosts'], day.get('gcp_host_count', 0))
+            host_summary['container_count'] = max(host_summary['container_count'], day.get('container_count', 0))
+
+        return host_summary
+
+    def get_log_usage(self) -> Dict[str, Any]:
+        """Get log ingestion and retention usage."""
+        endpoint = "/api/v1/usage/logs"
+
+        end_date = datetime.now()
+        start_date = end_date - timedelta(days=30)
+
+        params = {
+            'start_hr': int(start_date.timestamp()),
+            'end_hr': int(end_date.timestamp())
+        }
+
+        data = self._make_request(endpoint, params)
+
+        if not data:
+            return {'total_gb': 0, 'daily_avg_gb': 0}
+
+        usage = data.get('usage', [])
+
+        total_ingested = 0
+        days_count = len(usage)
+
+        for day in usage:
+            total_ingested += day.get('ingested_events_bytes', 0)
+
+        total_gb = total_ingested / (1024**3)  # Convert to GB
+        daily_avg_gb = total_gb / max(days_count, 1)
+
+        return {
+            'total_gb': total_gb,
+            'daily_avg_gb': daily_avg_gb,
+            'monthly_projected_gb': daily_avg_gb * 30
+        }
+
+    def get_unused_monitors(self) -> List[Dict[str, Any]]:
+        """Find monitors that haven't alerted in 30+ days."""
+        endpoint = "/api/v1/monitor"
+
+        data = self._make_request(endpoint)
+
+        if not data:
+            return []
+
+        monitors = data if isinstance(data, list) else []
+
+        unused = []
+        now = datetime.now()
+
+        for monitor in monitors:
+            # Check if monitor has triggered recently
+            overall_state = monitor.get('overall_state')
+            modified = monitor.get('modified', '')
+
+            # If monitor has been in OK state and not modified in 30+ days
+            try:
+                if modified:
+                    mod_date = datetime.fromisoformat(modified.replace('Z', '+00:00'))
+                    days_since_modified = (now - mod_date.replace(tzinfo=None)).days
+
+                    if days_since_modified > 30 and overall_state in ['OK', 'No Data']:
+                        unused.append({
+                            'name': monitor.get('name', 'Unknown'),
+                            'id': monitor.get('id'),
+                            'days_since_modified': days_since_modified,
+                            'state': overall_state
+                        })
+            except:
+                pass
+
+        return unused
+
+    def calculate_costs(self, usage_data: Dict[str, Any]) -> Dict[str, float]:
+        """Calculate estimated monthly costs."""
+        costs = {
+            'infrastructure': 0,
+            'custom_metrics': 0,
+            'logs': 0,
+            'apm': 0,
+            'total': 0
+        }
+
+        # Infrastructure (assuming Pro tier)
+        if 'hosts' in usage_data:
+            costs['infrastructure'] = usage_data['hosts'].get('total_hosts', 0) * self.PRICING['infrastructure_pro']
+
+        # Custom metrics
+        if 'custom_metrics' in usage_data:
+            billable = usage_data['custom_metrics'].get('billable_metrics', 0)
+            costs['custom_metrics'] = billable * self.PRICING['custom_metric']
+
+        # Logs
+        if 'logs' in usage_data:
+            monthly_gb = usage_data['logs'].get('monthly_projected_gb', 0)
+            costs['logs'] = monthly_gb * self.PRICING['log_ingestion']
+
+        costs['total'] = sum(costs.values())
+
+        return costs
+
+    def get_recommendations(self, usage_data: Dict[str, Any]) -> List[str]:
+        """Generate cost optimization recommendations."""
+        recommendations = []
+
+        # Custom metrics recommendations
+        if 'custom_metrics' in usage_data:
+            billable = usage_data['custom_metrics'].get('billable_metrics', 0)
+            if billable > 500:
+                savings = (billable * 0.3) * self.PRICING['custom_metric']  # Assume 30% reduction possible
+                recommendations.append({
+                    'category': 'Custom Metrics',
+                    'issue': f'High custom metric count: {billable:,} billable metrics',
+                    'action': 'Review metric tags for high cardinality, consider aggregating or dropping unused metrics',
+                    'potential_savings': f'${savings:.2f}/month'
+                })
+
+        # Container vs VM recommendations
+        if 'hosts' in usage_data:
+            hosts = usage_data['hosts'].get('total_hosts', 0)
+            containers = usage_data['hosts'].get('container_count', 0)
+
+            if containers > hosts * 10:  # Many containers per host
+                savings = hosts * 0.2 * self.PRICING['infrastructure_pro']
+                recommendations.append({
+                    'category': 'Infrastructure',
+                    'issue': f'{containers:,} containers running on {hosts} hosts',
+                    'action': 'Consider using container monitoring instead of host-based (can be 50-70% cheaper)',
+                    'potential_savings': f'${savings:.2f}/month'
+                })
+
+        # Unused monitors
+        if 'unused_monitors' in usage_data:
+            count = len(usage_data['unused_monitors'])
+            if count > 10:
+                recommendations.append({
+                    'category': 'Monitors',
+                    'issue': f'{count} monitors unused for 30+ days',
+                    'action': 'Delete or disable unused monitors to reduce noise and improve performance',
+                    'potential_savings': 'Operational efficiency'
+                })
+
+        # Log volume recommendations
+        if 'logs' in usage_data:
+            monthly_gb = usage_data['logs'].get('monthly_projected_gb', 0)
+            if monthly_gb > 100:
+                savings = (monthly_gb * 0.4) * self.PRICING['log_ingestion']  # 40% reduction
+                recommendations.append({
+                    'category': 'Logs',
+                    'issue': f'High log volume: {monthly_gb:.1f} GB/month projected',
+                    'action': 'Review log sources, implement sampling for debug logs, exclude health checks',
+                    'potential_savings': f'${savings:.2f}/month'
+                })
+
+        # Migration recommendation if costs are high
+        costs = self.calculate_costs(usage_data)
+        if costs['total'] > 5000:
+            oss_cost = usage_data['hosts'].get('total_hosts', 0) * 15  # Rough estimate for self-hosted
+            savings = costs['total'] - oss_cost
+            recommendations.append({
+                'category': 'Strategic',
+                'issue': f'Total monthly cost: ${costs["total"]:.2f}',
+                'action': 'Consider migrating to open-source stack (Prometheus + Grafana + Loki)',
+                'potential_savings': f'${savings:.2f}/month (~{(savings/costs["total"]*100):.0f}% reduction)'
+            })
+
+        return recommendations
+
+
+def print_usage_summary(usage_data: Dict[str, Any]):
+    """Print usage summary."""
+    print("\n" + "="*70)
+    print("📊 DATADOG USAGE SUMMARY")
+    print("="*70)
+
+    # Infrastructure
+    if 'hosts' in usage_data:
+        hosts = usage_data['hosts']
+        print(f"\n🖥️  Infrastructure:")
+        print(f"   Total Hosts: {hosts.get('total_hosts', 0):,}")
+        print(f"   Agent Hosts: {hosts.get('agent_hosts', 0):,}")
+        print(f"   AWS Hosts: {hosts.get('aws_hosts', 0):,}")
+        print(f"   Azure Hosts: {hosts.get('azure_hosts', 0):,}")
+        print(f"   GCP Hosts: {hosts.get('gcp_hosts', 0):,}")
+        print(f"   Containers: {hosts.get('container_count', 0):,}")
+
+    # Custom Metrics
+    if 'custom_metrics' in usage_data:
+        metrics = usage_data['custom_metrics']
+        print(f"\n📈 Custom Metrics:")
+        print(f"   Total: {metrics.get('total_custom_metrics', 0):,}")
+        print(f"   Billable: {metrics.get('billable_metrics', 0):,} (first 100 free)")
+
+    # Logs
+    if 'logs' in usage_data:
+        logs = usage_data['logs']
+        print(f"\n📝 Logs:")
+        print(f"   Daily Average: {logs.get('daily_avg_gb', 0):.2f} GB")
+        print(f"   Monthly Projected: {logs.get('monthly_projected_gb', 0):.2f} GB")
+
+    # Unused Monitors
+    if 'unused_monitors' in usage_data:
+        print(f"\n🔔 Unused Monitors:")
+        print(f"   Count: {len(usage_data['unused_monitors'])}")
+
+
+def print_cost_breakdown(costs: Dict[str, float]):
+    """Print cost breakdown."""
+    print("\n" + "="*70)
+    print("💰 ESTIMATED MONTHLY COSTS")
+    print("="*70)
+
+    print(f"\n   Infrastructure Monitoring: ${costs['infrastructure']:,.2f}")
+    print(f"   Custom Metrics:            ${costs['custom_metrics']:,.2f}")
+    print(f"   Log Management:            ${costs['logs']:,.2f}")
+    print(f"   APM:                       ${costs['apm']:,.2f}")
+    print(f"   " + "-"*40)
+    print(f"   TOTAL:                     ${costs['total']:,.2f}/month")
+    print(f"                              ${costs['total']*12:,.2f}/year")
+
+
+def print_recommendations(recommendations: List[Dict]):
+    """Print recommendations."""
+    print("\n" + "="*70)
+    print("💡 COST OPTIMIZATION RECOMMENDATIONS")
+    print("="*70)
+
+    total_savings = 0
+
+    for i, rec in enumerate(recommendations, 1):
+        print(f"\n{i}. {rec['category']}")
+        print(f"   Issue: {rec['issue']}")
+        print(f"   Action: {rec['action']}")
+        print(f"   Potential Savings: {rec['potential_savings']}")
+
+        # Extract savings amount if it's a dollar value
+        if '$' in rec['potential_savings']:
+            try:
+                amount = float(rec['potential_savings'].replace('$', '').replace('/month', '').replace(',', ''))
+                total_savings += amount
+            except:
+                pass
+
+    if total_savings > 0:
+        print(f"\n{'='*70}")
+        print(f"💵 Total Potential Monthly Savings: ${total_savings:,.2f}")
+        print(f"💵 Total Potential Annual Savings:  ${total_savings*12:,.2f}")
+        print(f"{'='*70}")
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="Analyze Datadog usage and identify cost optimization opportunities",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog="""
+Examples:
+  # Analyze current usage
+  python3 datadog_cost_analyzer.py \\
+    --api-key DD_API_KEY \\
+    --app-key DD_APP_KEY
+
+  # Use environment variables
+  export DD_API_KEY=your_api_key
+  export DD_APP_KEY=your_app_key
+  python3 datadog_cost_analyzer.py
+
+  # Specify site (for EU)
+  python3 datadog_cost_analyzer.py --site datadoghq.eu
+
+Required Datadog Permissions:
+  - usage_read
+  - monitors_read
+        """
+    )
+
+    parser.add_argument('--api-key',
+                       default=os.environ.get('DD_API_KEY'),
+                       help='Datadog API key (or set DD_API_KEY env var)')
+    parser.add_argument('--app-key',
+                       default=os.environ.get('DD_APP_KEY'),
+                       help='Datadog Application key (or set DD_APP_KEY env var)')
+    parser.add_argument('--site',
+                       default='datadoghq.com',
+                       help='Datadog site (default: datadoghq.com, EU: datadoghq.eu)')
+
+    args = parser.parse_args()
+
+    if not args.api_key or not args.app_key:
+        print("❌ Error: API key and Application key required")
+        print("   Set via --api-key and --app-key flags or DD_API_KEY and DD_APP_KEY env vars")
+        sys.exit(1)
+
+    print("🔍 Analyzing Datadog usage...")
+    print("   This may take 30-60 seconds...\n")
+
+    analyzer = DatadogCostAnalyzer(args.api_key, args.app_key, args.site)
+
+    # Gather usage data
+    usage_data = {}
+
+    print("   ⏳ Fetching infrastructure usage...")
+    usage_data['hosts'] = analyzer.get_infrastructure_hosts()
+
+    print("   ⏳ Fetching custom metrics...")
+    usage_data['custom_metrics'] = analyzer.get_custom_metrics()
+
+    print("   ⏳ Fetching log usage...")
+    usage_data['logs'] = analyzer.get_log_usage()
+
+    print("   ⏳ Finding unused monitors...")
+    usage_data['unused_monitors'] = analyzer.get_unused_monitors()
+
+    # Calculate costs
+    costs = analyzer.calculate_costs(usage_data)
+
+    # Generate recommendations
+    recommendations = analyzer.get_recommendations(usage_data)
+
+    # Print results
+    print_usage_summary(usage_data)
+    print_cost_breakdown(costs)
+    print_recommendations(recommendations)
+
+    print("\n" + "="*70)
+    print("✅ Analysis complete!")
+    print("="*70)
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/health_check_validator.py
+++ b/scripts/health_check_validator.py
@@ -0,0 +1,297 @@
+#!/usr/bin/env python3
+"""
+Validate health check endpoints and analyze response quality.
+Checks: response time, status code, response format, dependencies.
+"""
+
+import argparse
+import sys
+import time
+import json
+from typing import Dict, List, Any, Optional
+from urllib.parse import urlparse
+
+try:
+    import requests
+except ImportError:
+    print("⚠️  Warning: 'requests' library not found. Install with: pip install requests")
+    sys.exit(1)
+
+
+class HealthCheckValidator:
+    def __init__(self, timeout: int = 5):
+        self.timeout = timeout
+        self.results = []
+
+    def validate_endpoint(self, url: str) -> Dict[str, Any]:
+        """Validate a health check endpoint."""
+        result = {
+            "url": url,
+            "timestamp": time.strftime("%Y-%m-%d %H:%M:%S"),
+            "checks": [],
+            "warnings": [],
+            "errors": []
+        }
+
+        try:
+            # Make request
+            start_time = time.time()
+            response = requests.get(url, timeout=self.timeout, verify=True)
+            response_time = time.time() - start_time
+
+            result["status_code"] = response.status_code
+            result["response_time"] = response_time
+
+            # Check 1: Status code
+            if response.status_code == 200:
+                result["checks"].append("✅ Status code is 200")
+            else:
+                result["errors"].append(f"❌ Unexpected status code: {response.status_code} (expected 200)")
+
+            # Check 2: Response time
+            if response_time < 1.0:
+                result["checks"].append(f"✅ Response time: {response_time:.3f}s (< 1s)")
+            elif response_time < 3.0:
+                result["warnings"].append(f"⚠️  Slow response time: {response_time:.3f}s (should be < 1s)")
+            else:
+                result["errors"].append(f"❌ Very slow response time: {response_time:.3f}s (should be < 1s)")
+
+            # Check 3: Content type
+            content_type = response.headers.get('Content-Type', '')
+            if 'application/json' in content_type:
+                result["checks"].append("✅ Content-Type is application/json")
+
+                # Try to parse JSON
+                try:
+                    data = response.json()
+                    result["response_data"] = data
+
+                    # Check for common health check fields
+                    self._validate_json_structure(data, result)
+
+                except json.JSONDecodeError:
+                    result["errors"].append("❌ Invalid JSON response")
+            elif 'text/plain' in content_type:
+                result["warnings"].append("⚠️  Content-Type is text/plain (JSON recommended)")
+                result["response_data"] = response.text
+            else:
+                result["warnings"].append(f"⚠️  Unexpected Content-Type: {content_type}")
+
+            # Check 4: Response headers
+            self._validate_headers(response.headers, result)
+
+        except requests.exceptions.Timeout:
+            result["errors"].append(f"❌ Request timeout (> {self.timeout}s)")
+            result["status_code"] = None
+            result["response_time"] = None
+
+        except requests.exceptions.ConnectionError:
+            result["errors"].append("❌ Connection error (endpoint unreachable)")
+            result["status_code"] = None
+            result["response_time"] = None
+
+        except requests.exceptions.SSLError:
+            result["errors"].append("❌ SSL certificate validation failed")
+            result["status_code"] = None
+            result["response_time"] = None
+
+        except Exception as e:
+            result["errors"].append(f"❌ Unexpected error: {str(e)}")
+            result["status_code"] = None
+            result["response_time"] = None
+
+        # Overall status
+        if result["errors"]:
+            result["overall_status"] = "UNHEALTHY"
+        elif result["warnings"]:
+            result["overall_status"] = "DEGRADED"
+        else:
+            result["overall_status"] = "HEALTHY"
+
+        return result
+
+    def _validate_json_structure(self, data: Dict[str, Any], result: Dict[str, Any]):
+        """Validate JSON health check structure."""
+        # Check for status field
+        if "status" in data:
+            status = data["status"]
+            if status in ["ok", "healthy", "up", "pass"]:
+                result["checks"].append(f"✅ Status field present: '{status}'")
+            else:
+                result["warnings"].append(f"⚠️  Status field has unexpected value: '{status}'")
+        else:
+            result["warnings"].append("⚠️  Missing 'status' field (recommended)")
+
+        # Check for version/build info
+        if any(key in data for key in ["version", "build", "commit", "timestamp"]):
+            result["checks"].append("✅ Version/build information present")
+        else:
+            result["warnings"].append("⚠️  No version/build information (recommended)")
+
+        # Check for dependencies
+        if "dependencies" in data or "checks" in data or "components" in data:
+            result["checks"].append("✅ Dependency checks present")
+
+            # Validate dependency structure
+            deps = data.get("dependencies") or data.get("checks") or data.get("components")
+            if isinstance(deps, dict):
+                unhealthy_deps = []
+                for name, info in deps.items():
+                    if isinstance(info, dict):
+                        dep_status = info.get("status", "unknown")
+                        if dep_status not in ["ok", "healthy", "up", "pass"]:
+                            unhealthy_deps.append(name)
+                    elif isinstance(info, str):
+                        if info not in ["ok", "healthy", "up", "pass"]:
+                            unhealthy_deps.append(name)
+
+                if unhealthy_deps:
+                    result["warnings"].append(f"⚠️  Unhealthy dependencies: {', '.join(unhealthy_deps)}")
+                else:
+                    result["checks"].append(f"✅ All dependencies healthy ({len(deps)} checked)")
+        else:
+            result["warnings"].append("⚠️  No dependency checks (recommended for production services)")
+
+        # Check for uptime/metrics
+        if any(key in data for key in ["uptime", "metrics", "stats"]):
+            result["checks"].append("✅ Metrics/stats present")
+
+    def _validate_headers(self, headers: Dict[str, str], result: Dict[str, Any]):
+        """Validate response headers."""
+        # Check for caching headers
+        cache_control = headers.get('Cache-Control', '')
+        if 'no-cache' in cache_control or 'no-store' in cache_control:
+            result["checks"].append("✅ Caching disabled (Cache-Control: no-cache)")
+        else:
+            result["warnings"].append("⚠️  Caching not explicitly disabled (add Cache-Control: no-cache)")
+
+    def validate_multiple(self, urls: List[str]) -> List[Dict[str, Any]]:
+        """Validate multiple health check endpoints."""
+        results = []
+        for url in urls:
+            print(f"🔍 Checking: {url}")
+            result = self.validate_endpoint(url)
+            results.append(result)
+        return results
+
+
+def print_result(result: Dict[str, Any], verbose: bool = False):
+    """Print validation result."""
+    status_emoji = {
+        "HEALTHY": "✅",
+        "DEGRADED": "⚠️",
+        "UNHEALTHY": "❌"
+    }
+
+    print("\n" + "="*60)
+    emoji = status_emoji.get(result["overall_status"], "❓")
+    print(f"{emoji} {result['overall_status']}: {result['url']}")
+    print("="*60)
+
+    if result.get("status_code"):
+        print(f"\n📊 Status Code: {result['status_code']}")
+        print(f"⏱️  Response Time: {result['response_time']:.3f}s")
+
+    # Print checks
+    if result["checks"]:
+        print(f"\n✅ Passed Checks:")
+        for check in result["checks"]:
+            print(f"   {check}")
+
+    # Print warnings
+    if result["warnings"]:
+        print(f"\n⚠️  Warnings:")
+        for warning in result["warnings"]:
+            print(f"   {warning}")
+
+    # Print errors
+    if result["errors"]:
+        print(f"\n❌ Errors:")
+        for error in result["errors"]:
+            print(f"   {error}")
+
+    # Print response data if verbose
+    if verbose and "response_data" in result:
+        print(f"\n📄 Response Data:")
+        if isinstance(result["response_data"], dict):
+            print(json.dumps(result["response_data"], indent=2))
+        else:
+            print(result["response_data"])
+
+    print("="*60)
+
+
+def print_summary(results: List[Dict[str, Any]]):
+    """Print summary of multiple validations."""
+    print("\n" + "="*60)
+    print("📊 HEALTH CHECK VALIDATION SUMMARY")
+    print("="*60)
+
+    healthy = sum(1 for r in results if r["overall_status"] == "HEALTHY")
+    degraded = sum(1 for r in results if r["overall_status"] == "DEGRADED")
+    unhealthy = sum(1 for r in results if r["overall_status"] == "UNHEALTHY")
+
+    print(f"\n✅ Healthy:   {healthy}/{len(results)}")
+    print(f"⚠️  Degraded:  {degraded}/{len(results)}")
+    print(f"❌ Unhealthy: {unhealthy}/{len(results)}")
+
+    if results:
+        avg_response_time = sum(r.get("response_time", 0) for r in results if r.get("response_time")) / len(results)
+        print(f"\n⏱️  Average Response Time: {avg_response_time:.3f}s")
+
+    print("="*60)
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="Validate health check endpoints",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog="""
+Examples:
+  # Check a single endpoint
+  python3 health_check_validator.py https://api.example.com/health
+
+  # Check multiple endpoints
+  python3 health_check_validator.py \\
+    https://api.example.com/health \\
+    https://api.example.com/readiness
+
+  # Verbose output with response data
+  python3 health_check_validator.py https://api.example.com/health --verbose
+
+  # Custom timeout
+  python3 health_check_validator.py https://api.example.com/health --timeout 10
+
+Best Practices Checked:
+  ✓ Returns 200 status code
+  ✓ Response time < 1 second
+  ✓ Returns JSON format
+  ✓ Contains 'status' field
+  ✓ Includes version/build info
+  ✓ Checks dependencies
+  ✓ Includes metrics
+  ✓ Disables caching
+        """
+    )
+
+    parser.add_argument('urls', nargs='+', help='Health check endpoint URL(s)')
+    parser.add_argument('--timeout', type=int, default=5, help='Request timeout in seconds (default: 5)')
+    parser.add_argument('--verbose', action='store_true', help='Show detailed response data')
+
+    args = parser.parse_args()
+
+    validator = HealthCheckValidator(timeout=args.timeout)
+
+    results = validator.validate_multiple(args.urls)
+
+    # Print individual results
+    for result in results:
+        print_result(result, args.verbose)
+
+    # Print summary if multiple endpoints
+    if len(results) > 1:
+        print_summary(results)
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/log_analyzer.py
+++ b/scripts/log_analyzer.py
@@ -0,0 +1,321 @@
+#!/usr/bin/env python3
+"""
+Parse and analyze logs for patterns, errors, and anomalies.
+Supports: error detection, frequency analysis, pattern matching.
+"""
+
+import argparse
+import sys
+import re
+import json
+from collections import Counter, defaultdict
+from datetime import datetime
+from typing import Dict, List, Any, Optional
+from pathlib import Path
+
+try:
+    from tabulate import tabulate
+except ImportError:
+    tabulate = None
+
+
+class LogAnalyzer:
+    # Common log level patterns
+    LOG_LEVELS = {
+        'ERROR': r'\b(ERROR|Error|error)\b',
+        'WARN': r'\b(WARN|Warning|warn|warning)\b',
+        'INFO': r'\b(INFO|Info|info)\b',
+        'DEBUG': r'\b(DEBUG|Debug|debug)\b',
+        'FATAL': r'\b(FATAL|Fatal|fatal|CRITICAL|Critical)\b'
+    }
+
+    # Common error patterns
+    ERROR_PATTERNS = {
+        'exception': r'Exception|exception|EXCEPTION',
+        'stack_trace': r'\s+at\s+.*\(.*:\d+\)',
+        'http_error': r'\b[45]\d{2}\b',  # 4xx and 5xx HTTP codes
+        'timeout': r'timeout|timed out|TIMEOUT',
+        'connection_refused': r'connection refused|ECONNREFUSED',
+        'out_of_memory': r'OutOfMemoryError|OOM|out of memory',
+        'null_pointer': r'NullPointerException|null pointer|NPE',
+        'database_error': r'SQLException|database error|DB error'
+    }
+
+    def __init__(self, log_file: str):
+        self.log_file = log_file
+        self.lines = []
+        self.log_levels = Counter()
+        self.error_patterns = Counter()
+        self.timestamps = []
+
+    def parse_file(self) -> bool:
+        """Parse log file."""
+        try:
+            with open(self.log_file, 'r', encoding='utf-8', errors='ignore') as f:
+                self.lines = f.readlines()
+            return True
+        except Exception as e:
+            print(f"❌ Error reading file: {e}")
+            return False
+
+    def analyze_log_levels(self):
+        """Count log levels."""
+        for line in self.lines:
+            for level, pattern in self.LOG_LEVELS.items():
+                if re.search(pattern, line):
+                    self.log_levels[level] += 1
+                    break  # Count each line only once
+
+    def analyze_error_patterns(self):
+        """Detect common error patterns."""
+        for line in self.lines:
+            for pattern_name, pattern in self.ERROR_PATTERNS.items():
+                if re.search(pattern, line, re.IGNORECASE):
+                    self.error_patterns[pattern_name] += 1
+
+    def extract_timestamps(self, timestamp_pattern: Optional[str] = None):
+        """Extract timestamps from logs."""
+        if not timestamp_pattern:
+            # Common timestamp patterns
+            patterns = [
+                r'\d{4}-\d{2}-\d{2}[T ]\d{2}:\d{2}:\d{2}',  # ISO format
+                r'\d{2}/\w{3}/\d{4}:\d{2}:\d{2}:\d{2}',  # Apache format
+                r'\w{3}\s+\d{1,2}\s+\d{2}:\d{2}:\d{2}',  # Syslog format
+            ]
+        else:
+            patterns = [timestamp_pattern]
+
+        for line in self.lines:
+            for pattern in patterns:
+                match = re.search(pattern, line)
+                if match:
+                    self.timestamps.append(match.group())
+                    break
+
+    def find_error_lines(self, context: int = 2) -> List[Dict[str, Any]]:
+        """Find error lines with context."""
+        errors = []
+
+        for i, line in enumerate(self.lines):
+            # Check if line contains error keywords
+            is_error = any(re.search(pattern, line, re.IGNORECASE)
+                          for pattern in [self.LOG_LEVELS['ERROR'], self.LOG_LEVELS['FATAL']])
+
+            if is_error:
+                # Get context lines
+                start = max(0, i - context)
+                end = min(len(self.lines), i + context + 1)
+                context_lines = self.lines[start:end]
+
+                errors.append({
+                    'line_number': i + 1,
+                    'line': line.strip(),
+                    'context': ''.join(context_lines)
+                })
+
+        return errors
+
+    def analyze_frequency(self, time_window_minutes: int = 5) -> Dict[str, Any]:
+        """Analyze log frequency over time."""
+        if not self.timestamps:
+            return {"error": "No timestamps found"}
+
+        # This is a simplified version - in production you'd parse actual timestamps
+        total_lines = len(self.lines)
+        if self.timestamps:
+            time_span = len(self.timestamps)
+            avg_per_window = total_lines / max(1, time_span / time_window_minutes)
+        else:
+            avg_per_window = 0
+
+        return {
+            "total_lines": total_lines,
+            "timestamps_found": len(self.timestamps),
+            "avg_per_window": avg_per_window
+        }
+
+    def extract_unique_messages(self, pattern: str) -> List[str]:
+        """Extract unique messages matching a pattern."""
+        matches = []
+        seen = set()
+
+        for line in self.lines:
+            match = re.search(pattern, line, re.IGNORECASE)
+            if match:
+                msg = match.group() if match.lastindex is None else match.group(1)
+                if msg not in seen:
+                    matches.append(msg)
+                    seen.add(msg)
+
+        return matches
+
+    def find_stack_traces(self) -> List[Dict[str, Any]]:
+        """Extract complete stack traces."""
+        stack_traces = []
+        current_trace = []
+        in_trace = False
+
+        for i, line in enumerate(self.lines):
+            # Start of stack trace
+            if re.search(r'Exception|Error.*:', line):
+                if current_trace:
+                    stack_traces.append({
+                        'line_start': i - len(current_trace) + 1,
+                        'trace': '\n'.join(current_trace)
+                    })
+                current_trace = [line.strip()]
+                in_trace = True
+            # Stack trace continuation
+            elif in_trace and re.search(r'^\s+at\s+', line):
+                current_trace.append(line.strip())
+            # End of stack trace
+            elif in_trace:
+                if current_trace:
+                    stack_traces.append({
+                        'line_start': i - len(current_trace) + 1,
+                        'trace': '\n'.join(current_trace)
+                    })
+                current_trace = []
+                in_trace = False
+
+        # Add last trace if exists
+        if current_trace:
+            stack_traces.append({
+                'line_start': len(self.lines) - len(current_trace) + 1,
+                'trace': '\n'.join(current_trace)
+            })
+
+        return stack_traces
+
+
+def print_analysis_results(analyzer: LogAnalyzer, show_errors: bool = False,
+                           show_traces: bool = False):
+    """Print analysis results."""
+    print("\n" + "="*60)
+    print("📝 LOG ANALYSIS RESULTS")
+    print("="*60)
+
+    print(f"\n📁 File: {analyzer.log_file}")
+    print(f"📊 Total Lines: {len(analyzer.lines):,}")
+
+    # Log levels
+    if analyzer.log_levels:
+        print(f"\n{'='*60}")
+        print("📊 LOG LEVEL DISTRIBUTION:")
+        print(f"{'='*60}")
+
+        level_emoji = {
+            'FATAL': '🔴',
+            'ERROR': '❌',
+            'WARN': '⚠️',
+            'INFO': 'ℹ️',
+            'DEBUG': '🐛'
+        }
+
+        for level, count in analyzer.log_levels.most_common():
+            emoji = level_emoji.get(level, '•')
+            percentage = (count / len(analyzer.lines)) * 100
+            print(f"{emoji} {level:10s}: {count:6,} ({percentage:5.1f}%)")
+
+    # Error patterns
+    if analyzer.error_patterns:
+        print(f"\n{'='*60}")
+        print("🔍 ERROR PATTERNS DETECTED:")
+        print(f"{'='*60}")
+
+        for pattern, count in analyzer.error_patterns.most_common(10):
+            print(f"• {pattern:20s}: {count:,} occurrences")
+
+    # Timestamps
+    if analyzer.timestamps:
+        print(f"\n{'='*60}")
+        print(f"⏰ Timestamps Found: {len(analyzer.timestamps):,}")
+        print(f"   First: {analyzer.timestamps[0]}")
+        print(f"   Last:  {analyzer.timestamps[-1]}")
+
+    # Error lines
+    if show_errors:
+        errors = analyzer.find_error_lines(context=1)
+        if errors:
+            print(f"\n{'='*60}")
+            print(f"❌ ERROR LINES (showing first 10 of {len(errors)}):")
+            print(f"{'='*60}")
+
+            for error in errors[:10]:
+                print(f"\nLine {error['line_number']}:")
+                print(f"  {error['line']}")
+
+    # Stack traces
+    if show_traces:
+        traces = analyzer.find_stack_traces()
+        if traces:
+            print(f"\n{'='*60}")
+            print(f"📚 STACK TRACES FOUND: {len(traces)}")
+            print(f"{'='*60}")
+
+            for i, trace in enumerate(traces[:5], 1):
+                print(f"\nTrace {i} (starting at line {trace['line_start']}):")
+                print(trace['trace'])
+                if i < len(traces):
+                    print("\n" + "-"*60)
+
+    print("\n" + "="*60)
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="Analyze log files for errors, patterns, and anomalies",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog="""
+Examples:
+  # Basic analysis
+  python3 log_analyzer.py application.log
+
+  # Show error lines with context
+  python3 log_analyzer.py application.log --show-errors
+
+  # Show stack traces
+  python3 log_analyzer.py application.log --show-traces
+
+  # Full analysis
+  python3 log_analyzer.py application.log --show-errors --show-traces
+
+Features:
+  • Log level distribution (ERROR, WARN, INFO, DEBUG, FATAL)
+  • Common error pattern detection
+  • Timestamp extraction
+  • Error line identification with context
+  • Stack trace extraction
+  • Frequency analysis
+        """
+    )
+
+    parser.add_argument('log_file', help='Path to log file')
+    parser.add_argument('--show-errors', action='store_true', help='Show error lines')
+    parser.add_argument('--show-traces', action='store_true', help='Show stack traces')
+    parser.add_argument('--timestamp-pattern', help='Custom regex for timestamp extraction')
+
+    args = parser.parse_args()
+
+    if not Path(args.log_file).exists():
+        print(f"❌ File not found: {args.log_file}")
+        sys.exit(1)
+
+    print(f"🔍 Analyzing log file: {args.log_file}")
+
+    analyzer = LogAnalyzer(args.log_file)
+
+    if not analyzer.parse_file():
+        sys.exit(1)
+
+    # Perform analysis
+    analyzer.analyze_log_levels()
+    analyzer.analyze_error_patterns()
+    analyzer.extract_timestamps(args.timestamp_pattern)
+
+    # Print results
+    print_analysis_results(analyzer, args.show_errors, args.show_traces)
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/slo_calculator.py
+++ b/scripts/slo_calculator.py
@@ -0,0 +1,365 @@
+#!/usr/bin/env python3
+"""
+Calculate SLO compliance, error budgets, and burn rates.
+Supports availability SLOs and latency SLOs.
+"""
+
+import argparse
+import sys
+from datetime import datetime, timedelta
+from typing import Dict, Any, Optional
+
+try:
+    from tabulate import tabulate
+except ImportError:
+    print("⚠️  Warning: 'tabulate' library not found. Install with: pip install tabulate")
+    tabulate = None
+
+
+class SLOCalculator:
+    # SLO targets and allowed downtime per period
+    SLO_TARGETS = {
+        "90.0": {"year": 36.5, "month": 3.0, "week": 0.7, "day": 0.1},  # days
+        "95.0": {"year": 18.25, "month": 1.5, "week": 0.35, "day": 0.05},
+        "99.0": {"year": 3.65, "month": 0.3, "week": 0.07, "day": 0.01},
+        "99.5": {"year": 1.83, "month": 0.15, "week": 0.035, "day": 0.005},
+        "99.9": {"year": 0.365, "month": 0.03, "week": 0.007, "day": 0.001},
+        "99.95": {"year": 0.183, "month": 0.015, "week": 0.0035, "day": 0.0005},
+        "99.99": {"year": 0.0365, "month": 0.003, "week": 0.0007, "day": 0.0001},
+    }
+
+    def __init__(self, slo_target: float, period_days: int = 30):
+        """
+        Initialize SLO calculator.
+
+        Args:
+            slo_target: SLO target percentage (e.g., 99.9)
+            period_days: Time period in days (default: 30)
+        """
+        self.slo_target = slo_target
+        self.period_days = period_days
+        self.error_budget_minutes = self.calculate_error_budget_minutes()
+
+    def calculate_error_budget_minutes(self) -> float:
+        """Calculate error budget in minutes for the period."""
+        total_minutes = self.period_days * 24 * 60
+        allowed_error_rate = (100 - self.slo_target) / 100
+        return total_minutes * allowed_error_rate
+
+    def calculate_availability_slo(self, total_requests: int, failed_requests: int) -> Dict[str, Any]:
+        """
+        Calculate availability SLO compliance.
+
+        Args:
+            total_requests: Total number of requests
+            failed_requests: Number of failed requests
+
+        Returns:
+            Dict with SLO compliance metrics
+        """
+        if total_requests == 0:
+            return {
+                "error": "No requests in the period",
+                "slo_met": False
+            }
+
+        success_rate = ((total_requests - failed_requests) / total_requests) * 100
+        error_rate = (failed_requests / total_requests) * 100
+
+        # Calculate error budget consumption
+        allowed_failures = total_requests * ((100 - self.slo_target) / 100)
+        error_budget_consumed = (failed_requests / allowed_failures) * 100 if allowed_failures > 0 else float('inf')
+        error_budget_remaining = max(0, 100 - error_budget_consumed)
+
+        # Determine if SLO is met
+        slo_met = success_rate >= self.slo_target
+
+        return {
+            "slo_target": self.slo_target,
+            "period_days": self.period_days,
+            "total_requests": total_requests,
+            "failed_requests": failed_requests,
+            "success_requests": total_requests - failed_requests,
+            "success_rate": success_rate,
+            "error_rate": error_rate,
+            "slo_met": slo_met,
+            "error_budget_total": allowed_failures,
+            "error_budget_consumed": error_budget_consumed,
+            "error_budget_remaining": error_budget_remaining,
+            "margin": success_rate - self.slo_target
+        }
+
+    def calculate_latency_slo(self, total_requests: int, requests_exceeding_threshold: int) -> Dict[str, Any]:
+        """
+        Calculate latency SLO compliance.
+
+        Args:
+            total_requests: Total number of requests
+            requests_exceeding_threshold: Number of requests exceeding latency threshold
+
+        Returns:
+            Dict with SLO compliance metrics
+        """
+        if total_requests == 0:
+            return {
+                "error": "No requests in the period",
+                "slo_met": False
+            }
+
+        within_threshold_rate = ((total_requests - requests_exceeding_threshold) / total_requests) * 100
+
+        # Calculate error budget consumption
+        allowed_slow_requests = total_requests * ((100 - self.slo_target) / 100)
+        error_budget_consumed = (requests_exceeding_threshold / allowed_slow_requests) * 100 if allowed_slow_requests > 0 else float('inf')
+        error_budget_remaining = max(0, 100 - error_budget_consumed)
+
+        slo_met = within_threshold_rate >= self.slo_target
+
+        return {
+            "slo_target": self.slo_target,
+            "period_days": self.period_days,
+            "total_requests": total_requests,
+            "requests_exceeding_threshold": requests_exceeding_threshold,
+            "requests_within_threshold": total_requests - requests_exceeding_threshold,
+            "within_threshold_rate": within_threshold_rate,
+            "slo_met": slo_met,
+            "error_budget_total": allowed_slow_requests,
+            "error_budget_consumed": error_budget_consumed,
+            "error_budget_remaining": error_budget_remaining,
+            "margin": within_threshold_rate - self.slo_target
+        }
+
+    def calculate_burn_rate(self, errors_in_window: int, requests_in_window: int, window_hours: float) -> Dict[str, Any]:
+        """
+        Calculate error budget burn rate.
+
+        Args:
+            errors_in_window: Number of errors in the time window
+            requests_in_window: Total requests in the time window
+            window_hours: Size of the time window in hours
+
+        Returns:
+            Dict with burn rate metrics
+        """
+        if requests_in_window == 0:
+            return {"error": "No requests in window"}
+
+        # Calculate actual error rate in this window
+        actual_error_rate = (errors_in_window / requests_in_window) * 100
+
+        # Calculate allowed error rate for SLO
+        allowed_error_rate = 100 - self.slo_target
+
+        # Burn rate = actual error rate / allowed error rate
+        burn_rate = actual_error_rate / allowed_error_rate if allowed_error_rate > 0 else float('inf')
+
+        # Calculate time to exhaustion
+        if burn_rate > 0:
+            error_budget_hours = self.error_budget_minutes / 60
+            hours_to_exhaustion = error_budget_hours / burn_rate
+        else:
+            hours_to_exhaustion = float('inf')
+
+        # Determine severity
+        if burn_rate >= 14.4:  # 1 hour window, burns budget in 2 days
+            severity = "critical"
+        elif burn_rate >= 6:  # 6 hour window, burns budget in 5 days
+            severity = "warning"
+        elif burn_rate >= 1:
+            severity = "elevated"
+        else:
+            severity = "normal"
+
+        return {
+            "window_hours": window_hours,
+            "requests_in_window": requests_in_window,
+            "errors_in_window": errors_in_window,
+            "actual_error_rate": actual_error_rate,
+            "allowed_error_rate": allowed_error_rate,
+            "burn_rate": burn_rate,
+            "hours_to_exhaustion": hours_to_exhaustion,
+            "severity": severity
+        }
+
+    @staticmethod
+    def print_slo_table():
+        """Print table of common SLO targets and allowed downtime."""
+        if not tabulate:
+            print("Install tabulate for formatted output: pip install tabulate")
+            return
+
+        print("\n📊 SLO TARGETS AND ALLOWED DOWNTIME")
+        print("="*60)
+
+        headers = ["SLO", "Year", "Month", "Week", "Day"]
+        rows = []
+
+        for slo, downtimes in sorted(SLOCalculator.SLO_TARGETS.items(), reverse=True):
+            row = [
+                f"{slo}%",
+                f"{downtimes['year']:.2f} days",
+                f"{downtimes['month']:.2f} days",
+                f"{downtimes['week']:.2f} days",
+                f"{downtimes['day']:.2f} days"
+            ]
+            rows.append(row)
+
+        print(tabulate(rows, headers=headers, tablefmt="grid"))
+
+
+def print_availability_results(results: Dict[str, Any]):
+    """Print availability SLO results."""
+    print("\n" + "="*60)
+    print("📊 AVAILABILITY SLO COMPLIANCE")
+    print("="*60)
+
+    if "error" in results:
+        print(f"\n❌ Error: {results['error']}")
+        return
+
+    status_emoji = "✅" if results['slo_met'] else "❌"
+    print(f"\n{status_emoji} SLO Status: {'MET' if results['slo_met'] else 'VIOLATED'}")
+    print(f"   Target: {results['slo_target']}%")
+    print(f"   Actual: {results['success_rate']:.3f}%")
+    print(f"   Margin: {results['margin']:+.3f}%")
+
+    print(f"\n📈 Request Statistics:")
+    print(f"   Total Requests: {results['total_requests']:,}")
+    print(f"   Successful: {results['success_requests']:,}")
+    print(f"   Failed: {results['failed_requests']:,}")
+    print(f"   Error Rate: {results['error_rate']:.3f}%")
+
+    print(f"\n💰 Error Budget:")
+    budget_emoji = "✅" if results['error_budget_remaining'] > 20 else "⚠️" if results['error_budget_remaining'] > 0 else "❌"
+    print(f"   {budget_emoji} Remaining: {results['error_budget_remaining']:.1f}%")
+    print(f"   Consumed: {results['error_budget_consumed']:.1f}%")
+    print(f"   Allowed Failures: {results['error_budget_total']:.0f}")
+
+    print("\n" + "="*60)
+
+
+def print_burn_rate_results(results: Dict[str, Any]):
+    """Print burn rate results."""
+    print("\n" + "="*60)
+    print("🔥 ERROR BUDGET BURN RATE")
+    print("="*60)
+
+    if "error" in results:
+        print(f"\n❌ Error: {results['error']}")
+        return
+
+    severity_emoji = {
+        "critical": "🔴",
+        "warning": "🟡",
+        "elevated": "🟠",
+        "normal": "🟢"
+    }
+
+    print(f"\n{severity_emoji.get(results['severity'], '❓')} Severity: {results['severity'].upper()}")
+    print(f"   Burn Rate: {results['burn_rate']:.2f}x")
+    print(f"   Time to Exhaustion: {results['hours_to_exhaustion']:.1f} hours ({results['hours_to_exhaustion']/24:.1f} days)")
+
+    print(f"\n📊 Window Statistics:")
+    print(f"   Window: {results['window_hours']} hours")
+    print(f"   Requests: {results['requests_in_window']:,}")
+    print(f"   Errors: {results['errors_in_window']:,}")
+    print(f"   Actual Error Rate: {results['actual_error_rate']:.3f}%")
+    print(f"   Allowed Error Rate: {results['allowed_error_rate']:.3f}%")
+
+    print("\n" + "="*60)
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="Calculate SLO compliance and error budgets",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog="""
+Examples:
+  # Show SLO reference table
+  python3 slo_calculator.py --table
+
+  # Calculate availability SLO
+  python3 slo_calculator.py availability \\
+    --slo 99.9 \\
+    --total-requests 1000000 \\
+    --failed-requests 1500 \\
+    --period-days 30
+
+  # Calculate latency SLO
+  python3 slo_calculator.py latency \\
+    --slo 99.5 \\
+    --total-requests 500000 \\
+    --slow-requests 3000 \\
+    --period-days 7
+
+  # Calculate burn rate
+  python3 slo_calculator.py burn-rate \\
+    --slo 99.9 \\
+    --errors 50 \\
+    --requests 10000 \\
+    --window-hours 1
+        """
+    )
+
+    parser.add_argument('mode', nargs='?', choices=['availability', 'latency', 'burn-rate'],
+                       help='Calculation mode')
+    parser.add_argument('--table', action='store_true', help='Show SLO reference table')
+    parser.add_argument('--slo', type=float, help='SLO target percentage (e.g., 99.9)')
+    parser.add_argument('--period-days', type=int, default=30, help='Period in days (default: 30)')
+
+    # Availability SLO arguments
+    parser.add_argument('--total-requests', type=int, help='Total number of requests')
+    parser.add_argument('--failed-requests', type=int, help='Number of failed requests')
+
+    # Latency SLO arguments
+    parser.add_argument('--slow-requests', type=int, help='Number of requests exceeding threshold')
+
+    # Burn rate arguments
+    parser.add_argument('--errors', type=int, help='Number of errors in window')
+    parser.add_argument('--requests', type=int, help='Number of requests in window')
+    parser.add_argument('--window-hours', type=float, help='Window size in hours')
+
+    args = parser.parse_args()
+
+    # Show table if requested
+    if args.table:
+        SLOCalculator.print_slo_table()
+        return
+
+    if not args.mode:
+        parser.print_help()
+        return
+
+    if not args.slo:
+        print("❌ --slo required")
+        sys.exit(1)
+
+    calculator = SLOCalculator(args.slo, args.period_days)
+
+    if args.mode == 'availability':
+        if not args.total_requests or args.failed_requests is None:
+            print("❌ --total-requests and --failed-requests required")
+            sys.exit(1)
+
+        results = calculator.calculate_availability_slo(args.total_requests, args.failed_requests)
+        print_availability_results(results)
+
+    elif args.mode == 'latency':
+        if not args.total_requests or args.slow_requests is None:
+            print("❌ --total-requests and --slow-requests required")
+            sys.exit(1)
+
+        results = calculator.calculate_latency_slo(args.total_requests, args.slow_requests)
+        print_availability_results(results)  # Same format
+
+    elif args.mode == 'burn-rate':
+        if not all([args.errors is not None, args.requests, args.window_hours]):
+            print("❌ --errors, --requests, and --window-hours required")
+            sys.exit(1)
+
+        results = calculator.calculate_burn_rate(args.errors, args.requests, args.window_hours)
+        print_burn_rate_results(results)
+
+
+if __name__ == "__main__":
+    main()