Initial commit

2025-11-29 17:51:22 +08:00
commit 23753b435e
24 changed files with 9837 additions and 0 deletions
--- a/SKILL.md
+++ b/SKILL.md
@@ -0,0 +1,869 @@
+---
+name: monitoring-observability
+description: Monitoring and observability strategy, implementation, and troubleshooting. Use for designing metrics/logs/traces systems, setting up Prometheus/Grafana/Loki, creating alerts and dashboards, calculating SLOs and error budgets, analyzing performance issues, and comparing monitoring tools (Datadog, ELK, CloudWatch). Covers the Four Golden Signals, RED/USE methods, OpenTelemetry instrumentation, log aggregation patterns, and distributed tracing.
+---
+
+# Monitoring & Observability
+
+## Overview
+
+This skill provides comprehensive guidance for monitoring and observability workflows including metrics design, log aggregation, distributed tracing, alerting strategies, SLO/SLA management, and tool selection.
+
+**When to use this skill**:
+- Setting up monitoring for new services
+- Designing alerts and dashboards
+- Troubleshooting performance issues
+- Implementing SLO tracking and error budgets
+- Choosing between monitoring tools
+- Integrating OpenTelemetry instrumentation
+- Analyzing metrics, logs, and traces
+- Optimizing Datadog costs and finding waste
+- Migrating from Datadog to open-source stack
+
+---
+
+## Core Workflow: Observability Implementation
+
+Use this decision tree to determine your starting point:
+
+```
+Are you setting up monitoring from scratch?
+├─ YES → Start with "1. Design Metrics Strategy"
+└─ NO → Do you have an existing issue?
+    ├─ YES → Go to "9. Troubleshooting & Analysis"
+    └─ NO → Are you improving existing monitoring?
+        ├─ Alerts → Go to "3. Alert Design"
+        ├─ Dashboards → Go to "4. Dashboard & Visualization"
+        ├─ SLOs → Go to "5. SLO & Error Budgets"
+        ├─ Tool selection → Read references/tool_comparison.md
+        └─ Using Datadog? High costs? → Go to "7. Datadog Cost Optimization & Migration"
+```
+
+---
+
+## 1. Design Metrics Strategy
+
+### Start with The Four Golden Signals
+
+Every service should monitor:
+
+1. **Latency**: Response time (p50, p95, p99)
+2. **Traffic**: Requests per second
+3. **Errors**: Failure rate
+4. **Saturation**: Resource utilization
+
+**For request-driven services**, use the **RED Method**:
+- **R**ate: Requests/sec
+- **E**rrors: Error rate
+- **D**uration: Response time
+
+**For infrastructure resources**, use the **USE Method**:
+- **U**tilization: % time busy
+- **S**aturation**: Queue depth
+- **E**rrors**: Error count
+
+**Quick Start - Web Application Example**:
+```promql
+# Rate (requests/sec)
+sum(rate(http_requests_total[5m]))
+
+# Errors (error rate %)
+sum(rate(http_requests_total{status=~"5.."}[5m]))
+  /
+sum(rate(http_requests_total[5m])) * 100
+
+# Duration (p95 latency)
+histogram_quantile(0.95,
+  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
+)
+```
+
+### Deep Dive: Metric Design
+
+For comprehensive metric design guidance including:
+- Metric types (counter, gauge, histogram, summary)
+- Cardinality best practices
+- Naming conventions
+- Dashboard design principles
+
+**→ Read**: [references/metrics_design.md](references/metrics_design.md)
+
+### Automated Metric Analysis
+
+Detect anomalies and trends in your metrics:
+
+```bash
+# Analyze Prometheus metrics for anomalies
+python3 scripts/analyze_metrics.py prometheus \
+  --endpoint http://localhost:9090 \
+  --query 'rate(http_requests_total[5m])' \
+  --hours 24
+
+# Analyze CloudWatch metrics
+python3 scripts/analyze_metrics.py cloudwatch \
+  --namespace AWS/EC2 \
+  --metric CPUUtilization \
+  --dimensions InstanceId=i-1234567890abcdef0 \
+  --hours 48
+```
+
+**→ Script**: [scripts/analyze_metrics.py](scripts/analyze_metrics.py)
+
+---
+
+## 2. Log Aggregation & Analysis
+
+### Structured Logging Checklist
+
+Every log entry should include:
+- ✅ Timestamp (ISO 8601 format)
+- ✅ Log level (DEBUG, INFO, WARN, ERROR, FATAL)
+- ✅ Message (human-readable)
+- ✅ Service name
+- ✅ Request ID (for tracing)
+
+**Example structured log (JSON)**:
+```json
+{
+  "timestamp": "2024-10-28T14:32:15Z",
+  "level": "error",
+  "message": "Payment processing failed",
+  "service": "payment-service",
+  "request_id": "550e8400-e29b-41d4-a716-446655440000",
+  "user_id": "user123",
+  "order_id": "ORD-456",
+  "error_type": "GatewayTimeout",
+  "duration_ms": 5000
+}
+```
+
+### Log Aggregation Patterns
+
+**ELK Stack** (Elasticsearch, Logstash, Kibana):
+- Best for: Deep log analysis, complex queries
+- Cost: High (infrastructure + operations)
+- Complexity: High
+
+**Grafana Loki**:
+- Best for: Cost-effective logging, Kubernetes
+- Cost: Low
+- Complexity: Medium
+
+**CloudWatch Logs**:
+- Best for: AWS-centric applications
+- Cost: Medium
+- Complexity: Low
+
+### Log Analysis
+
+Analyze logs for errors, patterns, and anomalies:
+
+```bash
+# Analyze log file for patterns
+python3 scripts/log_analyzer.py application.log
+
+# Show error lines with context
+python3 scripts/log_analyzer.py application.log --show-errors
+
+# Extract stack traces
+python3 scripts/log_analyzer.py application.log --show-traces
+```
+
+**→ Script**: [scripts/log_analyzer.py](scripts/log_analyzer.py)
+
+### Deep Dive: Logging
+
+For comprehensive logging guidance including:
+- Structured logging implementation examples (Python, Node.js, Go, Java)
+- Log aggregation patterns (ELK, Loki, CloudWatch, Fluentd)
+- Query patterns and best practices
+- PII redaction and security
+- Sampling and rate limiting
+
+**→ Read**: [references/logging_guide.md](references/logging_guide.md)
+
+---
+
+## 3. Alert Design
+
+### Alert Design Principles
+
+1. **Every alert must be actionable** - If you can't do something, don't alert
+2. **Alert on symptoms, not causes** - Alert on user experience, not components
+3. **Tie alerts to SLOs** - Connect to business impact
+4. **Reduce noise** - Only page for critical issues
+
+### Alert Severity Levels
+
+| Severity | Response Time | Example |
+|----------|--------------|---------|
+| **Critical** | Page immediately | Service down, SLO violation |
+| **Warning** | Ticket, review in hours | Elevated error rate, resource warning |
+| **Info** | Log for awareness | Deployment completed, scaling event |
+
+### Multi-Window Burn Rate Alerting
+
+Alert when error budget is consumed too quickly:
+
+```yaml
+# Fast burn (1h window) - Critical
+- alert: ErrorBudgetFastBurn
+  expr: |
+    (error_rate / 0.001) > 14.4  # 99.9% SLO
+  for: 2m
+  labels:
+    severity: critical
+
+# Slow burn (6h window) - Warning
+- alert: ErrorBudgetSlowBurn
+  expr: |
+    (error_rate / 0.001) > 6  # 99.9% SLO
+  for: 30m
+  labels:
+    severity: warning
+```
+
+### Alert Quality Checker
+
+Audit your alert rules against best practices:
+
+```bash
+# Check single file
+python3 scripts/alert_quality_checker.py alerts.yml
+
+# Check all rules in directory
+python3 scripts/alert_quality_checker.py /path/to/prometheus/rules/
+```
+
+**Checks for**:
+- Alert naming conventions
+- Required labels (severity, team)
+- Required annotations (summary, description, runbook_url)
+- PromQL expression quality
+- 'for' clause to prevent flapping
+
+**→ Script**: [scripts/alert_quality_checker.py](scripts/alert_quality_checker.py)
+
+### Alert Templates
+
+Production-ready alert rule templates:
+
+**→ Templates**:
+- [assets/templates/prometheus-alerts/webapp-alerts.yml](assets/templates/prometheus-alerts/webapp-alerts.yml) - Web application alerts
+- [assets/templates/prometheus-alerts/kubernetes-alerts.yml](assets/templates/prometheus-alerts/kubernetes-alerts.yml) - Kubernetes alerts
+
+### Deep Dive: Alerting
+
+For comprehensive alerting guidance including:
+- Alert design patterns (multi-window, rate of change, threshold with hysteresis)
+- Alert annotation best practices
+- Alert routing (severity-based, team-based, time-based)
+- Inhibition rules
+- Runbook structure
+- On-call best practices
+
+**→ Read**: [references/alerting_best_practices.md](references/alerting_best_practices.md)
+
+### Runbook Template
+
+Create comprehensive runbooks for your alerts:
+
+**→ Template**: [assets/templates/runbooks/incident-runbook-template.md](assets/templates/runbooks/incident-runbook-template.md)
+
+---
+
+## 4. Dashboard & Visualization
+
+### Dashboard Design Principles
+
+1. **Top-down layout**: Most important metrics first
+2. **Color coding**: Red (critical), yellow (warning), green (healthy)
+3. **Consistent time windows**: All panels use same time range
+4. **Limit panels**: 8-12 panels per dashboard maximum
+5. **Include context**: Show related metrics together
+
+### Recommended Dashboard Structure
+
+```
+┌─────────────────────────────────────┐
+│  Overall Health (Single Stats)      │
+│  [Requests/s] [Error%] [P95 Latency]│
+└─────────────────────────────────────┘
+┌─────────────────────────────────────┐
+│  Request Rate & Errors (Graphs)     │
+└─────────────────────────────────────┘
+┌─────────────────────────────────────┐
+│  Latency Distribution (Graphs)      │
+└─────────────────────────────────────┘
+┌─────────────────────────────────────┐
+│  Resource Usage (Graphs)            │
+└─────────────────────────────────────┘
+```
+
+### Generate Grafana Dashboards
+
+Automatically generate dashboards from templates:
+
+```bash
+# Web application dashboard
+python3 scripts/dashboard_generator.py webapp \
+  --title "My API Dashboard" \
+  --service my_api \
+  --output dashboard.json
+
+# Kubernetes dashboard
+python3 scripts/dashboard_generator.py kubernetes \
+  --title "K8s Production" \
+  --namespace production \
+  --output k8s-dashboard.json
+
+# Database dashboard
+python3 scripts/dashboard_generator.py database \
+  --title "PostgreSQL" \
+  --db-type postgres \
+  --instance db.example.com:5432 \
+  --output db-dashboard.json
+```
+
+**Supports**:
+- Web applications (requests, errors, latency, resources)
+- Kubernetes (pods, nodes, resources, network)
+- Databases (PostgreSQL, MySQL)
+
+**→ Script**: [scripts/dashboard_generator.py](scripts/dashboard_generator.py)
+
+---
+
+## 5. SLO & Error Budgets
+
+### SLO Fundamentals
+
+**SLI** (Service Level Indicator): Measurement of service quality
+- Example: Request latency, error rate, availability
+
+**SLO** (Service Level Objective): Target value for an SLI
+- Example: "99.9% of requests return in < 500ms"
+
+**Error Budget**: Allowed failure amount = (100% - SLO)
+- Example: 99.9% SLO = 0.1% error budget = 43.2 minutes/month
+
+### Common SLO Targets
+
+| Availability | Downtime/Month | Use Case |
+|--------------|----------------|----------|
+| **99%** | 7.2 hours | Internal tools |
+| **99.9%** | 43.2 minutes | Standard production |
+| **99.95%** | 21.6 minutes | Critical services |
+| **99.99%** | 4.3 minutes | High availability |
+
+### SLO Calculator
+
+Calculate compliance, error budgets, and burn rates:
+
+```bash
+# Show SLO reference table
+python3 scripts/slo_calculator.py --table
+
+# Calculate availability SLO
+python3 scripts/slo_calculator.py availability \
+  --slo 99.9 \
+  --total-requests 1000000 \
+  --failed-requests 1500 \
+  --period-days 30
+
+# Calculate burn rate
+python3 scripts/slo_calculator.py burn-rate \
+  --slo 99.9 \
+  --errors 50 \
+  --requests 10000 \
+  --window-hours 1
+```
+
+**→ Script**: [scripts/slo_calculator.py](scripts/slo_calculator.py)
+
+### Deep Dive: SLO/SLA
+
+For comprehensive SLO/SLA guidance including:
+- Choosing appropriate SLIs
+- Setting realistic SLO targets
+- Error budget policies
+- Burn rate alerting
+- SLA structure and contracts
+- Monthly reporting templates
+
+**→ Read**: [references/slo_sla_guide.md](references/slo_sla_guide.md)
+
+---
+
+## 6. Distributed Tracing
+
+### When to Use Tracing
+
+Use distributed tracing when you need to:
+- Debug performance issues across services
+- Understand request flow through microservices
+- Identify bottlenecks in distributed systems
+- Find N+1 query problems
+
+### OpenTelemetry Implementation
+
+**Python example**:
+```python
+from opentelemetry import trace
+
+tracer = trace.get_tracer(__name__)
+
+@tracer.start_as_current_span("process_order")
+def process_order(order_id):
+    span = trace.get_current_span()
+    span.set_attribute("order.id", order_id)
+
+    try:
+        result = payment_service.charge(order_id)
+        span.set_attribute("payment.status", "success")
+        return result
+    except Exception as e:
+        span.set_status(trace.Status(trace.StatusCode.ERROR))
+        span.record_exception(e)
+        raise
+```
+
+### Sampling Strategies
+
+- **Development**: 100% (ALWAYS_ON)
+- **Staging**: 50-100%
+- **Production**: 1-10% (or error-based sampling)
+
+**Error-based sampling** (always sample errors, 1% of successes):
+```python
+class ErrorSampler(Sampler):
+    def should_sample(self, parent_context, trace_id, name, **kwargs):
+        attributes = kwargs.get('attributes', {})
+
+        if attributes.get('error', False):
+            return Decision.RECORD_AND_SAMPLE
+
+        if trace_id & 0xFF < 3:  # ~1%
+            return Decision.RECORD_AND_SAMPLE
+
+        return Decision.DROP
+```
+
+### OTel Collector Configuration
+
+Production-ready OpenTelemetry Collector configuration:
+
+**→ Template**: [assets/templates/otel-config/collector-config.yaml](assets/templates/otel-config/collector-config.yaml)
+
+**Features**:
+- Receives OTLP, Prometheus, and host metrics
+- Batching and memory limiting
+- Tail sampling (error-based, latency-based, probabilistic)
+- Multiple exporters (Tempo, Jaeger, Loki, Prometheus, CloudWatch, Datadog)
+
+### Deep Dive: Tracing
+
+For comprehensive tracing guidance including:
+- OpenTelemetry instrumentation (Python, Node.js, Go, Java)
+- Span attributes and semantic conventions
+- Context propagation (W3C Trace Context)
+- Backend comparison (Jaeger, Tempo, X-Ray, Datadog APM)
+- Analysis patterns (finding slow traces, N+1 queries)
+- Integration with logs
+
+**→ Read**: [references/tracing_guide.md](references/tracing_guide.md)
+
+---
+
+## 7. Datadog Cost Optimization & Migration
+
+### Scenario 1: I'm Using Datadog and Costs Are Too High
+
+If your Datadog bill is growing out of control, start by identifying waste:
+
+#### Cost Analysis Script
+
+Automatically analyze your Datadog usage and find cost optimization opportunities:
+
+```bash
+# Analyze Datadog usage (requires API key and APP key)
+python3 scripts/datadog_cost_analyzer.py \
+  --api-key $DD_API_KEY \
+  --app-key $DD_APP_KEY
+
+# Show detailed breakdown by category
+python3 scripts/datadog_cost_analyzer.py \
+  --api-key $DD_API_KEY \
+  --app-key $DD_APP_KEY \
+  --show-details
+```
+
+**What it checks**:
+- Infrastructure host count and cost
+- Custom metrics usage and high-cardinality metrics
+- Log ingestion volume and trends
+- APM host usage
+- Unused or noisy monitors
+- Container vs VM optimization opportunities
+
+**→ Script**: [scripts/datadog_cost_analyzer.py](scripts/datadog_cost_analyzer.py)
+
+#### Common Cost Optimization Strategies
+
+**1. Custom Metrics Optimization** (typical savings: 20-40%):
+- Remove high-cardinality tags (user IDs, request IDs)
+- Delete unused custom metrics
+- Aggregate metrics before sending
+- Use metric prefixes to identify teams/services
+
+**2. Log Management** (typical savings: 30-50%):
+- Implement log sampling for high-volume services
+- Use exclusion filters for debug/trace logs in production
+- Archive cold logs to S3/GCS after 7 days
+- Set log retention policies (15 days instead of 30)
+
+**3. APM Optimization** (typical savings: 15-25%):
+- Reduce trace sampling rates (10% → 5% in prod)
+- Use head-based sampling instead of complete sampling
+- Remove APM from non-critical services
+- Use trace search with lower retention
+
+**4. Infrastructure Monitoring** (typical savings: 10-20%):
+- Switch from VM-based to container-based pricing where possible
+- Remove agents from ephemeral instances
+- Use Datadog's host reduction strategies
+- Consolidate staging environments
+
+### Scenario 2: Migrating Away from Datadog
+
+If you're considering migrating to a more cost-effective open-source stack:
+
+#### Migration Overview
+
+**From Datadog** → **To Open Source Stack**:
+- Metrics: Datadog → **Prometheus + Grafana**
+- Logs: Datadog Logs → **Grafana Loki**
+- Traces: Datadog APM → **Tempo or Jaeger**
+- Dashboards: Datadog → **Grafana**
+- Alerts: Datadog Monitors → **Prometheus Alertmanager**
+
+**Estimated Cost Savings**: 60-77% ($49.8k-61.8k/year for 100-host environment)
+
+#### Migration Strategy
+
+**Phase 1: Run Parallel** (Month 1-2):
+- Deploy open-source stack alongside Datadog
+- Migrate metrics first (lowest risk)
+- Validate data accuracy
+
+**Phase 2: Migrate Dashboards & Alerts** (Month 2-3):
+- Convert Datadog dashboards to Grafana
+- Translate alert rules (use DQL → PromQL guide below)
+- Train team on new tools
+
+**Phase 3: Migrate Logs & Traces** (Month 3-4):
+- Set up Loki for log aggregation
+- Deploy Tempo/Jaeger for tracing
+- Update application instrumentation
+
+**Phase 4: Decommission Datadog** (Month 4-5):
+- Confirm all functionality migrated
+- Cancel Datadog subscription
+
+#### Query Translation: DQL → PromQL
+
+When migrating dashboards and alerts, you'll need to translate Datadog queries to PromQL:
+
+**Quick examples**:
+```
+# Average CPU
+Datadog: avg:system.cpu.user{*}
+Prometheus: avg(node_cpu_seconds_total{mode="user"})
+
+# Request rate
+Datadog: sum:requests.count{*}.as_rate()
+Prometheus: sum(rate(http_requests_total[5m]))
+
+# P95 latency
+Datadog: p95:request.duration{*}
+Prometheus: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
+
+# Error rate percentage
+Datadog: (sum:requests.errors{*}.as_rate() / sum:requests.count{*}.as_rate()) * 100
+Prometheus: (sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) * 100
+```
+
+**→ Full Translation Guide**: [references/dql_promql_translation.md](references/dql_promql_translation.md)
+
+#### Cost Comparison
+
+**Example: 100-host infrastructure**
+
+| Component | Datadog (Annual) | Open Source (Annual) | Savings |
+|-----------|-----------------|---------------------|---------|
+| Infrastructure | $18,000 | $10,000 (self-hosted infra) | $8,000 |
+| Custom Metrics | $600 | Included | $600 |
+| Logs | $24,000 | $3,000 (storage) | $21,000 |
+| APM/Traces | $37,200 | $5,000 (storage) | $32,200 |
+| **Total** | **$79,800** | **$18,000** | **$61,800 (77%)** |
+
+### Deep Dive: Datadog Migration
+
+For comprehensive migration guidance including:
+- Detailed cost comparison and ROI calculations
+- Step-by-step migration instructions
+- Infrastructure sizing recommendations (CPU, RAM, storage)
+- Dashboard conversion tools and examples
+- Alert rule translation patterns
+- Application instrumentation changes (DogStatsD → Prometheus client)
+- Python scripts for exporting Datadog dashboards and monitors
+- Common challenges and solutions
+
+**→ Read**: [references/datadog_migration.md](references/datadog_migration.md)
+
+---
+
+## 8. Tool Selection & Comparison
+
+### Decision Matrix
+
+**Choose Prometheus + Grafana if**:
+- ✅ Using Kubernetes
+- ✅ Want control and customization
+- ✅ Have ops capacity
+- ✅ Budget-conscious
+
+**Choose Datadog if**:
+- ✅ Want ease of use
+- ✅ Need full observability now
+- ✅ Budget allows ($8k+/month for 100 hosts)
+
+**Choose Grafana Stack (LGTM) if**:
+- ✅ Want open source full stack
+- ✅ Cost-effective solution
+- ✅ Cloud-native architecture
+
+**Choose ELK Stack if**:
+- ✅ Heavy log analysis needs
+- ✅ Need powerful search
+- ✅ Have dedicated ops team
+
+**Choose Cloud Native (CloudWatch/etc) if**:
+- ✅ Single cloud provider
+- ✅ Simple needs
+- ✅ Want minimal setup
+
+### Cost Comparison (100 hosts, 1TB logs/month)
+
+| Solution | Monthly Cost | Setup | Ops Burden |
+|----------|-------------|--------|------------|
+| Prometheus + Loki + Tempo | $1,500 | Medium | Medium |
+| Grafana Cloud | $3,000 | Low | Low |
+| Datadog | $8,000 | Low | None |
+| ELK Stack | $4,000 | High | High |
+| CloudWatch | $2,000 | Low | Low |
+
+### Deep Dive: Tool Comparison
+
+For comprehensive tool comparison including:
+- Metrics platforms (Prometheus, Datadog, New Relic, CloudWatch, Grafana Cloud)
+- Logging platforms (ELK, Loki, Splunk, CloudWatch Logs, Sumo Logic)
+- Tracing platforms (Jaeger, Tempo, Datadog APM, X-Ray)
+- Full-stack observability comparison
+- Recommendations by company size
+
+**→ Read**: [references/tool_comparison.md](references/tool_comparison.md)
+
+---
+
+## 9. Troubleshooting & Analysis
+
+### Health Check Validation
+
+Validate health check endpoints against best practices:
+
+```bash
+# Check single endpoint
+python3 scripts/health_check_validator.py https://api.example.com/health
+
+# Check multiple endpoints
+python3 scripts/health_check_validator.py \
+  https://api.example.com/health \
+  https://api.example.com/readiness \
+  --verbose
+```
+
+**Checks for**:
+- ✓ Returns 200 status code
+- ✓ Response time < 1 second
+- ✓ Returns JSON format
+- ✓ Contains 'status' field
+- ✓ Includes version/build info
+- ✓ Checks dependencies
+- ✓ Disables caching
+
+**→ Script**: [scripts/health_check_validator.py](scripts/health_check_validator.py)
+
+### Common Troubleshooting Workflows
+
+**High Latency Investigation**:
+1. Check dashboards for latency spike
+2. Query traces for slow operations
+3. Check database slow query log
+4. Check external API response times
+5. Review recent deployments
+6. Check resource utilization (CPU, memory)
+
+**High Error Rate Investigation**:
+1. Check error logs for patterns
+2. Identify affected endpoints
+3. Check dependency health
+4. Review recent deployments
+5. Check resource limits
+6. Verify configuration
+
+**Service Down Investigation**:
+1. Check if pods/instances are running
+2. Check health check endpoint
+3. Review recent deployments
+4. Check resource availability
+5. Check network connectivity
+6. Review logs for startup errors
+
+---
+
+## Quick Reference Commands
+
+### Prometheus Queries
+
+```promql
+# Request rate
+sum(rate(http_requests_total[5m]))
+
+# Error rate
+sum(rate(http_requests_total{status=~"5.."}[5m]))
+  /
+sum(rate(http_requests_total[5m])) * 100
+
+# P95 latency
+histogram_quantile(0.95,
+  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
+)
+
+# CPU usage
+100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
+
+# Memory usage
+(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
+```
+
+### Kubernetes Commands
+
+```bash
+# Check pod status
+kubectl get pods -n <namespace>
+
+# View pod logs
+kubectl logs -f <pod-name> -n <namespace>
+
+# Check pod resources
+kubectl top pods -n <namespace>
+
+# Describe pod for events
+kubectl describe pod <pod-name> -n <namespace>
+
+# Check recent deployments
+kubectl rollout history deployment/<name> -n <namespace>
+```
+
+### Log Queries
+
+**Elasticsearch**:
+```json
+GET /logs-*/_search
+{
+  "query": {
+    "bool": {
+      "must": [
+        { "match": { "level": "error" } },
+        { "range": { "@timestamp": { "gte": "now-1h" } } }
+      ]
+    }
+  }
+}
+```
+
+**Loki (LogQL)**:
+```logql
+{job="app", level="error"} |= "error" | json
+```
+
+**CloudWatch Insights**:
+```
+fields @timestamp, level, message
+| filter level = "error"
+| filter @timestamp > ago(1h)
+```
+
+---
+
+## Resources Summary
+
+### Scripts (automation and analysis)
+- `analyze_metrics.py` - Detect anomalies in Prometheus/CloudWatch metrics
+- `alert_quality_checker.py` - Audit alert rules against best practices
+- `slo_calculator.py` - Calculate SLO compliance and error budgets
+- `log_analyzer.py` - Parse logs for errors and patterns
+- `dashboard_generator.py` - Generate Grafana dashboards from templates
+- `health_check_validator.py` - Validate health check endpoints
+- `datadog_cost_analyzer.py` - Analyze Datadog usage and find cost waste
+
+### References (deep-dive documentation)
+- `metrics_design.md` - Four Golden Signals, RED/USE methods, metric types
+- `alerting_best_practices.md` - Alert design, runbooks, on-call practices
+- `logging_guide.md` - Structured logging, aggregation patterns
+- `tracing_guide.md` - OpenTelemetry, distributed tracing
+- `slo_sla_guide.md` - SLI/SLO/SLA definitions, error budgets
+- `tool_comparison.md` - Comprehensive comparison of monitoring tools
+- `datadog_migration.md` - Complete guide for migrating from Datadog to OSS stack
+- `dql_promql_translation.md` - Datadog Query Language to PromQL translation reference
+
+### Templates (ready-to-use configurations)
+- `prometheus-alerts/webapp-alerts.yml` - Production-ready web app alerts
+- `prometheus-alerts/kubernetes-alerts.yml` - Kubernetes monitoring alerts
+- `otel-config/collector-config.yaml` - OpenTelemetry Collector configuration
+- `runbooks/incident-runbook-template.md` - Incident response template
+
+---
+
+## Best Practices
+
+### Metrics
+- Start with Four Golden Signals
+- Use appropriate metric types (counter, gauge, histogram)
+- Keep cardinality low (avoid high-cardinality labels)
+- Follow naming conventions
+
+### Logging
+- Use structured logging (JSON)
+- Include request IDs for tracing
+- Set appropriate log levels
+- Redact PII before logging
+
+### Alerting
+- Make every alert actionable
+- Alert on symptoms, not causes
+- Use multi-window burn rate alerts
+- Include runbook links
+
+### Tracing
+- Sample appropriately (1-10% in production)
+- Always record errors
+- Use semantic conventions
+- Propagate context between services
+
+### SLOs
+- Start with current performance
+- Set realistic targets
+- Define error budget policies
+- Review and adjust quarterly