--- name: prometheus-configuration description: Set up Prometheus for comprehensive metric collection, storage, and monitoring of infrastructure and applications. Use when implementing metrics collection, setting up monitoring infrastructure, or configuring alerting systems. --- # Prometheus Configuration Complete guide to Prometheus setup, metric collection, scrape configuration, and recording rules. ## Purpose Configure Prometheus for comprehensive metric collection, alerting, and monitoring of infrastructure and applications. ## When to Use - Set up Prometheus monitoring - Configure metric scraping - Create recording rules - Design alert rules - Implement service discovery ## Prometheus Architecture ``` ┌──────────────┐ │ Applications │ ← Instrumented with client libraries └──────┬───────┘ │ /metrics endpoint ↓ ┌──────────────┐ │ Prometheus │ ← Scrapes metrics periodically │ Server │ └──────┬───────┘ │ ├─→ AlertManager (alerts) ├─→ Grafana (visualization) └─→ Long-term storage (Thanos/Cortex) ``` ## Installation ### Kubernetes with Helm ```bash helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo update helm install prometheus prometheus-community/kube-prometheus-stack \ --namespace monitoring \ --create-namespace \ --set prometheus.prometheusSpec.retention=30d \ --set prometheus.prometheusSpec.storageVolumeSize=50Gi ``` ### Docker Compose ```yaml version: '3.8' services: prometheus: image: prom/prometheus:latest ports: - "9090:9090" volumes: - ./prometheus.yml:/etc/prometheus/prometheus.yml - prometheus-data:/prometheus command: - '--config.file=/etc/prometheus/prometheus.yml' - '--storage.tsdb.path=/prometheus' - '--storage.tsdb.retention.time=30d' volumes: prometheus-data: ``` ## Configuration File **prometheus.yml:** ```yaml global: scrape_interval: 15s evaluation_interval: 15s external_labels: cluster: 'production' region: 'us-west-2' # Alertmanager configuration alerting: alertmanagers: - static_configs: - targets: - alertmanager:9093 # Load rules files rule_files: - /etc/prometheus/rules/*.yml # Scrape configurations scrape_configs: # Prometheus itself - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] # Node exporters - job_name: 'node-exporter' static_configs: - targets: - 'node1:9100' - 'node2:9100' - 'node3:9100' relabel_configs: - source_labels: [__address__] target_label: instance regex: '([^:]+)(:[0-9]+)?' replacement: '${1}' # Kubernetes pods with annotations - job_name: 'kubernetes-pods' kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port] action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 target_label: __address__ - source_labels: [__meta_kubernetes_namespace] action: replace target_label: namespace - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: pod # Application metrics - job_name: 'my-app' static_configs: - targets: - 'app1.example.com:9090' - 'app2.example.com:9090' metrics_path: '/metrics' scheme: 'https' tls_config: ca_file: /etc/prometheus/ca.crt cert_file: /etc/prometheus/client.crt key_file: /etc/prometheus/client.key ``` **Reference:** See `assets/prometheus.yml.template` ## Scrape Configurations ### Static Targets ```yaml scrape_configs: - job_name: 'static-targets' static_configs: - targets: ['host1:9100', 'host2:9100'] labels: env: 'production' region: 'us-west-2' ``` ### File-based Service Discovery ```yaml scrape_configs: - job_name: 'file-sd' file_sd_configs: - files: - /etc/prometheus/targets/*.json - /etc/prometheus/targets/*.yml refresh_interval: 5m ``` **targets/production.json:** ```json [ { "targets": ["app1:9090", "app2:9090"], "labels": { "env": "production", "service": "api" } } ] ``` ### Kubernetes Service Discovery ```yaml scrape_configs: - job_name: 'kubernetes-services' kubernetes_sd_configs: - role: service relabel_configs: - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme] action: replace target_label: __scheme__ regex: (https?) - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) ``` **Reference:** See `references/scrape-configs.md` ## Recording Rules Create pre-computed metrics for frequently queried expressions: ```yaml # /etc/prometheus/rules/recording_rules.yml groups: - name: api_metrics interval: 15s rules: # HTTP request rate per service - record: job:http_requests:rate5m expr: sum by (job) (rate(http_requests_total[5m])) # Error rate percentage - record: job:http_requests_errors:rate5m expr: sum by (job) (rate(http_requests_total{status=~"5.."}[5m])) - record: job:http_requests_error_rate:percentage expr: | (job:http_requests_errors:rate5m / job:http_requests:rate5m) * 100 # P95 latency - record: job:http_request_duration:p95 expr: | histogram_quantile(0.95, sum by (job, le) (rate(http_request_duration_seconds_bucket[5m])) ) - name: resource_metrics interval: 30s rules: # CPU utilization percentage - record: instance:node_cpu:utilization expr: | 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) # Memory utilization percentage - record: instance:node_memory:utilization expr: | 100 - ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100) # Disk usage percentage - record: instance:node_disk:utilization expr: | 100 - ((node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100) ``` **Reference:** See `references/recording-rules.md` ## Alert Rules ```yaml # /etc/prometheus/rules/alert_rules.yml groups: - name: availability interval: 30s rules: - alert: ServiceDown expr: up{job="my-app"} == 0 for: 1m labels: severity: critical annotations: summary: "Service {{ $labels.instance }} is down" description: "{{ $labels.job }} has been down for more than 1 minute" - alert: HighErrorRate expr: job:http_requests_error_rate:percentage > 5 for: 5m labels: severity: warning annotations: summary: "High error rate for {{ $labels.job }}" description: "Error rate is {{ $value }}% (threshold: 5%)" - alert: HighLatency expr: job:http_request_duration:p95 > 1 for: 5m labels: severity: warning annotations: summary: "High latency for {{ $labels.job }}" description: "P95 latency is {{ $value }}s (threshold: 1s)" - name: resources interval: 1m rules: - alert: HighCPUUsage expr: instance:node_cpu:utilization > 80 for: 5m labels: severity: warning annotations: summary: "High CPU usage on {{ $labels.instance }}" description: "CPU usage is {{ $value }}%" - alert: HighMemoryUsage expr: instance:node_memory:utilization > 85 for: 5m labels: severity: warning annotations: summary: "High memory usage on {{ $labels.instance }}" description: "Memory usage is {{ $value }}%" - alert: DiskSpaceLow expr: instance:node_disk:utilization > 90 for: 5m labels: severity: critical annotations: summary: "Low disk space on {{ $labels.instance }}" description: "Disk usage is {{ $value }}%" ``` ## Validation ```bash # Validate configuration promtool check config prometheus.yml # Validate rules promtool check rules /etc/prometheus/rules/*.yml # Test query promtool query instant http://localhost:9090 'up' ``` **Reference:** See `scripts/validate-prometheus.sh` ## Best Practices 1. **Use consistent naming** for metrics (prefix_name_unit) 2. **Set appropriate scrape intervals** (15-60s typical) 3. **Use recording rules** for expensive queries 4. **Implement high availability** (multiple Prometheus instances) 5. **Configure retention** based on storage capacity 6. **Use relabeling** for metric cleanup 7. **Monitor Prometheus itself** 8. **Implement federation** for large deployments 9. **Use Thanos/Cortex** for long-term storage 10. **Document custom metrics** ## Troubleshooting **Check scrape targets:** ```bash curl http://localhost:9090/api/v1/targets ``` **Check configuration:** ```bash curl http://localhost:9090/api/v1/status/config ``` **Test query:** ```bash curl 'http://localhost:9090/api/v1/query?query=up' ``` ## Reference Files - `assets/prometheus.yml.template` - Complete configuration template - `references/scrape-configs.md` - Scrape configuration patterns - `references/recording-rules.md` - Recording rule examples - `scripts/validate-prometheus.sh` - Validation script ## Related Skills - `grafana-dashboards` - For visualization - `slo-implementation` - For SLO monitoring - `distributed-tracing` - For request tracing