393 lines
10 KiB
Markdown
393 lines
10 KiB
Markdown
---
|
|
name: prometheus-configuration
|
|
description: Set up Prometheus for comprehensive metric collection, storage, and monitoring of infrastructure and applications. Use when implementing metrics collection, setting up monitoring infrastructure, or configuring alerting systems.
|
|
---
|
|
|
|
# Prometheus Configuration
|
|
|
|
Complete guide to Prometheus setup, metric collection, scrape configuration, and recording rules.
|
|
|
|
## Purpose
|
|
|
|
Configure Prometheus for comprehensive metric collection, alerting, and monitoring of infrastructure and applications.
|
|
|
|
## When to Use
|
|
|
|
- Set up Prometheus monitoring
|
|
- Configure metric scraping
|
|
- Create recording rules
|
|
- Design alert rules
|
|
- Implement service discovery
|
|
|
|
## Prometheus Architecture
|
|
|
|
```
|
|
┌──────────────┐
|
|
│ Applications │ ← Instrumented with client libraries
|
|
└──────┬───────┘
|
|
│ /metrics endpoint
|
|
↓
|
|
┌──────────────┐
|
|
│ Prometheus │ ← Scrapes metrics periodically
|
|
│ Server │
|
|
└──────┬───────┘
|
|
│
|
|
├─→ AlertManager (alerts)
|
|
├─→ Grafana (visualization)
|
|
└─→ Long-term storage (Thanos/Cortex)
|
|
```
|
|
|
|
## Installation
|
|
|
|
### Kubernetes with Helm
|
|
|
|
```bash
|
|
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
|
|
helm repo update
|
|
|
|
helm install prometheus prometheus-community/kube-prometheus-stack \
|
|
--namespace monitoring \
|
|
--create-namespace \
|
|
--set prometheus.prometheusSpec.retention=30d \
|
|
--set prometheus.prometheusSpec.storageVolumeSize=50Gi
|
|
```
|
|
|
|
### Docker Compose
|
|
|
|
```yaml
|
|
version: '3.8'
|
|
services:
|
|
prometheus:
|
|
image: prom/prometheus:latest
|
|
ports:
|
|
- "9090:9090"
|
|
volumes:
|
|
- ./prometheus.yml:/etc/prometheus/prometheus.yml
|
|
- prometheus-data:/prometheus
|
|
command:
|
|
- '--config.file=/etc/prometheus/prometheus.yml'
|
|
- '--storage.tsdb.path=/prometheus'
|
|
- '--storage.tsdb.retention.time=30d'
|
|
|
|
volumes:
|
|
prometheus-data:
|
|
```
|
|
|
|
## Configuration File
|
|
|
|
**prometheus.yml:**
|
|
```yaml
|
|
global:
|
|
scrape_interval: 15s
|
|
evaluation_interval: 15s
|
|
external_labels:
|
|
cluster: 'production'
|
|
region: 'us-west-2'
|
|
|
|
# Alertmanager configuration
|
|
alerting:
|
|
alertmanagers:
|
|
- static_configs:
|
|
- targets:
|
|
- alertmanager:9093
|
|
|
|
# Load rules files
|
|
rule_files:
|
|
- /etc/prometheus/rules/*.yml
|
|
|
|
# Scrape configurations
|
|
scrape_configs:
|
|
# Prometheus itself
|
|
- job_name: 'prometheus'
|
|
static_configs:
|
|
- targets: ['localhost:9090']
|
|
|
|
# Node exporters
|
|
- job_name: 'node-exporter'
|
|
static_configs:
|
|
- targets:
|
|
- 'node1:9100'
|
|
- 'node2:9100'
|
|
- 'node3:9100'
|
|
relabel_configs:
|
|
- source_labels: [__address__]
|
|
target_label: instance
|
|
regex: '([^:]+)(:[0-9]+)?'
|
|
replacement: '${1}'
|
|
|
|
# Kubernetes pods with annotations
|
|
- job_name: 'kubernetes-pods'
|
|
kubernetes_sd_configs:
|
|
- role: pod
|
|
relabel_configs:
|
|
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
|
|
action: keep
|
|
regex: true
|
|
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
|
|
action: replace
|
|
target_label: __metrics_path__
|
|
regex: (.+)
|
|
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
|
|
action: replace
|
|
regex: ([^:]+)(?::\d+)?;(\d+)
|
|
replacement: $1:$2
|
|
target_label: __address__
|
|
- source_labels: [__meta_kubernetes_namespace]
|
|
action: replace
|
|
target_label: namespace
|
|
- source_labels: [__meta_kubernetes_pod_name]
|
|
action: replace
|
|
target_label: pod
|
|
|
|
# Application metrics
|
|
- job_name: 'my-app'
|
|
static_configs:
|
|
- targets:
|
|
- 'app1.example.com:9090'
|
|
- 'app2.example.com:9090'
|
|
metrics_path: '/metrics'
|
|
scheme: 'https'
|
|
tls_config:
|
|
ca_file: /etc/prometheus/ca.crt
|
|
cert_file: /etc/prometheus/client.crt
|
|
key_file: /etc/prometheus/client.key
|
|
```
|
|
|
|
**Reference:** See `assets/prometheus.yml.template`
|
|
|
|
## Scrape Configurations
|
|
|
|
### Static Targets
|
|
|
|
```yaml
|
|
scrape_configs:
|
|
- job_name: 'static-targets'
|
|
static_configs:
|
|
- targets: ['host1:9100', 'host2:9100']
|
|
labels:
|
|
env: 'production'
|
|
region: 'us-west-2'
|
|
```
|
|
|
|
### File-based Service Discovery
|
|
|
|
```yaml
|
|
scrape_configs:
|
|
- job_name: 'file-sd'
|
|
file_sd_configs:
|
|
- files:
|
|
- /etc/prometheus/targets/*.json
|
|
- /etc/prometheus/targets/*.yml
|
|
refresh_interval: 5m
|
|
```
|
|
|
|
**targets/production.json:**
|
|
```json
|
|
[
|
|
{
|
|
"targets": ["app1:9090", "app2:9090"],
|
|
"labels": {
|
|
"env": "production",
|
|
"service": "api"
|
|
}
|
|
}
|
|
]
|
|
```
|
|
|
|
### Kubernetes Service Discovery
|
|
|
|
```yaml
|
|
scrape_configs:
|
|
- job_name: 'kubernetes-services'
|
|
kubernetes_sd_configs:
|
|
- role: service
|
|
relabel_configs:
|
|
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
|
|
action: keep
|
|
regex: true
|
|
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
|
|
action: replace
|
|
target_label: __scheme__
|
|
regex: (https?)
|
|
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
|
|
action: replace
|
|
target_label: __metrics_path__
|
|
regex: (.+)
|
|
```
|
|
|
|
**Reference:** See `references/scrape-configs.md`
|
|
|
|
## Recording Rules
|
|
|
|
Create pre-computed metrics for frequently queried expressions:
|
|
|
|
```yaml
|
|
# /etc/prometheus/rules/recording_rules.yml
|
|
groups:
|
|
- name: api_metrics
|
|
interval: 15s
|
|
rules:
|
|
# HTTP request rate per service
|
|
- record: job:http_requests:rate5m
|
|
expr: sum by (job) (rate(http_requests_total[5m]))
|
|
|
|
# Error rate percentage
|
|
- record: job:http_requests_errors:rate5m
|
|
expr: sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
|
|
|
|
- record: job:http_requests_error_rate:percentage
|
|
expr: |
|
|
(job:http_requests_errors:rate5m / job:http_requests:rate5m) * 100
|
|
|
|
# P95 latency
|
|
- record: job:http_request_duration:p95
|
|
expr: |
|
|
histogram_quantile(0.95,
|
|
sum by (job, le) (rate(http_request_duration_seconds_bucket[5m]))
|
|
)
|
|
|
|
- name: resource_metrics
|
|
interval: 30s
|
|
rules:
|
|
# CPU utilization percentage
|
|
- record: instance:node_cpu:utilization
|
|
expr: |
|
|
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
|
|
|
|
# Memory utilization percentage
|
|
- record: instance:node_memory:utilization
|
|
expr: |
|
|
100 - ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100)
|
|
|
|
# Disk usage percentage
|
|
- record: instance:node_disk:utilization
|
|
expr: |
|
|
100 - ((node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100)
|
|
```
|
|
|
|
**Reference:** See `references/recording-rules.md`
|
|
|
|
## Alert Rules
|
|
|
|
```yaml
|
|
# /etc/prometheus/rules/alert_rules.yml
|
|
groups:
|
|
- name: availability
|
|
interval: 30s
|
|
rules:
|
|
- alert: ServiceDown
|
|
expr: up{job="my-app"} == 0
|
|
for: 1m
|
|
labels:
|
|
severity: critical
|
|
annotations:
|
|
summary: "Service {{ $labels.instance }} is down"
|
|
description: "{{ $labels.job }} has been down for more than 1 minute"
|
|
|
|
- alert: HighErrorRate
|
|
expr: job:http_requests_error_rate:percentage > 5
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "High error rate for {{ $labels.job }}"
|
|
description: "Error rate is {{ $value }}% (threshold: 5%)"
|
|
|
|
- alert: HighLatency
|
|
expr: job:http_request_duration:p95 > 1
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "High latency for {{ $labels.job }}"
|
|
description: "P95 latency is {{ $value }}s (threshold: 1s)"
|
|
|
|
- name: resources
|
|
interval: 1m
|
|
rules:
|
|
- alert: HighCPUUsage
|
|
expr: instance:node_cpu:utilization > 80
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "High CPU usage on {{ $labels.instance }}"
|
|
description: "CPU usage is {{ $value }}%"
|
|
|
|
- alert: HighMemoryUsage
|
|
expr: instance:node_memory:utilization > 85
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "High memory usage on {{ $labels.instance }}"
|
|
description: "Memory usage is {{ $value }}%"
|
|
|
|
- alert: DiskSpaceLow
|
|
expr: instance:node_disk:utilization > 90
|
|
for: 5m
|
|
labels:
|
|
severity: critical
|
|
annotations:
|
|
summary: "Low disk space on {{ $labels.instance }}"
|
|
description: "Disk usage is {{ $value }}%"
|
|
```
|
|
|
|
## Validation
|
|
|
|
```bash
|
|
# Validate configuration
|
|
promtool check config prometheus.yml
|
|
|
|
# Validate rules
|
|
promtool check rules /etc/prometheus/rules/*.yml
|
|
|
|
# Test query
|
|
promtool query instant http://localhost:9090 'up'
|
|
```
|
|
|
|
**Reference:** See `scripts/validate-prometheus.sh`
|
|
|
|
## Best Practices
|
|
|
|
1. **Use consistent naming** for metrics (prefix_name_unit)
|
|
2. **Set appropriate scrape intervals** (15-60s typical)
|
|
3. **Use recording rules** for expensive queries
|
|
4. **Implement high availability** (multiple Prometheus instances)
|
|
5. **Configure retention** based on storage capacity
|
|
6. **Use relabeling** for metric cleanup
|
|
7. **Monitor Prometheus itself**
|
|
8. **Implement federation** for large deployments
|
|
9. **Use Thanos/Cortex** for long-term storage
|
|
10. **Document custom metrics**
|
|
|
|
## Troubleshooting
|
|
|
|
**Check scrape targets:**
|
|
```bash
|
|
curl http://localhost:9090/api/v1/targets
|
|
```
|
|
|
|
**Check configuration:**
|
|
```bash
|
|
curl http://localhost:9090/api/v1/status/config
|
|
```
|
|
|
|
**Test query:**
|
|
```bash
|
|
curl 'http://localhost:9090/api/v1/query?query=up'
|
|
```
|
|
|
|
## Reference Files
|
|
|
|
- `assets/prometheus.yml.template` - Complete configuration template
|
|
- `references/scrape-configs.md` - Scrape configuration patterns
|
|
- `references/recording-rules.md` - Recording rule examples
|
|
- `scripts/validate-prometheus.sh` - Validation script
|
|
|
|
## Related Skills
|
|
|
|
- `grafana-dashboards` - For visualization
|
|
- `slo-implementation` - For SLO monitoring
|
|
- `distributed-tracing` - For request tracing
|