Initial commit
This commit is contained in:
392
skills/prometheus-configuration/SKILL.md
Normal file
392
skills/prometheus-configuration/SKILL.md
Normal file
@@ -0,0 +1,392 @@
|
||||
---
|
||||
name: prometheus-configuration
|
||||
description: Set up Prometheus for comprehensive metric collection, storage, and monitoring of infrastructure and applications. Use when implementing metrics collection, setting up monitoring infrastructure, or configuring alerting systems.
|
||||
---
|
||||
|
||||
# Prometheus Configuration
|
||||
|
||||
Complete guide to Prometheus setup, metric collection, scrape configuration, and recording rules.
|
||||
|
||||
## Purpose
|
||||
|
||||
Configure Prometheus for comprehensive metric collection, alerting, and monitoring of infrastructure and applications.
|
||||
|
||||
## When to Use
|
||||
|
||||
- Set up Prometheus monitoring
|
||||
- Configure metric scraping
|
||||
- Create recording rules
|
||||
- Design alert rules
|
||||
- Implement service discovery
|
||||
|
||||
## Prometheus Architecture
|
||||
|
||||
```
|
||||
┌──────────────┐
|
||||
│ Applications │ ← Instrumented with client libraries
|
||||
└──────┬───────┘
|
||||
│ /metrics endpoint
|
||||
↓
|
||||
┌──────────────┐
|
||||
│ Prometheus │ ← Scrapes metrics periodically
|
||||
│ Server │
|
||||
└──────┬───────┘
|
||||
│
|
||||
├─→ AlertManager (alerts)
|
||||
├─→ Grafana (visualization)
|
||||
└─→ Long-term storage (Thanos/Cortex)
|
||||
```
|
||||
|
||||
## Installation
|
||||
|
||||
### Kubernetes with Helm
|
||||
|
||||
```bash
|
||||
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
|
||||
helm repo update
|
||||
|
||||
helm install prometheus prometheus-community/kube-prometheus-stack \
|
||||
--namespace monitoring \
|
||||
--create-namespace \
|
||||
--set prometheus.prometheusSpec.retention=30d \
|
||||
--set prometheus.prometheusSpec.storageVolumeSize=50Gi
|
||||
```
|
||||
|
||||
### Docker Compose
|
||||
|
||||
```yaml
|
||||
version: '3.8'
|
||||
services:
|
||||
prometheus:
|
||||
image: prom/prometheus:latest
|
||||
ports:
|
||||
- "9090:9090"
|
||||
volumes:
|
||||
- ./prometheus.yml:/etc/prometheus/prometheus.yml
|
||||
- prometheus-data:/prometheus
|
||||
command:
|
||||
- '--config.file=/etc/prometheus/prometheus.yml'
|
||||
- '--storage.tsdb.path=/prometheus'
|
||||
- '--storage.tsdb.retention.time=30d'
|
||||
|
||||
volumes:
|
||||
prometheus-data:
|
||||
```
|
||||
|
||||
## Configuration File
|
||||
|
||||
**prometheus.yml:**
|
||||
```yaml
|
||||
global:
|
||||
scrape_interval: 15s
|
||||
evaluation_interval: 15s
|
||||
external_labels:
|
||||
cluster: 'production'
|
||||
region: 'us-west-2'
|
||||
|
||||
# Alertmanager configuration
|
||||
alerting:
|
||||
alertmanagers:
|
||||
- static_configs:
|
||||
- targets:
|
||||
- alertmanager:9093
|
||||
|
||||
# Load rules files
|
||||
rule_files:
|
||||
- /etc/prometheus/rules/*.yml
|
||||
|
||||
# Scrape configurations
|
||||
scrape_configs:
|
||||
# Prometheus itself
|
||||
- job_name: 'prometheus'
|
||||
static_configs:
|
||||
- targets: ['localhost:9090']
|
||||
|
||||
# Node exporters
|
||||
- job_name: 'node-exporter'
|
||||
static_configs:
|
||||
- targets:
|
||||
- 'node1:9100'
|
||||
- 'node2:9100'
|
||||
- 'node3:9100'
|
||||
relabel_configs:
|
||||
- source_labels: [__address__]
|
||||
target_label: instance
|
||||
regex: '([^:]+)(:[0-9]+)?'
|
||||
replacement: '${1}'
|
||||
|
||||
# Kubernetes pods with annotations
|
||||
- job_name: 'kubernetes-pods'
|
||||
kubernetes_sd_configs:
|
||||
- role: pod
|
||||
relabel_configs:
|
||||
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
|
||||
action: keep
|
||||
regex: true
|
||||
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
|
||||
action: replace
|
||||
target_label: __metrics_path__
|
||||
regex: (.+)
|
||||
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
|
||||
action: replace
|
||||
regex: ([^:]+)(?::\d+)?;(\d+)
|
||||
replacement: $1:$2
|
||||
target_label: __address__
|
||||
- source_labels: [__meta_kubernetes_namespace]
|
||||
action: replace
|
||||
target_label: namespace
|
||||
- source_labels: [__meta_kubernetes_pod_name]
|
||||
action: replace
|
||||
target_label: pod
|
||||
|
||||
# Application metrics
|
||||
- job_name: 'my-app'
|
||||
static_configs:
|
||||
- targets:
|
||||
- 'app1.example.com:9090'
|
||||
- 'app2.example.com:9090'
|
||||
metrics_path: '/metrics'
|
||||
scheme: 'https'
|
||||
tls_config:
|
||||
ca_file: /etc/prometheus/ca.crt
|
||||
cert_file: /etc/prometheus/client.crt
|
||||
key_file: /etc/prometheus/client.key
|
||||
```
|
||||
|
||||
**Reference:** See `assets/prometheus.yml.template`
|
||||
|
||||
## Scrape Configurations
|
||||
|
||||
### Static Targets
|
||||
|
||||
```yaml
|
||||
scrape_configs:
|
||||
- job_name: 'static-targets'
|
||||
static_configs:
|
||||
- targets: ['host1:9100', 'host2:9100']
|
||||
labels:
|
||||
env: 'production'
|
||||
region: 'us-west-2'
|
||||
```
|
||||
|
||||
### File-based Service Discovery
|
||||
|
||||
```yaml
|
||||
scrape_configs:
|
||||
- job_name: 'file-sd'
|
||||
file_sd_configs:
|
||||
- files:
|
||||
- /etc/prometheus/targets/*.json
|
||||
- /etc/prometheus/targets/*.yml
|
||||
refresh_interval: 5m
|
||||
```
|
||||
|
||||
**targets/production.json:**
|
||||
```json
|
||||
[
|
||||
{
|
||||
"targets": ["app1:9090", "app2:9090"],
|
||||
"labels": {
|
||||
"env": "production",
|
||||
"service": "api"
|
||||
}
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
### Kubernetes Service Discovery
|
||||
|
||||
```yaml
|
||||
scrape_configs:
|
||||
- job_name: 'kubernetes-services'
|
||||
kubernetes_sd_configs:
|
||||
- role: service
|
||||
relabel_configs:
|
||||
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
|
||||
action: keep
|
||||
regex: true
|
||||
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
|
||||
action: replace
|
||||
target_label: __scheme__
|
||||
regex: (https?)
|
||||
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
|
||||
action: replace
|
||||
target_label: __metrics_path__
|
||||
regex: (.+)
|
||||
```
|
||||
|
||||
**Reference:** See `references/scrape-configs.md`
|
||||
|
||||
## Recording Rules
|
||||
|
||||
Create pre-computed metrics for frequently queried expressions:
|
||||
|
||||
```yaml
|
||||
# /etc/prometheus/rules/recording_rules.yml
|
||||
groups:
|
||||
- name: api_metrics
|
||||
interval: 15s
|
||||
rules:
|
||||
# HTTP request rate per service
|
||||
- record: job:http_requests:rate5m
|
||||
expr: sum by (job) (rate(http_requests_total[5m]))
|
||||
|
||||
# Error rate percentage
|
||||
- record: job:http_requests_errors:rate5m
|
||||
expr: sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
|
||||
|
||||
- record: job:http_requests_error_rate:percentage
|
||||
expr: |
|
||||
(job:http_requests_errors:rate5m / job:http_requests:rate5m) * 100
|
||||
|
||||
# P95 latency
|
||||
- record: job:http_request_duration:p95
|
||||
expr: |
|
||||
histogram_quantile(0.95,
|
||||
sum by (job, le) (rate(http_request_duration_seconds_bucket[5m]))
|
||||
)
|
||||
|
||||
- name: resource_metrics
|
||||
interval: 30s
|
||||
rules:
|
||||
# CPU utilization percentage
|
||||
- record: instance:node_cpu:utilization
|
||||
expr: |
|
||||
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
|
||||
|
||||
# Memory utilization percentage
|
||||
- record: instance:node_memory:utilization
|
||||
expr: |
|
||||
100 - ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100)
|
||||
|
||||
# Disk usage percentage
|
||||
- record: instance:node_disk:utilization
|
||||
expr: |
|
||||
100 - ((node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100)
|
||||
```
|
||||
|
||||
**Reference:** See `references/recording-rules.md`
|
||||
|
||||
## Alert Rules
|
||||
|
||||
```yaml
|
||||
# /etc/prometheus/rules/alert_rules.yml
|
||||
groups:
|
||||
- name: availability
|
||||
interval: 30s
|
||||
rules:
|
||||
- alert: ServiceDown
|
||||
expr: up{job="my-app"} == 0
|
||||
for: 1m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "Service {{ $labels.instance }} is down"
|
||||
description: "{{ $labels.job }} has been down for more than 1 minute"
|
||||
|
||||
- alert: HighErrorRate
|
||||
expr: job:http_requests_error_rate:percentage > 5
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "High error rate for {{ $labels.job }}"
|
||||
description: "Error rate is {{ $value }}% (threshold: 5%)"
|
||||
|
||||
- alert: HighLatency
|
||||
expr: job:http_request_duration:p95 > 1
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "High latency for {{ $labels.job }}"
|
||||
description: "P95 latency is {{ $value }}s (threshold: 1s)"
|
||||
|
||||
- name: resources
|
||||
interval: 1m
|
||||
rules:
|
||||
- alert: HighCPUUsage
|
||||
expr: instance:node_cpu:utilization > 80
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "High CPU usage on {{ $labels.instance }}"
|
||||
description: "CPU usage is {{ $value }}%"
|
||||
|
||||
- alert: HighMemoryUsage
|
||||
expr: instance:node_memory:utilization > 85
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "High memory usage on {{ $labels.instance }}"
|
||||
description: "Memory usage is {{ $value }}%"
|
||||
|
||||
- alert: DiskSpaceLow
|
||||
expr: instance:node_disk:utilization > 90
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "Low disk space on {{ $labels.instance }}"
|
||||
description: "Disk usage is {{ $value }}%"
|
||||
```
|
||||
|
||||
## Validation
|
||||
|
||||
```bash
|
||||
# Validate configuration
|
||||
promtool check config prometheus.yml
|
||||
|
||||
# Validate rules
|
||||
promtool check rules /etc/prometheus/rules/*.yml
|
||||
|
||||
# Test query
|
||||
promtool query instant http://localhost:9090 'up'
|
||||
```
|
||||
|
||||
**Reference:** See `scripts/validate-prometheus.sh`
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Use consistent naming** for metrics (prefix_name_unit)
|
||||
2. **Set appropriate scrape intervals** (15-60s typical)
|
||||
3. **Use recording rules** for expensive queries
|
||||
4. **Implement high availability** (multiple Prometheus instances)
|
||||
5. **Configure retention** based on storage capacity
|
||||
6. **Use relabeling** for metric cleanup
|
||||
7. **Monitor Prometheus itself**
|
||||
8. **Implement federation** for large deployments
|
||||
9. **Use Thanos/Cortex** for long-term storage
|
||||
10. **Document custom metrics**
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
**Check scrape targets:**
|
||||
```bash
|
||||
curl http://localhost:9090/api/v1/targets
|
||||
```
|
||||
|
||||
**Check configuration:**
|
||||
```bash
|
||||
curl http://localhost:9090/api/v1/status/config
|
||||
```
|
||||
|
||||
**Test query:**
|
||||
```bash
|
||||
curl 'http://localhost:9090/api/v1/query?query=up'
|
||||
```
|
||||
|
||||
## Reference Files
|
||||
|
||||
- `assets/prometheus.yml.template` - Complete configuration template
|
||||
- `references/scrape-configs.md` - Scrape configuration patterns
|
||||
- `references/recording-rules.md` - Recording rule examples
|
||||
- `scripts/validate-prometheus.sh` - Validation script
|
||||
|
||||
## Related Skills
|
||||
|
||||
- `grafana-dashboards` - For visualization
|
||||
- `slo-implementation` - For SLO monitoring
|
||||
- `distributed-tracing` - For request tracing
|
||||
Reference in New Issue
Block a user