Initial commit
This commit is contained in:
369
skills/grafana-dashboards/SKILL.md
Normal file
369
skills/grafana-dashboards/SKILL.md
Normal file
@@ -0,0 +1,369 @@
|
||||
---
|
||||
name: grafana-dashboards
|
||||
description: Create and manage production Grafana dashboards for real-time visualization of system and application metrics. Use when building monitoring dashboards, visualizing metrics, or creating operational observability interfaces.
|
||||
---
|
||||
|
||||
# Grafana Dashboards
|
||||
|
||||
Create and manage production-ready Grafana dashboards for comprehensive system observability.
|
||||
|
||||
## Purpose
|
||||
|
||||
Design effective Grafana dashboards for monitoring applications, infrastructure, and business metrics.
|
||||
|
||||
## When to Use
|
||||
|
||||
- Visualize Prometheus metrics
|
||||
- Create custom dashboards
|
||||
- Implement SLO dashboards
|
||||
- Monitor infrastructure
|
||||
- Track business KPIs
|
||||
|
||||
## Dashboard Design Principles
|
||||
|
||||
### 1. Hierarchy of Information
|
||||
```
|
||||
┌─────────────────────────────────────┐
|
||||
│ Critical Metrics (Big Numbers) │
|
||||
├─────────────────────────────────────┤
|
||||
│ Key Trends (Time Series) │
|
||||
├─────────────────────────────────────┤
|
||||
│ Detailed Metrics (Tables/Heatmaps) │
|
||||
└─────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### 2. RED Method (Services)
|
||||
- **Rate** - Requests per second
|
||||
- **Errors** - Error rate
|
||||
- **Duration** - Latency/response time
|
||||
|
||||
### 3. USE Method (Resources)
|
||||
- **Utilization** - % time resource is busy
|
||||
- **Saturation** - Queue length/wait time
|
||||
- **Errors** - Error count
|
||||
|
||||
## Dashboard Structure
|
||||
|
||||
### API Monitoring Dashboard
|
||||
|
||||
```json
|
||||
{
|
||||
"dashboard": {
|
||||
"title": "API Monitoring",
|
||||
"tags": ["api", "production"],
|
||||
"timezone": "browser",
|
||||
"refresh": "30s",
|
||||
"panels": [
|
||||
{
|
||||
"title": "Request Rate",
|
||||
"type": "graph",
|
||||
"targets": [
|
||||
{
|
||||
"expr": "sum(rate(http_requests_total[5m])) by (service)",
|
||||
"legendFormat": "{{service}}"
|
||||
}
|
||||
],
|
||||
"gridPos": {"x": 0, "y": 0, "w": 12, "h": 8}
|
||||
},
|
||||
{
|
||||
"title": "Error Rate %",
|
||||
"type": "graph",
|
||||
"targets": [
|
||||
{
|
||||
"expr": "(sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))) * 100",
|
||||
"legendFormat": "Error Rate"
|
||||
}
|
||||
],
|
||||
"alert": {
|
||||
"conditions": [
|
||||
{
|
||||
"evaluator": {"params": [5], "type": "gt"},
|
||||
"operator": {"type": "and"},
|
||||
"query": {"params": ["A", "5m", "now"]},
|
||||
"type": "query"
|
||||
}
|
||||
]
|
||||
},
|
||||
"gridPos": {"x": 12, "y": 0, "w": 12, "h": 8}
|
||||
},
|
||||
{
|
||||
"title": "P95 Latency",
|
||||
"type": "graph",
|
||||
"targets": [
|
||||
{
|
||||
"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))",
|
||||
"legendFormat": "{{service}}"
|
||||
}
|
||||
],
|
||||
"gridPos": {"x": 0, "y": 8, "w": 24, "h": 8}
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Reference:** See `assets/api-dashboard.json`
|
||||
|
||||
## Panel Types
|
||||
|
||||
### 1. Stat Panel (Single Value)
|
||||
```json
|
||||
{
|
||||
"type": "stat",
|
||||
"title": "Total Requests",
|
||||
"targets": [{
|
||||
"expr": "sum(http_requests_total)"
|
||||
}],
|
||||
"options": {
|
||||
"reduceOptions": {
|
||||
"values": false,
|
||||
"calcs": ["lastNotNull"]
|
||||
},
|
||||
"orientation": "auto",
|
||||
"textMode": "auto",
|
||||
"colorMode": "value"
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{"value": 0, "color": "green"},
|
||||
{"value": 80, "color": "yellow"},
|
||||
{"value": 90, "color": "red"}
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 2. Time Series Graph
|
||||
```json
|
||||
{
|
||||
"type": "graph",
|
||||
"title": "CPU Usage",
|
||||
"targets": [{
|
||||
"expr": "100 - (avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)"
|
||||
}],
|
||||
"yaxes": [
|
||||
{"format": "percent", "max": 100, "min": 0},
|
||||
{"format": "short"}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### 3. Table Panel
|
||||
```json
|
||||
{
|
||||
"type": "table",
|
||||
"title": "Service Status",
|
||||
"targets": [{
|
||||
"expr": "up",
|
||||
"format": "table",
|
||||
"instant": true
|
||||
}],
|
||||
"transformations": [
|
||||
{
|
||||
"id": "organize",
|
||||
"options": {
|
||||
"excludeByName": {"Time": true},
|
||||
"indexByName": {},
|
||||
"renameByName": {
|
||||
"instance": "Instance",
|
||||
"job": "Service",
|
||||
"Value": "Status"
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### 4. Heatmap
|
||||
```json
|
||||
{
|
||||
"type": "heatmap",
|
||||
"title": "Latency Heatmap",
|
||||
"targets": [{
|
||||
"expr": "sum(rate(http_request_duration_seconds_bucket[5m])) by (le)",
|
||||
"format": "heatmap"
|
||||
}],
|
||||
"dataFormat": "tsbuckets",
|
||||
"yAxis": {
|
||||
"format": "s"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Variables
|
||||
|
||||
### Query Variables
|
||||
```json
|
||||
{
|
||||
"templating": {
|
||||
"list": [
|
||||
{
|
||||
"name": "namespace",
|
||||
"type": "query",
|
||||
"datasource": "Prometheus",
|
||||
"query": "label_values(kube_pod_info, namespace)",
|
||||
"refresh": 1,
|
||||
"multi": false
|
||||
},
|
||||
{
|
||||
"name": "service",
|
||||
"type": "query",
|
||||
"datasource": "Prometheus",
|
||||
"query": "label_values(kube_service_info{namespace=\"$namespace\"}, service)",
|
||||
"refresh": 1,
|
||||
"multi": true
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Use Variables in Queries
|
||||
```
|
||||
sum(rate(http_requests_total{namespace="$namespace", service=~"$service"}[5m]))
|
||||
```
|
||||
|
||||
## Alerts in Dashboards
|
||||
|
||||
```json
|
||||
{
|
||||
"alert": {
|
||||
"name": "High Error Rate",
|
||||
"conditions": [
|
||||
{
|
||||
"evaluator": {
|
||||
"params": [5],
|
||||
"type": "gt"
|
||||
},
|
||||
"operator": {"type": "and"},
|
||||
"query": {
|
||||
"params": ["A", "5m", "now"]
|
||||
},
|
||||
"reducer": {"type": "avg"},
|
||||
"type": "query"
|
||||
}
|
||||
],
|
||||
"executionErrorState": "alerting",
|
||||
"for": "5m",
|
||||
"frequency": "1m",
|
||||
"message": "Error rate is above 5%",
|
||||
"noDataState": "no_data",
|
||||
"notifications": [
|
||||
{"uid": "slack-channel"}
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Dashboard Provisioning
|
||||
|
||||
**dashboards.yml:**
|
||||
```yaml
|
||||
apiVersion: 1
|
||||
|
||||
providers:
|
||||
- name: 'default'
|
||||
orgId: 1
|
||||
folder: 'General'
|
||||
type: file
|
||||
disableDeletion: false
|
||||
updateIntervalSeconds: 10
|
||||
allowUiUpdates: true
|
||||
options:
|
||||
path: /etc/grafana/dashboards
|
||||
```
|
||||
|
||||
## Common Dashboard Patterns
|
||||
|
||||
### Infrastructure Dashboard
|
||||
|
||||
**Key Panels:**
|
||||
- CPU utilization per node
|
||||
- Memory usage per node
|
||||
- Disk I/O
|
||||
- Network traffic
|
||||
- Pod count by namespace
|
||||
- Node status
|
||||
|
||||
**Reference:** See `assets/infrastructure-dashboard.json`
|
||||
|
||||
### Database Dashboard
|
||||
|
||||
**Key Panels:**
|
||||
- Queries per second
|
||||
- Connection pool usage
|
||||
- Query latency (P50, P95, P99)
|
||||
- Active connections
|
||||
- Database size
|
||||
- Replication lag
|
||||
- Slow queries
|
||||
|
||||
**Reference:** See `assets/database-dashboard.json`
|
||||
|
||||
### Application Dashboard
|
||||
|
||||
**Key Panels:**
|
||||
- Request rate
|
||||
- Error rate
|
||||
- Response time (percentiles)
|
||||
- Active users/sessions
|
||||
- Cache hit rate
|
||||
- Queue length
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Start with templates** (Grafana community dashboards)
|
||||
2. **Use consistent naming** for panels and variables
|
||||
3. **Group related metrics** in rows
|
||||
4. **Set appropriate time ranges** (default: Last 6 hours)
|
||||
5. **Use variables** for flexibility
|
||||
6. **Add panel descriptions** for context
|
||||
7. **Configure units** correctly
|
||||
8. **Set meaningful thresholds** for colors
|
||||
9. **Use consistent colors** across dashboards
|
||||
10. **Test with different time ranges**
|
||||
|
||||
## Dashboard as Code
|
||||
|
||||
### Terraform Provisioning
|
||||
|
||||
```hcl
|
||||
resource "grafana_dashboard" "api_monitoring" {
|
||||
config_json = file("${path.module}/dashboards/api-monitoring.json")
|
||||
folder = grafana_folder.monitoring.id
|
||||
}
|
||||
|
||||
resource "grafana_folder" "monitoring" {
|
||||
title = "Production Monitoring"
|
||||
}
|
||||
```
|
||||
|
||||
### Ansible Provisioning
|
||||
|
||||
```yaml
|
||||
- name: Deploy Grafana dashboards
|
||||
copy:
|
||||
src: "{{ item }}"
|
||||
dest: /etc/grafana/dashboards/
|
||||
with_fileglob:
|
||||
- "dashboards/*.json"
|
||||
notify: restart grafana
|
||||
```
|
||||
|
||||
## Reference Files
|
||||
|
||||
- `assets/api-dashboard.json` - API monitoring dashboard
|
||||
- `assets/infrastructure-dashboard.json` - Infrastructure dashboard
|
||||
- `assets/database-dashboard.json` - Database monitoring dashboard
|
||||
- `references/dashboard-design.md` - Dashboard design guide
|
||||
|
||||
## Related Skills
|
||||
|
||||
- `prometheus-configuration` - For metric collection
|
||||
- `slo-implementation` - For SLO dashboards
|
||||
Reference in New Issue
Block a user