370 lines
8.1 KiB
Markdown
370 lines
8.1 KiB
Markdown
---
|
|
name: grafana-dashboards
|
|
description: Create and manage production Grafana dashboards for real-time visualization of system and application metrics. Use when building monitoring dashboards, visualizing metrics, or creating operational observability interfaces.
|
|
---
|
|
|
|
# Grafana Dashboards
|
|
|
|
Create and manage production-ready Grafana dashboards for comprehensive system observability.
|
|
|
|
## Purpose
|
|
|
|
Design effective Grafana dashboards for monitoring applications, infrastructure, and business metrics.
|
|
|
|
## When to Use
|
|
|
|
- Visualize Prometheus metrics
|
|
- Create custom dashboards
|
|
- Implement SLO dashboards
|
|
- Monitor infrastructure
|
|
- Track business KPIs
|
|
|
|
## Dashboard Design Principles
|
|
|
|
### 1. Hierarchy of Information
|
|
```
|
|
┌─────────────────────────────────────┐
|
|
│ Critical Metrics (Big Numbers) │
|
|
├─────────────────────────────────────┤
|
|
│ Key Trends (Time Series) │
|
|
├─────────────────────────────────────┤
|
|
│ Detailed Metrics (Tables/Heatmaps) │
|
|
└─────────────────────────────────────┘
|
|
```
|
|
|
|
### 2. RED Method (Services)
|
|
- **Rate** - Requests per second
|
|
- **Errors** - Error rate
|
|
- **Duration** - Latency/response time
|
|
|
|
### 3. USE Method (Resources)
|
|
- **Utilization** - % time resource is busy
|
|
- **Saturation** - Queue length/wait time
|
|
- **Errors** - Error count
|
|
|
|
## Dashboard Structure
|
|
|
|
### API Monitoring Dashboard
|
|
|
|
```json
|
|
{
|
|
"dashboard": {
|
|
"title": "API Monitoring",
|
|
"tags": ["api", "production"],
|
|
"timezone": "browser",
|
|
"refresh": "30s",
|
|
"panels": [
|
|
{
|
|
"title": "Request Rate",
|
|
"type": "graph",
|
|
"targets": [
|
|
{
|
|
"expr": "sum(rate(http_requests_total[5m])) by (service)",
|
|
"legendFormat": "{{service}}"
|
|
}
|
|
],
|
|
"gridPos": {"x": 0, "y": 0, "w": 12, "h": 8}
|
|
},
|
|
{
|
|
"title": "Error Rate %",
|
|
"type": "graph",
|
|
"targets": [
|
|
{
|
|
"expr": "(sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))) * 100",
|
|
"legendFormat": "Error Rate"
|
|
}
|
|
],
|
|
"alert": {
|
|
"conditions": [
|
|
{
|
|
"evaluator": {"params": [5], "type": "gt"},
|
|
"operator": {"type": "and"},
|
|
"query": {"params": ["A", "5m", "now"]},
|
|
"type": "query"
|
|
}
|
|
]
|
|
},
|
|
"gridPos": {"x": 12, "y": 0, "w": 12, "h": 8}
|
|
},
|
|
{
|
|
"title": "P95 Latency",
|
|
"type": "graph",
|
|
"targets": [
|
|
{
|
|
"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))",
|
|
"legendFormat": "{{service}}"
|
|
}
|
|
],
|
|
"gridPos": {"x": 0, "y": 8, "w": 24, "h": 8}
|
|
}
|
|
]
|
|
}
|
|
}
|
|
```
|
|
|
|
**Reference:** See `assets/api-dashboard.json`
|
|
|
|
## Panel Types
|
|
|
|
### 1. Stat Panel (Single Value)
|
|
```json
|
|
{
|
|
"type": "stat",
|
|
"title": "Total Requests",
|
|
"targets": [{
|
|
"expr": "sum(http_requests_total)"
|
|
}],
|
|
"options": {
|
|
"reduceOptions": {
|
|
"values": false,
|
|
"calcs": ["lastNotNull"]
|
|
},
|
|
"orientation": "auto",
|
|
"textMode": "auto",
|
|
"colorMode": "value"
|
|
},
|
|
"fieldConfig": {
|
|
"defaults": {
|
|
"thresholds": {
|
|
"mode": "absolute",
|
|
"steps": [
|
|
{"value": 0, "color": "green"},
|
|
{"value": 80, "color": "yellow"},
|
|
{"value": 90, "color": "red"}
|
|
]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
### 2. Time Series Graph
|
|
```json
|
|
{
|
|
"type": "graph",
|
|
"title": "CPU Usage",
|
|
"targets": [{
|
|
"expr": "100 - (avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)"
|
|
}],
|
|
"yaxes": [
|
|
{"format": "percent", "max": 100, "min": 0},
|
|
{"format": "short"}
|
|
]
|
|
}
|
|
```
|
|
|
|
### 3. Table Panel
|
|
```json
|
|
{
|
|
"type": "table",
|
|
"title": "Service Status",
|
|
"targets": [{
|
|
"expr": "up",
|
|
"format": "table",
|
|
"instant": true
|
|
}],
|
|
"transformations": [
|
|
{
|
|
"id": "organize",
|
|
"options": {
|
|
"excludeByName": {"Time": true},
|
|
"indexByName": {},
|
|
"renameByName": {
|
|
"instance": "Instance",
|
|
"job": "Service",
|
|
"Value": "Status"
|
|
}
|
|
}
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
### 4. Heatmap
|
|
```json
|
|
{
|
|
"type": "heatmap",
|
|
"title": "Latency Heatmap",
|
|
"targets": [{
|
|
"expr": "sum(rate(http_request_duration_seconds_bucket[5m])) by (le)",
|
|
"format": "heatmap"
|
|
}],
|
|
"dataFormat": "tsbuckets",
|
|
"yAxis": {
|
|
"format": "s"
|
|
}
|
|
}
|
|
```
|
|
|
|
## Variables
|
|
|
|
### Query Variables
|
|
```json
|
|
{
|
|
"templating": {
|
|
"list": [
|
|
{
|
|
"name": "namespace",
|
|
"type": "query",
|
|
"datasource": "Prometheus",
|
|
"query": "label_values(kube_pod_info, namespace)",
|
|
"refresh": 1,
|
|
"multi": false
|
|
},
|
|
{
|
|
"name": "service",
|
|
"type": "query",
|
|
"datasource": "Prometheus",
|
|
"query": "label_values(kube_service_info{namespace=\"$namespace\"}, service)",
|
|
"refresh": 1,
|
|
"multi": true
|
|
}
|
|
]
|
|
}
|
|
}
|
|
```
|
|
|
|
### Use Variables in Queries
|
|
```
|
|
sum(rate(http_requests_total{namespace="$namespace", service=~"$service"}[5m]))
|
|
```
|
|
|
|
## Alerts in Dashboards
|
|
|
|
```json
|
|
{
|
|
"alert": {
|
|
"name": "High Error Rate",
|
|
"conditions": [
|
|
{
|
|
"evaluator": {
|
|
"params": [5],
|
|
"type": "gt"
|
|
},
|
|
"operator": {"type": "and"},
|
|
"query": {
|
|
"params": ["A", "5m", "now"]
|
|
},
|
|
"reducer": {"type": "avg"},
|
|
"type": "query"
|
|
}
|
|
],
|
|
"executionErrorState": "alerting",
|
|
"for": "5m",
|
|
"frequency": "1m",
|
|
"message": "Error rate is above 5%",
|
|
"noDataState": "no_data",
|
|
"notifications": [
|
|
{"uid": "slack-channel"}
|
|
]
|
|
}
|
|
}
|
|
```
|
|
|
|
## Dashboard Provisioning
|
|
|
|
**dashboards.yml:**
|
|
```yaml
|
|
apiVersion: 1
|
|
|
|
providers:
|
|
- name: 'default'
|
|
orgId: 1
|
|
folder: 'General'
|
|
type: file
|
|
disableDeletion: false
|
|
updateIntervalSeconds: 10
|
|
allowUiUpdates: true
|
|
options:
|
|
path: /etc/grafana/dashboards
|
|
```
|
|
|
|
## Common Dashboard Patterns
|
|
|
|
### Infrastructure Dashboard
|
|
|
|
**Key Panels:**
|
|
- CPU utilization per node
|
|
- Memory usage per node
|
|
- Disk I/O
|
|
- Network traffic
|
|
- Pod count by namespace
|
|
- Node status
|
|
|
|
**Reference:** See `assets/infrastructure-dashboard.json`
|
|
|
|
### Database Dashboard
|
|
|
|
**Key Panels:**
|
|
- Queries per second
|
|
- Connection pool usage
|
|
- Query latency (P50, P95, P99)
|
|
- Active connections
|
|
- Database size
|
|
- Replication lag
|
|
- Slow queries
|
|
|
|
**Reference:** See `assets/database-dashboard.json`
|
|
|
|
### Application Dashboard
|
|
|
|
**Key Panels:**
|
|
- Request rate
|
|
- Error rate
|
|
- Response time (percentiles)
|
|
- Active users/sessions
|
|
- Cache hit rate
|
|
- Queue length
|
|
|
|
## Best Practices
|
|
|
|
1. **Start with templates** (Grafana community dashboards)
|
|
2. **Use consistent naming** for panels and variables
|
|
3. **Group related metrics** in rows
|
|
4. **Set appropriate time ranges** (default: Last 6 hours)
|
|
5. **Use variables** for flexibility
|
|
6. **Add panel descriptions** for context
|
|
7. **Configure units** correctly
|
|
8. **Set meaningful thresholds** for colors
|
|
9. **Use consistent colors** across dashboards
|
|
10. **Test with different time ranges**
|
|
|
|
## Dashboard as Code
|
|
|
|
### Terraform Provisioning
|
|
|
|
```hcl
|
|
resource "grafana_dashboard" "api_monitoring" {
|
|
config_json = file("${path.module}/dashboards/api-monitoring.json")
|
|
folder = grafana_folder.monitoring.id
|
|
}
|
|
|
|
resource "grafana_folder" "monitoring" {
|
|
title = "Production Monitoring"
|
|
}
|
|
```
|
|
|
|
### Ansible Provisioning
|
|
|
|
```yaml
|
|
- name: Deploy Grafana dashboards
|
|
copy:
|
|
src: "{{ item }}"
|
|
dest: /etc/grafana/dashboards/
|
|
with_fileglob:
|
|
- "dashboards/*.json"
|
|
notify: restart grafana
|
|
```
|
|
|
|
## Reference Files
|
|
|
|
- `assets/api-dashboard.json` - API monitoring dashboard
|
|
- `assets/infrastructure-dashboard.json` - Infrastructure dashboard
|
|
- `assets/database-dashboard.json` - Database monitoring dashboard
|
|
- `references/dashboard-design.md` - Dashboard design guide
|
|
|
|
## Related Skills
|
|
|
|
- `prometheus-configuration` - For metric collection
|
|
- `slo-implementation` - For SLO dashboards
|