Files
2025-11-29 17:56:41 +08:00

370 lines
8.1 KiB
Markdown

---
name: grafana-dashboards
description: Create and manage production Grafana dashboards for real-time visualization of system and application metrics. Use when building monitoring dashboards, visualizing metrics, or creating operational observability interfaces.
---
# Grafana Dashboards
Create and manage production-ready Grafana dashboards for comprehensive system observability.
## Purpose
Design effective Grafana dashboards for monitoring applications, infrastructure, and business metrics.
## When to Use
- Visualize Prometheus metrics
- Create custom dashboards
- Implement SLO dashboards
- Monitor infrastructure
- Track business KPIs
## Dashboard Design Principles
### 1. Hierarchy of Information
```
┌─────────────────────────────────────┐
│ Critical Metrics (Big Numbers) │
├─────────────────────────────────────┤
│ Key Trends (Time Series) │
├─────────────────────────────────────┤
│ Detailed Metrics (Tables/Heatmaps) │
└─────────────────────────────────────┘
```
### 2. RED Method (Services)
- **Rate** - Requests per second
- **Errors** - Error rate
- **Duration** - Latency/response time
### 3. USE Method (Resources)
- **Utilization** - % time resource is busy
- **Saturation** - Queue length/wait time
- **Errors** - Error count
## Dashboard Structure
### API Monitoring Dashboard
```json
{
"dashboard": {
"title": "API Monitoring",
"tags": ["api", "production"],
"timezone": "browser",
"refresh": "30s",
"panels": [
{
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "sum(rate(http_requests_total[5m])) by (service)",
"legendFormat": "{{service}}"
}
],
"gridPos": {"x": 0, "y": 0, "w": 12, "h": 8}
},
{
"title": "Error Rate %",
"type": "graph",
"targets": [
{
"expr": "(sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))) * 100",
"legendFormat": "Error Rate"
}
],
"alert": {
"conditions": [
{
"evaluator": {"params": [5], "type": "gt"},
"operator": {"type": "and"},
"query": {"params": ["A", "5m", "now"]},
"type": "query"
}
]
},
"gridPos": {"x": 12, "y": 0, "w": 12, "h": 8}
},
{
"title": "P95 Latency",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))",
"legendFormat": "{{service}}"
}
],
"gridPos": {"x": 0, "y": 8, "w": 24, "h": 8}
}
]
}
}
```
**Reference:** See `assets/api-dashboard.json`
## Panel Types
### 1. Stat Panel (Single Value)
```json
{
"type": "stat",
"title": "Total Requests",
"targets": [{
"expr": "sum(http_requests_total)"
}],
"options": {
"reduceOptions": {
"values": false,
"calcs": ["lastNotNull"]
},
"orientation": "auto",
"textMode": "auto",
"colorMode": "value"
},
"fieldConfig": {
"defaults": {
"thresholds": {
"mode": "absolute",
"steps": [
{"value": 0, "color": "green"},
{"value": 80, "color": "yellow"},
{"value": 90, "color": "red"}
]
}
}
}
}
```
### 2. Time Series Graph
```json
{
"type": "graph",
"title": "CPU Usage",
"targets": [{
"expr": "100 - (avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)"
}],
"yaxes": [
{"format": "percent", "max": 100, "min": 0},
{"format": "short"}
]
}
```
### 3. Table Panel
```json
{
"type": "table",
"title": "Service Status",
"targets": [{
"expr": "up",
"format": "table",
"instant": true
}],
"transformations": [
{
"id": "organize",
"options": {
"excludeByName": {"Time": true},
"indexByName": {},
"renameByName": {
"instance": "Instance",
"job": "Service",
"Value": "Status"
}
}
}
]
}
```
### 4. Heatmap
```json
{
"type": "heatmap",
"title": "Latency Heatmap",
"targets": [{
"expr": "sum(rate(http_request_duration_seconds_bucket[5m])) by (le)",
"format": "heatmap"
}],
"dataFormat": "tsbuckets",
"yAxis": {
"format": "s"
}
}
```
## Variables
### Query Variables
```json
{
"templating": {
"list": [
{
"name": "namespace",
"type": "query",
"datasource": "Prometheus",
"query": "label_values(kube_pod_info, namespace)",
"refresh": 1,
"multi": false
},
{
"name": "service",
"type": "query",
"datasource": "Prometheus",
"query": "label_values(kube_service_info{namespace=\"$namespace\"}, service)",
"refresh": 1,
"multi": true
}
]
}
}
```
### Use Variables in Queries
```
sum(rate(http_requests_total{namespace="$namespace", service=~"$service"}[5m]))
```
## Alerts in Dashboards
```json
{
"alert": {
"name": "High Error Rate",
"conditions": [
{
"evaluator": {
"params": [5],
"type": "gt"
},
"operator": {"type": "and"},
"query": {
"params": ["A", "5m", "now"]
},
"reducer": {"type": "avg"},
"type": "query"
}
],
"executionErrorState": "alerting",
"for": "5m",
"frequency": "1m",
"message": "Error rate is above 5%",
"noDataState": "no_data",
"notifications": [
{"uid": "slack-channel"}
]
}
}
```
## Dashboard Provisioning
**dashboards.yml:**
```yaml
apiVersion: 1
providers:
- name: 'default'
orgId: 1
folder: 'General'
type: file
disableDeletion: false
updateIntervalSeconds: 10
allowUiUpdates: true
options:
path: /etc/grafana/dashboards
```
## Common Dashboard Patterns
### Infrastructure Dashboard
**Key Panels:**
- CPU utilization per node
- Memory usage per node
- Disk I/O
- Network traffic
- Pod count by namespace
- Node status
**Reference:** See `assets/infrastructure-dashboard.json`
### Database Dashboard
**Key Panels:**
- Queries per second
- Connection pool usage
- Query latency (P50, P95, P99)
- Active connections
- Database size
- Replication lag
- Slow queries
**Reference:** See `assets/database-dashboard.json`
### Application Dashboard
**Key Panels:**
- Request rate
- Error rate
- Response time (percentiles)
- Active users/sessions
- Cache hit rate
- Queue length
## Best Practices
1. **Start with templates** (Grafana community dashboards)
2. **Use consistent naming** for panels and variables
3. **Group related metrics** in rows
4. **Set appropriate time ranges** (default: Last 6 hours)
5. **Use variables** for flexibility
6. **Add panel descriptions** for context
7. **Configure units** correctly
8. **Set meaningful thresholds** for colors
9. **Use consistent colors** across dashboards
10. **Test with different time ranges**
## Dashboard as Code
### Terraform Provisioning
```hcl
resource "grafana_dashboard" "api_monitoring" {
config_json = file("${path.module}/dashboards/api-monitoring.json")
folder = grafana_folder.monitoring.id
}
resource "grafana_folder" "monitoring" {
title = "Production Monitoring"
}
```
### Ansible Provisioning
```yaml
- name: Deploy Grafana dashboards
copy:
src: "{{ item }}"
dest: /etc/grafana/dashboards/
with_fileglob:
- "dashboards/*.json"
notify: restart grafana
```
## Reference Files
- `assets/api-dashboard.json` - API monitoring dashboard
- `assets/infrastructure-dashboard.json` - Infrastructure dashboard
- `assets/database-dashboard.json` - Database monitoring dashboard
- `references/dashboard-design.md` - Dashboard design guide
## Related Skills
- `prometheus-configuration` - For metric collection
- `slo-implementation` - For SLO dashboards