Initial commit

2025-11-29 17:56:41 +08:00
commit 9427ed1eea
40 changed files with 15189 additions and 0 deletions
--- a/skills/grafana-dashboards/SKILL.md
+++ b/skills/grafana-dashboards/SKILL.md
@@ -0,0 +1,369 @@
+---
+name: grafana-dashboards
+description: Create and manage production Grafana dashboards for real-time visualization of system and application metrics. Use when building monitoring dashboards, visualizing metrics, or creating operational observability interfaces.
+---
+
+# Grafana Dashboards
+
+Create and manage production-ready Grafana dashboards for comprehensive system observability.
+
+## Purpose
+
+Design effective Grafana dashboards for monitoring applications, infrastructure, and business metrics.
+
+## When to Use
+
+- Visualize Prometheus metrics
+- Create custom dashboards
+- Implement SLO dashboards
+- Monitor infrastructure
+- Track business KPIs
+
+## Dashboard Design Principles
+
+### 1. Hierarchy of Information
+```
+┌─────────────────────────────────────┐
+│  Critical Metrics (Big Numbers)     │
+├─────────────────────────────────────┤
+│  Key Trends (Time Series)           │
+├─────────────────────────────────────┤
+│  Detailed Metrics (Tables/Heatmaps) │
+└─────────────────────────────────────┘
+```
+
+### 2. RED Method (Services)
+- **Rate** - Requests per second
+- **Errors** - Error rate
+- **Duration** - Latency/response time
+
+### 3. USE Method (Resources)
+- **Utilization** - % time resource is busy
+- **Saturation** - Queue length/wait time
+- **Errors** - Error count
+
+## Dashboard Structure
+
+### API Monitoring Dashboard
+
+```json
+{
+  "dashboard": {
+    "title": "API Monitoring",
+    "tags": ["api", "production"],
+    "timezone": "browser",
+    "refresh": "30s",
+    "panels": [
+      {
+        "title": "Request Rate",
+        "type": "graph",
+        "targets": [
+          {
+            "expr": "sum(rate(http_requests_total[5m])) by (service)",
+            "legendFormat": "{{service}}"
+          }
+        ],
+        "gridPos": {"x": 0, "y": 0, "w": 12, "h": 8}
+      },
+      {
+        "title": "Error Rate %",
+        "type": "graph",
+        "targets": [
+          {
+            "expr": "(sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))) * 100",
+            "legendFormat": "Error Rate"
+          }
+        ],
+        "alert": {
+          "conditions": [
+            {
+              "evaluator": {"params": [5], "type": "gt"},
+              "operator": {"type": "and"},
+              "query": {"params": ["A", "5m", "now"]},
+              "type": "query"
+            }
+          ]
+        },
+        "gridPos": {"x": 12, "y": 0, "w": 12, "h": 8}
+      },
+      {
+        "title": "P95 Latency",
+        "type": "graph",
+        "targets": [
+          {
+            "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))",
+            "legendFormat": "{{service}}"
+          }
+        ],
+        "gridPos": {"x": 0, "y": 8, "w": 24, "h": 8}
+      }
+    ]
+  }
+}
+```
+
+**Reference:** See `assets/api-dashboard.json`
+
+## Panel Types
+
+### 1. Stat Panel (Single Value)
+```json
+{
+  "type": "stat",
+  "title": "Total Requests",
+  "targets": [{
+    "expr": "sum(http_requests_total)"
+  }],
+  "options": {
+    "reduceOptions": {
+      "values": false,
+      "calcs": ["lastNotNull"]
+    },
+    "orientation": "auto",
+    "textMode": "auto",
+    "colorMode": "value"
+  },
+  "fieldConfig": {
+    "defaults": {
+      "thresholds": {
+        "mode": "absolute",
+        "steps": [
+          {"value": 0, "color": "green"},
+          {"value": 80, "color": "yellow"},
+          {"value": 90, "color": "red"}
+        ]
+      }
+    }
+  }
+}
+```
+
+### 2. Time Series Graph
+```json
+{
+  "type": "graph",
+  "title": "CPU Usage",
+  "targets": [{
+    "expr": "100 - (avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)"
+  }],
+  "yaxes": [
+    {"format": "percent", "max": 100, "min": 0},
+    {"format": "short"}
+  ]
+}
+```
+
+### 3. Table Panel
+```json
+{
+  "type": "table",
+  "title": "Service Status",
+  "targets": [{
+    "expr": "up",
+    "format": "table",
+    "instant": true
+  }],
+  "transformations": [
+    {
+      "id": "organize",
+      "options": {
+        "excludeByName": {"Time": true},
+        "indexByName": {},
+        "renameByName": {
+          "instance": "Instance",
+          "job": "Service",
+          "Value": "Status"
+        }
+      }
+    }
+  ]
+}
+```
+
+### 4. Heatmap
+```json
+{
+  "type": "heatmap",
+  "title": "Latency Heatmap",
+  "targets": [{
+    "expr": "sum(rate(http_request_duration_seconds_bucket[5m])) by (le)",
+    "format": "heatmap"
+  }],
+  "dataFormat": "tsbuckets",
+  "yAxis": {
+    "format": "s"
+  }
+}
+```
+
+## Variables
+
+### Query Variables
+```json
+{
+  "templating": {
+    "list": [
+      {
+        "name": "namespace",
+        "type": "query",
+        "datasource": "Prometheus",
+        "query": "label_values(kube_pod_info, namespace)",
+        "refresh": 1,
+        "multi": false
+      },
+      {
+        "name": "service",
+        "type": "query",
+        "datasource": "Prometheus",
+        "query": "label_values(kube_service_info{namespace=\"$namespace\"}, service)",
+        "refresh": 1,
+        "multi": true
+      }
+    ]
+  }
+}
+```
+
+### Use Variables in Queries
+```
+sum(rate(http_requests_total{namespace="$namespace", service=~"$service"}[5m]))
+```
+
+## Alerts in Dashboards
+
+```json
+{
+  "alert": {
+    "name": "High Error Rate",
+    "conditions": [
+      {
+        "evaluator": {
+          "params": [5],
+          "type": "gt"
+        },
+        "operator": {"type": "and"},
+        "query": {
+          "params": ["A", "5m", "now"]
+        },
+        "reducer": {"type": "avg"},
+        "type": "query"
+      }
+    ],
+    "executionErrorState": "alerting",
+    "for": "5m",
+    "frequency": "1m",
+    "message": "Error rate is above 5%",
+    "noDataState": "no_data",
+    "notifications": [
+      {"uid": "slack-channel"}
+    ]
+  }
+}
+```
+
+## Dashboard Provisioning
+
+**dashboards.yml:**
+```yaml
+apiVersion: 1
+
+providers:
+  - name: 'default'
+    orgId: 1
+    folder: 'General'
+    type: file
+    disableDeletion: false
+    updateIntervalSeconds: 10
+    allowUiUpdates: true
+    options:
+      path: /etc/grafana/dashboards
+```
+
+## Common Dashboard Patterns
+
+### Infrastructure Dashboard
+
+**Key Panels:**
+- CPU utilization per node
+- Memory usage per node
+- Disk I/O
+- Network traffic
+- Pod count by namespace
+- Node status
+
+**Reference:** See `assets/infrastructure-dashboard.json`
+
+### Database Dashboard
+
+**Key Panels:**
+- Queries per second
+- Connection pool usage
+- Query latency (P50, P95, P99)
+- Active connections
+- Database size
+- Replication lag
+- Slow queries
+
+**Reference:** See `assets/database-dashboard.json`
+
+### Application Dashboard
+
+**Key Panels:**
+- Request rate
+- Error rate
+- Response time (percentiles)
+- Active users/sessions
+- Cache hit rate
+- Queue length
+
+## Best Practices
+
+1. **Start with templates** (Grafana community dashboards)
+2. **Use consistent naming** for panels and variables
+3. **Group related metrics** in rows
+4. **Set appropriate time ranges** (default: Last 6 hours)
+5. **Use variables** for flexibility
+6. **Add panel descriptions** for context
+7. **Configure units** correctly
+8. **Set meaningful thresholds** for colors
+9. **Use consistent colors** across dashboards
+10. **Test with different time ranges**
+
+## Dashboard as Code
+
+### Terraform Provisioning
+
+```hcl
+resource "grafana_dashboard" "api_monitoring" {
+  config_json = file("${path.module}/dashboards/api-monitoring.json")
+  folder      = grafana_folder.monitoring.id
+}
+
+resource "grafana_folder" "monitoring" {
+  title = "Production Monitoring"
+}
+```
+
+### Ansible Provisioning
+
+```yaml
+- name: Deploy Grafana dashboards
+  copy:
+    src: "{{ item }}"
+    dest: /etc/grafana/dashboards/
+  with_fileglob:
+    - "dashboards/*.json"
+  notify: restart grafana
+```
+
+## Reference Files
+
+- `assets/api-dashboard.json` - API monitoring dashboard
+- `assets/infrastructure-dashboard.json` - Infrastructure dashboard
+- `assets/database-dashboard.json` - Database monitoring dashboard
+- `references/dashboard-design.md` - Dashboard design guide
+
+## Related Skills
+
+- `prometheus-configuration` - For metric collection
+- `slo-implementation` - For SLO dashboards