Initial commit

2025-11-29 17:51:22 +08:00
commit 23753b435e
24 changed files with 9837 additions and 0 deletions
--- a/assets/templates/runbooks/incident-runbook-template.md
+++ b/assets/templates/runbooks/incident-runbook-template.md
@@ -0,0 +1,409 @@
+# Runbook: [Alert Name]
+
+## Overview
+
+**Alert Name**: [e.g., HighLatency, ServiceDown, ErrorBudgetBurn]
+
+**Severity**: [Critical | Warning | Info]
+
+**Team**: [e.g., Backend, Platform, Database]
+
+**Component**: [e.g., API Gateway, User Service, PostgreSQL]
+
+**What it means**: [One-line description of what this alert indicates]
+
+**User impact**: [How does this affect users? High/Medium/Low]
+
+**Urgency**: [How quickly must this be addressed? Immediate/Hours/Days]
+
+---
+
+## Alert Details
+
+### When This Alert Fires
+
+This alert fires when:
+- [Specific condition, e.g., "P95 latency exceeds 500ms for 10 minutes"]
+- [Any additional conditions]
+
+### Symptoms
+
+Users will experience:
+- [ ] Slow response times
+- [ ] Errors or failures
+- [ ] Service unavailable
+- [ ] [Other symptoms]
+
+### Probable Causes
+
+Common causes include:
+1. **[Cause 1]**: [Description]
+   - Example: Database overload due to slow queries
+2. **[Cause 2]**: [Description]
+   - Example: Memory leak causing OOM errors
+3. **[Cause 3]**: [Description]
+   - Example: Upstream service degradation
+
+---
+
+## Investigation Steps
+
+### 1. Check Service Health
+
+**Dashboard**: [Link to primary dashboard]
+
+**Key metrics to check**:
+```bash
+# Request rate
+sum(rate(http_requests_total[5m]))
+
+# Error rate
+sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
+
+# Latency (p95, p99)
+histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
+```
+
+**What to look for**:
+- [ ] Has traffic spiked recently?
+- [ ] Is error rate elevated?
+- [ ] Are any endpoints particularly slow?
+
+### 2. Check Recent Changes
+
+**Deployments**:
+```bash
+# Kubernetes
+kubectl rollout history deployment/[service-name] -n [namespace]
+
+# Check when last deployed
+kubectl get pods -n [namespace] -o wide | grep [service-name]
+```
+
+**What to look for**:
+- [ ] Was there a recent deployment?
+- [ ] Did alert start after deployment?
+- [ ] Any configuration changes?
+
+### 3. Check Logs
+
+**Log query** (adjust for your log system):
+```bash
+# Kubernetes
+kubectl logs deployment/[service-name] -n [namespace] --tail=100 | grep ERROR
+
+# Elasticsearch/Kibana
+GET /logs-*/_search
+{
+  "query": {
+    "bool": {
+      "must": [
+        { "match": { "service": "[service-name]" } },
+        { "match": { "level": "error" } },
+        { "range": { "@timestamp": { "gte": "now-30m" } } }
+      ]
+    }
+  }
+}
+
+# Loki/LogQL
+{job="[service-name]"} |= "error" | json | level="error"
+```
+
+**What to look for**:
+- [ ] Repeated error messages
+- [ ] Stack traces
+- [ ] Connection errors
+- [ ] Timeout errors
+
+### 4. Check Dependencies
+
+**Database**:
+```bash
+# Check active connections
+SELECT count(*) FROM pg_stat_activity WHERE state = 'active';
+
+# Check slow queries
+SELECT pid, now() - pg_stat_activity.query_start AS duration, query
+FROM pg_stat_activity
+WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '5 seconds';
+```
+
+**External APIs**:
+- [ ] Check status pages: [Link to status pages]
+- [ ] Check API error rates in dashboard
+- [ ] Test API endpoints manually
+
+**Cache** (Redis/Memcached):
+```bash
+# Redis info
+redis-cli -h [host] INFO stats
+
+# Check memory usage
+redis-cli -h [host] INFO memory
+```
+
+### 5. Check Resource Usage
+
+**CPU and Memory**:
+```bash
+# Kubernetes
+kubectl top pods -n [namespace] | grep [service-name]
+
+# Node metrics
+kubectl top nodes
+```
+
+**Prometheus queries**:
+```promql
+# CPU usage by pod
+sum(rate(container_cpu_usage_seconds_total{pod=~"[service-name].*"}[5m])) by (pod)
+
+# Memory usage by pod
+sum(container_memory_usage_bytes{pod=~"[service-name].*"}) by (pod)
+```
+
+**What to look for**:
+- [ ] CPU throttling
+- [ ] Memory approaching limits
+- [ ] Disk space issues
+
+### 6. Check Traces (if available)
+
+**Trace query**:
+```bash
+# Jaeger
+# Search for slow traces (> 1s) in last 30 minutes
+
+# Tempo/TraceQL
+{ duration > 1s && resource.service.name = "[service-name]" }
+```
+
+**What to look for**:
+- [ ] Which operation is slow?
+- [ ] Where is time spent? (DB, external API, service logic)
+- [ ] Any N+1 query patterns?
+
+---
+
+## Common Scenarios and Solutions
+
+### Scenario 1: Recent Deployment Caused Issue
+
+**Symptoms**:
+- Alert started immediately after deployment
+- Error logs correlate with new code
+
+**Solution**:
+```bash
+# Rollback deployment
+kubectl rollout undo deployment/[service-name] -n [namespace]
+
+# Verify rollback succeeded
+kubectl rollout status deployment/[service-name] -n [namespace]
+
+# Monitor for alert resolution
+```
+
+**Follow-up**:
+- [ ] Create incident report
+- [ ] Review deployment process
+- [ ] Add pre-deployment checks
+
+### Scenario 2: Database Performance Issue
+
+**Symptoms**:
+- Slow query logs show problematic queries
+- Database CPU or connection pool exhausted
+
+**Solution**:
+```bash
+# Identify slow query
+# Kill long-running query (use with caution)
+SELECT pg_cancel_backend([pid]);
+
+# Or terminate if cancel doesn't work
+SELECT pg_terminate_backend([pid]);
+
+# Add index if missing (in maintenance window)
+CREATE INDEX CONCURRENTLY idx_name ON table_name (column_name);
+```
+
+**Follow-up**:
+- [ ] Add query performance test
+- [ ] Review and optimize query
+- [ ] Consider read replicas
+
+### Scenario 3: Memory Leak
+
+**Symptoms**:
+- Memory usage gradually increasing
+- Eventually OOMKilled
+- Restarts temporarily fix issue
+
+**Solution**:
+```bash
+# Immediate: Restart pods
+kubectl rollout restart deployment/[service-name] -n [namespace]
+
+# Increase memory limits (temporary)
+kubectl set resources deployment/[service-name] -n [namespace] \
+  --limits=memory=2Gi
+```
+
+**Follow-up**:
+- [ ] Profile application for memory leaks
+- [ ] Add memory usage alerts
+- [ ] Fix root cause
+
+### Scenario 4: Traffic Spike / DDoS
+
+**Symptoms**:
+- Sudden traffic increase
+- Traffic from unusual sources
+- High CPU/memory across all instances
+
+**Solution**:
+```bash
+# Scale up immediately
+kubectl scale deployment/[service-name] -n [namespace] --replicas=10
+
+# Enable rate limiting at load balancer level
+# (Specific steps depend on LB)
+
+# Block suspicious IPs if confirmed DDoS
+# (Use WAF or network policies)
+```
+
+**Follow-up**:
+- [ ] Implement rate limiting
+- [ ] Add DDoS protection (CloudFlare, WAF)
+- [ ] Set up auto-scaling
+
+### Scenario 5: Upstream Service Degradation
+
+**Symptoms**:
+- Errors calling external API
+- Timeouts to upstream service
+- Upstream status page shows issues
+
+**Solution**:
+```bash
+# Enable circuit breaker (if available)
+# Adjust timeout configuration
+# Switch to backup service/cached data
+
+# Monitor external service
+# Check status page: [Link]
+```
+
+**Follow-up**:
+- [ ] Implement circuit breaker pattern
+- [ ] Add fallback mechanisms
+- [ ] Set up external service monitoring
+
+---
+
+## Immediate Actions (< 5 minutes)
+
+These should be done first to mitigate impact:
+
+1. **[Action 1]**: [e.g., "Scale up service"]
+   ```bash
+   kubectl scale deployment/[service] --replicas=10
+   ```
+
+2. **[Action 2]**: [e.g., "Rollback deployment"]
+   ```bash
+   kubectl rollout undo deployment/[service]
+   ```
+
+3. **[Action 3]**: [e.g., "Enable circuit breaker"]
+
+---
+
+## Short-term Actions (< 30 minutes)
+
+After immediate mitigation:
+
+1. **[Action 1]**: [e.g., "Investigate root cause"]
+2. **[Action 2]**: [e.g., "Optimize slow query"]
+3. **[Action 3]**: [e.g., "Clear cache if stale"]
+
+---
+
+## Long-term Actions (Post-Incident)
+
+Preventive measures:
+
+1. **[Action 1]**: [e.g., "Add circuit breaker"]
+2. **[Action 2]**: [e.g., "Implement auto-scaling"]
+3. **[Action 3]**: [e.g., "Add query performance tests"]
+4. **[Action 4]**: [e.g., "Update alert thresholds"]
+
+---
+
+## Escalation
+
+If issue persists after 30 minutes:
+
+**Escalation Path**:
+1. **Primary oncall**: @[username] ([slack/email])
+2. **Team lead**: @[username] ([slack/email])
+3. **Engineering manager**: @[username] ([slack/email])
+4. **Incident commander**: @[username] ([slack/email])
+
+**Communication**:
+- **Slack channel**: #[incidents-channel]
+- **Status page**: [Link]
+- **Incident tracking**: [Link to incident management tool]
+
+---
+
+## Related Runbooks
+
+- [Related Runbook 1]
+- [Related Runbook 2]
+- [Related Runbook 3]
+
+## Related Dashboards
+
+- [Main Service Dashboard]
+- [Resource Usage Dashboard]
+- [Dependency Dashboard]
+
+## Related Documentation
+
+- [Architecture Diagram]
+- [Service Documentation]
+- [API Documentation]
+
+---
+
+## Recent Incidents
+
+| Date | Duration | Root Cause | Resolution | Ticket |
+|------|----------|------------|------------|--------|
+| 2024-10-15 | 23 min | Database pool exhausted | Increased pool size | INC-123 |
+| 2024-09-30 | 45 min | Memory leak | Fixed code, restarted | INC-120 |
+
+---
+
+## Runbook Metadata
+
+**Last Updated**: [Date]
+
+**Owner**: [Team name]
+
+**Reviewers**: [Names]
+
+**Next Review**: [Date]
+
+---
+
+## Notes
+
+- This runbook should be reviewed quarterly
+- Update after each incident to capture new learnings
+- Keep investigation steps concise and actionable
+- Include actual commands that can be copy-pasted