Initial commit

2025-11-29 17:56:41 +08:00
commit 9427ed1eea
40 changed files with 15189 additions and 0 deletions
--- a/agents/sre/modules/monitoring.md
+++ b/agents/sre/modules/monitoring.md
@@ -0,0 +1,439 @@
+# Monitoring & Observability
+
+**Purpose**: Set up monitoring, alerting, and observability to detect incidents early.
+
+## Observability Pillars
+
+### 1. Metrics
+
+**What to Monitor**:
+- **Application**: Response time, error rate, throughput
+- **Infrastructure**: CPU, memory, disk, network
+- **Database**: Query time, connections, deadlocks
+- **Business**: User signups, revenue, conversions
+
+**Tools**:
+- Prometheus + Grafana
+- DataDog
+- New Relic
+- CloudWatch (AWS)
+- Azure Monitor
+
+---
+
+#### Key Metrics by Layer
+
+**Application Metrics**:
+```
+http_requests_total               # Total requests
+http_request_duration_seconds     # Response time (histogram)
+http_requests_errors_total        # Error count
+http_requests_in_flight           # Concurrent requests
+```
+
+**Infrastructure Metrics**:
+```
+node_cpu_seconds_total            # CPU usage
+node_memory_usage_bytes           # Memory usage
+node_disk_usage_bytes             # Disk usage
+node_network_receive_bytes_total  # Network in
+```
+
+**Database Metrics**:
+```
+pg_stat_database_tup_returned     # Rows returned
+pg_stat_database_tup_fetched      # Rows fetched
+pg_stat_database_deadlocks        # Deadlock count
+pg_stat_activity_connections      # Active connections
+```
+
+---
+
+### 2. Logs
+
+**What to Log**:
+- **Application logs**: Errors, warnings, info
+- **Access logs**: HTTP requests (nginx, apache)
+- **System logs**: Kernel, systemd, auth
+- **Audit logs**: Security events, data access
+
+**Log Levels**:
+- **ERROR**: Application errors, exceptions
+- **WARN**: Potential issues (deprecated API, high latency)
+- **INFO**: Normal operations (user login, job completed)
+- **DEBUG**: Detailed troubleshooting (only in dev)
+
+**Tools**:
+- ELK Stack (Elasticsearch, Logstash, Kibana)
+- Splunk
+- CloudWatch Logs
+- Azure Log Analytics
+
+---
+
+#### Structured Logging
+
+**BAD** (unstructured):
+```javascript
+console.log("User logged in: " + userId);
+```
+
+**GOOD** (structured JSON):
+```javascript
+logger.info("User logged in", {
+  userId: 123,
+  ip: "192.168.1.1",
+  timestamp: "2025-10-26T12:00:00Z",
+  userAgent: "Mozilla/5.0...",
+});
+
+// Output:
+// {"level":"info","message":"User logged in","userId":123,"ip":"192.168.1.1",...}
+```
+
+**Benefits**:
+- Queryable (filter by userId)
+- Machine-readable
+- Consistent format
+
+---
+
+### 3. Traces
+
+**Purpose**: Track request flow through distributed systems
+
+**Example**:
+```
+User Request → API Gateway → Auth Service → Payment Service → Database
+     1ms           2ms            50ms            100ms          30ms
+                                                  ↑ SLOW SPAN
+```
+
+**Tools**:
+- Jaeger
+- Zipkin
+- AWS X-Ray
+- DataDog APM
+- New Relic
+
+**When to Use**:
+- Microservices architecture
+- Slow requests (which service is slow?)
+- Debugging distributed systems
+
+---
+
+## Alerting Best Practices
+
+### Alert on Symptoms, Not Causes
+
+**BAD** (cause-based):
+- Alert: "CPU usage >80%"
+- Problem: CPU can be high without user impact
+
+**GOOD** (symptom-based):
+- Alert: "API response time >1s"
+- Why: Users actually experiencing slowness
+
+---
+
+### Alert Severity Levels
+
+**P1 (SEV1) - Page On-Call**:
+- Service down (availability <99%)
+- Data loss
+- Security breach
+- Response time >5s (unusable)
+
+**P2 (SEV2) - Notify During Business Hours**:
+- Degraded performance (response time >1s)
+- Error rate >1%
+- Disk >90% full
+
+**P3 (SEV3) - Email/Slack**:
+- Warning signs (disk >80%, memory >80%)
+- Non-critical errors
+- Monitoring gaps
+
+---
+
+### Alert Fatigue Prevention
+
+**Rules**:
+1. **Actionable**: Every alert must have clear action
+2. **Meaningful**: Alert only on real problems
+3. **Context**: Include relevant info (which server, which metric)
+4. **Deduplicate**: Don't alert 100 times for same issue
+5. **Escalate**: Auto-escalate if not acknowledged
+
+**Example Bad Alert**:
+```
+Subject: Alert
+Body: Server is down
+```
+
+**Example Good Alert**:
+```
+Subject: [P1] API Server Down - Production
+Body:
+- Service: api.example.com
+- Issue: Health check failing for 5 minutes
+- Impact: All users affected (100%)
+- Runbook: https://wiki.example.com/runbook/api-down
+- Dashboard: https://grafana.example.com/d/api
+```
+
+---
+
+## Monitoring Setup
+
+### Application Monitoring
+
+#### Prometheus + Grafana
+
+**Install Prometheus Client** (Node.js):
+```javascript
+const client = require('prom-client');
+
+// Enable default metrics (CPU, memory, etc.)
+client.collectDefaultMetrics();
+
+// Custom metrics
+const httpRequestDuration = new client.Histogram({
+  name: 'http_request_duration_seconds',
+  help: 'HTTP request duration in seconds',
+  labelNames: ['method', 'route', 'status'],
+});
+
+// Instrument code
+app.use((req, res, next) => {
+  const end = httpRequestDuration.startTimer();
+  res.on('finish', () => {
+    end({ method: req.method, route: req.route.path, status: res.statusCode });
+  });
+  next();
+});
+
+// Expose metrics endpoint
+app.get('/metrics', (req, res) => {
+  res.set('Content-Type', client.register.contentType);
+  res.end(client.register.metrics());
+});
+```
+
+**Prometheus Config** (prometheus.yml):
+```yaml
+scrape_configs:
+  - job_name: 'api-server'
+    static_configs:
+      - targets: ['localhost:3000']
+    scrape_interval: 15s
+```
+
+---
+
+### Log Aggregation
+
+#### ELK Stack
+
+**Application** (send logs to Logstash):
+```javascript
+const winston = require('winston');
+const LogstashTransport = require('winston-logstash-transport').LogstashTransport;
+
+const logger = winston.createLogger({
+  transports: [
+    new LogstashTransport({
+      host: 'logstash.example.com',
+      port: 5000,
+    }),
+  ],
+});
+
+logger.info('User logged in', { userId: 123, ip: '192.168.1.1' });
+```
+
+**Logstash Config**:
+```
+input {
+  tcp {
+    port => 5000
+    codec => json
+  }
+}
+
+output {
+  elasticsearch {
+    hosts => ["elasticsearch:9200"]
+    index => "application-logs-%{+YYYY.MM.dd}"
+  }
+}
+```
+
+---
+
+### Health Checks
+
+**Purpose**: Check if service is healthy and ready to serve traffic
+
+**Types**:
+1. **Liveness**: Is the service running? (restart if fails)
+2. **Readiness**: Is the service ready to serve traffic? (remove from load balancer if fails)
+
+**Example** (Express.js):
+```javascript
+// Liveness probe (simple check)
+app.get('/healthz', (req, res) => {
+  res.status(200).send('OK');
+});
+
+// Readiness probe (check dependencies)
+app.get('/ready', async (req, res) => {
+  try {
+    // Check database
+    await db.query('SELECT 1');
+
+    // Check Redis
+    await redis.ping();
+
+    // Check external API
+    await fetch('https://api.external.com/health');
+
+    res.status(200).send('Ready');
+  } catch (error) {
+    res.status(503).send('Not ready');
+  }
+});
+```
+
+**Kubernetes**:
+```yaml
+livenessProbe:
+  httpGet:
+    path: /healthz
+    port: 3000
+  initialDelaySeconds: 30
+  periodSeconds: 10
+
+readinessProbe:
+  httpGet:
+    path: /ready
+    port: 3000
+  initialDelaySeconds: 10
+  periodSeconds: 5
+```
+
+---
+
+### SLI, SLO, SLA
+
+**SLI** (Service Level Indicator):
+- Metrics that measure service quality
+- Examples: Response time, error rate, availability
+
+**SLO** (Service Level Objective):
+- Target for SLI
+- Examples: "99.9% availability", "p95 response time <500ms"
+
+**SLA** (Service Level Agreement):
+- Contract with users (with penalties)
+- Examples: "99.9% uptime or refund"
+
+**Example**:
+```
+SLI: Availability = (successful requests / total requests) * 100
+SLO: Availability must be ≥99.9% per month
+SLA: If availability <99.9%, users get 10% refund
+```
+
+---
+
+## Monitoring Checklist
+
+**Application**:
+- [ ] Response time metrics (p50, p95, p99)
+- [ ] Error rate metrics (4xx, 5xx)
+- [ ] Throughput metrics (requests per second)
+- [ ] Health check endpoint (/healthz, /ready)
+- [ ] Structured logging (JSON format)
+- [ ] Distributed tracing (if microservices)
+
+**Infrastructure**:
+- [ ] CPU, memory, disk, network metrics
+- [ ] System logs (syslog, journalctl)
+- [ ] Cloud metrics (CloudWatch, Azure Monitor)
+- [ ] Disk I/O metrics (iostat)
+
+**Database**:
+- [ ] Query performance metrics
+- [ ] Connection pool metrics
+- [ ] Slow query log enabled
+- [ ] Deadlock monitoring
+
+**Alerts**:
+- [ ] P1 alerts for critical issues (page on-call)
+- [ ] P2 alerts for degraded performance
+- [ ] Runbook linked in alerts
+- [ ] Dashboard linked in alerts
+- [ ] Escalation policy configured
+
+**Dashboards**:
+- [ ] Overview dashboard (RED metrics: Rate, Errors, Duration)
+- [ ] Infrastructure dashboard (CPU, memory, disk)
+- [ ] Database dashboard (queries, connections)
+- [ ] Business metrics dashboard (signups, revenue)
+
+---
+
+## Common Monitoring Patterns
+
+### RED Method (for services)
+
+**Rate**: Requests per second
+**Errors**: Error rate (%)
+**Duration**: Response time (p50, p95, p99)
+
+**Dashboard**:
+```
+-----------------+  +-----------------+  +-----------------+
+|      Rate       |  |     Errors      |  |    Duration     |
+|  1000 req/s     |  |      0.5%       |  | p95: 250ms      |
+-----------------+  +-----------------+  +-----------------+
+```
+
+### USE Method (for resources)
+
+**Utilization**: % of resource used (CPU, memory, disk)
+**Saturation**: Queue depth, backlog
+**Errors**: Error count
+
+**Dashboard**:
+```
+CPU: 70% utilization, 0.5 load average, 0 errors
+Memory: 80% utilization, 0 swap, 0 OOM kills
+Disk: 60% utilization, 5ms latency, 0 I/O errors
+```
+
+---
+
+## Tools Comparison
+
+| Tool | Type | Best For | Cost |
+|------|------|----------|------|
+| Prometheus + Grafana | Metrics | Self-hosted, cost-effective | Free |
+| DataDog | Metrics, Logs, APM | All-in-one, easy setup | $15/host/month |
+| New Relic | APM | Application performance | $99/user/month |
+| ELK Stack | Logs | Log aggregation | Free (self-hosted) |
+| Splunk | Logs | Enterprise log analysis | $1800/GB/year |
+| Jaeger | Traces | Distributed tracing | Free |
+| CloudWatch | Metrics, Logs | AWS-native | $0.30/metric/month |
+| Azure Monitor | Metrics, Logs | Azure-native | $0.25/metric/month |
+
+---
+
+## Related Documentation
+
+- [SKILL.md](../SKILL.md) - Main SRE agent
+- [backend-diagnostics.md](backend-diagnostics.md) - Application troubleshooting
+- [database-diagnostics.md](database-diagnostics.md) - Database monitoring
+- [infrastructure.md](infrastructure.md) - Infrastructure monitoring