Files
2025-11-29 17:56:41 +08:00

440 lines
9.8 KiB
Markdown

# Monitoring & Observability
**Purpose**: Set up monitoring, alerting, and observability to detect incidents early.
## Observability Pillars
### 1. Metrics
**What to Monitor**:
- **Application**: Response time, error rate, throughput
- **Infrastructure**: CPU, memory, disk, network
- **Database**: Query time, connections, deadlocks
- **Business**: User signups, revenue, conversions
**Tools**:
- Prometheus + Grafana
- DataDog
- New Relic
- CloudWatch (AWS)
- Azure Monitor
---
#### Key Metrics by Layer
**Application Metrics**:
```
http_requests_total # Total requests
http_request_duration_seconds # Response time (histogram)
http_requests_errors_total # Error count
http_requests_in_flight # Concurrent requests
```
**Infrastructure Metrics**:
```
node_cpu_seconds_total # CPU usage
node_memory_usage_bytes # Memory usage
node_disk_usage_bytes # Disk usage
node_network_receive_bytes_total # Network in
```
**Database Metrics**:
```
pg_stat_database_tup_returned # Rows returned
pg_stat_database_tup_fetched # Rows fetched
pg_stat_database_deadlocks # Deadlock count
pg_stat_activity_connections # Active connections
```
---
### 2. Logs
**What to Log**:
- **Application logs**: Errors, warnings, info
- **Access logs**: HTTP requests (nginx, apache)
- **System logs**: Kernel, systemd, auth
- **Audit logs**: Security events, data access
**Log Levels**:
- **ERROR**: Application errors, exceptions
- **WARN**: Potential issues (deprecated API, high latency)
- **INFO**: Normal operations (user login, job completed)
- **DEBUG**: Detailed troubleshooting (only in dev)
**Tools**:
- ELK Stack (Elasticsearch, Logstash, Kibana)
- Splunk
- CloudWatch Logs
- Azure Log Analytics
---
#### Structured Logging
**BAD** (unstructured):
```javascript
console.log("User logged in: " + userId);
```
**GOOD** (structured JSON):
```javascript
logger.info("User logged in", {
userId: 123,
ip: "192.168.1.1",
timestamp: "2025-10-26T12:00:00Z",
userAgent: "Mozilla/5.0...",
});
// Output:
// {"level":"info","message":"User logged in","userId":123,"ip":"192.168.1.1",...}
```
**Benefits**:
- Queryable (filter by userId)
- Machine-readable
- Consistent format
---
### 3. Traces
**Purpose**: Track request flow through distributed systems
**Example**:
```
User Request → API Gateway → Auth Service → Payment Service → Database
1ms 2ms 50ms 100ms 30ms
↑ SLOW SPAN
```
**Tools**:
- Jaeger
- Zipkin
- AWS X-Ray
- DataDog APM
- New Relic
**When to Use**:
- Microservices architecture
- Slow requests (which service is slow?)
- Debugging distributed systems
---
## Alerting Best Practices
### Alert on Symptoms, Not Causes
**BAD** (cause-based):
- Alert: "CPU usage >80%"
- Problem: CPU can be high without user impact
**GOOD** (symptom-based):
- Alert: "API response time >1s"
- Why: Users actually experiencing slowness
---
### Alert Severity Levels
**P1 (SEV1) - Page On-Call**:
- Service down (availability <99%)
- Data loss
- Security breach
- Response time >5s (unusable)
**P2 (SEV2) - Notify During Business Hours**:
- Degraded performance (response time >1s)
- Error rate >1%
- Disk >90% full
**P3 (SEV3) - Email/Slack**:
- Warning signs (disk >80%, memory >80%)
- Non-critical errors
- Monitoring gaps
---
### Alert Fatigue Prevention
**Rules**:
1. **Actionable**: Every alert must have clear action
2. **Meaningful**: Alert only on real problems
3. **Context**: Include relevant info (which server, which metric)
4. **Deduplicate**: Don't alert 100 times for same issue
5. **Escalate**: Auto-escalate if not acknowledged
**Example Bad Alert**:
```
Subject: Alert
Body: Server is down
```
**Example Good Alert**:
```
Subject: [P1] API Server Down - Production
Body:
- Service: api.example.com
- Issue: Health check failing for 5 minutes
- Impact: All users affected (100%)
- Runbook: https://wiki.example.com/runbook/api-down
- Dashboard: https://grafana.example.com/d/api
```
---
## Monitoring Setup
### Application Monitoring
#### Prometheus + Grafana
**Install Prometheus Client** (Node.js):
```javascript
const client = require('prom-client');
// Enable default metrics (CPU, memory, etc.)
client.collectDefaultMetrics();
// Custom metrics
const httpRequestDuration = new client.Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request duration in seconds',
labelNames: ['method', 'route', 'status'],
});
// Instrument code
app.use((req, res, next) => {
const end = httpRequestDuration.startTimer();
res.on('finish', () => {
end({ method: req.method, route: req.route.path, status: res.statusCode });
});
next();
});
// Expose metrics endpoint
app.get('/metrics', (req, res) => {
res.set('Content-Type', client.register.contentType);
res.end(client.register.metrics());
});
```
**Prometheus Config** (prometheus.yml):
```yaml
scrape_configs:
- job_name: 'api-server'
static_configs:
- targets: ['localhost:3000']
scrape_interval: 15s
```
---
### Log Aggregation
#### ELK Stack
**Application** (send logs to Logstash):
```javascript
const winston = require('winston');
const LogstashTransport = require('winston-logstash-transport').LogstashTransport;
const logger = winston.createLogger({
transports: [
new LogstashTransport({
host: 'logstash.example.com',
port: 5000,
}),
],
});
logger.info('User logged in', { userId: 123, ip: '192.168.1.1' });
```
**Logstash Config**:
```
input {
tcp {
port => 5000
codec => json
}
}
output {
elasticsearch {
hosts => ["elasticsearch:9200"]
index => "application-logs-%{+YYYY.MM.dd}"
}
}
```
---
### Health Checks
**Purpose**: Check if service is healthy and ready to serve traffic
**Types**:
1. **Liveness**: Is the service running? (restart if fails)
2. **Readiness**: Is the service ready to serve traffic? (remove from load balancer if fails)
**Example** (Express.js):
```javascript
// Liveness probe (simple check)
app.get('/healthz', (req, res) => {
res.status(200).send('OK');
});
// Readiness probe (check dependencies)
app.get('/ready', async (req, res) => {
try {
// Check database
await db.query('SELECT 1');
// Check Redis
await redis.ping();
// Check external API
await fetch('https://api.external.com/health');
res.status(200).send('Ready');
} catch (error) {
res.status(503).send('Not ready');
}
});
```
**Kubernetes**:
```yaml
livenessProbe:
httpGet:
path: /healthz
port: 3000
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 3000
initialDelaySeconds: 10
periodSeconds: 5
```
---
### SLI, SLO, SLA
**SLI** (Service Level Indicator):
- Metrics that measure service quality
- Examples: Response time, error rate, availability
**SLO** (Service Level Objective):
- Target for SLI
- Examples: "99.9% availability", "p95 response time <500ms"
**SLA** (Service Level Agreement):
- Contract with users (with penalties)
- Examples: "99.9% uptime or refund"
**Example**:
```
SLI: Availability = (successful requests / total requests) * 100
SLO: Availability must be ≥99.9% per month
SLA: If availability <99.9%, users get 10% refund
```
---
## Monitoring Checklist
**Application**:
- [ ] Response time metrics (p50, p95, p99)
- [ ] Error rate metrics (4xx, 5xx)
- [ ] Throughput metrics (requests per second)
- [ ] Health check endpoint (/healthz, /ready)
- [ ] Structured logging (JSON format)
- [ ] Distributed tracing (if microservices)
**Infrastructure**:
- [ ] CPU, memory, disk, network metrics
- [ ] System logs (syslog, journalctl)
- [ ] Cloud metrics (CloudWatch, Azure Monitor)
- [ ] Disk I/O metrics (iostat)
**Database**:
- [ ] Query performance metrics
- [ ] Connection pool metrics
- [ ] Slow query log enabled
- [ ] Deadlock monitoring
**Alerts**:
- [ ] P1 alerts for critical issues (page on-call)
- [ ] P2 alerts for degraded performance
- [ ] Runbook linked in alerts
- [ ] Dashboard linked in alerts
- [ ] Escalation policy configured
**Dashboards**:
- [ ] Overview dashboard (RED metrics: Rate, Errors, Duration)
- [ ] Infrastructure dashboard (CPU, memory, disk)
- [ ] Database dashboard (queries, connections)
- [ ] Business metrics dashboard (signups, revenue)
---
## Common Monitoring Patterns
### RED Method (for services)
**Rate**: Requests per second
**Errors**: Error rate (%)
**Duration**: Response time (p50, p95, p99)
**Dashboard**:
```
+-----------------+ +-----------------+ +-----------------+
| Rate | | Errors | | Duration |
| 1000 req/s | | 0.5% | | p95: 250ms |
+-----------------+ +-----------------+ +-----------------+
```
### USE Method (for resources)
**Utilization**: % of resource used (CPU, memory, disk)
**Saturation**: Queue depth, backlog
**Errors**: Error count
**Dashboard**:
```
CPU: 70% utilization, 0.5 load average, 0 errors
Memory: 80% utilization, 0 swap, 0 OOM kills
Disk: 60% utilization, 5ms latency, 0 I/O errors
```
---
## Tools Comparison
| Tool | Type | Best For | Cost |
|------|------|----------|------|
| Prometheus + Grafana | Metrics | Self-hosted, cost-effective | Free |
| DataDog | Metrics, Logs, APM | All-in-one, easy setup | $15/host/month |
| New Relic | APM | Application performance | $99/user/month |
| ELK Stack | Logs | Log aggregation | Free (self-hosted) |
| Splunk | Logs | Enterprise log analysis | $1800/GB/year |
| Jaeger | Traces | Distributed tracing | Free |
| CloudWatch | Metrics, Logs | AWS-native | $0.30/metric/month |
| Azure Monitor | Metrics, Logs | Azure-native | $0.25/metric/month |
---
## Related Documentation
- [SKILL.md](../SKILL.md) - Main SRE agent
- [backend-diagnostics.md](backend-diagnostics.md) - Application troubleshooting
- [database-diagnostics.md](database-diagnostics.md) - Database monitoring
- [infrastructure.md](infrastructure.md) - Infrastructure monitoring