9.8 KiB
Monitoring & Observability
Purpose: Set up monitoring, alerting, and observability to detect incidents early.
Observability Pillars
1. Metrics
What to Monitor:
- Application: Response time, error rate, throughput
- Infrastructure: CPU, memory, disk, network
- Database: Query time, connections, deadlocks
- Business: User signups, revenue, conversions
Tools:
- Prometheus + Grafana
- DataDog
- New Relic
- CloudWatch (AWS)
- Azure Monitor
Key Metrics by Layer
Application Metrics:
http_requests_total # Total requests
http_request_duration_seconds # Response time (histogram)
http_requests_errors_total # Error count
http_requests_in_flight # Concurrent requests
Infrastructure Metrics:
node_cpu_seconds_total # CPU usage
node_memory_usage_bytes # Memory usage
node_disk_usage_bytes # Disk usage
node_network_receive_bytes_total # Network in
Database Metrics:
pg_stat_database_tup_returned # Rows returned
pg_stat_database_tup_fetched # Rows fetched
pg_stat_database_deadlocks # Deadlock count
pg_stat_activity_connections # Active connections
2. Logs
What to Log:
- Application logs: Errors, warnings, info
- Access logs: HTTP requests (nginx, apache)
- System logs: Kernel, systemd, auth
- Audit logs: Security events, data access
Log Levels:
- ERROR: Application errors, exceptions
- WARN: Potential issues (deprecated API, high latency)
- INFO: Normal operations (user login, job completed)
- DEBUG: Detailed troubleshooting (only in dev)
Tools:
- ELK Stack (Elasticsearch, Logstash, Kibana)
- Splunk
- CloudWatch Logs
- Azure Log Analytics
Structured Logging
BAD (unstructured):
console.log("User logged in: " + userId);
GOOD (structured JSON):
logger.info("User logged in", {
userId: 123,
ip: "192.168.1.1",
timestamp: "2025-10-26T12:00:00Z",
userAgent: "Mozilla/5.0...",
});
// Output:
// {"level":"info","message":"User logged in","userId":123,"ip":"192.168.1.1",...}
Benefits:
- Queryable (filter by userId)
- Machine-readable
- Consistent format
3. Traces
Purpose: Track request flow through distributed systems
Example:
User Request → API Gateway → Auth Service → Payment Service → Database
1ms 2ms 50ms 100ms 30ms
↑ SLOW SPAN
Tools:
- Jaeger
- Zipkin
- AWS X-Ray
- DataDog APM
- New Relic
When to Use:
- Microservices architecture
- Slow requests (which service is slow?)
- Debugging distributed systems
Alerting Best Practices
Alert on Symptoms, Not Causes
BAD (cause-based):
- Alert: "CPU usage >80%"
- Problem: CPU can be high without user impact
GOOD (symptom-based):
- Alert: "API response time >1s"
- Why: Users actually experiencing slowness
Alert Severity Levels
P1 (SEV1) - Page On-Call:
- Service down (availability <99%)
- Data loss
- Security breach
- Response time >5s (unusable)
P2 (SEV2) - Notify During Business Hours:
- Degraded performance (response time >1s)
- Error rate >1%
- Disk >90% full
P3 (SEV3) - Email/Slack:
- Warning signs (disk >80%, memory >80%)
- Non-critical errors
- Monitoring gaps
Alert Fatigue Prevention
Rules:
- Actionable: Every alert must have clear action
- Meaningful: Alert only on real problems
- Context: Include relevant info (which server, which metric)
- Deduplicate: Don't alert 100 times for same issue
- Escalate: Auto-escalate if not acknowledged
Example Bad Alert:
Subject: Alert
Body: Server is down
Example Good Alert:
Subject: [P1] API Server Down - Production
Body:
- Service: api.example.com
- Issue: Health check failing for 5 minutes
- Impact: All users affected (100%)
- Runbook: https://wiki.example.com/runbook/api-down
- Dashboard: https://grafana.example.com/d/api
Monitoring Setup
Application Monitoring
Prometheus + Grafana
Install Prometheus Client (Node.js):
const client = require('prom-client');
// Enable default metrics (CPU, memory, etc.)
client.collectDefaultMetrics();
// Custom metrics
const httpRequestDuration = new client.Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request duration in seconds',
labelNames: ['method', 'route', 'status'],
});
// Instrument code
app.use((req, res, next) => {
const end = httpRequestDuration.startTimer();
res.on('finish', () => {
end({ method: req.method, route: req.route.path, status: res.statusCode });
});
next();
});
// Expose metrics endpoint
app.get('/metrics', (req, res) => {
res.set('Content-Type', client.register.contentType);
res.end(client.register.metrics());
});
Prometheus Config (prometheus.yml):
scrape_configs:
- job_name: 'api-server'
static_configs:
- targets: ['localhost:3000']
scrape_interval: 15s
Log Aggregation
ELK Stack
Application (send logs to Logstash):
const winston = require('winston');
const LogstashTransport = require('winston-logstash-transport').LogstashTransport;
const logger = winston.createLogger({
transports: [
new LogstashTransport({
host: 'logstash.example.com',
port: 5000,
}),
],
});
logger.info('User logged in', { userId: 123, ip: '192.168.1.1' });
Logstash Config:
input {
tcp {
port => 5000
codec => json
}
}
output {
elasticsearch {
hosts => ["elasticsearch:9200"]
index => "application-logs-%{+YYYY.MM.dd}"
}
}
Health Checks
Purpose: Check if service is healthy and ready to serve traffic
Types:
- Liveness: Is the service running? (restart if fails)
- Readiness: Is the service ready to serve traffic? (remove from load balancer if fails)
Example (Express.js):
// Liveness probe (simple check)
app.get('/healthz', (req, res) => {
res.status(200).send('OK');
});
// Readiness probe (check dependencies)
app.get('/ready', async (req, res) => {
try {
// Check database
await db.query('SELECT 1');
// Check Redis
await redis.ping();
// Check external API
await fetch('https://api.external.com/health');
res.status(200).send('Ready');
} catch (error) {
res.status(503).send('Not ready');
}
});
Kubernetes:
livenessProbe:
httpGet:
path: /healthz
port: 3000
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 3000
initialDelaySeconds: 10
periodSeconds: 5
SLI, SLO, SLA
SLI (Service Level Indicator):
- Metrics that measure service quality
- Examples: Response time, error rate, availability
SLO (Service Level Objective):
- Target for SLI
- Examples: "99.9% availability", "p95 response time <500ms"
SLA (Service Level Agreement):
- Contract with users (with penalties)
- Examples: "99.9% uptime or refund"
Example:
SLI: Availability = (successful requests / total requests) * 100
SLO: Availability must be ≥99.9% per month
SLA: If availability <99.9%, users get 10% refund
Monitoring Checklist
Application:
- Response time metrics (p50, p95, p99)
- Error rate metrics (4xx, 5xx)
- Throughput metrics (requests per second)
- Health check endpoint (/healthz, /ready)
- Structured logging (JSON format)
- Distributed tracing (if microservices)
Infrastructure:
- CPU, memory, disk, network metrics
- System logs (syslog, journalctl)
- Cloud metrics (CloudWatch, Azure Monitor)
- Disk I/O metrics (iostat)
Database:
- Query performance metrics
- Connection pool metrics
- Slow query log enabled
- Deadlock monitoring
Alerts:
- P1 alerts for critical issues (page on-call)
- P2 alerts for degraded performance
- Runbook linked in alerts
- Dashboard linked in alerts
- Escalation policy configured
Dashboards:
- Overview dashboard (RED metrics: Rate, Errors, Duration)
- Infrastructure dashboard (CPU, memory, disk)
- Database dashboard (queries, connections)
- Business metrics dashboard (signups, revenue)
Common Monitoring Patterns
RED Method (for services)
Rate: Requests per second Errors: Error rate (%) Duration: Response time (p50, p95, p99)
Dashboard:
+-----------------+ +-----------------+ +-----------------+
| Rate | | Errors | | Duration |
| 1000 req/s | | 0.5% | | p95: 250ms |
+-----------------+ +-----------------+ +-----------------+
USE Method (for resources)
Utilization: % of resource used (CPU, memory, disk) Saturation: Queue depth, backlog Errors: Error count
Dashboard:
CPU: 70% utilization, 0.5 load average, 0 errors
Memory: 80% utilization, 0 swap, 0 OOM kills
Disk: 60% utilization, 5ms latency, 0 I/O errors
Tools Comparison
| Tool | Type | Best For | Cost |
|---|---|---|---|
| Prometheus + Grafana | Metrics | Self-hosted, cost-effective | Free |
| DataDog | Metrics, Logs, APM | All-in-one, easy setup | $15/host/month |
| New Relic | APM | Application performance | $99/user/month |
| ELK Stack | Logs | Log aggregation | Free (self-hosted) |
| Splunk | Logs | Enterprise log analysis | $1800/GB/year |
| Jaeger | Traces | Distributed tracing | Free |
| CloudWatch | Metrics, Logs | AWS-native | $0.30/metric/month |
| Azure Monitor | Metrics, Logs | Azure-native | $0.25/metric/month |
Related Documentation
- SKILL.md - Main SRE agent
- backend-diagnostics.md - Application troubleshooting
- database-diagnostics.md - Database monitoring
- infrastructure.md - Infrastructure monitoring