# Monitoring & Observability **Purpose**: Set up monitoring, alerting, and observability to detect incidents early. ## Observability Pillars ### 1. Metrics **What to Monitor**: - **Application**: Response time, error rate, throughput - **Infrastructure**: CPU, memory, disk, network - **Database**: Query time, connections, deadlocks - **Business**: User signups, revenue, conversions **Tools**: - Prometheus + Grafana - DataDog - New Relic - CloudWatch (AWS) - Azure Monitor --- #### Key Metrics by Layer **Application Metrics**: ``` http_requests_total # Total requests http_request_duration_seconds # Response time (histogram) http_requests_errors_total # Error count http_requests_in_flight # Concurrent requests ``` **Infrastructure Metrics**: ``` node_cpu_seconds_total # CPU usage node_memory_usage_bytes # Memory usage node_disk_usage_bytes # Disk usage node_network_receive_bytes_total # Network in ``` **Database Metrics**: ``` pg_stat_database_tup_returned # Rows returned pg_stat_database_tup_fetched # Rows fetched pg_stat_database_deadlocks # Deadlock count pg_stat_activity_connections # Active connections ``` --- ### 2. Logs **What to Log**: - **Application logs**: Errors, warnings, info - **Access logs**: HTTP requests (nginx, apache) - **System logs**: Kernel, systemd, auth - **Audit logs**: Security events, data access **Log Levels**: - **ERROR**: Application errors, exceptions - **WARN**: Potential issues (deprecated API, high latency) - **INFO**: Normal operations (user login, job completed) - **DEBUG**: Detailed troubleshooting (only in dev) **Tools**: - ELK Stack (Elasticsearch, Logstash, Kibana) - Splunk - CloudWatch Logs - Azure Log Analytics --- #### Structured Logging **BAD** (unstructured): ```javascript console.log("User logged in: " + userId); ``` **GOOD** (structured JSON): ```javascript logger.info("User logged in", { userId: 123, ip: "192.168.1.1", timestamp: "2025-10-26T12:00:00Z", userAgent: "Mozilla/5.0...", }); // Output: // {"level":"info","message":"User logged in","userId":123,"ip":"192.168.1.1",...} ``` **Benefits**: - Queryable (filter by userId) - Machine-readable - Consistent format --- ### 3. Traces **Purpose**: Track request flow through distributed systems **Example**: ``` User Request → API Gateway → Auth Service → Payment Service → Database 1ms 2ms 50ms 100ms 30ms ↑ SLOW SPAN ``` **Tools**: - Jaeger - Zipkin - AWS X-Ray - DataDog APM - New Relic **When to Use**: - Microservices architecture - Slow requests (which service is slow?) - Debugging distributed systems --- ## Alerting Best Practices ### Alert on Symptoms, Not Causes **BAD** (cause-based): - Alert: "CPU usage >80%" - Problem: CPU can be high without user impact **GOOD** (symptom-based): - Alert: "API response time >1s" - Why: Users actually experiencing slowness --- ### Alert Severity Levels **P1 (SEV1) - Page On-Call**: - Service down (availability <99%) - Data loss - Security breach - Response time >5s (unusable) **P2 (SEV2) - Notify During Business Hours**: - Degraded performance (response time >1s) - Error rate >1% - Disk >90% full **P3 (SEV3) - Email/Slack**: - Warning signs (disk >80%, memory >80%) - Non-critical errors - Monitoring gaps --- ### Alert Fatigue Prevention **Rules**: 1. **Actionable**: Every alert must have clear action 2. **Meaningful**: Alert only on real problems 3. **Context**: Include relevant info (which server, which metric) 4. **Deduplicate**: Don't alert 100 times for same issue 5. **Escalate**: Auto-escalate if not acknowledged **Example Bad Alert**: ``` Subject: Alert Body: Server is down ``` **Example Good Alert**: ``` Subject: [P1] API Server Down - Production Body: - Service: api.example.com - Issue: Health check failing for 5 minutes - Impact: All users affected (100%) - Runbook: https://wiki.example.com/runbook/api-down - Dashboard: https://grafana.example.com/d/api ``` --- ## Monitoring Setup ### Application Monitoring #### Prometheus + Grafana **Install Prometheus Client** (Node.js): ```javascript const client = require('prom-client'); // Enable default metrics (CPU, memory, etc.) client.collectDefaultMetrics(); // Custom metrics const httpRequestDuration = new client.Histogram({ name: 'http_request_duration_seconds', help: 'HTTP request duration in seconds', labelNames: ['method', 'route', 'status'], }); // Instrument code app.use((req, res, next) => { const end = httpRequestDuration.startTimer(); res.on('finish', () => { end({ method: req.method, route: req.route.path, status: res.statusCode }); }); next(); }); // Expose metrics endpoint app.get('/metrics', (req, res) => { res.set('Content-Type', client.register.contentType); res.end(client.register.metrics()); }); ``` **Prometheus Config** (prometheus.yml): ```yaml scrape_configs: - job_name: 'api-server' static_configs: - targets: ['localhost:3000'] scrape_interval: 15s ``` --- ### Log Aggregation #### ELK Stack **Application** (send logs to Logstash): ```javascript const winston = require('winston'); const LogstashTransport = require('winston-logstash-transport').LogstashTransport; const logger = winston.createLogger({ transports: [ new LogstashTransport({ host: 'logstash.example.com', port: 5000, }), ], }); logger.info('User logged in', { userId: 123, ip: '192.168.1.1' }); ``` **Logstash Config**: ``` input { tcp { port => 5000 codec => json } } output { elasticsearch { hosts => ["elasticsearch:9200"] index => "application-logs-%{+YYYY.MM.dd}" } } ``` --- ### Health Checks **Purpose**: Check if service is healthy and ready to serve traffic **Types**: 1. **Liveness**: Is the service running? (restart if fails) 2. **Readiness**: Is the service ready to serve traffic? (remove from load balancer if fails) **Example** (Express.js): ```javascript // Liveness probe (simple check) app.get('/healthz', (req, res) => { res.status(200).send('OK'); }); // Readiness probe (check dependencies) app.get('/ready', async (req, res) => { try { // Check database await db.query('SELECT 1'); // Check Redis await redis.ping(); // Check external API await fetch('https://api.external.com/health'); res.status(200).send('Ready'); } catch (error) { res.status(503).send('Not ready'); } }); ``` **Kubernetes**: ```yaml livenessProbe: httpGet: path: /healthz port: 3000 initialDelaySeconds: 30 periodSeconds: 10 readinessProbe: httpGet: path: /ready port: 3000 initialDelaySeconds: 10 periodSeconds: 5 ``` --- ### SLI, SLO, SLA **SLI** (Service Level Indicator): - Metrics that measure service quality - Examples: Response time, error rate, availability **SLO** (Service Level Objective): - Target for SLI - Examples: "99.9% availability", "p95 response time <500ms" **SLA** (Service Level Agreement): - Contract with users (with penalties) - Examples: "99.9% uptime or refund" **Example**: ``` SLI: Availability = (successful requests / total requests) * 100 SLO: Availability must be ≥99.9% per month SLA: If availability <99.9%, users get 10% refund ``` --- ## Monitoring Checklist **Application**: - [ ] Response time metrics (p50, p95, p99) - [ ] Error rate metrics (4xx, 5xx) - [ ] Throughput metrics (requests per second) - [ ] Health check endpoint (/healthz, /ready) - [ ] Structured logging (JSON format) - [ ] Distributed tracing (if microservices) **Infrastructure**: - [ ] CPU, memory, disk, network metrics - [ ] System logs (syslog, journalctl) - [ ] Cloud metrics (CloudWatch, Azure Monitor) - [ ] Disk I/O metrics (iostat) **Database**: - [ ] Query performance metrics - [ ] Connection pool metrics - [ ] Slow query log enabled - [ ] Deadlock monitoring **Alerts**: - [ ] P1 alerts for critical issues (page on-call) - [ ] P2 alerts for degraded performance - [ ] Runbook linked in alerts - [ ] Dashboard linked in alerts - [ ] Escalation policy configured **Dashboards**: - [ ] Overview dashboard (RED metrics: Rate, Errors, Duration) - [ ] Infrastructure dashboard (CPU, memory, disk) - [ ] Database dashboard (queries, connections) - [ ] Business metrics dashboard (signups, revenue) --- ## Common Monitoring Patterns ### RED Method (for services) **Rate**: Requests per second **Errors**: Error rate (%) **Duration**: Response time (p50, p95, p99) **Dashboard**: ``` +-----------------+ +-----------------+ +-----------------+ | Rate | | Errors | | Duration | | 1000 req/s | | 0.5% | | p95: 250ms | +-----------------+ +-----------------+ +-----------------+ ``` ### USE Method (for resources) **Utilization**: % of resource used (CPU, memory, disk) **Saturation**: Queue depth, backlog **Errors**: Error count **Dashboard**: ``` CPU: 70% utilization, 0.5 load average, 0 errors Memory: 80% utilization, 0 swap, 0 OOM kills Disk: 60% utilization, 5ms latency, 0 I/O errors ``` --- ## Tools Comparison | Tool | Type | Best For | Cost | |------|------|----------|------| | Prometheus + Grafana | Metrics | Self-hosted, cost-effective | Free | | DataDog | Metrics, Logs, APM | All-in-one, easy setup | $15/host/month | | New Relic | APM | Application performance | $99/user/month | | ELK Stack | Logs | Log aggregation | Free (self-hosted) | | Splunk | Logs | Enterprise log analysis | $1800/GB/year | | Jaeger | Traces | Distributed tracing | Free | | CloudWatch | Metrics, Logs | AWS-native | $0.30/metric/month | | Azure Monitor | Metrics, Logs | Azure-native | $0.25/metric/month | --- ## Related Documentation - [SKILL.md](../SKILL.md) - Main SRE agent - [backend-diagnostics.md](backend-diagnostics.md) - Application troubleshooting - [database-diagnostics.md](database-diagnostics.md) - Database monitoring - [infrastructure.md](infrastructure.md) - Infrastructure monitoring