Files
2025-11-29 17:56:41 +08:00

9.8 KiB

Monitoring & Observability

Purpose: Set up monitoring, alerting, and observability to detect incidents early.

Observability Pillars

1. Metrics

What to Monitor:

  • Application: Response time, error rate, throughput
  • Infrastructure: CPU, memory, disk, network
  • Database: Query time, connections, deadlocks
  • Business: User signups, revenue, conversions

Tools:

  • Prometheus + Grafana
  • DataDog
  • New Relic
  • CloudWatch (AWS)
  • Azure Monitor

Key Metrics by Layer

Application Metrics:

http_requests_total               # Total requests
http_request_duration_seconds     # Response time (histogram)
http_requests_errors_total        # Error count
http_requests_in_flight           # Concurrent requests

Infrastructure Metrics:

node_cpu_seconds_total            # CPU usage
node_memory_usage_bytes           # Memory usage
node_disk_usage_bytes             # Disk usage
node_network_receive_bytes_total  # Network in

Database Metrics:

pg_stat_database_tup_returned     # Rows returned
pg_stat_database_tup_fetched      # Rows fetched
pg_stat_database_deadlocks        # Deadlock count
pg_stat_activity_connections      # Active connections

2. Logs

What to Log:

  • Application logs: Errors, warnings, info
  • Access logs: HTTP requests (nginx, apache)
  • System logs: Kernel, systemd, auth
  • Audit logs: Security events, data access

Log Levels:

  • ERROR: Application errors, exceptions
  • WARN: Potential issues (deprecated API, high latency)
  • INFO: Normal operations (user login, job completed)
  • DEBUG: Detailed troubleshooting (only in dev)

Tools:

  • ELK Stack (Elasticsearch, Logstash, Kibana)
  • Splunk
  • CloudWatch Logs
  • Azure Log Analytics

Structured Logging

BAD (unstructured):

console.log("User logged in: " + userId);

GOOD (structured JSON):

logger.info("User logged in", {
  userId: 123,
  ip: "192.168.1.1",
  timestamp: "2025-10-26T12:00:00Z",
  userAgent: "Mozilla/5.0...",
});

// Output:
// {"level":"info","message":"User logged in","userId":123,"ip":"192.168.1.1",...}

Benefits:

  • Queryable (filter by userId)
  • Machine-readable
  • Consistent format

3. Traces

Purpose: Track request flow through distributed systems

Example:

User Request → API Gateway → Auth Service → Payment Service → Database
     1ms           2ms            50ms            100ms          30ms
                                                  ↑ SLOW SPAN

Tools:

  • Jaeger
  • Zipkin
  • AWS X-Ray
  • DataDog APM
  • New Relic

When to Use:

  • Microservices architecture
  • Slow requests (which service is slow?)
  • Debugging distributed systems

Alerting Best Practices

Alert on Symptoms, Not Causes

BAD (cause-based):

  • Alert: "CPU usage >80%"
  • Problem: CPU can be high without user impact

GOOD (symptom-based):

  • Alert: "API response time >1s"
  • Why: Users actually experiencing slowness

Alert Severity Levels

P1 (SEV1) - Page On-Call:

  • Service down (availability <99%)
  • Data loss
  • Security breach
  • Response time >5s (unusable)

P2 (SEV2) - Notify During Business Hours:

  • Degraded performance (response time >1s)
  • Error rate >1%
  • Disk >90% full

P3 (SEV3) - Email/Slack:

  • Warning signs (disk >80%, memory >80%)
  • Non-critical errors
  • Monitoring gaps

Alert Fatigue Prevention

Rules:

  1. Actionable: Every alert must have clear action
  2. Meaningful: Alert only on real problems
  3. Context: Include relevant info (which server, which metric)
  4. Deduplicate: Don't alert 100 times for same issue
  5. Escalate: Auto-escalate if not acknowledged

Example Bad Alert:

Subject: Alert
Body: Server is down

Example Good Alert:

Subject: [P1] API Server Down - Production
Body:
- Service: api.example.com
- Issue: Health check failing for 5 minutes
- Impact: All users affected (100%)
- Runbook: https://wiki.example.com/runbook/api-down
- Dashboard: https://grafana.example.com/d/api

Monitoring Setup

Application Monitoring

Prometheus + Grafana

Install Prometheus Client (Node.js):

const client = require('prom-client');

// Enable default metrics (CPU, memory, etc.)
client.collectDefaultMetrics();

// Custom metrics
const httpRequestDuration = new client.Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration in seconds',
  labelNames: ['method', 'route', 'status'],
});

// Instrument code
app.use((req, res, next) => {
  const end = httpRequestDuration.startTimer();
  res.on('finish', () => {
    end({ method: req.method, route: req.route.path, status: res.statusCode });
  });
  next();
});

// Expose metrics endpoint
app.get('/metrics', (req, res) => {
  res.set('Content-Type', client.register.contentType);
  res.end(client.register.metrics());
});

Prometheus Config (prometheus.yml):

scrape_configs:
  - job_name: 'api-server'
    static_configs:
      - targets: ['localhost:3000']
    scrape_interval: 15s

Log Aggregation

ELK Stack

Application (send logs to Logstash):

const winston = require('winston');
const LogstashTransport = require('winston-logstash-transport').LogstashTransport;

const logger = winston.createLogger({
  transports: [
    new LogstashTransport({
      host: 'logstash.example.com',
      port: 5000,
    }),
  ],
});

logger.info('User logged in', { userId: 123, ip: '192.168.1.1' });

Logstash Config:

input {
  tcp {
    port => 5000
    codec => json
  }
}

output {
  elasticsearch {
    hosts => ["elasticsearch:9200"]
    index => "application-logs-%{+YYYY.MM.dd}"
  }
}

Health Checks

Purpose: Check if service is healthy and ready to serve traffic

Types:

  1. Liveness: Is the service running? (restart if fails)
  2. Readiness: Is the service ready to serve traffic? (remove from load balancer if fails)

Example (Express.js):

// Liveness probe (simple check)
app.get('/healthz', (req, res) => {
  res.status(200).send('OK');
});

// Readiness probe (check dependencies)
app.get('/ready', async (req, res) => {
  try {
    // Check database
    await db.query('SELECT 1');

    // Check Redis
    await redis.ping();

    // Check external API
    await fetch('https://api.external.com/health');

    res.status(200).send('Ready');
  } catch (error) {
    res.status(503).send('Not ready');
  }
});

Kubernetes:

livenessProbe:
  httpGet:
    path: /healthz
    port: 3000
  initialDelaySeconds: 30
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /ready
    port: 3000
  initialDelaySeconds: 10
  periodSeconds: 5

SLI, SLO, SLA

SLI (Service Level Indicator):

  • Metrics that measure service quality
  • Examples: Response time, error rate, availability

SLO (Service Level Objective):

  • Target for SLI
  • Examples: "99.9% availability", "p95 response time <500ms"

SLA (Service Level Agreement):

  • Contract with users (with penalties)
  • Examples: "99.9% uptime or refund"

Example:

SLI: Availability = (successful requests / total requests) * 100
SLO: Availability must be ≥99.9% per month
SLA: If availability <99.9%, users get 10% refund

Monitoring Checklist

Application:

  • Response time metrics (p50, p95, p99)
  • Error rate metrics (4xx, 5xx)
  • Throughput metrics (requests per second)
  • Health check endpoint (/healthz, /ready)
  • Structured logging (JSON format)
  • Distributed tracing (if microservices)

Infrastructure:

  • CPU, memory, disk, network metrics
  • System logs (syslog, journalctl)
  • Cloud metrics (CloudWatch, Azure Monitor)
  • Disk I/O metrics (iostat)

Database:

  • Query performance metrics
  • Connection pool metrics
  • Slow query log enabled
  • Deadlock monitoring

Alerts:

  • P1 alerts for critical issues (page on-call)
  • P2 alerts for degraded performance
  • Runbook linked in alerts
  • Dashboard linked in alerts
  • Escalation policy configured

Dashboards:

  • Overview dashboard (RED metrics: Rate, Errors, Duration)
  • Infrastructure dashboard (CPU, memory, disk)
  • Database dashboard (queries, connections)
  • Business metrics dashboard (signups, revenue)

Common Monitoring Patterns

RED Method (for services)

Rate: Requests per second Errors: Error rate (%) Duration: Response time (p50, p95, p99)

Dashboard:

+-----------------+  +-----------------+  +-----------------+
|      Rate       |  |     Errors      |  |    Duration     |
|  1000 req/s     |  |      0.5%       |  | p95: 250ms      |
+-----------------+  +-----------------+  +-----------------+

USE Method (for resources)

Utilization: % of resource used (CPU, memory, disk) Saturation: Queue depth, backlog Errors: Error count

Dashboard:

CPU: 70% utilization, 0.5 load average, 0 errors
Memory: 80% utilization, 0 swap, 0 OOM kills
Disk: 60% utilization, 5ms latency, 0 I/O errors

Tools Comparison

Tool Type Best For Cost
Prometheus + Grafana Metrics Self-hosted, cost-effective Free
DataDog Metrics, Logs, APM All-in-one, easy setup $15/host/month
New Relic APM Application performance $99/user/month
ELK Stack Logs Log aggregation Free (self-hosted)
Splunk Logs Enterprise log analysis $1800/GB/year
Jaeger Traces Distributed tracing Free
CloudWatch Metrics, Logs AWS-native $0.30/metric/month
Azure Monitor Metrics, Logs Azure-native $0.25/metric/month