zhongwei/gh-anton-abyzov-specweave-plugins-specweave-infrastructure

Fork 0

Files

Zhongwei Li 9427ed1eea Initial commit

2025-11-29 17:56:41 +08:00

9.8 KiB

Raw Permalink Blame History

Monitoring & Observability

Purpose: Set up monitoring, alerting, and observability to detect incidents early.

Observability Pillars

1. Metrics

What to Monitor:

Application: Response time, error rate, throughput
Infrastructure: CPU, memory, disk, network
Database: Query time, connections, deadlocks
Business: User signups, revenue, conversions

Tools:

Prometheus + Grafana
DataDog
New Relic
CloudWatch (AWS)
Azure Monitor

Key Metrics by Layer

Application Metrics:

http_requests_total               # Total requests
http_request_duration_seconds     # Response time (histogram)
http_requests_errors_total        # Error count
http_requests_in_flight           # Concurrent requests

Infrastructure Metrics:

node_cpu_seconds_total            # CPU usage
node_memory_usage_bytes           # Memory usage
node_disk_usage_bytes             # Disk usage
node_network_receive_bytes_total  # Network in

Database Metrics:

pg_stat_database_tup_returned     # Rows returned
pg_stat_database_tup_fetched      # Rows fetched
pg_stat_database_deadlocks        # Deadlock count
pg_stat_activity_connections      # Active connections

2. Logs

What to Log:

Application logs: Errors, warnings, info
Access logs: HTTP requests (nginx, apache)
System logs: Kernel, systemd, auth
Audit logs: Security events, data access

Log Levels:

ERROR: Application errors, exceptions
WARN: Potential issues (deprecated API, high latency)
INFO: Normal operations (user login, job completed)
DEBUG: Detailed troubleshooting (only in dev)

Tools:

ELK Stack (Elasticsearch, Logstash, Kibana)
Splunk
CloudWatch Logs
Azure Log Analytics

Structured Logging

BAD (unstructured):

console.log("User logged in: " + userId);

GOOD (structured JSON):

logger.info("User logged in", {
  userId: 123,
  ip: "192.168.1.1",
  timestamp: "2025-10-26T12:00:00Z",
  userAgent: "Mozilla/5.0...",
});

// Output:
// {"level":"info","message":"User logged in","userId":123,"ip":"192.168.1.1",...}

Benefits:

Queryable (filter by userId)
Machine-readable
Consistent format

3. Traces

Purpose: Track request flow through distributed systems

Example:

User Request → API Gateway → Auth Service → Payment Service → Database
     1ms           2ms            50ms            100ms          30ms
                                                  ↑ SLOW SPAN

Tools:

Jaeger
Zipkin
AWS X-Ray
DataDog APM
New Relic

When to Use:

Microservices architecture
Slow requests (which service is slow?)
Debugging distributed systems

Alerting Best Practices

Alert on Symptoms, Not Causes

BAD (cause-based):

Alert: "CPU usage >80%"
Problem: CPU can be high without user impact

GOOD (symptom-based):

Alert: "API response time >1s"
Why: Users actually experiencing slowness

Alert Severity Levels

P1 (SEV1) - Page On-Call:

Service down (availability <99%)
Data loss
Security breach
Response time >5s (unusable)

P2 (SEV2) - Notify During Business Hours:

Degraded performance (response time >1s)
Error rate >1%
Disk >90% full

P3 (SEV3) - Email/Slack:

Warning signs (disk >80%, memory >80%)
Non-critical errors
Monitoring gaps

Alert Fatigue Prevention

Rules:

Actionable: Every alert must have clear action
Meaningful: Alert only on real problems
Context: Include relevant info (which server, which metric)
Deduplicate: Don't alert 100 times for same issue
Escalate: Auto-escalate if not acknowledged

Example Bad Alert:

Subject: Alert
Body: Server is down

Example Good Alert:

Subject: [P1] API Server Down - Production
Body:
- Service: api.example.com
- Issue: Health check failing for 5 minutes
- Impact: All users affected (100%)
- Runbook: https://wiki.example.com/runbook/api-down
- Dashboard: https://grafana.example.com/d/api

Monitoring Setup

Application Monitoring

Prometheus + Grafana

Install Prometheus Client (Node.js):

const client = require('prom-client');

// Enable default metrics (CPU, memory, etc.)
client.collectDefaultMetrics();

// Custom metrics
const httpRequestDuration = new client.Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration in seconds',
  labelNames: ['method', 'route', 'status'],
});

// Instrument code
app.use((req, res, next) => {
  const end = httpRequestDuration.startTimer();
  res.on('finish', () => {
    end({ method: req.method, route: req.route.path, status: res.statusCode });
  });
  next();
});

// Expose metrics endpoint
app.get('/metrics', (req, res) => {
  res.set('Content-Type', client.register.contentType);
  res.end(client.register.metrics());
});

Prometheus Config (prometheus.yml):

scrape_configs:
  - job_name: 'api-server'
    static_configs:
      - targets: ['localhost:3000']
    scrape_interval: 15s

Log Aggregation

ELK Stack

Application (send logs to Logstash):

const winston = require('winston');
const LogstashTransport = require('winston-logstash-transport').LogstashTransport;

const logger = winston.createLogger({
  transports: [
    new LogstashTransport({
      host: 'logstash.example.com',
      port: 5000,
    }),
  ],
});

logger.info('User logged in', { userId: 123, ip: '192.168.1.1' });

Logstash Config:

input {
  tcp {
    port => 5000
    codec => json
  }
}

output {
  elasticsearch {
    hosts => ["elasticsearch:9200"]
    index => "application-logs-%{+YYYY.MM.dd}"
  }
}

Health Checks

Purpose: Check if service is healthy and ready to serve traffic

Types:

Liveness: Is the service running? (restart if fails)
Readiness: Is the service ready to serve traffic? (remove from load balancer if fails)

Example (Express.js):

// Liveness probe (simple check)
app.get('/healthz', (req, res) => {
  res.status(200).send('OK');
});

// Readiness probe (check dependencies)
app.get('/ready', async (req, res) => {
  try {
    // Check database
    await db.query('SELECT 1');

    // Check Redis
    await redis.ping();

    // Check external API
    await fetch('https://api.external.com/health');

    res.status(200).send('Ready');
  } catch (error) {
    res.status(503).send('Not ready');
  }
});

Kubernetes:

livenessProbe:
  httpGet:
    path: /healthz
    port: 3000
  initialDelaySeconds: 30
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /ready
    port: 3000
  initialDelaySeconds: 10
  periodSeconds: 5

SLI, SLO, SLA

SLI (Service Level Indicator):

Metrics that measure service quality
Examples: Response time, error rate, availability

SLO (Service Level Objective):

Target for SLI
Examples: "99.9% availability", "p95 response time <500ms"

SLA (Service Level Agreement):

Contract with users (with penalties)
Examples: "99.9% uptime or refund"

Example:

SLI: Availability = (successful requests / total requests) * 100
SLO: Availability must be ≥99.9% per month
SLA: If availability <99.9%, users get 10% refund

Monitoring Checklist

Application:

Response time metrics (p50, p95, p99)
Error rate metrics (4xx, 5xx)
Throughput metrics (requests per second)
Health check endpoint (/healthz, /ready)
Structured logging (JSON format)
Distributed tracing (if microservices)

Infrastructure:

CPU, memory, disk, network metrics
System logs (syslog, journalctl)
Cloud metrics (CloudWatch, Azure Monitor)
Disk I/O metrics (iostat)

Database:

Query performance metrics
Connection pool metrics
Slow query log enabled
Deadlock monitoring

Alerts:

P1 alerts for critical issues (page on-call)
P2 alerts for degraded performance
Runbook linked in alerts
Dashboard linked in alerts
Escalation policy configured

Dashboards:

Overview dashboard (RED metrics: Rate, Errors, Duration)
Infrastructure dashboard (CPU, memory, disk)
Database dashboard (queries, connections)
Business metrics dashboard (signups, revenue)

Common Monitoring Patterns

RED Method (for services)

Rate: Requests per second Errors: Error rate (%) Duration: Response time (p50, p95, p99)

Dashboard:

+-----------------+  +-----------------+  +-----------------+
|      Rate       |  |     Errors      |  |    Duration     |
|  1000 req/s     |  |      0.5%       |  | p95: 250ms      |
+-----------------+  +-----------------+  +-----------------+

USE Method (for resources)

Utilization: % of resource used (CPU, memory, disk) Saturation: Queue depth, backlog Errors: Error count

Dashboard:

CPU: 70% utilization, 0.5 load average, 0 errors
Memory: 80% utilization, 0 swap, 0 OOM kills
Disk: 60% utilization, 5ms latency, 0 I/O errors

Tools Comparison

Tool	Type	Best For	Cost
Prometheus + Grafana	Metrics	Self-hosted, cost-effective	Free
DataDog	Metrics, Logs, APM	All-in-one, easy setup	$15/host/month
New Relic	APM	Application performance	$99/user/month
ELK Stack	Logs	Log aggregation	Free (self-hosted)
Splunk	Logs	Enterprise log analysis	$1800/GB/year
Jaeger	Traces	Distributed tracing	Free
CloudWatch	Metrics, Logs	AWS-native	$0.30/metric/month
Azure Monitor	Metrics, Logs	Azure-native	$0.25/metric/month

SKILL.md - Main SRE agent
backend-diagnostics.md - Application troubleshooting
database-diagnostics.md - Database monitoring
infrastructure.md - Infrastructure monitoring

9.8 KiB Raw Permalink Blame History

Monitoring & Observability

Observability Pillars

1. Metrics

Key Metrics by Layer

2. Logs

Structured Logging

3. Traces

Alerting Best Practices

Alert on Symptoms, Not Causes

Alert Severity Levels

Alert Fatigue Prevention

Monitoring Setup

Application Monitoring

Prometheus + Grafana

Log Aggregation

ELK Stack

Health Checks

SLI, SLO, SLA

Monitoring Checklist

Common Monitoring Patterns

RED Method (for services)

USE Method (for resources)

Tools Comparison

Related Documentation

9.8 KiB

Raw Permalink Blame History