Files
gh-greyhaven-ai-claude-code…/skills/observability-engineering/checklists/observability-setup-checklist.md
2025-11-29 18:29:23 +08:00

16 KiB

Observability Engineering Setup Checklist

Comprehensive checklist for implementing production-grade observability with logs, metrics, traces, and alerts.

Pre-Implementation Planning

  • Define observability goals (debug production issues, monitor SLAs, detect anomalies)

  • Choose observability stack:

    • Logging: Pino (Node.js), structlog (Python), CloudWatch, Datadog
    • Metrics: Prometheus, Datadog, CloudWatch
    • Tracing: OpenTelemetry, Datadog APM, Jaeger
    • Visualization: Grafana, Datadog, Honeycomb
  • Set up observability infrastructure (collectors, storage, dashboards)

  • Define data retention policies (logs: 30 days, metrics: 1 year, traces: 7 days)

  • Plan for scale (log volume, metric cardinality, trace sampling)

Structured Logging

Logger Configuration

  • Structured logger installed:

    • Node.js: pino with pino-pretty for dev
    • Python: structlog with JSON formatter
    • Browser: Custom JSON logger or service integration
  • Log levels defined:

    • TRACE: Very detailed debugging
    • DEBUG: Detailed debugging info
    • INFO: General informational messages
    • WARN: Warning messages (recoverable issues)
    • ERROR: Error messages (failures)
    • FATAL: Critical failures (application crash)
  • Environment-based configuration:

    • Development: Pretty-printed logs, DEBUG level
    • Production: JSON logs, INFO level
    • Test: Silent or minimal logs

Log Structure

  • Standard log format across services:

    {
      "level": "info",
      "timestamp": "2025-11-10T10:30:00.000Z",
      "service": "api-server",
      "environment": "production",
      "tenant_id": "uuid",
      "user_id": "uuid",
      "request_id": "uuid",
      "message": "User logged in",
      "duration_ms": 150,
      "http": {
        "method": "POST",
        "path": "/api/login",
        "status": 200,
        "user_agent": "Mozilla/5.0..."
      }
    }
    
  • Correlation IDs added:

    • request_id: Unique per request
    • session_id: Unique per session
    • trace_id: Unique per distributed trace
    • tenant_id: Multi-tenant context
  • Context propagation through request lifecycle

  • Sensitive data redacted (passwords, tokens, credit cards)

What to Log

  • Request/response logging:

    • HTTP method, path, status code
    • Request duration
    • User agent, IP address (hashed or anonymized)
    • Query parameters (non-sensitive)
  • Authentication events:

    • Login success/failure
    • Logout
    • Token refresh
    • Permission checks
  • Business events:

    • User registration
    • Payment processing
    • Data exports
    • Admin actions
  • Errors and exceptions:

    • Error message
    • Stack trace
    • Error context (what user was doing)
    • Affected resources (user_id, tenant_id, entity_id)
  • Performance metrics:

    • Database query times
    • External API call times
    • Cache hit/miss rates
    • Background job durations

Log Aggregation

  • Logs shipped to central location:

    • CloudWatch Logs
    • Datadog Logs
    • Elasticsearch (ELK stack)
    • Splunk
  • Log retention configured (30-90 days typical)

  • Log volume monitored (cost management)

  • Log sampling for high-volume services (if needed)

Application Metrics

Metric Types

  • Counters for events that only increase:

    • Total requests
    • Total errors
    • Total registrations
  • Gauges for values that go up and down:

    • Active connections
    • Memory usage
    • Queue depth
  • Histograms for distributions:

    • Request duration
    • Response size
    • Database query time
  • Summaries for quantiles (p50, p95, p99)

Standard Metrics

HTTP Metrics

  • http_requests_total (counter):

    • Labels: method, path, status, tenant_id
    • Track total requests per endpoint
  • http_request_duration_seconds (histogram):

    • Labels: method, path, status
    • Buckets: 0.1, 0.5, 1, 2, 5, 10 seconds
  • http_request_size_bytes (histogram)

  • http_response_size_bytes (histogram)

Database Metrics

  • db_queries_total (counter):

    • Labels: operation (SELECT, INSERT, UPDATE, DELETE), table
  • db_query_duration_seconds (histogram):

    • Labels: operation, table
    • Track slow queries (p95, p99)
  • db_connection_pool_size (gauge)

  • db_connection_pool_available (gauge)

Application Metrics

  • active_sessions (gauge)

  • background_jobs_total (counter):

    • Labels: job_name, status (success, failure)
  • background_job_duration_seconds (histogram):

    • Labels: job_name
  • cache_operations_total (counter):

    • Labels: operation (hit, miss, set, delete)
  • external_api_calls_total (counter):

    • Labels: service, status
  • external_api_duration_seconds (histogram):

    • Labels: service

System Metrics

  • process_cpu_usage_percent (gauge)
  • process_memory_usage_bytes (gauge)
  • process_heap_usage_bytes (gauge) - JavaScript specific
  • process_open_file_descriptors (gauge)

Metric Collection

  • Prometheus client library installed:

    • Node.js: prom-client
    • Python: prometheus-client
    • Custom: OpenTelemetry SDK
  • Metrics endpoint exposed (/metrics)

  • Prometheus scrapes endpoint (or push to gateway)

  • Metric naming follows conventions:

    • Lowercase with underscores
    • Unit suffixes (_seconds, _bytes, _total)
    • Namespace prefix (myapp_http_requests_total)

Multi-Tenant Metrics

  • tenant_id label on all relevant metrics

  • Per-tenant dashboards (filter by tenant_id)

  • Tenant resource usage tracked:

    • API calls per tenant
    • Database storage per tenant
    • Data transfer per tenant
  • Tenant quotas monitored (alert on approaching limit)

Distributed Tracing

Tracing Setup

  • OpenTelemetry SDK installed:

    • Node.js: @opentelemetry/sdk-node
    • Python: opentelemetry-sdk
  • Tracing backend configured:

    • Jaeger (self-hosted)
    • Datadog APM
    • Honeycomb
    • AWS X-Ray
  • Auto-instrumentation enabled:

    • HTTP client/server
    • Database queries
    • Redis operations
    • Message queues

Span Creation

  • Custom spans for business logic:

    const span = tracer.startSpan('process-payment');
    span.setAttribute('tenant_id', tenantId);
    span.setAttribute('amount', amount);
    try {
      await processPayment();
      span.setStatus({ code: SpanStatusCode.OK });
    } catch (error) {
      span.recordException(error);
      span.setStatus({ code: SpanStatusCode.ERROR });
      throw error;
    } finally {
      span.end();
    }
    
  • Span attributes include context:

    • tenant_id, user_id, request_id
    • Input parameters (non-sensitive)
    • Result status
  • Span events for key moments:

    • "Payment started"
    • "Database query executed"
    • "External API called"

Trace Context Propagation

  • W3C Trace Context headers propagated:

    • traceparent: trace-id, parent-span-id, flags
    • tracestate: vendor-specific data
  • Context propagated across:

    • HTTP requests (frontend ↔ backend)
    • Background jobs
    • Message queues
    • Microservices
  • Trace ID included in logs (correlate logs + traces)

Sampling

  • Sampling strategy defined:

    • Head-based: Sample at trace start (1%, 10%, 100%)
    • Tail-based: Sample after trace completes (error traces, slow traces)
    • Adaptive: Sample based on load
  • Always sample errors and slow requests

  • Sample rate appropriate for volume (start high, reduce if needed)

Alerting

Alert Definitions

  • Error rate alerts:

    • Condition: Error rate > 5% for 5 minutes
    • Severity: Critical
    • Action: Page on-call engineer
  • Latency alerts:

    • Condition: p95 latency > 1s for 10 minutes
    • Severity: Warning
    • Action: Slack notification
  • Availability alerts:

    • Condition: Health check fails 3 consecutive times
    • Severity: Critical
    • Action: Page on-call + auto-restart
  • Resource alerts:

    • Memory usage > 80%
    • CPU usage > 80%
    • Disk usage > 85%
    • Database connections > 90% of pool
  • Business metric alerts:

    • Registration rate drops > 50%
    • Payment failures increase > 10%
    • Active users drop significantly

Alert Channels

  • PagerDuty (or equivalent) for critical alerts
  • Slack for warnings and notifications
  • Email for non-urgent alerts
  • SMS for highest priority (only use sparingly)

Alert Management

  • Alert fatigue prevented:

    • Appropriate thresholds (not too sensitive)
    • Proper severity levels (not everything is critical)
    • Alert aggregation (deduplicate similar alerts)
  • Runbooks for each alert:

    • What the alert means
    • How to investigate
    • How to resolve
    • Escalation path
  • Alert suppression during deployments (planned downtime)

  • Alert escalation if not acknowledged

Dashboards & Visualization

Standard Dashboards

  • Service Overview dashboard:

    • Request rate (requests/sec)
    • Error rate (errors/sec, %)
    • Latency (p50, p95, p99)
    • Availability (uptime %)
  • Database dashboard:

    • Query rate
    • Slow queries (p95, p99)
    • Connection pool usage
    • Table sizes
  • System Resources dashboard:

    • CPU usage
    • Memory usage
    • Disk I/O
    • Network I/O
  • Business Metrics dashboard:

    • Active users
    • Registrations
    • Revenue
    • Feature usage

Dashboard Best Practices

  • Auto-refresh enabled (every 30-60 seconds)
  • Time range configurable (last hour, 24h, 7 days)
  • Drill-down to detailed views
  • Annotations for deployments/incidents
  • Shared dashboards accessible to team

Per-Tenant Dashboards

  • Tenant filter on all relevant dashboards
  • Tenant resource usage visualized
  • Tenant-specific alerts (if large customer)
  • Tenant comparison view (compare usage across tenants)

Health Checks

Endpoint Implementation

  • Health check endpoint (/health or /healthz):
    • Returns 200 OK when healthy
    • Returns 503 Service Unavailable when unhealthy
    • Includes subsystem status
{
  "status": "healthy",
  "version": "1.2.3",
  "uptime_seconds": 86400,
  "checks": {
    "database": "healthy",
    "redis": "healthy",
    "external_api": "degraded"
  }
}
  • Liveness probe (/health/live):

    • Checks if application is running
    • Fails → restart container
  • Readiness probe (/health/ready):

    • Checks if application is ready to serve traffic
    • Fails → remove from load balancer

Health Check Coverage

  • Database connectivity checked
  • Cache connectivity checked (Redis, Memcached)
  • External APIs checked (optional, can cause false positives)
  • Disk space checked
  • Critical dependencies checked

Monitoring Health Checks

  • Uptime monitoring service (Pingdom, UptimeRobot, Datadog Synthetics)
  • Check frequency appropriate (every 1-5 minutes)
  • Alerting on failed health checks
  • Geographic monitoring (check from multiple regions)

Error Tracking

Error Capture

  • Error tracking service integrated:

    • Sentry
    • Datadog Error Tracking
    • Rollbar
    • Custom solution
  • Unhandled exceptions captured automatically

  • Handled errors reported when appropriate

  • Error context included:

    • User ID, tenant ID
    • Request ID, trace ID
    • User actions (breadcrumbs)
    • Environment variables (non-sensitive)

Error Grouping

  • Errors grouped by fingerprint (same error, different occurrences)
  • Error rate tracked per group
  • Alerting on new error types or spike in existing
  • Error assignment to team members
  • Resolution tracking (mark errors as resolved)

Privacy & Security

  • PII redacted from error reports:

    • Passwords, tokens, API keys
    • Credit card numbers
    • Email addresses (unless necessary)
    • SSNs, tax IDs
  • Source maps uploaded for frontend (de-minify stack traces)

  • Release tagging (associate errors with deployments)

Performance Monitoring

Real User Monitoring (RUM)

  • RUM tool integrated (Datadog RUM, New Relic Browser, Google Analytics):

    • Page load times
    • Core Web Vitals (LCP, FID, CLS)
    • JavaScript errors
    • User sessions
  • Performance budgets defined:

    • First Contentful Paint < 1.8s
    • Largest Contentful Paint < 2.5s
    • Time to Interactive < 3.8s
    • Cumulative Layout Shift < 0.1
  • Alerting on performance regressions

Application Performance Monitoring (APM)

  • APM tool integrated (Datadog APM, New Relic APM):

    • Trace every request
    • Identify slow endpoints
    • Database query analysis
    • External API profiling
  • Performance profiling for critical paths:

    • Authentication flow
    • Payment processing
    • Data exports
    • Complex queries

Cost Management

  • Observability costs tracked:

    • Log ingestion costs
    • Metric cardinality costs
    • Trace sampling costs
    • Dashboard/seat costs
  • Cost optimization:

    • Log sampling for high-volume services
    • Metric aggregation (reduce cardinality)
    • Trace sampling (not 100% in production)
    • Data retention policies
  • Budget alerts configured

Security & Compliance

  • Access control on observability tools (role-based)
  • Audit logging for observability access
  • Data retention complies with regulations (GDPR, HIPAA)
  • Data encryption in transit and at rest
  • PII handling compliant (redaction, anonymization)

Testing Observability

  • Log output tested in unit tests:

    test('logs user login', () => {
      const logs = captureLogs();
      await loginUser();
      expect(logs).toContainEqual(
        expect.objectContaining({
          level: 'info',
          message: 'User logged in',
          user_id: expect.any(String)
        })
      );
    });
    
  • Metrics incremented in tests

  • Traces created in integration tests

  • Health checks tested

  • Alert thresholds tested (inject failures, verify alert fires)

Documentation

  • Observability runbook created:

    • How to access logs, metrics, traces
    • How to create dashboards
    • How to set up alerts
    • Common troubleshooting queries
  • Alert runbooks for each alert

  • Dashboard documentation (what each panel shows)

  • Metric dictionary (what each metric means)

  • On-call procedures documented

Scoring

  • 85+ items checked: Excellent - Production-grade observability
  • 65-84 items: Good - Most observability covered ⚠️
  • 45-64 items: Fair - Significant gaps exist 🔴
  • <45 items: Poor - Not ready for production

Priority Items

Address these first:

  1. Structured logging - Foundation for debugging
  2. Error tracking - Catch and fix bugs quickly
  3. Health checks - Know when service is down
  4. Alerting - Get notified of issues
  5. Key metrics - Request rate, error rate, latency

Common Pitfalls

Don't:

  • Log sensitive data (passwords, tokens, PII)
  • Create high-cardinality metrics (user_id as label)
  • Trace 100% of requests in production (sample instead)
  • Alert on every anomaly (alert fatigue)
  • Ignore observability until there's a problem

Do:

  • Log at appropriate levels (use DEBUG for verbose)
  • Use correlation IDs throughout request lifecycle
  • Set up alerts with clear runbooks
  • Review dashboards regularly (detect issues early)
  • Iterate on observability (improve over time)

Total Items: 140+ observability checks Critical Items: Logging, Metrics, Alerting, Health checks Coverage: Logs, Metrics, Traces, Alerts, Dashboards Last Updated: 2025-11-10