# Observability Engineering Setup Checklist Comprehensive checklist for implementing production-grade observability with logs, metrics, traces, and alerts. ## Pre-Implementation Planning - [ ] **Define observability goals** (debug production issues, monitor SLAs, detect anomalies) - [ ] **Choose observability stack**: - [ ] Logging: Pino (Node.js), structlog (Python), CloudWatch, Datadog - [ ] Metrics: Prometheus, Datadog, CloudWatch - [ ] Tracing: OpenTelemetry, Datadog APM, Jaeger - [ ] Visualization: Grafana, Datadog, Honeycomb - [ ] **Set up observability infrastructure** (collectors, storage, dashboards) - [ ] **Define data retention** policies (logs: 30 days, metrics: 1 year, traces: 7 days) - [ ] **Plan for scale** (log volume, metric cardinality, trace sampling) ## Structured Logging ### Logger Configuration - [ ] **Structured logger installed**: - Node.js: `pino` with `pino-pretty` for dev - Python: `structlog` with JSON formatter - Browser: Custom JSON logger or service integration - [ ] **Log levels defined**: - [ ] TRACE: Very detailed debugging - [ ] DEBUG: Detailed debugging info - [ ] INFO: General informational messages - [ ] WARN: Warning messages (recoverable issues) - [ ] ERROR: Error messages (failures) - [ ] FATAL: Critical failures (application crash) - [ ] **Environment-based configuration**: - [ ] Development: Pretty-printed logs, DEBUG level - [ ] Production: JSON logs, INFO level - [ ] Test: Silent or minimal logs ### Log Structure - [ ] **Standard log format** across services: ```json { "level": "info", "timestamp": "2025-11-10T10:30:00.000Z", "service": "api-server", "environment": "production", "tenant_id": "uuid", "user_id": "uuid", "request_id": "uuid", "message": "User logged in", "duration_ms": 150, "http": { "method": "POST", "path": "/api/login", "status": 200, "user_agent": "Mozilla/5.0..." } } ``` - [ ] **Correlation IDs** added: - [ ] request_id: Unique per request - [ ] session_id: Unique per session - [ ] trace_id: Unique per distributed trace - [ ] tenant_id: Multi-tenant context - [ ] **Context propagation** through request lifecycle - [ ] **Sensitive data redacted** (passwords, tokens, credit cards) ### What to Log - [ ] **Request/response logging**: - [ ] HTTP method, path, status code - [ ] Request duration - [ ] User agent, IP address (hashed or anonymized) - [ ] Query parameters (non-sensitive) - [ ] **Authentication events**: - [ ] Login success/failure - [ ] Logout - [ ] Token refresh - [ ] Permission checks - [ ] **Business events**: - [ ] User registration - [ ] Payment processing - [ ] Data exports - [ ] Admin actions - [ ] **Errors and exceptions**: - [ ] Error message - [ ] Stack trace - [ ] Error context (what user was doing) - [ ] Affected resources (user_id, tenant_id, entity_id) - [ ] **Performance metrics**: - [ ] Database query times - [ ] External API call times - [ ] Cache hit/miss rates - [ ] Background job durations ### Log Aggregation - [ ] **Logs shipped** to central location: - [ ] CloudWatch Logs - [ ] Datadog Logs - [ ] Elasticsearch (ELK stack) - [ ] Splunk - [ ] **Log retention** configured (30-90 days typical) - [ ] **Log volume** monitored (cost management) - [ ] **Log sampling** for high-volume services (if needed) ## Application Metrics ### Metric Types - [ ] **Counters** for events that only increase: - [ ] Total requests - [ ] Total errors - [ ] Total registrations - [ ] **Gauges** for values that go up and down: - [ ] Active connections - [ ] Memory usage - [ ] Queue depth - [ ] **Histograms** for distributions: - [ ] Request duration - [ ] Response size - [ ] Database query time - [ ] **Summaries** for quantiles (p50, p95, p99) ### Standard Metrics #### HTTP Metrics - [ ] **http_requests_total** (counter): - Labels: method, path, status, tenant_id - Track total requests per endpoint - [ ] **http_request_duration_seconds** (histogram): - Labels: method, path, status - Buckets: 0.1, 0.5, 1, 2, 5, 10 seconds - [ ] **http_request_size_bytes** (histogram) - [ ] **http_response_size_bytes** (histogram) #### Database Metrics - [ ] **db_queries_total** (counter): - Labels: operation (SELECT, INSERT, UPDATE, DELETE), table - [ ] **db_query_duration_seconds** (histogram): - Labels: operation, table - Track slow queries (p95, p99) - [ ] **db_connection_pool_size** (gauge) - [ ] **db_connection_pool_available** (gauge) #### Application Metrics - [ ] **active_sessions** (gauge) - [ ] **background_jobs_total** (counter): - Labels: job_name, status (success, failure) - [ ] **background_job_duration_seconds** (histogram): - Labels: job_name - [ ] **cache_operations_total** (counter): - Labels: operation (hit, miss, set, delete) - [ ] **external_api_calls_total** (counter): - Labels: service, status - [ ] **external_api_duration_seconds** (histogram): - Labels: service #### System Metrics - [ ] **process_cpu_usage_percent** (gauge) - [ ] **process_memory_usage_bytes** (gauge) - [ ] **process_heap_usage_bytes** (gauge) - JavaScript specific - [ ] **process_open_file_descriptors** (gauge) ### Metric Collection - [ ] **Prometheus client library** installed: - Node.js: `prom-client` - Python: `prometheus-client` - Custom: OpenTelemetry SDK - [ ] **Metrics endpoint** exposed (`/metrics`) - [ ] **Prometheus scrapes** endpoint (or push to gateway) - [ ] **Metric naming** follows conventions: - Lowercase with underscores - Unit suffixes (_seconds, _bytes, _total) - Namespace prefix (myapp_http_requests_total) ### Multi-Tenant Metrics - [ ] **tenant_id label** on all relevant metrics - [ ] **Per-tenant dashboards** (filter by tenant_id) - [ ] **Tenant resource usage** tracked: - [ ] API calls per tenant - [ ] Database storage per tenant - [ ] Data transfer per tenant - [ ] **Tenant quotas** monitored (alert on approaching limit) ## Distributed Tracing ### Tracing Setup - [ ] **OpenTelemetry SDK** installed: - Node.js: `@opentelemetry/sdk-node` - Python: `opentelemetry-sdk` - [ ] **Tracing backend** configured: - [ ] Jaeger (self-hosted) - [ ] Datadog APM - [ ] Honeycomb - [ ] AWS X-Ray - [ ] **Auto-instrumentation** enabled: - [ ] HTTP client/server - [ ] Database queries - [ ] Redis operations - [ ] Message queues ### Span Creation - [ ] **Custom spans** for business logic: ```typescript const span = tracer.startSpan('process-payment'); span.setAttribute('tenant_id', tenantId); span.setAttribute('amount', amount); try { await processPayment(); span.setStatus({ code: SpanStatusCode.OK }); } catch (error) { span.recordException(error); span.setStatus({ code: SpanStatusCode.ERROR }); throw error; } finally { span.end(); } ``` - [ ] **Span attributes** include context: - [ ] tenant_id, user_id, request_id - [ ] Input parameters (non-sensitive) - [ ] Result status - [ ] **Span events** for key moments: - [ ] "Payment started" - [ ] "Database query executed" - [ ] "External API called" ### Trace Context Propagation - [ ] **W3C Trace Context** headers propagated: - traceparent: trace-id, parent-span-id, flags - tracestate: vendor-specific data - [ ] **Context propagated** across: - [ ] HTTP requests (frontend ↔ backend) - [ ] Background jobs - [ ] Message queues - [ ] Microservices - [ ] **Trace ID** included in logs (correlate logs + traces) ### Sampling - [ ] **Sampling strategy** defined: - [ ] Head-based: Sample at trace start (1%, 10%, 100%) - [ ] Tail-based: Sample after trace completes (error traces, slow traces) - [ ] Adaptive: Sample based on load - [ ] **Always sample** errors and slow requests - [ ] **Sample rate** appropriate for volume (start high, reduce if needed) ## Alerting ### Alert Definitions - [ ] **Error rate alerts**: - [ ] Condition: Error rate > 5% for 5 minutes - [ ] Severity: Critical - [ ] Action: Page on-call engineer - [ ] **Latency alerts**: - [ ] Condition: p95 latency > 1s for 10 minutes - [ ] Severity: Warning - [ ] Action: Slack notification - [ ] **Availability alerts**: - [ ] Condition: Health check fails 3 consecutive times - [ ] Severity: Critical - [ ] Action: Page on-call + auto-restart - [ ] **Resource alerts**: - [ ] Memory usage > 80% - [ ] CPU usage > 80% - [ ] Disk usage > 85% - [ ] Database connections > 90% of pool - [ ] **Business metric alerts**: - [ ] Registration rate drops > 50% - [ ] Payment failures increase > 10% - [ ] Active users drop significantly ### Alert Channels - [ ] **PagerDuty** (or equivalent) for critical alerts - [ ] **Slack** for warnings and notifications - [ ] **Email** for non-urgent alerts - [ ] **SMS** for highest priority (only use sparingly) ### Alert Management - [ ] **Alert fatigue** prevented: - [ ] Appropriate thresholds (not too sensitive) - [ ] Proper severity levels (not everything is critical) - [ ] Alert aggregation (deduplicate similar alerts) - [ ] **Runbooks** for each alert: - [ ] What the alert means - [ ] How to investigate - [ ] How to resolve - [ ] Escalation path - [ ] **Alert suppression** during deployments (planned downtime) - [ ] **Alert escalation** if not acknowledged ## Dashboards & Visualization ### Standard Dashboards - [ ] **Service Overview** dashboard: - [ ] Request rate (requests/sec) - [ ] Error rate (errors/sec, %) - [ ] Latency (p50, p95, p99) - [ ] Availability (uptime %) - [ ] **Database** dashboard: - [ ] Query rate - [ ] Slow queries (p95, p99) - [ ] Connection pool usage - [ ] Table sizes - [ ] **System Resources** dashboard: - [ ] CPU usage - [ ] Memory usage - [ ] Disk I/O - [ ] Network I/O - [ ] **Business Metrics** dashboard: - [ ] Active users - [ ] Registrations - [ ] Revenue - [ ] Feature usage ### Dashboard Best Practices - [ ] **Auto-refresh** enabled (every 30-60 seconds) - [ ] **Time range** configurable (last hour, 24h, 7 days) - [ ] **Drill-down** to detailed views - [ ] **Annotations** for deployments/incidents - [ ] **Shared dashboards** accessible to team ### Per-Tenant Dashboards - [ ] **Tenant filter** on all relevant dashboards - [ ] **Tenant resource usage** visualized - [ ] **Tenant-specific alerts** (if large customer) - [ ] **Tenant comparison** view (compare usage across tenants) ## Health Checks ### Endpoint Implementation - [ ] **Health check endpoint** (`/health` or `/healthz`): - [ ] Returns 200 OK when healthy - [ ] Returns 503 Service Unavailable when unhealthy - [ ] Includes subsystem status ```json { "status": "healthy", "version": "1.2.3", "uptime_seconds": 86400, "checks": { "database": "healthy", "redis": "healthy", "external_api": "degraded" } } ``` - [ ] **Liveness probe** (`/health/live`): - [ ] Checks if application is running - [ ] Fails → restart container - [ ] **Readiness probe** (`/health/ready`): - [ ] Checks if application is ready to serve traffic - [ ] Fails → remove from load balancer ### Health Check Coverage - [ ] **Database connectivity** checked - [ ] **Cache connectivity** checked (Redis, Memcached) - [ ] **External APIs** checked (optional, can cause false positives) - [ ] **Disk space** checked - [ ] **Critical dependencies** checked ### Monitoring Health Checks - [ ] **Uptime monitoring** service (Pingdom, UptimeRobot, Datadog Synthetics) - [ ] **Check frequency** appropriate (every 1-5 minutes) - [ ] **Alerting** on failed health checks - [ ] **Geographic monitoring** (check from multiple regions) ## Error Tracking ### Error Capture - [ ] **Error tracking service** integrated: - [ ] Sentry - [ ] Datadog Error Tracking - [ ] Rollbar - [ ] Custom solution - [ ] **Unhandled exceptions** captured automatically - [ ] **Handled errors** reported when appropriate - [ ] **Error context** included: - [ ] User ID, tenant ID - [ ] Request ID, trace ID - [ ] User actions (breadcrumbs) - [ ] Environment variables (non-sensitive) ### Error Grouping - [ ] **Errors grouped** by fingerprint (same error, different occurrences) - [ ] **Error rate** tracked per group - [ ] **Alerting** on new error types or spike in existing - [ ] **Error assignment** to team members - [ ] **Resolution tracking** (mark errors as resolved) ### Privacy & Security - [ ] **PII redacted** from error reports: - [ ] Passwords, tokens, API keys - [ ] Credit card numbers - [ ] Email addresses (unless necessary) - [ ] SSNs, tax IDs - [ ] **Source maps** uploaded for frontend (de-minify stack traces) - [ ] **Release tagging** (associate errors with deployments) ## Performance Monitoring ### Real User Monitoring (RUM) - [ ] **RUM tool integrated** (Datadog RUM, New Relic Browser, Google Analytics): - [ ] Page load times - [ ] Core Web Vitals (LCP, FID, CLS) - [ ] JavaScript errors - [ ] User sessions - [ ] **Performance budgets** defined: - [ ] First Contentful Paint < 1.8s - [ ] Largest Contentful Paint < 2.5s - [ ] Time to Interactive < 3.8s - [ ] Cumulative Layout Shift < 0.1 - [ ] **Alerting** on performance regressions ### Application Performance Monitoring (APM) - [ ] **APM tool** integrated (Datadog APM, New Relic APM): - [ ] Trace every request - [ ] Identify slow endpoints - [ ] Database query analysis - [ ] External API profiling - [ ] **Performance profiling** for critical paths: - [ ] Authentication flow - [ ] Payment processing - [ ] Data exports - [ ] Complex queries ## Cost Management - [ ] **Observability costs** tracked: - [ ] Log ingestion costs - [ ] Metric cardinality costs - [ ] Trace sampling costs - [ ] Dashboard/seat costs - [ ] **Cost optimization**: - [ ] Log sampling for high-volume services - [ ] Metric aggregation (reduce cardinality) - [ ] Trace sampling (not 100% in production) - [ ] Data retention policies - [ ] **Budget alerts** configured ## Security & Compliance - [ ] **Access control** on observability tools (role-based) - [ ] **Audit logging** for observability access - [ ] **Data retention** complies with regulations (GDPR, HIPAA) - [ ] **Data encryption** in transit and at rest - [ ] **PII handling** compliant (redaction, anonymization) ## Testing Observability - [ ] **Log output** tested in unit tests: ```typescript test('logs user login', () => { const logs = captureLogs(); await loginUser(); expect(logs).toContainEqual( expect.objectContaining({ level: 'info', message: 'User logged in', user_id: expect.any(String) }) ); }); ``` - [ ] **Metrics** incremented in tests - [ ] **Traces** created in integration tests - [ ] **Health checks** tested - [ ] **Alert thresholds** tested (inject failures, verify alert fires) ## Documentation - [ ] **Observability runbook** created: - [ ] How to access logs, metrics, traces - [ ] How to create dashboards - [ ] How to set up alerts - [ ] Common troubleshooting queries - [ ] **Alert runbooks** for each alert - [ ] **Dashboard documentation** (what each panel shows) - [ ] **Metric dictionary** (what each metric means) - [ ] **On-call procedures** documented ## Scoring - **85+ items checked**: Excellent - Production-grade observability ✅ - **65-84 items**: Good - Most observability covered ⚠️ - **45-64 items**: Fair - Significant gaps exist 🔴 - **<45 items**: Poor - Not ready for production ❌ ## Priority Items Address these first: 1. **Structured logging** - Foundation for debugging 2. **Error tracking** - Catch and fix bugs quickly 3. **Health checks** - Know when service is down 4. **Alerting** - Get notified of issues 5. **Key metrics** - Request rate, error rate, latency ## Common Pitfalls ❌ **Don't:** - Log sensitive data (passwords, tokens, PII) - Create high-cardinality metrics (user_id as label) - Trace 100% of requests in production (sample instead) - Alert on every anomaly (alert fatigue) - Ignore observability until there's a problem ✅ **Do:** - Log at appropriate levels (use DEBUG for verbose) - Use correlation IDs throughout request lifecycle - Set up alerts with clear runbooks - Review dashboards regularly (detect issues early) - Iterate on observability (improve over time) ## Related Resources - [OpenTelemetry Documentation](https://opentelemetry.io/docs/) - [Pino Logger](https://getpino.io) - [Prometheus Best Practices](https://prometheus.io/docs/practices/) - [observability-engineering skill](../SKILL.md) --- **Total Items**: 140+ observability checks **Critical Items**: Logging, Metrics, Alerting, Health checks **Coverage**: Logs, Metrics, Traces, Alerts, Dashboards **Last Updated**: 2025-11-10