16 KiB
Observability Engineering Setup Checklist
Comprehensive checklist for implementing production-grade observability with logs, metrics, traces, and alerts.
Pre-Implementation Planning
-
Define observability goals (debug production issues, monitor SLAs, detect anomalies)
-
Choose observability stack:
- Logging: Pino (Node.js), structlog (Python), CloudWatch, Datadog
- Metrics: Prometheus, Datadog, CloudWatch
- Tracing: OpenTelemetry, Datadog APM, Jaeger
- Visualization: Grafana, Datadog, Honeycomb
-
Set up observability infrastructure (collectors, storage, dashboards)
-
Define data retention policies (logs: 30 days, metrics: 1 year, traces: 7 days)
-
Plan for scale (log volume, metric cardinality, trace sampling)
Structured Logging
Logger Configuration
-
Structured logger installed:
- Node.js:
pinowithpino-prettyfor dev - Python:
structlogwith JSON formatter - Browser: Custom JSON logger or service integration
- Node.js:
-
Log levels defined:
- TRACE: Very detailed debugging
- DEBUG: Detailed debugging info
- INFO: General informational messages
- WARN: Warning messages (recoverable issues)
- ERROR: Error messages (failures)
- FATAL: Critical failures (application crash)
-
Environment-based configuration:
- Development: Pretty-printed logs, DEBUG level
- Production: JSON logs, INFO level
- Test: Silent or minimal logs
Log Structure
-
Standard log format across services:
{ "level": "info", "timestamp": "2025-11-10T10:30:00.000Z", "service": "api-server", "environment": "production", "tenant_id": "uuid", "user_id": "uuid", "request_id": "uuid", "message": "User logged in", "duration_ms": 150, "http": { "method": "POST", "path": "/api/login", "status": 200, "user_agent": "Mozilla/5.0..." } } -
Correlation IDs added:
- request_id: Unique per request
- session_id: Unique per session
- trace_id: Unique per distributed trace
- tenant_id: Multi-tenant context
-
Context propagation through request lifecycle
-
Sensitive data redacted (passwords, tokens, credit cards)
What to Log
-
Request/response logging:
- HTTP method, path, status code
- Request duration
- User agent, IP address (hashed or anonymized)
- Query parameters (non-sensitive)
-
Authentication events:
- Login success/failure
- Logout
- Token refresh
- Permission checks
-
Business events:
- User registration
- Payment processing
- Data exports
- Admin actions
-
Errors and exceptions:
- Error message
- Stack trace
- Error context (what user was doing)
- Affected resources (user_id, tenant_id, entity_id)
-
Performance metrics:
- Database query times
- External API call times
- Cache hit/miss rates
- Background job durations
Log Aggregation
-
Logs shipped to central location:
- CloudWatch Logs
- Datadog Logs
- Elasticsearch (ELK stack)
- Splunk
-
Log retention configured (30-90 days typical)
-
Log volume monitored (cost management)
-
Log sampling for high-volume services (if needed)
Application Metrics
Metric Types
-
Counters for events that only increase:
- Total requests
- Total errors
- Total registrations
-
Gauges for values that go up and down:
- Active connections
- Memory usage
- Queue depth
-
Histograms for distributions:
- Request duration
- Response size
- Database query time
-
Summaries for quantiles (p50, p95, p99)
Standard Metrics
HTTP Metrics
-
http_requests_total (counter):
- Labels: method, path, status, tenant_id
- Track total requests per endpoint
-
http_request_duration_seconds (histogram):
- Labels: method, path, status
- Buckets: 0.1, 0.5, 1, 2, 5, 10 seconds
-
http_request_size_bytes (histogram)
-
http_response_size_bytes (histogram)
Database Metrics
-
db_queries_total (counter):
- Labels: operation (SELECT, INSERT, UPDATE, DELETE), table
-
db_query_duration_seconds (histogram):
- Labels: operation, table
- Track slow queries (p95, p99)
-
db_connection_pool_size (gauge)
-
db_connection_pool_available (gauge)
Application Metrics
-
active_sessions (gauge)
-
background_jobs_total (counter):
- Labels: job_name, status (success, failure)
-
background_job_duration_seconds (histogram):
- Labels: job_name
-
cache_operations_total (counter):
- Labels: operation (hit, miss, set, delete)
-
external_api_calls_total (counter):
- Labels: service, status
-
external_api_duration_seconds (histogram):
- Labels: service
System Metrics
- process_cpu_usage_percent (gauge)
- process_memory_usage_bytes (gauge)
- process_heap_usage_bytes (gauge) - JavaScript specific
- process_open_file_descriptors (gauge)
Metric Collection
-
Prometheus client library installed:
- Node.js:
prom-client - Python:
prometheus-client - Custom: OpenTelemetry SDK
- Node.js:
-
Metrics endpoint exposed (
/metrics) -
Prometheus scrapes endpoint (or push to gateway)
-
Metric naming follows conventions:
- Lowercase with underscores
- Unit suffixes (_seconds, _bytes, _total)
- Namespace prefix (myapp_http_requests_total)
Multi-Tenant Metrics
-
tenant_id label on all relevant metrics
-
Per-tenant dashboards (filter by tenant_id)
-
Tenant resource usage tracked:
- API calls per tenant
- Database storage per tenant
- Data transfer per tenant
-
Tenant quotas monitored (alert on approaching limit)
Distributed Tracing
Tracing Setup
-
OpenTelemetry SDK installed:
- Node.js:
@opentelemetry/sdk-node - Python:
opentelemetry-sdk
- Node.js:
-
Tracing backend configured:
- Jaeger (self-hosted)
- Datadog APM
- Honeycomb
- AWS X-Ray
-
Auto-instrumentation enabled:
- HTTP client/server
- Database queries
- Redis operations
- Message queues
Span Creation
-
Custom spans for business logic:
const span = tracer.startSpan('process-payment'); span.setAttribute('tenant_id', tenantId); span.setAttribute('amount', amount); try { await processPayment(); span.setStatus({ code: SpanStatusCode.OK }); } catch (error) { span.recordException(error); span.setStatus({ code: SpanStatusCode.ERROR }); throw error; } finally { span.end(); } -
Span attributes include context:
- tenant_id, user_id, request_id
- Input parameters (non-sensitive)
- Result status
-
Span events for key moments:
- "Payment started"
- "Database query executed"
- "External API called"
Trace Context Propagation
-
W3C Trace Context headers propagated:
- traceparent: trace-id, parent-span-id, flags
- tracestate: vendor-specific data
-
Context propagated across:
- HTTP requests (frontend ↔ backend)
- Background jobs
- Message queues
- Microservices
-
Trace ID included in logs (correlate logs + traces)
Sampling
-
Sampling strategy defined:
- Head-based: Sample at trace start (1%, 10%, 100%)
- Tail-based: Sample after trace completes (error traces, slow traces)
- Adaptive: Sample based on load
-
Always sample errors and slow requests
-
Sample rate appropriate for volume (start high, reduce if needed)
Alerting
Alert Definitions
-
Error rate alerts:
- Condition: Error rate > 5% for 5 minutes
- Severity: Critical
- Action: Page on-call engineer
-
Latency alerts:
- Condition: p95 latency > 1s for 10 minutes
- Severity: Warning
- Action: Slack notification
-
Availability alerts:
- Condition: Health check fails 3 consecutive times
- Severity: Critical
- Action: Page on-call + auto-restart
-
Resource alerts:
- Memory usage > 80%
- CPU usage > 80%
- Disk usage > 85%
- Database connections > 90% of pool
-
Business metric alerts:
- Registration rate drops > 50%
- Payment failures increase > 10%
- Active users drop significantly
Alert Channels
- PagerDuty (or equivalent) for critical alerts
- Slack for warnings and notifications
- Email for non-urgent alerts
- SMS for highest priority (only use sparingly)
Alert Management
-
Alert fatigue prevented:
- Appropriate thresholds (not too sensitive)
- Proper severity levels (not everything is critical)
- Alert aggregation (deduplicate similar alerts)
-
Runbooks for each alert:
- What the alert means
- How to investigate
- How to resolve
- Escalation path
-
Alert suppression during deployments (planned downtime)
-
Alert escalation if not acknowledged
Dashboards & Visualization
Standard Dashboards
-
Service Overview dashboard:
- Request rate (requests/sec)
- Error rate (errors/sec, %)
- Latency (p50, p95, p99)
- Availability (uptime %)
-
Database dashboard:
- Query rate
- Slow queries (p95, p99)
- Connection pool usage
- Table sizes
-
System Resources dashboard:
- CPU usage
- Memory usage
- Disk I/O
- Network I/O
-
Business Metrics dashboard:
- Active users
- Registrations
- Revenue
- Feature usage
Dashboard Best Practices
- Auto-refresh enabled (every 30-60 seconds)
- Time range configurable (last hour, 24h, 7 days)
- Drill-down to detailed views
- Annotations for deployments/incidents
- Shared dashboards accessible to team
Per-Tenant Dashboards
- Tenant filter on all relevant dashboards
- Tenant resource usage visualized
- Tenant-specific alerts (if large customer)
- Tenant comparison view (compare usage across tenants)
Health Checks
Endpoint Implementation
- Health check endpoint (
/healthor/healthz):- Returns 200 OK when healthy
- Returns 503 Service Unavailable when unhealthy
- Includes subsystem status
{
"status": "healthy",
"version": "1.2.3",
"uptime_seconds": 86400,
"checks": {
"database": "healthy",
"redis": "healthy",
"external_api": "degraded"
}
}
-
Liveness probe (
/health/live):- Checks if application is running
- Fails → restart container
-
Readiness probe (
/health/ready):- Checks if application is ready to serve traffic
- Fails → remove from load balancer
Health Check Coverage
- Database connectivity checked
- Cache connectivity checked (Redis, Memcached)
- External APIs checked (optional, can cause false positives)
- Disk space checked
- Critical dependencies checked
Monitoring Health Checks
- Uptime monitoring service (Pingdom, UptimeRobot, Datadog Synthetics)
- Check frequency appropriate (every 1-5 minutes)
- Alerting on failed health checks
- Geographic monitoring (check from multiple regions)
Error Tracking
Error Capture
-
Error tracking service integrated:
- Sentry
- Datadog Error Tracking
- Rollbar
- Custom solution
-
Unhandled exceptions captured automatically
-
Handled errors reported when appropriate
-
Error context included:
- User ID, tenant ID
- Request ID, trace ID
- User actions (breadcrumbs)
- Environment variables (non-sensitive)
Error Grouping
- Errors grouped by fingerprint (same error, different occurrences)
- Error rate tracked per group
- Alerting on new error types or spike in existing
- Error assignment to team members
- Resolution tracking (mark errors as resolved)
Privacy & Security
-
PII redacted from error reports:
- Passwords, tokens, API keys
- Credit card numbers
- Email addresses (unless necessary)
- SSNs, tax IDs
-
Source maps uploaded for frontend (de-minify stack traces)
-
Release tagging (associate errors with deployments)
Performance Monitoring
Real User Monitoring (RUM)
-
RUM tool integrated (Datadog RUM, New Relic Browser, Google Analytics):
- Page load times
- Core Web Vitals (LCP, FID, CLS)
- JavaScript errors
- User sessions
-
Performance budgets defined:
- First Contentful Paint < 1.8s
- Largest Contentful Paint < 2.5s
- Time to Interactive < 3.8s
- Cumulative Layout Shift < 0.1
-
Alerting on performance regressions
Application Performance Monitoring (APM)
-
APM tool integrated (Datadog APM, New Relic APM):
- Trace every request
- Identify slow endpoints
- Database query analysis
- External API profiling
-
Performance profiling for critical paths:
- Authentication flow
- Payment processing
- Data exports
- Complex queries
Cost Management
-
Observability costs tracked:
- Log ingestion costs
- Metric cardinality costs
- Trace sampling costs
- Dashboard/seat costs
-
Cost optimization:
- Log sampling for high-volume services
- Metric aggregation (reduce cardinality)
- Trace sampling (not 100% in production)
- Data retention policies
-
Budget alerts configured
Security & Compliance
- Access control on observability tools (role-based)
- Audit logging for observability access
- Data retention complies with regulations (GDPR, HIPAA)
- Data encryption in transit and at rest
- PII handling compliant (redaction, anonymization)
Testing Observability
-
Log output tested in unit tests:
test('logs user login', () => { const logs = captureLogs(); await loginUser(); expect(logs).toContainEqual( expect.objectContaining({ level: 'info', message: 'User logged in', user_id: expect.any(String) }) ); }); -
Metrics incremented in tests
-
Traces created in integration tests
-
Health checks tested
-
Alert thresholds tested (inject failures, verify alert fires)
Documentation
-
Observability runbook created:
- How to access logs, metrics, traces
- How to create dashboards
- How to set up alerts
- Common troubleshooting queries
-
Alert runbooks for each alert
-
Dashboard documentation (what each panel shows)
-
Metric dictionary (what each metric means)
-
On-call procedures documented
Scoring
- 85+ items checked: Excellent - Production-grade observability ✅
- 65-84 items: Good - Most observability covered ⚠️
- 45-64 items: Fair - Significant gaps exist 🔴
- <45 items: Poor - Not ready for production ❌
Priority Items
Address these first:
- Structured logging - Foundation for debugging
- Error tracking - Catch and fix bugs quickly
- Health checks - Know when service is down
- Alerting - Get notified of issues
- Key metrics - Request rate, error rate, latency
Common Pitfalls
❌ Don't:
- Log sensitive data (passwords, tokens, PII)
- Create high-cardinality metrics (user_id as label)
- Trace 100% of requests in production (sample instead)
- Alert on every anomaly (alert fatigue)
- Ignore observability until there's a problem
✅ Do:
- Log at appropriate levels (use DEBUG for verbose)
- Use correlation IDs throughout request lifecycle
- Set up alerts with clear runbooks
- Review dashboards regularly (detect issues early)
- Iterate on observability (improve over time)
Related Resources
Total Items: 140+ observability checks Critical Items: Logging, Metrics, Alerting, Health checks Coverage: Logs, Metrics, Traces, Alerts, Dashboards Last Updated: 2025-11-10