Files
gh-greyhaven-ai-claude-code…/skills/observability-engineering/checklists/observability-setup-checklist.md
2025-11-29 18:29:23 +08:00

601 lines
16 KiB
Markdown

# Observability Engineering Setup Checklist
Comprehensive checklist for implementing production-grade observability with logs, metrics, traces, and alerts.
## Pre-Implementation Planning
- [ ] **Define observability goals** (debug production issues, monitor SLAs, detect anomalies)
- [ ] **Choose observability stack**:
- [ ] Logging: Pino (Node.js), structlog (Python), CloudWatch, Datadog
- [ ] Metrics: Prometheus, Datadog, CloudWatch
- [ ] Tracing: OpenTelemetry, Datadog APM, Jaeger
- [ ] Visualization: Grafana, Datadog, Honeycomb
- [ ] **Set up observability infrastructure** (collectors, storage, dashboards)
- [ ] **Define data retention** policies (logs: 30 days, metrics: 1 year, traces: 7 days)
- [ ] **Plan for scale** (log volume, metric cardinality, trace sampling)
## Structured Logging
### Logger Configuration
- [ ] **Structured logger installed**:
- Node.js: `pino` with `pino-pretty` for dev
- Python: `structlog` with JSON formatter
- Browser: Custom JSON logger or service integration
- [ ] **Log levels defined**:
- [ ] TRACE: Very detailed debugging
- [ ] DEBUG: Detailed debugging info
- [ ] INFO: General informational messages
- [ ] WARN: Warning messages (recoverable issues)
- [ ] ERROR: Error messages (failures)
- [ ] FATAL: Critical failures (application crash)
- [ ] **Environment-based configuration**:
- [ ] Development: Pretty-printed logs, DEBUG level
- [ ] Production: JSON logs, INFO level
- [ ] Test: Silent or minimal logs
### Log Structure
- [ ] **Standard log format** across services:
```json
{
"level": "info",
"timestamp": "2025-11-10T10:30:00.000Z",
"service": "api-server",
"environment": "production",
"tenant_id": "uuid",
"user_id": "uuid",
"request_id": "uuid",
"message": "User logged in",
"duration_ms": 150,
"http": {
"method": "POST",
"path": "/api/login",
"status": 200,
"user_agent": "Mozilla/5.0..."
}
}
```
- [ ] **Correlation IDs** added:
- [ ] request_id: Unique per request
- [ ] session_id: Unique per session
- [ ] trace_id: Unique per distributed trace
- [ ] tenant_id: Multi-tenant context
- [ ] **Context propagation** through request lifecycle
- [ ] **Sensitive data redacted** (passwords, tokens, credit cards)
### What to Log
- [ ] **Request/response logging**:
- [ ] HTTP method, path, status code
- [ ] Request duration
- [ ] User agent, IP address (hashed or anonymized)
- [ ] Query parameters (non-sensitive)
- [ ] **Authentication events**:
- [ ] Login success/failure
- [ ] Logout
- [ ] Token refresh
- [ ] Permission checks
- [ ] **Business events**:
- [ ] User registration
- [ ] Payment processing
- [ ] Data exports
- [ ] Admin actions
- [ ] **Errors and exceptions**:
- [ ] Error message
- [ ] Stack trace
- [ ] Error context (what user was doing)
- [ ] Affected resources (user_id, tenant_id, entity_id)
- [ ] **Performance metrics**:
- [ ] Database query times
- [ ] External API call times
- [ ] Cache hit/miss rates
- [ ] Background job durations
### Log Aggregation
- [ ] **Logs shipped** to central location:
- [ ] CloudWatch Logs
- [ ] Datadog Logs
- [ ] Elasticsearch (ELK stack)
- [ ] Splunk
- [ ] **Log retention** configured (30-90 days typical)
- [ ] **Log volume** monitored (cost management)
- [ ] **Log sampling** for high-volume services (if needed)
## Application Metrics
### Metric Types
- [ ] **Counters** for events that only increase:
- [ ] Total requests
- [ ] Total errors
- [ ] Total registrations
- [ ] **Gauges** for values that go up and down:
- [ ] Active connections
- [ ] Memory usage
- [ ] Queue depth
- [ ] **Histograms** for distributions:
- [ ] Request duration
- [ ] Response size
- [ ] Database query time
- [ ] **Summaries** for quantiles (p50, p95, p99)
### Standard Metrics
#### HTTP Metrics
- [ ] **http_requests_total** (counter):
- Labels: method, path, status, tenant_id
- Track total requests per endpoint
- [ ] **http_request_duration_seconds** (histogram):
- Labels: method, path, status
- Buckets: 0.1, 0.5, 1, 2, 5, 10 seconds
- [ ] **http_request_size_bytes** (histogram)
- [ ] **http_response_size_bytes** (histogram)
#### Database Metrics
- [ ] **db_queries_total** (counter):
- Labels: operation (SELECT, INSERT, UPDATE, DELETE), table
- [ ] **db_query_duration_seconds** (histogram):
- Labels: operation, table
- Track slow queries (p95, p99)
- [ ] **db_connection_pool_size** (gauge)
- [ ] **db_connection_pool_available** (gauge)
#### Application Metrics
- [ ] **active_sessions** (gauge)
- [ ] **background_jobs_total** (counter):
- Labels: job_name, status (success, failure)
- [ ] **background_job_duration_seconds** (histogram):
- Labels: job_name
- [ ] **cache_operations_total** (counter):
- Labels: operation (hit, miss, set, delete)
- [ ] **external_api_calls_total** (counter):
- Labels: service, status
- [ ] **external_api_duration_seconds** (histogram):
- Labels: service
#### System Metrics
- [ ] **process_cpu_usage_percent** (gauge)
- [ ] **process_memory_usage_bytes** (gauge)
- [ ] **process_heap_usage_bytes** (gauge) - JavaScript specific
- [ ] **process_open_file_descriptors** (gauge)
### Metric Collection
- [ ] **Prometheus client library** installed:
- Node.js: `prom-client`
- Python: `prometheus-client`
- Custom: OpenTelemetry SDK
- [ ] **Metrics endpoint** exposed (`/metrics`)
- [ ] **Prometheus scrapes** endpoint (or push to gateway)
- [ ] **Metric naming** follows conventions:
- Lowercase with underscores
- Unit suffixes (_seconds, _bytes, _total)
- Namespace prefix (myapp_http_requests_total)
### Multi-Tenant Metrics
- [ ] **tenant_id label** on all relevant metrics
- [ ] **Per-tenant dashboards** (filter by tenant_id)
- [ ] **Tenant resource usage** tracked:
- [ ] API calls per tenant
- [ ] Database storage per tenant
- [ ] Data transfer per tenant
- [ ] **Tenant quotas** monitored (alert on approaching limit)
## Distributed Tracing
### Tracing Setup
- [ ] **OpenTelemetry SDK** installed:
- Node.js: `@opentelemetry/sdk-node`
- Python: `opentelemetry-sdk`
- [ ] **Tracing backend** configured:
- [ ] Jaeger (self-hosted)
- [ ] Datadog APM
- [ ] Honeycomb
- [ ] AWS X-Ray
- [ ] **Auto-instrumentation** enabled:
- [ ] HTTP client/server
- [ ] Database queries
- [ ] Redis operations
- [ ] Message queues
### Span Creation
- [ ] **Custom spans** for business logic:
```typescript
const span = tracer.startSpan('process-payment');
span.setAttribute('tenant_id', tenantId);
span.setAttribute('amount', amount);
try {
await processPayment();
span.setStatus({ code: SpanStatusCode.OK });
} catch (error) {
span.recordException(error);
span.setStatus({ code: SpanStatusCode.ERROR });
throw error;
} finally {
span.end();
}
```
- [ ] **Span attributes** include context:
- [ ] tenant_id, user_id, request_id
- [ ] Input parameters (non-sensitive)
- [ ] Result status
- [ ] **Span events** for key moments:
- [ ] "Payment started"
- [ ] "Database query executed"
- [ ] "External API called"
### Trace Context Propagation
- [ ] **W3C Trace Context** headers propagated:
- traceparent: trace-id, parent-span-id, flags
- tracestate: vendor-specific data
- [ ] **Context propagated** across:
- [ ] HTTP requests (frontend ↔ backend)
- [ ] Background jobs
- [ ] Message queues
- [ ] Microservices
- [ ] **Trace ID** included in logs (correlate logs + traces)
### Sampling
- [ ] **Sampling strategy** defined:
- [ ] Head-based: Sample at trace start (1%, 10%, 100%)
- [ ] Tail-based: Sample after trace completes (error traces, slow traces)
- [ ] Adaptive: Sample based on load
- [ ] **Always sample** errors and slow requests
- [ ] **Sample rate** appropriate for volume (start high, reduce if needed)
## Alerting
### Alert Definitions
- [ ] **Error rate alerts**:
- [ ] Condition: Error rate > 5% for 5 minutes
- [ ] Severity: Critical
- [ ] Action: Page on-call engineer
- [ ] **Latency alerts**:
- [ ] Condition: p95 latency > 1s for 10 minutes
- [ ] Severity: Warning
- [ ] Action: Slack notification
- [ ] **Availability alerts**:
- [ ] Condition: Health check fails 3 consecutive times
- [ ] Severity: Critical
- [ ] Action: Page on-call + auto-restart
- [ ] **Resource alerts**:
- [ ] Memory usage > 80%
- [ ] CPU usage > 80%
- [ ] Disk usage > 85%
- [ ] Database connections > 90% of pool
- [ ] **Business metric alerts**:
- [ ] Registration rate drops > 50%
- [ ] Payment failures increase > 10%
- [ ] Active users drop significantly
### Alert Channels
- [ ] **PagerDuty** (or equivalent) for critical alerts
- [ ] **Slack** for warnings and notifications
- [ ] **Email** for non-urgent alerts
- [ ] **SMS** for highest priority (only use sparingly)
### Alert Management
- [ ] **Alert fatigue** prevented:
- [ ] Appropriate thresholds (not too sensitive)
- [ ] Proper severity levels (not everything is critical)
- [ ] Alert aggregation (deduplicate similar alerts)
- [ ] **Runbooks** for each alert:
- [ ] What the alert means
- [ ] How to investigate
- [ ] How to resolve
- [ ] Escalation path
- [ ] **Alert suppression** during deployments (planned downtime)
- [ ] **Alert escalation** if not acknowledged
## Dashboards & Visualization
### Standard Dashboards
- [ ] **Service Overview** dashboard:
- [ ] Request rate (requests/sec)
- [ ] Error rate (errors/sec, %)
- [ ] Latency (p50, p95, p99)
- [ ] Availability (uptime %)
- [ ] **Database** dashboard:
- [ ] Query rate
- [ ] Slow queries (p95, p99)
- [ ] Connection pool usage
- [ ] Table sizes
- [ ] **System Resources** dashboard:
- [ ] CPU usage
- [ ] Memory usage
- [ ] Disk I/O
- [ ] Network I/O
- [ ] **Business Metrics** dashboard:
- [ ] Active users
- [ ] Registrations
- [ ] Revenue
- [ ] Feature usage
### Dashboard Best Practices
- [ ] **Auto-refresh** enabled (every 30-60 seconds)
- [ ] **Time range** configurable (last hour, 24h, 7 days)
- [ ] **Drill-down** to detailed views
- [ ] **Annotations** for deployments/incidents
- [ ] **Shared dashboards** accessible to team
### Per-Tenant Dashboards
- [ ] **Tenant filter** on all relevant dashboards
- [ ] **Tenant resource usage** visualized
- [ ] **Tenant-specific alerts** (if large customer)
- [ ] **Tenant comparison** view (compare usage across tenants)
## Health Checks
### Endpoint Implementation
- [ ] **Health check endpoint** (`/health` or `/healthz`):
- [ ] Returns 200 OK when healthy
- [ ] Returns 503 Service Unavailable when unhealthy
- [ ] Includes subsystem status
```json
{
"status": "healthy",
"version": "1.2.3",
"uptime_seconds": 86400,
"checks": {
"database": "healthy",
"redis": "healthy",
"external_api": "degraded"
}
}
```
- [ ] **Liveness probe** (`/health/live`):
- [ ] Checks if application is running
- [ ] Fails → restart container
- [ ] **Readiness probe** (`/health/ready`):
- [ ] Checks if application is ready to serve traffic
- [ ] Fails → remove from load balancer
### Health Check Coverage
- [ ] **Database connectivity** checked
- [ ] **Cache connectivity** checked (Redis, Memcached)
- [ ] **External APIs** checked (optional, can cause false positives)
- [ ] **Disk space** checked
- [ ] **Critical dependencies** checked
### Monitoring Health Checks
- [ ] **Uptime monitoring** service (Pingdom, UptimeRobot, Datadog Synthetics)
- [ ] **Check frequency** appropriate (every 1-5 minutes)
- [ ] **Alerting** on failed health checks
- [ ] **Geographic monitoring** (check from multiple regions)
## Error Tracking
### Error Capture
- [ ] **Error tracking service** integrated:
- [ ] Sentry
- [ ] Datadog Error Tracking
- [ ] Rollbar
- [ ] Custom solution
- [ ] **Unhandled exceptions** captured automatically
- [ ] **Handled errors** reported when appropriate
- [ ] **Error context** included:
- [ ] User ID, tenant ID
- [ ] Request ID, trace ID
- [ ] User actions (breadcrumbs)
- [ ] Environment variables (non-sensitive)
### Error Grouping
- [ ] **Errors grouped** by fingerprint (same error, different occurrences)
- [ ] **Error rate** tracked per group
- [ ] **Alerting** on new error types or spike in existing
- [ ] **Error assignment** to team members
- [ ] **Resolution tracking** (mark errors as resolved)
### Privacy & Security
- [ ] **PII redacted** from error reports:
- [ ] Passwords, tokens, API keys
- [ ] Credit card numbers
- [ ] Email addresses (unless necessary)
- [ ] SSNs, tax IDs
- [ ] **Source maps** uploaded for frontend (de-minify stack traces)
- [ ] **Release tagging** (associate errors with deployments)
## Performance Monitoring
### Real User Monitoring (RUM)
- [ ] **RUM tool integrated** (Datadog RUM, New Relic Browser, Google Analytics):
- [ ] Page load times
- [ ] Core Web Vitals (LCP, FID, CLS)
- [ ] JavaScript errors
- [ ] User sessions
- [ ] **Performance budgets** defined:
- [ ] First Contentful Paint < 1.8s
- [ ] Largest Contentful Paint < 2.5s
- [ ] Time to Interactive < 3.8s
- [ ] Cumulative Layout Shift < 0.1
- [ ] **Alerting** on performance regressions
### Application Performance Monitoring (APM)
- [ ] **APM tool** integrated (Datadog APM, New Relic APM):
- [ ] Trace every request
- [ ] Identify slow endpoints
- [ ] Database query analysis
- [ ] External API profiling
- [ ] **Performance profiling** for critical paths:
- [ ] Authentication flow
- [ ] Payment processing
- [ ] Data exports
- [ ] Complex queries
## Cost Management
- [ ] **Observability costs** tracked:
- [ ] Log ingestion costs
- [ ] Metric cardinality costs
- [ ] Trace sampling costs
- [ ] Dashboard/seat costs
- [ ] **Cost optimization**:
- [ ] Log sampling for high-volume services
- [ ] Metric aggregation (reduce cardinality)
- [ ] Trace sampling (not 100% in production)
- [ ] Data retention policies
- [ ] **Budget alerts** configured
## Security & Compliance
- [ ] **Access control** on observability tools (role-based)
- [ ] **Audit logging** for observability access
- [ ] **Data retention** complies with regulations (GDPR, HIPAA)
- [ ] **Data encryption** in transit and at rest
- [ ] **PII handling** compliant (redaction, anonymization)
## Testing Observability
- [ ] **Log output** tested in unit tests:
```typescript
test('logs user login', () => {
const logs = captureLogs();
await loginUser();
expect(logs).toContainEqual(
expect.objectContaining({
level: 'info',
message: 'User logged in',
user_id: expect.any(String)
})
);
});
```
- [ ] **Metrics** incremented in tests
- [ ] **Traces** created in integration tests
- [ ] **Health checks** tested
- [ ] **Alert thresholds** tested (inject failures, verify alert fires)
## Documentation
- [ ] **Observability runbook** created:
- [ ] How to access logs, metrics, traces
- [ ] How to create dashboards
- [ ] How to set up alerts
- [ ] Common troubleshooting queries
- [ ] **Alert runbooks** for each alert
- [ ] **Dashboard documentation** (what each panel shows)
- [ ] **Metric dictionary** (what each metric means)
- [ ] **On-call procedures** documented
## Scoring
- **85+ items checked**: Excellent - Production-grade observability ✅
- **65-84 items**: Good - Most observability covered ⚠️
- **45-64 items**: Fair - Significant gaps exist 🔴
- **<45 items**: Poor - Not ready for production ❌
## Priority Items
Address these first:
1. **Structured logging** - Foundation for debugging
2. **Error tracking** - Catch and fix bugs quickly
3. **Health checks** - Know when service is down
4. **Alerting** - Get notified of issues
5. **Key metrics** - Request rate, error rate, latency
## Common Pitfalls
**Don't:**
- Log sensitive data (passwords, tokens, PII)
- Create high-cardinality metrics (user_id as label)
- Trace 100% of requests in production (sample instead)
- Alert on every anomaly (alert fatigue)
- Ignore observability until there's a problem
**Do:**
- Log at appropriate levels (use DEBUG for verbose)
- Use correlation IDs throughout request lifecycle
- Set up alerts with clear runbooks
- Review dashboards regularly (detect issues early)
- Iterate on observability (improve over time)
## Related Resources
- [OpenTelemetry Documentation](https://opentelemetry.io/docs/)
- [Pino Logger](https://getpino.io)
- [Prometheus Best Practices](https://prometheus.io/docs/practices/)
- [observability-engineering skill](../SKILL.md)
---
**Total Items**: 140+ observability checks
**Critical Items**: Logging, Metrics, Alerting, Health checks
**Coverage**: Logs, Metrics, Traces, Alerts, Dashboards
**Last Updated**: 2025-11-10