Initial commit

2025-11-30 09:07:22 +08:00
commit fab98d059b
179 changed files with 46209 additions and 0 deletions
--- a/skills/observability-instrumentation/SKILL.md
+++ b/skills/observability-instrumentation/SKILL.md
@@ -0,0 +1,357 @@
+---
+name: Observability Instrumentation
+description: Comprehensive observability methodology implementing three pillars (logs, metrics, traces) with structured logging using Go slog, Prometheus-style metrics, and distributed tracing patterns. Use when adding observability from scratch, logs unstructured or inadequate, no metrics collection, debugging production issues difficult, or need performance monitoring. Provides structured logging patterns (contextual logging, log levels DEBUG/INFO/WARN/ERROR, request ID propagation), metrics instrumentation (counter/gauge/histogram patterns, Prometheus exposition), tracing setup (span creation, context propagation, sampling strategies), and Go slog best practices (JSON formatting, attribute management, handler configuration). Validated in meta-cc with 23-46x speedup vs ad-hoc logging, 90-95% transferability across languages (slog specific to Go but patterns universal).
+allowed-tools: Read, Write, Edit, Bash, Grep, Glob
+---
+
+# Observability Instrumentation
+
+**Implement three pillars of observability: logs, metrics, and traces.**
+
+> You can't improve what you can't measure. You can't debug what you can't observe.
+
+---
+
+## When to Use This Skill
+
+Use this skill when:
+- 📊 **No observability**: Starting from scratch
+- 📝 **Unstructured logs**: Printf debugging, no context
+- 📈 **No metrics**: Can't measure performance or errors
+- 🐛 **Hard to debug**: Production issues take hours to diagnose
+- 🔍 **Performance unknown**: No visibility into bottlenecks
+- 🎯 **SLO/SLA tracking**: Need to measure reliability
+
+**Don't use when**:
+- ❌ Observability already comprehensive
+- ❌ Non-production code (development scripts, throwaway tools)
+- ❌ Performance not critical (batch jobs, admin tools)
+- ❌ No logging infrastructure available
+
+---
+
+## Quick Start (20 minutes)
+
+### Step 1: Add Structured Logging (10 min)
+
+```go
+// Initialize slog
+import "log/slog"
+
+logger := slog.New(slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{
+    Level: slog.LevelInfo,
+}))
+
+// Use structured logging
+logger.Info("operation completed",
+    slog.String("user_id", userID),
+    slog.Int("count", count),
+    slog.Duration("duration", elapsed))
+```
+
+### Step 2: Add Basic Metrics (5 min)
+
+```go
+// Counters
+requestCount.Add(1)
+errorCount.Add(1)
+
+// Gauges
+activeConnections.Set(float64(count))
+
+// Histograms
+requestDuration.Observe(elapsed.Seconds())
+```
+
+### Step 3: Add Request ID Propagation (5 min)
+
+```go
+// Generate request ID
+requestID := uuid.New().String()
+
+// Add to context
+ctx = context.WithValue(ctx, requestIDKey, requestID)
+
+// Log with request ID
+logger.InfoContext(ctx, "processing request",
+    slog.String("request_id", requestID))
+```
+
+---
+
+## Three Pillars of Observability
+
+### 1. Logs (Structured Logging)
+
+**Purpose**: Record discrete events with context
+
+**Go slog patterns**:
+```go
+// Contextual logging
+logger.InfoContext(ctx, "user authenticated",
+    slog.String("user_id", userID),
+    slog.String("method", authMethod),
+    slog.Duration("elapsed", elapsed))
+
+// Error logging with stack trace
+logger.ErrorContext(ctx, "database query failed",
+    slog.String("query", query),
+    slog.Any("error", err))
+
+// Debug logging (disabled in production)
+logger.DebugContext(ctx, "cache hit",
+    slog.String("key", cacheKey))
+```
+
+**Log levels**:
+- **DEBUG**: Detailed diagnostic information
+- **INFO**: General informational messages
+- **WARN**: Warning messages (potential issues)
+- **ERROR**: Error messages (failures)
+
+**Best practices**:
+- Always use structured logging (not printf)
+- Include request ID in all logs
+- Log both successes and failures
+- Include timing information
+- Don't log sensitive data (passwords, tokens)
+
+### 2. Metrics (Quantitative Measurements)
+
+**Purpose**: Track aggregate statistics over time
+
+**Three metric types**:
+
+**Counter** (monotonically increasing):
+```go
+httpRequestsTotal.Add(1)
+httpErrorsTotal.Add(1)
+```
+
+**Gauge** (can go up or down):
+```go
+activeConnections.Set(float64(connCount))
+queueLength.Set(float64(len(queue)))
+```
+
+**Histogram** (distributions):
+```go
+requestDuration.Observe(elapsed.Seconds())
+responseSize.Observe(float64(size))
+```
+
+**Prometheus exposition**:
+```go
+http.Handle("/metrics", promhttp.Handler())
+```
+
+### 3. Traces (Distributed Request Tracking)
+
+**Purpose**: Track requests across services
+
+**Span creation**:
+```go
+ctx, span := tracer.Start(ctx, "database.query")
+defer span.End()
+
+// Add attributes
+span.SetAttributes(
+    attribute.String("db.query", query),
+    attribute.Int("db.rows", rowCount))
+
+// Record error
+if err != nil {
+    span.RecordError(err)
+    span.SetStatus(codes.Error, err.Error())
+}
+```
+
+**Context propagation**:
+```go
+// Extract from HTTP headers
+ctx = otel.GetTextMapPropagator().Extract(ctx, propagation.HeaderCarrier(req.Header))
+
+// Inject into HTTP headers
+otel.GetTextMapPropagator().Inject(ctx, propagation.HeaderCarrier(req.Header))
+```
+
+---
+
+## Go slog Best Practices
+
+### Handler Configuration
+
+```go
+// Production: JSON handler
+logger := slog.New(slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{
+    Level: slog.LevelInfo,
+    AddSource: true, // Include file:line
+}))
+
+// Development: Text handler
+logger := slog.New(slog.NewTextHandler(os.Stdout, &slog.HandlerOptions{
+    Level: slog.LevelDebug,
+}))
+```
+
+### Attribute Management
+
+```go
+// Reusable attributes
+attrs := []slog.Attr{
+    slog.String("service", "api"),
+    slog.String("version", version),
+}
+
+// Child logger with default attributes
+apiLogger := logger.With(attrs...)
+
+// Use child logger
+apiLogger.Info("request received") // Includes service and version automatically
+```
+
+### Performance Optimization
+
+```go
+// Lazy evaluation (expensive operations)
+logger.Info("operation completed",
+    slog.Group("stats",
+        slog.Int("count", count),
+        slog.Any("details", func() interface{} {
+            return computeExpensiveStats() // Only computed if logged
+        })))
+```
+
+---
+
+## Implementation Patterns
+
+### Pattern 1: Request ID Propagation
+
+```go
+type contextKey string
+const requestIDKey contextKey = "request_id"
+
+// Generate and store
+requestID := uuid.New().String()
+ctx = context.WithValue(ctx, requestIDKey, requestID)
+
+// Extract and log
+if reqID, ok := ctx.Value(requestIDKey).(string); ok {
+    logger.InfoContext(ctx, "processing",
+        slog.String("request_id", reqID))
+}
+```
+
+### Pattern 2: Operation Timing
+
+```go
+func instrumentOperation(ctx context.Context, name string, fn func() error) error {
+    start := time.Now()
+    logger.InfoContext(ctx, "operation started", slog.String("operation", name))
+
+    err := fn()
+    elapsed := time.Since(start)
+
+    if err != nil {
+        logger.ErrorContext(ctx, "operation failed",
+            slog.String("operation", name),
+            slog.Duration("elapsed", elapsed),
+            slog.Any("error", err))
+        operationErrors.Add(1)
+    } else {
+        logger.InfoContext(ctx, "operation completed",
+            slog.String("operation", name),
+            slog.Duration("elapsed", elapsed))
+    }
+
+    operationDuration.Observe(elapsed.Seconds())
+    return err
+}
+```
+
+### Pattern 3: Error Rate Monitoring
+
+```go
+// Track error rates
+totalRequests.Add(1)
+if err != nil {
+    errorRequests.Add(1)
+}
+
+// Calculate error rate (in monitoring system)
+// error_rate = rate(errorRequests[5m]) / rate(totalRequests[5m])
+```
+
+---
+
+## Proven Results
+
+**Validated in bootstrap-009** (meta-cc project):
+- ✅ Structured logging with slog (100% coverage)
+- ✅ Metrics instrumentation (Prometheus-compatible)
+- ✅ Distributed tracing setup (OpenTelemetry)
+- ✅ 23-46x speedup vs ad-hoc logging
+- ✅ 7 iterations, ~21 hours
+- ✅ V_instance: 0.87, V_meta: 0.83
+
+**Speedup breakdown**:
+- Debug time: 46x faster (context immediately available)
+- Performance analysis: 23x faster (metrics pre-collected)
+- Error diagnosis: 30x faster (structured logs + traces)
+
+**Transferability**:
+- Go slog: 100% (Go-specific)
+- Structured logging patterns: 100% (universal)
+- Metrics patterns: 95% (Prometheus standard)
+- Tracing patterns: 95% (OpenTelemetry standard)
+- **Overall**: 90-95% transferable
+
+**Language adaptations**:
+- Python: structlog, prometheus_client, opentelemetry-python
+- Java: SLF4J, Micrometer, OpenTelemetry Java
+- Node.js: winston, prom-client, @opentelemetry/api
+- Rust: tracing, prometheus, opentelemetry
+
+---
+
+## Anti-Patterns
+
+❌ **Log spamming**: Logging everything (noise overwhelms signal)
+❌ **Unstructured logs**: String concatenation instead of structured fields
+❌ **Synchronous logging**: Blocking on log writes (use async handlers)
+❌ **Missing context**: Logs without request ID or user context
+❌ **Metrics explosion**: Too many unique label combinations (cardinality issues)
+❌ **Trace everything**: 100% sampling in production (performance impact)
+
+---
+
+## Related Skills
+
+**Parent framework**:
+- [methodology-bootstrapping](../methodology-bootstrapping/SKILL.md) - Core OCA cycle
+
+**Complementary**:
+- [error-recovery](../error-recovery/SKILL.md) - Error logging patterns
+- [ci-cd-optimization](../ci-cd-optimization/SKILL.md) - Build metrics
+- [testing-strategy](../testing-strategy/SKILL.md) - Test instrumentation
+
+---
+
+## References
+
+**Core guides**:
+- Reference materials in experiments/bootstrap-009-observability-methodology/
+- Three pillars methodology
+- Go slog patterns
+- Metrics instrumentation guide
+- Tracing setup guide
+
+**Templates**:
+- templates/logger-setup.go - Logger initialization
+- templates/metrics-instrumentation.go - Metrics patterns
+- templates/tracing-setup.go - OpenTelemetry configuration
+
+---
+
+**Status**: ✅ Production-ready | 23-46x speedup | 90-95% transferable | Validated in meta-cc