Initial commit

2025-11-29 17:56:41 +08:00
commit 9427ed1eea
40 changed files with 15189 additions and 0 deletions
--- a/skills/distributed-tracing/SKILL.md
+++ b/skills/distributed-tracing/SKILL.md
@@ -0,0 +1,438 @@
+---
+name: distributed-tracing
+description: Implement distributed tracing with Jaeger and Tempo to track requests across microservices and identify performance bottlenecks. Use when debugging microservices, analyzing request flows, or implementing observability for distributed systems.
+---
+
+# Distributed Tracing
+
+Implement distributed tracing with Jaeger and Tempo for request flow visibility across microservices.
+
+## Purpose
+
+Track requests across distributed systems to understand latency, dependencies, and failure points.
+
+## When to Use
+
+- Debug latency issues
+- Understand service dependencies
+- Identify bottlenecks
+- Trace error propagation
+- Analyze request paths
+
+## Distributed Tracing Concepts
+
+### Trace Structure
+```
+Trace (Request ID: abc123)
+  ↓
+Span (frontend) [100ms]
+  ↓
+Span (api-gateway) [80ms]
+  ├→ Span (auth-service) [10ms]
+  └→ Span (user-service) [60ms]
+      └→ Span (database) [40ms]
+```
+
+### Key Components
+- **Trace** - End-to-end request journey
+- **Span** - Single operation within a trace
+- **Context** - Metadata propagated between services
+- **Tags** - Key-value pairs for filtering
+- **Logs** - Timestamped events within a span
+
+## Jaeger Setup
+
+### Kubernetes Deployment
+
+```bash
+# Deploy Jaeger Operator
+kubectl create namespace observability
+kubectl create -f https://github.com/jaegertracing/jaeger-operator/releases/download/v1.51.0/jaeger-operator.yaml -n observability
+
+# Deploy Jaeger instance
+kubectl apply -f - <<EOF
+apiVersion: jaegertracing.io/v1
+kind: Jaeger
+metadata:
+  name: jaeger
+  namespace: observability
+spec:
+  strategy: production
+  storage:
+    type: elasticsearch
+    options:
+      es:
+        server-urls: http://elasticsearch:9200
+  ingress:
+    enabled: true
+EOF
+```
+
+### Docker Compose
+
+```yaml
+version: '3.8'
+services:
+  jaeger:
+    image: jaegertracing/all-in-one:latest
+    ports:
+      - "5775:5775/udp"
+      - "6831:6831/udp"
+      - "6832:6832/udp"
+      - "5778:5778"
+      - "16686:16686"  # UI
+      - "14268:14268"  # Collector
+      - "14250:14250"  # gRPC
+      - "9411:9411"    # Zipkin
+    environment:
+      - COLLECTOR_ZIPKIN_HOST_PORT=:9411
+```
+
+**Reference:** See `references/jaeger-setup.md`
+
+## Application Instrumentation
+
+### OpenTelemetry (Recommended)
+
+#### Python (Flask)
+```python
+from opentelemetry import trace
+from opentelemetry.exporter.jaeger.thrift import JaegerExporter
+from opentelemetry.sdk.resources import SERVICE_NAME, Resource
+from opentelemetry.sdk.trace import TracerProvider
+from opentelemetry.sdk.trace.export import BatchSpanProcessor
+from opentelemetry.instrumentation.flask import FlaskInstrumentor
+from flask import Flask
+
+# Initialize tracer
+resource = Resource(attributes={SERVICE_NAME: "my-service"})
+provider = TracerProvider(resource=resource)
+processor = BatchSpanProcessor(JaegerExporter(
+    agent_host_name="jaeger",
+    agent_port=6831,
+))
+provider.add_span_processor(processor)
+trace.set_tracer_provider(provider)
+
+# Instrument Flask
+app = Flask(__name__)
+FlaskInstrumentor().instrument_app(app)
+
+@app.route('/api/users')
+def get_users():
+    tracer = trace.get_tracer(__name__)
+
+    with tracer.start_as_current_span("get_users") as span:
+        span.set_attribute("user.count", 100)
+        # Business logic
+        users = fetch_users_from_db()
+        return {"users": users}
+
+def fetch_users_from_db():
+    tracer = trace.get_tracer(__name__)
+
+    with tracer.start_as_current_span("database_query") as span:
+        span.set_attribute("db.system", "postgresql")
+        span.set_attribute("db.statement", "SELECT * FROM users")
+        # Database query
+        return query_database()
+```
+
+#### Node.js (Express)
+```javascript
+const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
+const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');
+const { BatchSpanProcessor } = require('@opentelemetry/sdk-trace-base');
+const { registerInstrumentations } = require('@opentelemetry/instrumentation');
+const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');
+const { ExpressInstrumentation } = require('@opentelemetry/instrumentation-express');
+
+// Initialize tracer
+const provider = new NodeTracerProvider({
+  resource: { attributes: { 'service.name': 'my-service' } }
+});
+
+const exporter = new JaegerExporter({
+  endpoint: 'http://jaeger:14268/api/traces'
+});
+
+provider.addSpanProcessor(new BatchSpanProcessor(exporter));
+provider.register();
+
+// Instrument libraries
+registerInstrumentations({
+  instrumentations: [
+    new HttpInstrumentation(),
+    new ExpressInstrumentation(),
+  ],
+});
+
+const express = require('express');
+const app = express();
+
+app.get('/api/users', async (req, res) => {
+  const tracer = trace.getTracer('my-service');
+  const span = tracer.startSpan('get_users');
+
+  try {
+    const users = await fetchUsers();
+    span.setAttributes({ 'user.count': users.length });
+    res.json({ users });
+  } finally {
+    span.end();
+  }
+});
+```
+
+#### Go
+```go
+package main
+
+import (
+    "context"
+    "go.opentelemetry.io/otel"
+    "go.opentelemetry.io/otel/exporters/jaeger"
+    "go.opentelemetry.io/otel/sdk/resource"
+    sdktrace "go.opentelemetry.io/otel/sdk/trace"
+    semconv "go.opentelemetry.io/otel/semconv/v1.4.0"
+)
+
+func initTracer() (*sdktrace.TracerProvider, error) {
+    exporter, err := jaeger.New(jaeger.WithCollectorEndpoint(
+        jaeger.WithEndpoint("http://jaeger:14268/api/traces"),
+    ))
+    if err != nil {
+        return nil, err
+    }
+
+    tp := sdktrace.NewTracerProvider(
+        sdktrace.WithBatcher(exporter),
+        sdktrace.WithResource(resource.NewWithAttributes(
+            semconv.SchemaURL,
+            semconv.ServiceNameKey.String("my-service"),
+        )),
+    )
+
+    otel.SetTracerProvider(tp)
+    return tp, nil
+}
+
+func getUsers(ctx context.Context) ([]User, error) {
+    tracer := otel.Tracer("my-service")
+    ctx, span := tracer.Start(ctx, "get_users")
+    defer span.End()
+
+    span.SetAttributes(attribute.String("user.filter", "active"))
+
+    users, err := fetchUsersFromDB(ctx)
+    if err != nil {
+        span.RecordError(err)
+        return nil, err
+    }
+
+    span.SetAttributes(attribute.Int("user.count", len(users)))
+    return users, nil
+}
+```
+
+**Reference:** See `references/instrumentation.md`
+
+## Context Propagation
+
+### HTTP Headers
+```
+traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
+tracestate: congo=t61rcWkgMzE
+```
+
+### Propagation in HTTP Requests
+
+#### Python
+```python
+from opentelemetry.propagate import inject
+
+headers = {}
+inject(headers)  # Injects trace context
+
+response = requests.get('http://downstream-service/api', headers=headers)
+```
+
+#### Node.js
+```javascript
+const { propagation } = require('@opentelemetry/api');
+
+const headers = {};
+propagation.inject(context.active(), headers);
+
+axios.get('http://downstream-service/api', { headers });
+```
+
+## Tempo Setup (Grafana)
+
+### Kubernetes Deployment
+
+```yaml
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: tempo-config
+data:
+  tempo.yaml: |
+    server:
+      http_listen_port: 3200
+
+    distributor:
+      receivers:
+        jaeger:
+          protocols:
+            thrift_http:
+            grpc:
+        otlp:
+          protocols:
+            http:
+            grpc:
+
+    storage:
+      trace:
+        backend: s3
+        s3:
+          bucket: tempo-traces
+          endpoint: s3.amazonaws.com
+
+    querier:
+      frontend_worker:
+        frontend_address: tempo-query-frontend:9095
+---
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: tempo
+spec:
+  replicas: 1
+  template:
+    spec:
+      containers:
+      - name: tempo
+        image: grafana/tempo:latest
+        args:
+          - -config.file=/etc/tempo/tempo.yaml
+        volumeMounts:
+        - name: config
+          mountPath: /etc/tempo
+      volumes:
+      - name: config
+        configMap:
+          name: tempo-config
+```
+
+**Reference:** See `assets/jaeger-config.yaml.template`
+
+## Sampling Strategies
+
+### Probabilistic Sampling
+```yaml
+# Sample 1% of traces
+sampler:
+  type: probabilistic
+  param: 0.01
+```
+
+### Rate Limiting Sampling
+```yaml
+# Sample max 100 traces per second
+sampler:
+  type: ratelimiting
+  param: 100
+```
+
+### Adaptive Sampling
+```python
+from opentelemetry.sdk.trace.sampling import ParentBased, TraceIdRatioBased
+
+# Sample based on trace ID (deterministic)
+sampler = ParentBased(root=TraceIdRatioBased(0.01))
+```
+
+## Trace Analysis
+
+### Finding Slow Requests
+
+**Jaeger Query:**
+```
+service=my-service
+duration > 1s
+```
+
+### Finding Errors
+
+**Jaeger Query:**
+```
+service=my-service
+error=true
+tags.http.status_code >= 500
+```
+
+### Service Dependency Graph
+
+Jaeger automatically generates service dependency graphs showing:
+- Service relationships
+- Request rates
+- Error rates
+- Average latencies
+
+## Best Practices
+
+1. **Sample appropriately** (1-10% in production)
+2. **Add meaningful tags** (user_id, request_id)
+3. **Propagate context** across all service boundaries
+4. **Log exceptions** in spans
+5. **Use consistent naming** for operations
+6. **Monitor tracing overhead** (<1% CPU impact)
+7. **Set up alerts** for trace errors
+8. **Implement distributed context** (baggage)
+9. **Use span events** for important milestones
+10. **Document instrumentation** standards
+
+## Integration with Logging
+
+### Correlated Logs
+```python
+import logging
+from opentelemetry import trace
+
+logger = logging.getLogger(__name__)
+
+def process_request():
+    span = trace.get_current_span()
+    trace_id = span.get_span_context().trace_id
+
+    logger.info(
+        "Processing request",
+        extra={"trace_id": format(trace_id, '032x')}
+    )
+```
+
+## Troubleshooting
+
+**No traces appearing:**
+- Check collector endpoint
+- Verify network connectivity
+- Check sampling configuration
+- Review application logs
+
+**High latency overhead:**
+- Reduce sampling rate
+- Use batch span processor
+- Check exporter configuration
+
+## Reference Files
+
+- `references/jaeger-setup.md` - Jaeger installation
+- `references/instrumentation.md` - Instrumentation patterns
+- `assets/jaeger-config.yaml.template` - Jaeger configuration
+
+## Related Skills
+
+- `prometheus-configuration` - For metrics
+- `grafana-dashboards` - For visualization
+- `slo-implementation` - For latency SLOs