--- name: distributed-tracing description: Implement distributed tracing with Jaeger and Tempo to track requests across microservices and identify performance bottlenecks. Use when debugging microservices, analyzing request flows, or implementing observability for distributed systems. --- # Distributed Tracing Implement distributed tracing with Jaeger and Tempo for request flow visibility across microservices. ## Purpose Track requests across distributed systems to understand latency, dependencies, and failure points. ## When to Use - Debug latency issues - Understand service dependencies - Identify bottlenecks - Trace error propagation - Analyze request paths ## Distributed Tracing Concepts ### Trace Structure ``` Trace (Request ID: abc123) ↓ Span (frontend) [100ms] ↓ Span (api-gateway) [80ms] ├→ Span (auth-service) [10ms] └→ Span (user-service) [60ms] └→ Span (database) [40ms] ``` ### Key Components - **Trace** - End-to-end request journey - **Span** - Single operation within a trace - **Context** - Metadata propagated between services - **Tags** - Key-value pairs for filtering - **Logs** - Timestamped events within a span ## Jaeger Setup ### Kubernetes Deployment ```bash # Deploy Jaeger Operator kubectl create namespace observability kubectl create -f https://github.com/jaegertracing/jaeger-operator/releases/download/v1.51.0/jaeger-operator.yaml -n observability # Deploy Jaeger instance kubectl apply -f - < { const tracer = trace.getTracer('my-service'); const span = tracer.startSpan('get_users'); try { const users = await fetchUsers(); span.setAttributes({ 'user.count': users.length }); res.json({ users }); } finally { span.end(); } }); ``` #### Go ```go package main import ( "context" "go.opentelemetry.io/otel" "go.opentelemetry.io/otel/exporters/jaeger" "go.opentelemetry.io/otel/sdk/resource" sdktrace "go.opentelemetry.io/otel/sdk/trace" semconv "go.opentelemetry.io/otel/semconv/v1.4.0" ) func initTracer() (*sdktrace.TracerProvider, error) { exporter, err := jaeger.New(jaeger.WithCollectorEndpoint( jaeger.WithEndpoint("http://jaeger:14268/api/traces"), )) if err != nil { return nil, err } tp := sdktrace.NewTracerProvider( sdktrace.WithBatcher(exporter), sdktrace.WithResource(resource.NewWithAttributes( semconv.SchemaURL, semconv.ServiceNameKey.String("my-service"), )), ) otel.SetTracerProvider(tp) return tp, nil } func getUsers(ctx context.Context) ([]User, error) { tracer := otel.Tracer("my-service") ctx, span := tracer.Start(ctx, "get_users") defer span.End() span.SetAttributes(attribute.String("user.filter", "active")) users, err := fetchUsersFromDB(ctx) if err != nil { span.RecordError(err) return nil, err } span.SetAttributes(attribute.Int("user.count", len(users))) return users, nil } ``` **Reference:** See `references/instrumentation.md` ## Context Propagation ### HTTP Headers ``` traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01 tracestate: congo=t61rcWkgMzE ``` ### Propagation in HTTP Requests #### Python ```python from opentelemetry.propagate import inject headers = {} inject(headers) # Injects trace context response = requests.get('http://downstream-service/api', headers=headers) ``` #### Node.js ```javascript const { propagation } = require('@opentelemetry/api'); const headers = {}; propagation.inject(context.active(), headers); axios.get('http://downstream-service/api', { headers }); ``` ## Tempo Setup (Grafana) ### Kubernetes Deployment ```yaml apiVersion: v1 kind: ConfigMap metadata: name: tempo-config data: tempo.yaml: | server: http_listen_port: 3200 distributor: receivers: jaeger: protocols: thrift_http: grpc: otlp: protocols: http: grpc: storage: trace: backend: s3 s3: bucket: tempo-traces endpoint: s3.amazonaws.com querier: frontend_worker: frontend_address: tempo-query-frontend:9095 --- apiVersion: apps/v1 kind: Deployment metadata: name: tempo spec: replicas: 1 template: spec: containers: - name: tempo image: grafana/tempo:latest args: - -config.file=/etc/tempo/tempo.yaml volumeMounts: - name: config mountPath: /etc/tempo volumes: - name: config configMap: name: tempo-config ``` **Reference:** See `assets/jaeger-config.yaml.template` ## Sampling Strategies ### Probabilistic Sampling ```yaml # Sample 1% of traces sampler: type: probabilistic param: 0.01 ``` ### Rate Limiting Sampling ```yaml # Sample max 100 traces per second sampler: type: ratelimiting param: 100 ``` ### Adaptive Sampling ```python from opentelemetry.sdk.trace.sampling import ParentBased, TraceIdRatioBased # Sample based on trace ID (deterministic) sampler = ParentBased(root=TraceIdRatioBased(0.01)) ``` ## Trace Analysis ### Finding Slow Requests **Jaeger Query:** ``` service=my-service duration > 1s ``` ### Finding Errors **Jaeger Query:** ``` service=my-service error=true tags.http.status_code >= 500 ``` ### Service Dependency Graph Jaeger automatically generates service dependency graphs showing: - Service relationships - Request rates - Error rates - Average latencies ## Best Practices 1. **Sample appropriately** (1-10% in production) 2. **Add meaningful tags** (user_id, request_id) 3. **Propagate context** across all service boundaries 4. **Log exceptions** in spans 5. **Use consistent naming** for operations 6. **Monitor tracing overhead** (<1% CPU impact) 7. **Set up alerts** for trace errors 8. **Implement distributed context** (baggage) 9. **Use span events** for important milestones 10. **Document instrumentation** standards ## Integration with Logging ### Correlated Logs ```python import logging from opentelemetry import trace logger = logging.getLogger(__name__) def process_request(): span = trace.get_current_span() trace_id = span.get_span_context().trace_id logger.info( "Processing request", extra={"trace_id": format(trace_id, '032x')} ) ``` ## Troubleshooting **No traces appearing:** - Check collector endpoint - Verify network connectivity - Check sampling configuration - Review application logs **High latency overhead:** - Reduce sampling rate - Use batch span processor - Check exporter configuration ## Reference Files - `references/jaeger-setup.md` - Jaeger installation - `references/instrumentation.md` - Instrumentation patterns - `assets/jaeger-config.yaml.template` - Jaeger configuration ## Related Skills - `prometheus-configuration` - For metrics - `grafana-dashboards` - For visualization - `slo-implementation` - For latency SLOs