15 KiB
Distributed Tracing Guide
What is Distributed Tracing?
Distributed tracing tracks a request as it flows through multiple services in a distributed system.
Key Concepts
Trace: End-to-end journey of a request Span: Single operation within a trace Context: Metadata propagated between services (trace_id, span_id)
Example Flow
User Request → API Gateway → Auth Service → User Service → Database
↓ ↓ ↓
[Trace ID: abc123]
Span 1: gateway (50ms)
Span 2: auth (20ms)
Span 3: user_service (100ms)
Span 4: db_query (80ms)
Total: 250ms with waterfall view showing dependencies
OpenTelemetry (OTel)
OpenTelemetry is the industry standard for instrumentation.
Components
API: Instrument code (create spans, add attributes) SDK: Implement API, configure exporters Collector: Receive, process, and export telemetry data Exporters: Send data to backends (Jaeger, Tempo, Zipkin)
Architecture
Application → OTel SDK → OTel Collector → Backend (Jaeger/Tempo)
↓
Visualization
Instrumentation Examples
Python (using OpenTelemetry)
Setup:
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
# Setup tracer
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)
# Configure exporter
otlp_exporter = OTLPSpanExporter(endpoint="localhost:4317")
span_processor = BatchSpanProcessor(otlp_exporter)
trace.get_tracer_provider().add_span_processor(span_processor)
Manual instrumentation:
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
@tracer.start_as_current_span("process_order")
def process_order(order_id):
span = trace.get_current_span()
span.set_attribute("order.id", order_id)
span.set_attribute("order.amount", 99.99)
try:
result = payment_service.charge(order_id)
span.set_attribute("payment.status", "success")
return result
except Exception as e:
span.set_status(trace.Status(trace.StatusCode.ERROR))
span.record_exception(e)
raise
Auto-instrumentation (Flask example):
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
# Auto-instrument Flask
FlaskInstrumentor().instrument_app(app)
# Auto-instrument requests library
RequestsInstrumentor().instrument()
# Auto-instrument SQLAlchemy
SQLAlchemyInstrumentor().instrument(engine=db.engine)
Node.js (using OpenTelemetry)
Setup:
const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
const { BatchSpanProcessor } = require('@opentelemetry/sdk-trace-base');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
// Setup provider
const provider = new NodeTracerProvider();
const exporter = new OTLPTraceExporter({ url: 'localhost:4317' });
provider.addSpanProcessor(new BatchSpanProcessor(exporter));
provider.register();
Manual instrumentation:
const tracer = provider.getTracer('my-service');
async function processOrder(orderId) {
const span = tracer.startSpan('process_order');
span.setAttribute('order.id', orderId);
try {
const result = await paymentService.charge(orderId);
span.setAttribute('payment.status', 'success');
return result;
} catch (error) {
span.setStatus({ code: SpanStatusCode.ERROR });
span.recordException(error);
throw error;
} finally {
span.end();
}
}
Auto-instrumentation:
const { registerInstrumentations } = require('@opentelemetry/instrumentation');
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');
const { ExpressInstrumentation } = require('@opentelemetry/instrumentation-express');
const { MongoDBInstrumentation } = require('@opentelemetry/instrumentation-mongodb');
registerInstrumentations({
instrumentations: [
new HttpInstrumentation(),
new ExpressInstrumentation(),
new MongoDBInstrumentation()
]
});
Go (using OpenTelemetry)
Setup:
import (
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
"go.opentelemetry.io/otel/sdk/trace"
)
func initTracer() {
exporter, _ := otlptracegrpc.New(context.Background())
tp := trace.NewTracerProvider(
trace.WithBatcher(exporter),
)
otel.SetTracerProvider(tp)
}
Manual instrumentation:
import (
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
)
func processOrder(ctx context.Context, orderID string) error {
tracer := otel.Tracer("my-service")
ctx, span := tracer.Start(ctx, "process_order")
defer span.End()
span.SetAttributes(
attribute.String("order.id", orderID),
attribute.Float64("order.amount", 99.99),
)
err := paymentService.Charge(ctx, orderID)
if err != nil {
span.RecordError(err)
return err
}
span.SetAttributes(attribute.String("payment.status", "success"))
return nil
}
Span Attributes
Semantic Conventions
Follow OpenTelemetry semantic conventions for consistency:
HTTP:
span.set_attribute("http.method", "GET")
span.set_attribute("http.url", "https://api.example.com/users")
span.set_attribute("http.status_code", 200)
span.set_attribute("http.user_agent", "Mozilla/5.0...")
Database:
span.set_attribute("db.system", "postgresql")
span.set_attribute("db.name", "users_db")
span.set_attribute("db.statement", "SELECT * FROM users WHERE id = ?")
span.set_attribute("db.operation", "SELECT")
RPC/gRPC:
span.set_attribute("rpc.system", "grpc")
span.set_attribute("rpc.service", "UserService")
span.set_attribute("rpc.method", "GetUser")
span.set_attribute("rpc.grpc.status_code", 0)
Messaging:
span.set_attribute("messaging.system", "kafka")
span.set_attribute("messaging.destination", "user-events")
span.set_attribute("messaging.operation", "publish")
span.set_attribute("messaging.message_id", "msg123")
Custom Attributes
Add business context:
span.set_attribute("user.id", "user123")
span.set_attribute("order.id", "ORD-456")
span.set_attribute("feature.flag.checkout_v2", True)
span.set_attribute("cache.hit", False)
Context Propagation
W3C Trace Context (Standard)
Headers propagated between services:
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
tracestate: vendor1=value1,vendor2=value2
Format: version-trace_id-parent_span_id-trace_flags
Implementation
Python:
from opentelemetry.propagate import inject, extract
import requests
# Inject context into outgoing request
headers = {}
inject(headers)
requests.get("https://api.example.com", headers=headers)
# Extract context from incoming request
from flask import request
ctx = extract(request.headers)
Node.js:
const { propagation } = require('@opentelemetry/api');
// Inject
const headers = {};
propagation.inject(context.active(), headers);
axios.get('https://api.example.com', { headers });
// Extract
const ctx = propagation.extract(context.active(), req.headers);
HTTP Example:
curl -H "traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01" \
https://api.example.com/users
Sampling Strategies
1. Always On/Off
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.sampling import ALWAYS_ON, ALWAYS_OFF
# Development: trace everything
provider = TracerProvider(sampler=ALWAYS_ON)
# Production: trace nothing (usually not desired)
provider = TracerProvider(sampler=ALWAYS_OFF)
2. Probability-Based
from opentelemetry.sdk.trace.sampling import TraceIdRatioBased
# Sample 10% of traces
provider = TracerProvider(sampler=TraceIdRatioBased(0.1))
3. Rate Limiting
from opentelemetry.sdk.trace.sampling import ParentBased, RateLimitingSampler
# Sample max 100 traces per second
sampler = ParentBased(root=RateLimitingSampler(100))
provider = TracerProvider(sampler=sampler)
4. Parent-Based (Default)
from opentelemetry.sdk.trace.sampling import ParentBased, TraceIdRatioBased
# If parent span is sampled, sample child spans
sampler = ParentBased(root=TraceIdRatioBased(0.1))
provider = TracerProvider(sampler=sampler)
5. Custom Sampling
from opentelemetry.sdk.trace.sampling import Sampler, Decision
class ErrorSampler(Sampler):
"""Always sample errors, sample 1% of successes"""
def should_sample(self, parent_context, trace_id, name, **kwargs):
attributes = kwargs.get('attributes', {})
# Always sample if error
if attributes.get('error', False):
return Decision.RECORD_AND_SAMPLE
# Sample 1% of successes
if trace_id & 0xFF < 3: # ~1%
return Decision.RECORD_AND_SAMPLE
return Decision.DROP
provider = TracerProvider(sampler=ErrorSampler())
Backends
Jaeger
Docker Compose:
version: '3'
services:
jaeger:
image: jaegertracing/all-in-one:latest
ports:
- "16686:16686" # UI
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
environment:
- COLLECTOR_OTLP_ENABLED=true
Query traces:
# UI: http://localhost:16686
# API: Get trace by ID
curl http://localhost:16686/api/traces/abc123
# Search traces
curl "http://localhost:16686/api/traces?service=my-service&limit=20"
Grafana Tempo
Docker Compose:
version: '3'
services:
tempo:
image: grafana/tempo:latest
ports:
- "3200:3200" # Tempo
- "4317:4317" # OTLP gRPC
volumes:
- ./tempo.yaml:/etc/tempo.yaml
command: ["-config.file=/etc/tempo.yaml"]
tempo.yaml:
server:
http_listen_port: 3200
distributor:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
storage:
trace:
backend: local
local:
path: /tmp/tempo/traces
Query in Grafana:
- Install Tempo data source
- Use TraceQL:
{ span.http.status_code = 500 }
AWS X-Ray
Configuration:
from aws_xray_sdk.core import xray_recorder
from aws_xray_sdk.ext.flask.middleware import XRayMiddleware
xray_recorder.configure(service='my-service')
XRayMiddleware(app, xray_recorder)
Query:
aws xray get-trace-summaries \
--start-time 2024-10-28T00:00:00 \
--end-time 2024-10-28T23:59:59 \
--filter-expression 'error = true'
Analysis Patterns
Find Slow Traces
# Jaeger UI
- Filter by service
- Set min duration: 1000ms
- Sort by duration
# TraceQL (Tempo)
{ duration > 1s }
Find Error Traces
# Jaeger UI
- Filter by tag: error=true
- Or by HTTP status: http.status_code=500
# TraceQL (Tempo)
{ span.http.status_code >= 500 }
Find Traces by User
# Jaeger UI
- Filter by tag: user.id=user123
# TraceQL (Tempo)
{ span.user.id = "user123" }
Find N+1 Query Problems
Look for:
- Many sequential database spans
- Same query repeated multiple times
- Pattern: API call → DB query → DB query → DB query...
Find Service Bottlenecks
- Identify spans with longest duration
- Check if time is spent in service logic or waiting for dependencies
- Look at span relationships (parallel vs sequential)
Integration with Logs
Trace ID in Logs
Python:
from opentelemetry import trace
def add_trace_context():
span = trace.get_current_span()
trace_id = span.get_span_context().trace_id
span_id = span.get_span_context().span_id
return {
"trace_id": format(trace_id, '032x'),
"span_id": format(span_id, '016x')
}
logger.info("Processing order", **add_trace_context(), order_id=order_id)
Query logs for trace:
# Elasticsearch
GET /logs/_search
{
"query": {
"match": { "trace_id": "0af7651916cd43dd8448eb211c80319c" }
}
}
# Loki (LogQL)
{job="app"} |= "0af7651916cd43dd8448eb211c80319c"
Trace from Log (Grafana)
Configure derived fields in Grafana:
datasources:
- name: Loki
type: loki
jsonData:
derivedFields:
- name: TraceID
matcherRegex: "trace_id=([\\w]+)"
url: "http://tempo:3200/trace/$${__value.raw}"
datasourceUid: tempo_uid
Best Practices
1. Span Naming
✅ Use operation names, not IDs
- Good:
GET /api/users,UserService.GetUser,db.query.users - Bad:
/api/users/123,span_abc,query_1
2. Span Granularity
✅ One span per logical operation
- Too coarse: One span for entire request
- Too fine: Span for every variable assignment
- Just right: Span per service call, database query, external API
3. Add Context
Always include:
- Operation name
- Service name
- Error status
- Business identifiers (user_id, order_id)
4. Handle Errors
try:
result = operation()
except Exception as e:
span.set_status(trace.Status(trace.StatusCode.ERROR))
span.record_exception(e)
raise
5. Sampling Strategy
- Development: 100%
- Staging: 50-100%
- Production: 1-10% (or error-based)
6. Performance Impact
- Overhead: ~1-5% CPU
- Use async exporters
- Batch span exports
- Sample appropriately
7. Cardinality
Avoid high-cardinality attributes:
- ❌ Email addresses
- ❌ Full URLs with unique IDs
- ❌ Timestamps
- ✅ User ID
- ✅ Endpoint pattern
- ✅ Status code
Common Issues
Missing Traces
Cause: Context not propagated Solution: Verify headers are injected/extracted
Incomplete Traces
Cause: Spans not closed properly
Solution: Always use defer span.End() or context managers
High Overhead
Cause: Too many spans or synchronous export Solution: Reduce span count, use batch processor
No Error Traces
Cause: Errors not recorded on spans
Solution: Call span.record_exception() and set error status
Metrics from Traces
Generate RED metrics from trace data:
Rate: Traces per second Errors: Traces with error status Duration: Span duration percentiles
Example (using Tempo + Prometheus):
# Generate metrics from spans
metrics_generator:
processor:
span_metrics:
dimensions:
- http.method
- http.status_code
Query:
# Request rate
rate(traces_spanmetrics_calls_total[5m])
# Error rate
rate(traces_spanmetrics_calls_total{status_code="STATUS_CODE_ERROR"}[5m])
/
rate(traces_spanmetrics_calls_total[5m])
# P95 latency
histogram_quantile(0.95, traces_spanmetrics_latency_bucket)