439 lines
10 KiB
Markdown
439 lines
10 KiB
Markdown
---
|
|
name: distributed-tracing
|
|
description: Implement distributed tracing with Jaeger and Tempo to track requests across microservices and identify performance bottlenecks. Use when debugging microservices, analyzing request flows, or implementing observability for distributed systems.
|
|
---
|
|
|
|
# Distributed Tracing
|
|
|
|
Implement distributed tracing with Jaeger and Tempo for request flow visibility across microservices.
|
|
|
|
## Purpose
|
|
|
|
Track requests across distributed systems to understand latency, dependencies, and failure points.
|
|
|
|
## When to Use
|
|
|
|
- Debug latency issues
|
|
- Understand service dependencies
|
|
- Identify bottlenecks
|
|
- Trace error propagation
|
|
- Analyze request paths
|
|
|
|
## Distributed Tracing Concepts
|
|
|
|
### Trace Structure
|
|
```
|
|
Trace (Request ID: abc123)
|
|
↓
|
|
Span (frontend) [100ms]
|
|
↓
|
|
Span (api-gateway) [80ms]
|
|
├→ Span (auth-service) [10ms]
|
|
└→ Span (user-service) [60ms]
|
|
└→ Span (database) [40ms]
|
|
```
|
|
|
|
### Key Components
|
|
- **Trace** - End-to-end request journey
|
|
- **Span** - Single operation within a trace
|
|
- **Context** - Metadata propagated between services
|
|
- **Tags** - Key-value pairs for filtering
|
|
- **Logs** - Timestamped events within a span
|
|
|
|
## Jaeger Setup
|
|
|
|
### Kubernetes Deployment
|
|
|
|
```bash
|
|
# Deploy Jaeger Operator
|
|
kubectl create namespace observability
|
|
kubectl create -f https://github.com/jaegertracing/jaeger-operator/releases/download/v1.51.0/jaeger-operator.yaml -n observability
|
|
|
|
# Deploy Jaeger instance
|
|
kubectl apply -f - <<EOF
|
|
apiVersion: jaegertracing.io/v1
|
|
kind: Jaeger
|
|
metadata:
|
|
name: jaeger
|
|
namespace: observability
|
|
spec:
|
|
strategy: production
|
|
storage:
|
|
type: elasticsearch
|
|
options:
|
|
es:
|
|
server-urls: http://elasticsearch:9200
|
|
ingress:
|
|
enabled: true
|
|
EOF
|
|
```
|
|
|
|
### Docker Compose
|
|
|
|
```yaml
|
|
version: '3.8'
|
|
services:
|
|
jaeger:
|
|
image: jaegertracing/all-in-one:latest
|
|
ports:
|
|
- "5775:5775/udp"
|
|
- "6831:6831/udp"
|
|
- "6832:6832/udp"
|
|
- "5778:5778"
|
|
- "16686:16686" # UI
|
|
- "14268:14268" # Collector
|
|
- "14250:14250" # gRPC
|
|
- "9411:9411" # Zipkin
|
|
environment:
|
|
- COLLECTOR_ZIPKIN_HOST_PORT=:9411
|
|
```
|
|
|
|
**Reference:** See `references/jaeger-setup.md`
|
|
|
|
## Application Instrumentation
|
|
|
|
### OpenTelemetry (Recommended)
|
|
|
|
#### Python (Flask)
|
|
```python
|
|
from opentelemetry import trace
|
|
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
|
|
from opentelemetry.sdk.resources import SERVICE_NAME, Resource
|
|
from opentelemetry.sdk.trace import TracerProvider
|
|
from opentelemetry.sdk.trace.export import BatchSpanProcessor
|
|
from opentelemetry.instrumentation.flask import FlaskInstrumentor
|
|
from flask import Flask
|
|
|
|
# Initialize tracer
|
|
resource = Resource(attributes={SERVICE_NAME: "my-service"})
|
|
provider = TracerProvider(resource=resource)
|
|
processor = BatchSpanProcessor(JaegerExporter(
|
|
agent_host_name="jaeger",
|
|
agent_port=6831,
|
|
))
|
|
provider.add_span_processor(processor)
|
|
trace.set_tracer_provider(provider)
|
|
|
|
# Instrument Flask
|
|
app = Flask(__name__)
|
|
FlaskInstrumentor().instrument_app(app)
|
|
|
|
@app.route('/api/users')
|
|
def get_users():
|
|
tracer = trace.get_tracer(__name__)
|
|
|
|
with tracer.start_as_current_span("get_users") as span:
|
|
span.set_attribute("user.count", 100)
|
|
# Business logic
|
|
users = fetch_users_from_db()
|
|
return {"users": users}
|
|
|
|
def fetch_users_from_db():
|
|
tracer = trace.get_tracer(__name__)
|
|
|
|
with tracer.start_as_current_span("database_query") as span:
|
|
span.set_attribute("db.system", "postgresql")
|
|
span.set_attribute("db.statement", "SELECT * FROM users")
|
|
# Database query
|
|
return query_database()
|
|
```
|
|
|
|
#### Node.js (Express)
|
|
```javascript
|
|
const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
|
|
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');
|
|
const { BatchSpanProcessor } = require('@opentelemetry/sdk-trace-base');
|
|
const { registerInstrumentations } = require('@opentelemetry/instrumentation');
|
|
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');
|
|
const { ExpressInstrumentation } = require('@opentelemetry/instrumentation-express');
|
|
|
|
// Initialize tracer
|
|
const provider = new NodeTracerProvider({
|
|
resource: { attributes: { 'service.name': 'my-service' } }
|
|
});
|
|
|
|
const exporter = new JaegerExporter({
|
|
endpoint: 'http://jaeger:14268/api/traces'
|
|
});
|
|
|
|
provider.addSpanProcessor(new BatchSpanProcessor(exporter));
|
|
provider.register();
|
|
|
|
// Instrument libraries
|
|
registerInstrumentations({
|
|
instrumentations: [
|
|
new HttpInstrumentation(),
|
|
new ExpressInstrumentation(),
|
|
],
|
|
});
|
|
|
|
const express = require('express');
|
|
const app = express();
|
|
|
|
app.get('/api/users', async (req, res) => {
|
|
const tracer = trace.getTracer('my-service');
|
|
const span = tracer.startSpan('get_users');
|
|
|
|
try {
|
|
const users = await fetchUsers();
|
|
span.setAttributes({ 'user.count': users.length });
|
|
res.json({ users });
|
|
} finally {
|
|
span.end();
|
|
}
|
|
});
|
|
```
|
|
|
|
#### Go
|
|
```go
|
|
package main
|
|
|
|
import (
|
|
"context"
|
|
"go.opentelemetry.io/otel"
|
|
"go.opentelemetry.io/otel/exporters/jaeger"
|
|
"go.opentelemetry.io/otel/sdk/resource"
|
|
sdktrace "go.opentelemetry.io/otel/sdk/trace"
|
|
semconv "go.opentelemetry.io/otel/semconv/v1.4.0"
|
|
)
|
|
|
|
func initTracer() (*sdktrace.TracerProvider, error) {
|
|
exporter, err := jaeger.New(jaeger.WithCollectorEndpoint(
|
|
jaeger.WithEndpoint("http://jaeger:14268/api/traces"),
|
|
))
|
|
if err != nil {
|
|
return nil, err
|
|
}
|
|
|
|
tp := sdktrace.NewTracerProvider(
|
|
sdktrace.WithBatcher(exporter),
|
|
sdktrace.WithResource(resource.NewWithAttributes(
|
|
semconv.SchemaURL,
|
|
semconv.ServiceNameKey.String("my-service"),
|
|
)),
|
|
)
|
|
|
|
otel.SetTracerProvider(tp)
|
|
return tp, nil
|
|
}
|
|
|
|
func getUsers(ctx context.Context) ([]User, error) {
|
|
tracer := otel.Tracer("my-service")
|
|
ctx, span := tracer.Start(ctx, "get_users")
|
|
defer span.End()
|
|
|
|
span.SetAttributes(attribute.String("user.filter", "active"))
|
|
|
|
users, err := fetchUsersFromDB(ctx)
|
|
if err != nil {
|
|
span.RecordError(err)
|
|
return nil, err
|
|
}
|
|
|
|
span.SetAttributes(attribute.Int("user.count", len(users)))
|
|
return users, nil
|
|
}
|
|
```
|
|
|
|
**Reference:** See `references/instrumentation.md`
|
|
|
|
## Context Propagation
|
|
|
|
### HTTP Headers
|
|
```
|
|
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
|
|
tracestate: congo=t61rcWkgMzE
|
|
```
|
|
|
|
### Propagation in HTTP Requests
|
|
|
|
#### Python
|
|
```python
|
|
from opentelemetry.propagate import inject
|
|
|
|
headers = {}
|
|
inject(headers) # Injects trace context
|
|
|
|
response = requests.get('http://downstream-service/api', headers=headers)
|
|
```
|
|
|
|
#### Node.js
|
|
```javascript
|
|
const { propagation } = require('@opentelemetry/api');
|
|
|
|
const headers = {};
|
|
propagation.inject(context.active(), headers);
|
|
|
|
axios.get('http://downstream-service/api', { headers });
|
|
```
|
|
|
|
## Tempo Setup (Grafana)
|
|
|
|
### Kubernetes Deployment
|
|
|
|
```yaml
|
|
apiVersion: v1
|
|
kind: ConfigMap
|
|
metadata:
|
|
name: tempo-config
|
|
data:
|
|
tempo.yaml: |
|
|
server:
|
|
http_listen_port: 3200
|
|
|
|
distributor:
|
|
receivers:
|
|
jaeger:
|
|
protocols:
|
|
thrift_http:
|
|
grpc:
|
|
otlp:
|
|
protocols:
|
|
http:
|
|
grpc:
|
|
|
|
storage:
|
|
trace:
|
|
backend: s3
|
|
s3:
|
|
bucket: tempo-traces
|
|
endpoint: s3.amazonaws.com
|
|
|
|
querier:
|
|
frontend_worker:
|
|
frontend_address: tempo-query-frontend:9095
|
|
---
|
|
apiVersion: apps/v1
|
|
kind: Deployment
|
|
metadata:
|
|
name: tempo
|
|
spec:
|
|
replicas: 1
|
|
template:
|
|
spec:
|
|
containers:
|
|
- name: tempo
|
|
image: grafana/tempo:latest
|
|
args:
|
|
- -config.file=/etc/tempo/tempo.yaml
|
|
volumeMounts:
|
|
- name: config
|
|
mountPath: /etc/tempo
|
|
volumes:
|
|
- name: config
|
|
configMap:
|
|
name: tempo-config
|
|
```
|
|
|
|
**Reference:** See `assets/jaeger-config.yaml.template`
|
|
|
|
## Sampling Strategies
|
|
|
|
### Probabilistic Sampling
|
|
```yaml
|
|
# Sample 1% of traces
|
|
sampler:
|
|
type: probabilistic
|
|
param: 0.01
|
|
```
|
|
|
|
### Rate Limiting Sampling
|
|
```yaml
|
|
# Sample max 100 traces per second
|
|
sampler:
|
|
type: ratelimiting
|
|
param: 100
|
|
```
|
|
|
|
### Adaptive Sampling
|
|
```python
|
|
from opentelemetry.sdk.trace.sampling import ParentBased, TraceIdRatioBased
|
|
|
|
# Sample based on trace ID (deterministic)
|
|
sampler = ParentBased(root=TraceIdRatioBased(0.01))
|
|
```
|
|
|
|
## Trace Analysis
|
|
|
|
### Finding Slow Requests
|
|
|
|
**Jaeger Query:**
|
|
```
|
|
service=my-service
|
|
duration > 1s
|
|
```
|
|
|
|
### Finding Errors
|
|
|
|
**Jaeger Query:**
|
|
```
|
|
service=my-service
|
|
error=true
|
|
tags.http.status_code >= 500
|
|
```
|
|
|
|
### Service Dependency Graph
|
|
|
|
Jaeger automatically generates service dependency graphs showing:
|
|
- Service relationships
|
|
- Request rates
|
|
- Error rates
|
|
- Average latencies
|
|
|
|
## Best Practices
|
|
|
|
1. **Sample appropriately** (1-10% in production)
|
|
2. **Add meaningful tags** (user_id, request_id)
|
|
3. **Propagate context** across all service boundaries
|
|
4. **Log exceptions** in spans
|
|
5. **Use consistent naming** for operations
|
|
6. **Monitor tracing overhead** (<1% CPU impact)
|
|
7. **Set up alerts** for trace errors
|
|
8. **Implement distributed context** (baggage)
|
|
9. **Use span events** for important milestones
|
|
10. **Document instrumentation** standards
|
|
|
|
## Integration with Logging
|
|
|
|
### Correlated Logs
|
|
```python
|
|
import logging
|
|
from opentelemetry import trace
|
|
|
|
logger = logging.getLogger(__name__)
|
|
|
|
def process_request():
|
|
span = trace.get_current_span()
|
|
trace_id = span.get_span_context().trace_id
|
|
|
|
logger.info(
|
|
"Processing request",
|
|
extra={"trace_id": format(trace_id, '032x')}
|
|
)
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
**No traces appearing:**
|
|
- Check collector endpoint
|
|
- Verify network connectivity
|
|
- Check sampling configuration
|
|
- Review application logs
|
|
|
|
**High latency overhead:**
|
|
- Reduce sampling rate
|
|
- Use batch span processor
|
|
- Check exporter configuration
|
|
|
|
## Reference Files
|
|
|
|
- `references/jaeger-setup.md` - Jaeger installation
|
|
- `references/instrumentation.md` - Instrumentation patterns
|
|
- `assets/jaeger-config.yaml.template` - Jaeger configuration
|
|
|
|
## Related Skills
|
|
|
|
- `prometheus-configuration` - For metrics
|
|
- `grafana-dashboards` - For visualization
|
|
- `slo-implementation` - For latency SLOs
|