Initial commit
This commit is contained in:
438
skills/distributed-tracing/SKILL.md
Normal file
438
skills/distributed-tracing/SKILL.md
Normal file
@@ -0,0 +1,438 @@
|
||||
---
|
||||
name: distributed-tracing
|
||||
description: Implement distributed tracing with Jaeger and Tempo to track requests across microservices and identify performance bottlenecks. Use when debugging microservices, analyzing request flows, or implementing observability for distributed systems.
|
||||
---
|
||||
|
||||
# Distributed Tracing
|
||||
|
||||
Implement distributed tracing with Jaeger and Tempo for request flow visibility across microservices.
|
||||
|
||||
## Purpose
|
||||
|
||||
Track requests across distributed systems to understand latency, dependencies, and failure points.
|
||||
|
||||
## When to Use
|
||||
|
||||
- Debug latency issues
|
||||
- Understand service dependencies
|
||||
- Identify bottlenecks
|
||||
- Trace error propagation
|
||||
- Analyze request paths
|
||||
|
||||
## Distributed Tracing Concepts
|
||||
|
||||
### Trace Structure
|
||||
```
|
||||
Trace (Request ID: abc123)
|
||||
↓
|
||||
Span (frontend) [100ms]
|
||||
↓
|
||||
Span (api-gateway) [80ms]
|
||||
├→ Span (auth-service) [10ms]
|
||||
└→ Span (user-service) [60ms]
|
||||
└→ Span (database) [40ms]
|
||||
```
|
||||
|
||||
### Key Components
|
||||
- **Trace** - End-to-end request journey
|
||||
- **Span** - Single operation within a trace
|
||||
- **Context** - Metadata propagated between services
|
||||
- **Tags** - Key-value pairs for filtering
|
||||
- **Logs** - Timestamped events within a span
|
||||
|
||||
## Jaeger Setup
|
||||
|
||||
### Kubernetes Deployment
|
||||
|
||||
```bash
|
||||
# Deploy Jaeger Operator
|
||||
kubectl create namespace observability
|
||||
kubectl create -f https://github.com/jaegertracing/jaeger-operator/releases/download/v1.51.0/jaeger-operator.yaml -n observability
|
||||
|
||||
# Deploy Jaeger instance
|
||||
kubectl apply -f - <<EOF
|
||||
apiVersion: jaegertracing.io/v1
|
||||
kind: Jaeger
|
||||
metadata:
|
||||
name: jaeger
|
||||
namespace: observability
|
||||
spec:
|
||||
strategy: production
|
||||
storage:
|
||||
type: elasticsearch
|
||||
options:
|
||||
es:
|
||||
server-urls: http://elasticsearch:9200
|
||||
ingress:
|
||||
enabled: true
|
||||
EOF
|
||||
```
|
||||
|
||||
### Docker Compose
|
||||
|
||||
```yaml
|
||||
version: '3.8'
|
||||
services:
|
||||
jaeger:
|
||||
image: jaegertracing/all-in-one:latest
|
||||
ports:
|
||||
- "5775:5775/udp"
|
||||
- "6831:6831/udp"
|
||||
- "6832:6832/udp"
|
||||
- "5778:5778"
|
||||
- "16686:16686" # UI
|
||||
- "14268:14268" # Collector
|
||||
- "14250:14250" # gRPC
|
||||
- "9411:9411" # Zipkin
|
||||
environment:
|
||||
- COLLECTOR_ZIPKIN_HOST_PORT=:9411
|
||||
```
|
||||
|
||||
**Reference:** See `references/jaeger-setup.md`
|
||||
|
||||
## Application Instrumentation
|
||||
|
||||
### OpenTelemetry (Recommended)
|
||||
|
||||
#### Python (Flask)
|
||||
```python
|
||||
from opentelemetry import trace
|
||||
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
|
||||
from opentelemetry.sdk.resources import SERVICE_NAME, Resource
|
||||
from opentelemetry.sdk.trace import TracerProvider
|
||||
from opentelemetry.sdk.trace.export import BatchSpanProcessor
|
||||
from opentelemetry.instrumentation.flask import FlaskInstrumentor
|
||||
from flask import Flask
|
||||
|
||||
# Initialize tracer
|
||||
resource = Resource(attributes={SERVICE_NAME: "my-service"})
|
||||
provider = TracerProvider(resource=resource)
|
||||
processor = BatchSpanProcessor(JaegerExporter(
|
||||
agent_host_name="jaeger",
|
||||
agent_port=6831,
|
||||
))
|
||||
provider.add_span_processor(processor)
|
||||
trace.set_tracer_provider(provider)
|
||||
|
||||
# Instrument Flask
|
||||
app = Flask(__name__)
|
||||
FlaskInstrumentor().instrument_app(app)
|
||||
|
||||
@app.route('/api/users')
|
||||
def get_users():
|
||||
tracer = trace.get_tracer(__name__)
|
||||
|
||||
with tracer.start_as_current_span("get_users") as span:
|
||||
span.set_attribute("user.count", 100)
|
||||
# Business logic
|
||||
users = fetch_users_from_db()
|
||||
return {"users": users}
|
||||
|
||||
def fetch_users_from_db():
|
||||
tracer = trace.get_tracer(__name__)
|
||||
|
||||
with tracer.start_as_current_span("database_query") as span:
|
||||
span.set_attribute("db.system", "postgresql")
|
||||
span.set_attribute("db.statement", "SELECT * FROM users")
|
||||
# Database query
|
||||
return query_database()
|
||||
```
|
||||
|
||||
#### Node.js (Express)
|
||||
```javascript
|
||||
const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
|
||||
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');
|
||||
const { BatchSpanProcessor } = require('@opentelemetry/sdk-trace-base');
|
||||
const { registerInstrumentations } = require('@opentelemetry/instrumentation');
|
||||
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');
|
||||
const { ExpressInstrumentation } = require('@opentelemetry/instrumentation-express');
|
||||
|
||||
// Initialize tracer
|
||||
const provider = new NodeTracerProvider({
|
||||
resource: { attributes: { 'service.name': 'my-service' } }
|
||||
});
|
||||
|
||||
const exporter = new JaegerExporter({
|
||||
endpoint: 'http://jaeger:14268/api/traces'
|
||||
});
|
||||
|
||||
provider.addSpanProcessor(new BatchSpanProcessor(exporter));
|
||||
provider.register();
|
||||
|
||||
// Instrument libraries
|
||||
registerInstrumentations({
|
||||
instrumentations: [
|
||||
new HttpInstrumentation(),
|
||||
new ExpressInstrumentation(),
|
||||
],
|
||||
});
|
||||
|
||||
const express = require('express');
|
||||
const app = express();
|
||||
|
||||
app.get('/api/users', async (req, res) => {
|
||||
const tracer = trace.getTracer('my-service');
|
||||
const span = tracer.startSpan('get_users');
|
||||
|
||||
try {
|
||||
const users = await fetchUsers();
|
||||
span.setAttributes({ 'user.count': users.length });
|
||||
res.json({ users });
|
||||
} finally {
|
||||
span.end();
|
||||
}
|
||||
});
|
||||
```
|
||||
|
||||
#### Go
|
||||
```go
|
||||
package main
|
||||
|
||||
import (
|
||||
"context"
|
||||
"go.opentelemetry.io/otel"
|
||||
"go.opentelemetry.io/otel/exporters/jaeger"
|
||||
"go.opentelemetry.io/otel/sdk/resource"
|
||||
sdktrace "go.opentelemetry.io/otel/sdk/trace"
|
||||
semconv "go.opentelemetry.io/otel/semconv/v1.4.0"
|
||||
)
|
||||
|
||||
func initTracer() (*sdktrace.TracerProvider, error) {
|
||||
exporter, err := jaeger.New(jaeger.WithCollectorEndpoint(
|
||||
jaeger.WithEndpoint("http://jaeger:14268/api/traces"),
|
||||
))
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
|
||||
tp := sdktrace.NewTracerProvider(
|
||||
sdktrace.WithBatcher(exporter),
|
||||
sdktrace.WithResource(resource.NewWithAttributes(
|
||||
semconv.SchemaURL,
|
||||
semconv.ServiceNameKey.String("my-service"),
|
||||
)),
|
||||
)
|
||||
|
||||
otel.SetTracerProvider(tp)
|
||||
return tp, nil
|
||||
}
|
||||
|
||||
func getUsers(ctx context.Context) ([]User, error) {
|
||||
tracer := otel.Tracer("my-service")
|
||||
ctx, span := tracer.Start(ctx, "get_users")
|
||||
defer span.End()
|
||||
|
||||
span.SetAttributes(attribute.String("user.filter", "active"))
|
||||
|
||||
users, err := fetchUsersFromDB(ctx)
|
||||
if err != nil {
|
||||
span.RecordError(err)
|
||||
return nil, err
|
||||
}
|
||||
|
||||
span.SetAttributes(attribute.Int("user.count", len(users)))
|
||||
return users, nil
|
||||
}
|
||||
```
|
||||
|
||||
**Reference:** See `references/instrumentation.md`
|
||||
|
||||
## Context Propagation
|
||||
|
||||
### HTTP Headers
|
||||
```
|
||||
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
|
||||
tracestate: congo=t61rcWkgMzE
|
||||
```
|
||||
|
||||
### Propagation in HTTP Requests
|
||||
|
||||
#### Python
|
||||
```python
|
||||
from opentelemetry.propagate import inject
|
||||
|
||||
headers = {}
|
||||
inject(headers) # Injects trace context
|
||||
|
||||
response = requests.get('http://downstream-service/api', headers=headers)
|
||||
```
|
||||
|
||||
#### Node.js
|
||||
```javascript
|
||||
const { propagation } = require('@opentelemetry/api');
|
||||
|
||||
const headers = {};
|
||||
propagation.inject(context.active(), headers);
|
||||
|
||||
axios.get('http://downstream-service/api', { headers });
|
||||
```
|
||||
|
||||
## Tempo Setup (Grafana)
|
||||
|
||||
### Kubernetes Deployment
|
||||
|
||||
```yaml
|
||||
apiVersion: v1
|
||||
kind: ConfigMap
|
||||
metadata:
|
||||
name: tempo-config
|
||||
data:
|
||||
tempo.yaml: |
|
||||
server:
|
||||
http_listen_port: 3200
|
||||
|
||||
distributor:
|
||||
receivers:
|
||||
jaeger:
|
||||
protocols:
|
||||
thrift_http:
|
||||
grpc:
|
||||
otlp:
|
||||
protocols:
|
||||
http:
|
||||
grpc:
|
||||
|
||||
storage:
|
||||
trace:
|
||||
backend: s3
|
||||
s3:
|
||||
bucket: tempo-traces
|
||||
endpoint: s3.amazonaws.com
|
||||
|
||||
querier:
|
||||
frontend_worker:
|
||||
frontend_address: tempo-query-frontend:9095
|
||||
---
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: tempo
|
||||
spec:
|
||||
replicas: 1
|
||||
template:
|
||||
spec:
|
||||
containers:
|
||||
- name: tempo
|
||||
image: grafana/tempo:latest
|
||||
args:
|
||||
- -config.file=/etc/tempo/tempo.yaml
|
||||
volumeMounts:
|
||||
- name: config
|
||||
mountPath: /etc/tempo
|
||||
volumes:
|
||||
- name: config
|
||||
configMap:
|
||||
name: tempo-config
|
||||
```
|
||||
|
||||
**Reference:** See `assets/jaeger-config.yaml.template`
|
||||
|
||||
## Sampling Strategies
|
||||
|
||||
### Probabilistic Sampling
|
||||
```yaml
|
||||
# Sample 1% of traces
|
||||
sampler:
|
||||
type: probabilistic
|
||||
param: 0.01
|
||||
```
|
||||
|
||||
### Rate Limiting Sampling
|
||||
```yaml
|
||||
# Sample max 100 traces per second
|
||||
sampler:
|
||||
type: ratelimiting
|
||||
param: 100
|
||||
```
|
||||
|
||||
### Adaptive Sampling
|
||||
```python
|
||||
from opentelemetry.sdk.trace.sampling import ParentBased, TraceIdRatioBased
|
||||
|
||||
# Sample based on trace ID (deterministic)
|
||||
sampler = ParentBased(root=TraceIdRatioBased(0.01))
|
||||
```
|
||||
|
||||
## Trace Analysis
|
||||
|
||||
### Finding Slow Requests
|
||||
|
||||
**Jaeger Query:**
|
||||
```
|
||||
service=my-service
|
||||
duration > 1s
|
||||
```
|
||||
|
||||
### Finding Errors
|
||||
|
||||
**Jaeger Query:**
|
||||
```
|
||||
service=my-service
|
||||
error=true
|
||||
tags.http.status_code >= 500
|
||||
```
|
||||
|
||||
### Service Dependency Graph
|
||||
|
||||
Jaeger automatically generates service dependency graphs showing:
|
||||
- Service relationships
|
||||
- Request rates
|
||||
- Error rates
|
||||
- Average latencies
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Sample appropriately** (1-10% in production)
|
||||
2. **Add meaningful tags** (user_id, request_id)
|
||||
3. **Propagate context** across all service boundaries
|
||||
4. **Log exceptions** in spans
|
||||
5. **Use consistent naming** for operations
|
||||
6. **Monitor tracing overhead** (<1% CPU impact)
|
||||
7. **Set up alerts** for trace errors
|
||||
8. **Implement distributed context** (baggage)
|
||||
9. **Use span events** for important milestones
|
||||
10. **Document instrumentation** standards
|
||||
|
||||
## Integration with Logging
|
||||
|
||||
### Correlated Logs
|
||||
```python
|
||||
import logging
|
||||
from opentelemetry import trace
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
def process_request():
|
||||
span = trace.get_current_span()
|
||||
trace_id = span.get_span_context().trace_id
|
||||
|
||||
logger.info(
|
||||
"Processing request",
|
||||
extra={"trace_id": format(trace_id, '032x')}
|
||||
)
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
**No traces appearing:**
|
||||
- Check collector endpoint
|
||||
- Verify network connectivity
|
||||
- Check sampling configuration
|
||||
- Review application logs
|
||||
|
||||
**High latency overhead:**
|
||||
- Reduce sampling rate
|
||||
- Use batch span processor
|
||||
- Check exporter configuration
|
||||
|
||||
## Reference Files
|
||||
|
||||
- `references/jaeger-setup.md` - Jaeger installation
|
||||
- `references/instrumentation.md` - Instrumentation patterns
|
||||
- `assets/jaeger-config.yaml.template` - Jaeger configuration
|
||||
|
||||
## Related Skills
|
||||
|
||||
- `prometheus-configuration` - For metrics
|
||||
- `grafana-dashboards` - For visualization
|
||||
- `slo-implementation` - For latency SLOs
|
||||
Reference in New Issue
Block a user