15 KiB
Troubleshooting Guide
Common issues and solutions for Claude Code OpenTelemetry setup.
Container Issues
Docker Not Running
Symptom: Cannot connect to the Docker daemon
Diagnosis:
docker info
Solutions:
- Start Docker Desktop application
- Wait for Docker to fully initialize
- Check system tray for Docker icon
- Verify Docker daemon is running:
ps aux | grep docker
Containers Won't Start
Symptom: Containers exit immediately after docker compose up
Diagnosis:
# Check container logs
docker compose logs
# Check specific service
docker compose logs otel-collector
docker compose logs prometheus
Common Causes:
1. OTEL Collector Configuration Error
# Check for errors
docker compose logs otel-collector | grep -i error
# Common issues:
# - Deprecated logging exporter
# - Deprecated 'address' field in telemetry.metrics
Solution A - Deprecated logging exporter:
Update otel-collector-config.yml:
exporters:
debug:
verbosity: normal
# NOT:
# logging:
# loglevel: info
Solution B - Deprecated 'address' field (v0.123.0+):
If logs show: 'address' has invalid keys or similar error:
Update otel-collector-config.yml:
service:
telemetry:
metrics:
level: detailed
# REMOVE this line (deprecated in v0.123.0+):
# address: ":8888"
The address field in service.telemetry.metrics is deprecated in newer OTEL Collector versions. Simply remove it - the collector will use default internal metrics endpoint.
2. Port Already in Use
# Check which ports are in use
lsof -i :3000 # Grafana
lsof -i :4317 # OTEL gRPC
lsof -i :4318 # OTEL HTTP
lsof -i :8889 # OTEL Prometheus exporter
lsof -i :9090 # Prometheus
lsof -i :3100 # Loki
Solution:
- Stop conflicting service
- Or change port in docker-compose.yml
3. Volume Permission Issues
# Check volume permissions
docker volume ls
docker volume inspect claude-telemetry_prometheus-data
Solution:
# Remove and recreate volumes
docker compose down -v
docker compose up -d
Containers Keep Restarting
Symptom: Container status shows "Restarting"
Diagnosis:
docker compose ps
docker compose logs --tail=50 <service-name>
Solutions:
- Check memory limits: Increase memory_limiter in OTEL config
- Check disk space:
df -h - Check for configuration errors in logs
- Restart Docker Desktop
Claude Code Settings Issues
🚨 CRITICAL: Telemetry Not Sending (Most Common Issue)
Symptom: No metrics appearing in Prometheus after Claude Code restart
ROOT CAUSE (90% of cases): Missing required exporter environment variables
Even when CLAUDE_CODE_ENABLE_TELEMETRY=1 is set, telemetry will not send without explicit exporter configuration. This is the #1 most common issue.
Diagnosis Checklist:
1. Check REQUIRED exporters (MOST IMPORTANT):
jq '.env.OTEL_METRICS_EXPORTER' ~/.claude/settings.json
# Must return: "otlp" (NOT null, NOT missing)
jq '.env.OTEL_LOGS_EXPORTER' ~/.claude/settings.json
# Should return: "otlp" (recommended for event tracking)
If either returns null or is missing, this is your problem!
2. Verify telemetry is enabled:
jq '.env.CLAUDE_CODE_ENABLE_TELEMETRY' ~/.claude/settings.json
# Should return: "1"
3. Check OTEL endpoint:
jq '.env.OTEL_EXPORTER_OTLP_ENDPOINT' ~/.claude/settings.json
# Should return: "http://localhost:4317" (for local setup)
3. Verify JSON is valid:
jq empty ~/.claude/settings.json
# No output = valid JSON
4. Check if Claude Code was restarted:
# Telemetry config only loads at startup!
# Must quit and restart Claude Code completely
5. Test OTEL endpoint connectivity:
nc -zv localhost 4317
# Should show: Connection to localhost port 4317 [tcp/*] succeeded!
Solutions:
If exporters are missing (MOST COMMON):
Add these REQUIRED settings to ~/.claude/settings.json:
{
"env": {
"CLAUDE_CODE_ENABLE_TELEMETRY": "1",
"OTEL_METRICS_EXPORTER": "otlp",
"OTEL_LOGS_EXPORTER": "otlp",
"OTEL_EXPORTER_OTLP_ENDPOINT": "http://localhost:4317",
"OTEL_EXPORTER_OTLP_PROTOCOL": "grpc"
}
}
Then MUST restart Claude Code (settings only load at startup).
If endpoint unreachable:
- Verify OTEL Collector container is running
- Check firewall settings
- Try HTTP endpoint instead:
http://localhost:4318
If still no data:
- Check OTEL Collector logs for incoming connections
- Verify Claude Code is running (not just idle)
- Wait 60 seconds (default export interval)
Settings.json Syntax Errors
Symptom: Claude Code won't start or shows errors
Diagnosis:
# Validate JSON
jq empty ~/.claude/settings.json
# Pretty-print to find issues
jq . ~/.claude/settings.json
Common Issues:
- Missing commas between properties
- Trailing commas before closing braces
- Unescaped quotes in strings
- Incorrect nesting
Solution:
# Restore backup
cp ~/.claude/settings.json.backup ~/.claude/settings.json
# Or fix JSON manually with editor
Grafana Issues
Can't Access Grafana
Symptom: localhost:3000 doesn't load
Diagnosis:
# Check if Grafana is running
docker ps | grep grafana
# Check Grafana logs
docker compose logs grafana
# Check port availability
lsof -i :3000
Solutions:
- Verify container is running:
docker compose up -d grafana - Wait 30 seconds for Grafana to initialize
- Try
http://127.0.0.1:3000instead - Check Docker network:
docker network inspect claude-telemetry
Dashboard Shows "Datasource Not Found"
Symptom: Dashboard panels show "datasource prometheus not found"
Cause: Dashboard has hardcoded datasource UID that doesn't match your Grafana instance
Diagnosis:
- Go to: http://localhost:3000/connections/datasources
- Click on Prometheus datasource
- Note the UID from URL (e.g.,
PBFA97CFB590B2093)
Solution:
# Get your datasource UID
DATASOURCE_UID=$(curl -s -u admin:admin http://localhost:3000/api/datasources | jq -r '.[] | select(.type=="prometheus") | .uid')
echo "Your Prometheus datasource UID: $DATASOURCE_UID"
# Update dashboard JSON
cd ~/.claude/telemetry/dashboards
cat claude-code-overview.json | sed "s/PBFA97CFB590B2093/$DATASOURCE_UID/g" > claude-code-overview-fixed.json
# Re-import the fixed dashboard
Dashboard Shows "No Data"
Symptom: Dashboard loads but all panels show "No data"
Diagnosis Steps:
1. Check Prometheus has data:
# Query Prometheus directly
curl -s 'http://localhost:9090/api/v1/label/__name__/values' | jq . | grep claude_code
# Should see metrics like:
# "claude_code_claude_code_session_count_total"
# "claude_code_claude_code_cost_usage_USD_total"
2. Check datasource connection:
- Go to: http://localhost:3000/connections/datasources
- Click Prometheus
- Click "Save & Test"
- Should show: "Successfully queried the Prometheus API"
3. Verify metric names in queries:
# Check if metrics use double prefix
curl -s 'http://localhost:9090/api/v1/query?query=claude_code_claude_code_session_count_total' | jq .
Solutions:
If metrics don't exist:
- Claude Code hasn't sent data yet (wait 60 seconds)
- OTEL Collector isn't receiving data (check container logs)
- Settings.json wasn't configured correctly
If metrics exist but dashboard shows no data:
- Dashboard queries use wrong metric names
- Update queries to use double prefix:
claude_code_claude_code_* - Check time range (top-right corner of Grafana)
If single prefix metrics exist (claude_code_*):
Your setup uses old naming. Update dashboard:
# Replace double prefix with single
sed 's/claude_code_claude_code_/claude_code_/g' dashboard.json > dashboard-fixed.json
Prometheus Issues
Prometheus Shows No Targets
Symptom: Prometheus UI (localhost:9090) → Status → Targets shows no targets or DOWN status
Diagnosis:
# Check Prometheus config
cat ~/.claude/telemetry/prometheus.yml
# Check if OTEL Collector is reachable from Prometheus
docker exec -it claude-prometheus ping otel-collector
Solutions:
- Verify
prometheus.ymlhas correct scrape_configs - Ensure OTEL Collector is running
- Check Docker network connectivity
- Restart Prometheus:
docker compose restart prometheus
Prometheus Can't Scrape OTEL Collector
Symptom: Target shows as DOWN with error "context deadline exceeded"
Diagnosis:
# Check if OTEL Collector is exposing metrics
curl http://localhost:8889/metrics
# Check OTEL Collector logs
docker compose logs otel-collector
Solutions:
- Verify OTEL Collector prometheus exporter is configured
- Check port 8889 is exposed in docker-compose.yml
- Restart OTEL Collector:
docker compose restart otel-collector
Metric Issues
Metrics Have Double Prefix
Symptom: Metrics are named claude_code_claude_code_* instead of claude_code_*
Explanation: This is expected behavior with the current OTEL Collector configuration:
- First
claude_code= Prometheus exporter namespace - Second
claude_code= Original metric name
Solutions:
Option 1: Accept it (Recommended)
- Update dashboard queries to use double prefix
- This is the standard configuration
Option 2: Remove namespace prefix
Update otel-collector-config.yml:
exporters:
prometheus:
endpoint: "0.0.0.0:8889"
namespace: "" # Remove namespace
Then restart: docker compose restart otel-collector
Old Metrics Still Showing
Symptom: After changing configuration, old metrics still appear
Cause: Prometheus retains metrics until retention period expires
Solutions:
Quick fix: Delete Prometheus data:
docker compose down
docker volume rm claude-telemetry_prometheus-data
docker compose up -d
Proper fix: Wait for retention:
- Default retention is 15 days
- Old metrics will automatically disappear
- New metrics will coexist temporarily
Network Issues
Can't Reach OTEL Endpoint from Claude Code
Symptom: Claude Code can't connect to localhost:4317
Diagnosis:
# Test gRPC endpoint
nc -zv localhost 4317
# Test HTTP endpoint
curl -v http://localhost:4318/v1/metrics -d '{}'
Solutions:
If connection refused:
- Check OTEL Collector is running
- Verify ports are exposed in docker-compose.yml
- Check firewall/antivirus blocking localhost connections
If timeout:
- Increase export timeout in settings.json
- Try HTTP protocol instead of gRPC
macOS-specific:
- Use
http://host.docker.internal:4317instead oflocalhost:4317 - Or use bridge network mode
Enterprise Endpoint Unreachable
Symptom: Can't connect to company OTEL endpoint
Diagnosis:
# Test connectivity
ping otel.company.com
# Test port
nc -zv otel.company.com 4317
# Test with VPN
# (Ensure corporate VPN is connected)
Solutions:
- Connect to corporate VPN
- Check firewall allows outbound connections
- Verify endpoint URL is correct
- Try HTTP endpoint (port 4318) instead of gRPC
- Contact platform team to verify endpoint is accessible
Performance Issues
High Memory Usage
Symptom: OTEL Collector or Prometheus using excessive memory
Diagnosis:
# Check container resource usage
docker stats
# Check Prometheus TSDB size
du -sh ~/.claude/telemetry/prometheus-data
Solutions:
OTEL Collector:
Reduce memory_limiter in otel-collector-config.yml:
processors:
memory_limiter:
check_interval: 1s
limit_mib: 256 # Reduce from 512
Prometheus: Reduce retention:
command:
- '--storage.tsdb.retention.time=7d' # Reduce from 15d
- '--storage.tsdb.retention.size=1GB'
Slow Grafana Dashboards
Symptom: Dashboards take long time to load or timeout
Diagnosis:
# Check query performance in Prometheus
# Go to: http://localhost:9090/graph
# Run expensive queries like: sum by (account_uuid, model, type) (...)
Solutions:
- Reduce dashboard time range (use 6h instead of 7d)
- Increase dashboard refresh interval (1m → 5m)
- Use recording rules for complex queries
- Reduce number of panels
- Use simpler aggregations
Data Quality Issues
Unexpected Cost Values
Symptom: Cost metrics seem incorrect
Diagnosis:
# Check raw cost values
curl -s 'http://localhost:9090/api/v1/query?query=claude_code_claude_code_cost_usage_USD_total' | jq .
# Check token usage
curl -s 'http://localhost:9090/api/v1/query?query=claude_code_claude_code_token_usage_tokens_total' | jq .
Causes:
- Cost is cumulative counter (not reset between sessions)
- Dashboard may be using wrong time range
- Model pricing may have changed
Solutions:
- Use
increase([24h])not raw counter values - Verify pricing in metrics reference
- Check Claude Code version (pricing may vary)
Missing Sessions
Symptom: Some Claude Code sessions not recorded
Causes:
- Claude Code wasn't restarted after settings update
- OTEL Collector was down during session
- Export interval hadn't elapsed yet (60 seconds default)
- Network issue prevented export
Solutions:
- Always restart Claude Code after settings changes
- Monitor OTEL Collector uptime
- Check OTEL Collector logs for export errors
- Reduce export interval if real-time data needed
Getting Help
Collect Debug Information
When asking for help, provide:
# 1. Container status
docker compose ps
# 2. Container logs (last 50 lines)
docker compose logs --tail=50
# 3. Configuration files
cat ~/.claude/telemetry/otel-collector-config.yml
cat ~/.claude/telemetry/prometheus.yml
# 4. Claude Code settings (redact sensitive info!)
jq '.env | with_entries(select(.key | startswith("OTEL_")))' ~/.claude/settings.json
# 5. Prometheus metrics list
curl -s 'http://localhost:9090/api/v1/label/__name__/values' | jq . | grep claude_code
# 6. System info
docker --version
docker compose version
uname -a
Enable Debug Logging
OTEL Collector:
exporters:
debug:
verbosity: detailed # Change from 'normal'
service:
telemetry:
logs:
level: debug # Change from 'info'
Claude Code: Add to settings.json:
"env": {
"OTEL_LOG_LEVEL": "debug"
}
Then check logs:
docker compose logs -f otel-collector
Additional Resources
- OTEL Collector Docs: https://opentelemetry.io/docs/collector/
- Prometheus Troubleshooting: https://prometheus.io/docs/prometheus/latest/troubleshooting/
- Grafana Troubleshooting: https://grafana.com/docs/grafana/latest/troubleshooting/
- Docker Compose Docs: https://docs.docker.com/compose/