Files
2025-11-29 18:16:51 +08:00

15 KiB

Troubleshooting Guide

Common issues and solutions for Claude Code OpenTelemetry setup.


Container Issues

Docker Not Running

Symptom: Cannot connect to the Docker daemon

Diagnosis:

docker info

Solutions:

  1. Start Docker Desktop application
  2. Wait for Docker to fully initialize
  3. Check system tray for Docker icon
  4. Verify Docker daemon is running: ps aux | grep docker

Containers Won't Start

Symptom: Containers exit immediately after docker compose up

Diagnosis:

# Check container logs
docker compose logs

# Check specific service
docker compose logs otel-collector
docker compose logs prometheus

Common Causes:

1. OTEL Collector Configuration Error

# Check for errors
docker compose logs otel-collector | grep -i error

# Common issues:
# - Deprecated logging exporter
# - Deprecated 'address' field in telemetry.metrics

Solution A - Deprecated logging exporter: Update otel-collector-config.yml:

exporters:
  debug:
    verbosity: normal
  # NOT:
  # logging:
  #   loglevel: info

Solution B - Deprecated 'address' field (v0.123.0+):

If logs show: 'address' has invalid keys or similar error:

Update otel-collector-config.yml:

service:
  telemetry:
    metrics:
      level: detailed
      # REMOVE this line (deprecated in v0.123.0+):
      # address: ":8888"

The address field in service.telemetry.metrics is deprecated in newer OTEL Collector versions. Simply remove it - the collector will use default internal metrics endpoint.

2. Port Already in Use

# Check which ports are in use
lsof -i :3000  # Grafana
lsof -i :4317  # OTEL gRPC
lsof -i :4318  # OTEL HTTP
lsof -i :8889  # OTEL Prometheus exporter
lsof -i :9090  # Prometheus
lsof -i :3100  # Loki

Solution:

  • Stop conflicting service
  • Or change port in docker-compose.yml

3. Volume Permission Issues

# Check volume permissions
docker volume ls
docker volume inspect claude-telemetry_prometheus-data

Solution:

# Remove and recreate volumes
docker compose down -v
docker compose up -d

Containers Keep Restarting

Symptom: Container status shows "Restarting"

Diagnosis:

docker compose ps
docker compose logs --tail=50 <service-name>

Solutions:

  1. Check memory limits: Increase memory_limiter in OTEL config
  2. Check disk space: df -h
  3. Check for configuration errors in logs
  4. Restart Docker Desktop

Claude Code Settings Issues

🚨 CRITICAL: Telemetry Not Sending (Most Common Issue)

Symptom: No metrics appearing in Prometheus after Claude Code restart

ROOT CAUSE (90% of cases): Missing required exporter environment variables

Even when CLAUDE_CODE_ENABLE_TELEMETRY=1 is set, telemetry will not send without explicit exporter configuration. This is the #1 most common issue.

Diagnosis Checklist:

1. Check REQUIRED exporters (MOST IMPORTANT):

jq '.env.OTEL_METRICS_EXPORTER' ~/.claude/settings.json
# Must return: "otlp" (NOT null, NOT missing)

jq '.env.OTEL_LOGS_EXPORTER' ~/.claude/settings.json
# Should return: "otlp" (recommended for event tracking)

If either returns null or is missing, this is your problem!

2. Verify telemetry is enabled:

jq '.env.CLAUDE_CODE_ENABLE_TELEMETRY' ~/.claude/settings.json
# Should return: "1"

3. Check OTEL endpoint:

jq '.env.OTEL_EXPORTER_OTLP_ENDPOINT' ~/.claude/settings.json
# Should return: "http://localhost:4317" (for local setup)

3. Verify JSON is valid:

jq empty ~/.claude/settings.json
# No output = valid JSON

4. Check if Claude Code was restarted:

# Telemetry config only loads at startup!
# Must quit and restart Claude Code completely

5. Test OTEL endpoint connectivity:

nc -zv localhost 4317
# Should show: Connection to localhost port 4317 [tcp/*] succeeded!

Solutions:

If exporters are missing (MOST COMMON):

Add these REQUIRED settings to ~/.claude/settings.json:

{
  "env": {
    "CLAUDE_CODE_ENABLE_TELEMETRY": "1",
    "OTEL_METRICS_EXPORTER": "otlp",
    "OTEL_LOGS_EXPORTER": "otlp",
    "OTEL_EXPORTER_OTLP_ENDPOINT": "http://localhost:4317",
    "OTEL_EXPORTER_OTLP_PROTOCOL": "grpc"
  }
}

Then MUST restart Claude Code (settings only load at startup).

If endpoint unreachable:

  • Verify OTEL Collector container is running
  • Check firewall settings
  • Try HTTP endpoint instead: http://localhost:4318

If still no data:

  • Check OTEL Collector logs for incoming connections
  • Verify Claude Code is running (not just idle)
  • Wait 60 seconds (default export interval)

Settings.json Syntax Errors

Symptom: Claude Code won't start or shows errors

Diagnosis:

# Validate JSON
jq empty ~/.claude/settings.json

# Pretty-print to find issues
jq . ~/.claude/settings.json

Common Issues:

  • Missing commas between properties
  • Trailing commas before closing braces
  • Unescaped quotes in strings
  • Incorrect nesting

Solution:

# Restore backup
cp ~/.claude/settings.json.backup ~/.claude/settings.json

# Or fix JSON manually with editor

Grafana Issues

Can't Access Grafana

Symptom: localhost:3000 doesn't load

Diagnosis:

# Check if Grafana is running
docker ps | grep grafana

# Check Grafana logs
docker compose logs grafana

# Check port availability
lsof -i :3000

Solutions:

  1. Verify container is running: docker compose up -d grafana
  2. Wait 30 seconds for Grafana to initialize
  3. Try http://127.0.0.1:3000 instead
  4. Check Docker network: docker network inspect claude-telemetry

Dashboard Shows "Datasource Not Found"

Symptom: Dashboard panels show "datasource prometheus not found"

Cause: Dashboard has hardcoded datasource UID that doesn't match your Grafana instance

Diagnosis:

  1. Go to: http://localhost:3000/connections/datasources
  2. Click on Prometheus datasource
  3. Note the UID from URL (e.g., PBFA97CFB590B2093)

Solution:

# Get your datasource UID
DATASOURCE_UID=$(curl -s -u admin:admin http://localhost:3000/api/datasources | jq -r '.[] | select(.type=="prometheus") | .uid')

echo "Your Prometheus datasource UID: $DATASOURCE_UID"

# Update dashboard JSON
cd ~/.claude/telemetry/dashboards
cat claude-code-overview.json | sed "s/PBFA97CFB590B2093/$DATASOURCE_UID/g" > claude-code-overview-fixed.json

# Re-import the fixed dashboard

Dashboard Shows "No Data"

Symptom: Dashboard loads but all panels show "No data"

Diagnosis Steps:

1. Check Prometheus has data:

# Query Prometheus directly
curl -s 'http://localhost:9090/api/v1/label/__name__/values' | jq . | grep claude_code

# Should see metrics like:
# "claude_code_claude_code_session_count_total"
# "claude_code_claude_code_cost_usage_USD_total"

2. Check datasource connection:

3. Verify metric names in queries:

# Check if metrics use double prefix
curl -s 'http://localhost:9090/api/v1/query?query=claude_code_claude_code_session_count_total' | jq .

Solutions:

If metrics don't exist:

  • Claude Code hasn't sent data yet (wait 60 seconds)
  • OTEL Collector isn't receiving data (check container logs)
  • Settings.json wasn't configured correctly

If metrics exist but dashboard shows no data:

  • Dashboard queries use wrong metric names
  • Update queries to use double prefix: claude_code_claude_code_*
  • Check time range (top-right corner of Grafana)

If single prefix metrics exist (claude_code_*): Your setup uses old naming. Update dashboard:

# Replace double prefix with single
sed 's/claude_code_claude_code_/claude_code_/g' dashboard.json > dashboard-fixed.json

Prometheus Issues

Prometheus Shows No Targets

Symptom: Prometheus UI (localhost:9090) → Status → Targets shows no targets or DOWN status

Diagnosis:

# Check Prometheus config
cat ~/.claude/telemetry/prometheus.yml

# Check if OTEL Collector is reachable from Prometheus
docker exec -it claude-prometheus ping otel-collector

Solutions:

  1. Verify prometheus.yml has correct scrape_configs
  2. Ensure OTEL Collector is running
  3. Check Docker network connectivity
  4. Restart Prometheus: docker compose restart prometheus

Prometheus Can't Scrape OTEL Collector

Symptom: Target shows as DOWN with error "context deadline exceeded"

Diagnosis:

# Check if OTEL Collector is exposing metrics
curl http://localhost:8889/metrics

# Check OTEL Collector logs
docker compose logs otel-collector

Solutions:

  1. Verify OTEL Collector prometheus exporter is configured
  2. Check port 8889 is exposed in docker-compose.yml
  3. Restart OTEL Collector: docker compose restart otel-collector

Metric Issues

Metrics Have Double Prefix

Symptom: Metrics are named claude_code_claude_code_* instead of claude_code_*

Explanation: This is expected behavior with the current OTEL Collector configuration:

  • First claude_code = Prometheus exporter namespace
  • Second claude_code = Original metric name

Solutions:

Option 1: Accept it (Recommended)

  • Update dashboard queries to use double prefix
  • This is the standard configuration

Option 2: Remove namespace prefix Update otel-collector-config.yml:

exporters:
  prometheus:
    endpoint: "0.0.0.0:8889"
    namespace: ""  # Remove namespace

Then restart: docker compose restart otel-collector


Old Metrics Still Showing

Symptom: After changing configuration, old metrics still appear

Cause: Prometheus retains metrics until retention period expires

Solutions:

Quick fix: Delete Prometheus data:

docker compose down
docker volume rm claude-telemetry_prometheus-data
docker compose up -d

Proper fix: Wait for retention:

  • Default retention is 15 days
  • Old metrics will automatically disappear
  • New metrics will coexist temporarily

Network Issues

Can't Reach OTEL Endpoint from Claude Code

Symptom: Claude Code can't connect to localhost:4317

Diagnosis:

# Test gRPC endpoint
nc -zv localhost 4317

# Test HTTP endpoint
curl -v http://localhost:4318/v1/metrics -d '{}'

Solutions:

If connection refused:

  1. Check OTEL Collector is running
  2. Verify ports are exposed in docker-compose.yml
  3. Check firewall/antivirus blocking localhost connections

If timeout:

  1. Increase export timeout in settings.json
  2. Try HTTP protocol instead of gRPC

macOS-specific:

  • Use http://host.docker.internal:4317 instead of localhost:4317
  • Or use bridge network mode

Enterprise Endpoint Unreachable

Symptom: Can't connect to company OTEL endpoint

Diagnosis:

# Test connectivity
ping otel.company.com

# Test port
nc -zv otel.company.com 4317

# Test with VPN
# (Ensure corporate VPN is connected)

Solutions:

  1. Connect to corporate VPN
  2. Check firewall allows outbound connections
  3. Verify endpoint URL is correct
  4. Try HTTP endpoint (port 4318) instead of gRPC
  5. Contact platform team to verify endpoint is accessible

Performance Issues

High Memory Usage

Symptom: OTEL Collector or Prometheus using excessive memory

Diagnosis:

# Check container resource usage
docker stats

# Check Prometheus TSDB size
du -sh ~/.claude/telemetry/prometheus-data

Solutions:

OTEL Collector: Reduce memory_limiter in otel-collector-config.yml:

processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 256  # Reduce from 512

Prometheus: Reduce retention:

command:
  - '--storage.tsdb.retention.time=7d'  # Reduce from 15d
  - '--storage.tsdb.retention.size=1GB'

Slow Grafana Dashboards

Symptom: Dashboards take long time to load or timeout

Diagnosis:

# Check query performance in Prometheus
# Go to: http://localhost:9090/graph
# Run expensive queries like: sum by (account_uuid, model, type) (...)

Solutions:

  1. Reduce dashboard time range (use 6h instead of 7d)
  2. Increase dashboard refresh interval (1m → 5m)
  3. Use recording rules for complex queries
  4. Reduce number of panels
  5. Use simpler aggregations

Data Quality Issues

Unexpected Cost Values

Symptom: Cost metrics seem incorrect

Diagnosis:

# Check raw cost values
curl -s 'http://localhost:9090/api/v1/query?query=claude_code_claude_code_cost_usage_USD_total' | jq .

# Check token usage
curl -s 'http://localhost:9090/api/v1/query?query=claude_code_claude_code_token_usage_tokens_total' | jq .

Causes:

  • Cost is cumulative counter (not reset between sessions)
  • Dashboard may be using wrong time range
  • Model pricing may have changed

Solutions:

  • Use increase([24h]) not raw counter values
  • Verify pricing in metrics reference
  • Check Claude Code version (pricing may vary)

Missing Sessions

Symptom: Some Claude Code sessions not recorded

Causes:

  1. Claude Code wasn't restarted after settings update
  2. OTEL Collector was down during session
  3. Export interval hadn't elapsed yet (60 seconds default)
  4. Network issue prevented export

Solutions:

  • Always restart Claude Code after settings changes
  • Monitor OTEL Collector uptime
  • Check OTEL Collector logs for export errors
  • Reduce export interval if real-time data needed

Getting Help

Collect Debug Information

When asking for help, provide:

# 1. Container status
docker compose ps

# 2. Container logs (last 50 lines)
docker compose logs --tail=50

# 3. Configuration files
cat ~/.claude/telemetry/otel-collector-config.yml
cat ~/.claude/telemetry/prometheus.yml

# 4. Claude Code settings (redact sensitive info!)
jq '.env | with_entries(select(.key | startswith("OTEL_")))' ~/.claude/settings.json

# 5. Prometheus metrics list
curl -s 'http://localhost:9090/api/v1/label/__name__/values' | jq . | grep claude_code

# 6. System info
docker --version
docker compose version
uname -a

Enable Debug Logging

OTEL Collector:

exporters:
  debug:
    verbosity: detailed  # Change from 'normal'

service:
  telemetry:
    logs:
      level: debug  # Change from 'info'

Claude Code: Add to settings.json:

"env": {
  "OTEL_LOG_LEVEL": "debug"
}

Then check logs:

docker compose logs -f otel-collector

Additional Resources