zhongwei/gh-cskiro-claudex-claude-code-tools

Fork 0

Files

Zhongwei Li 4e8a12140c Initial commit

2025-11-29 18:16:51 +08:00

15 KiB

Raw Blame History

Troubleshooting Guide

Common issues and solutions for Claude Code OpenTelemetry setup.

Container Issues

Docker Not Running

Symptom: Cannot connect to the Docker daemon

Diagnosis:

docker info

Solutions:

Start Docker Desktop application
Wait for Docker to fully initialize
Check system tray for Docker icon
Verify Docker daemon is running: ps aux | grep docker

Containers Won't Start

Symptom: Containers exit immediately after docker compose up

Diagnosis:

# Check container logs
docker compose logs

# Check specific service
docker compose logs otel-collector
docker compose logs prometheus

Common Causes:

1. OTEL Collector Configuration Error

# Check for errors
docker compose logs otel-collector | grep -i error

# Common issues:
# - Deprecated logging exporter
# - Deprecated 'address' field in telemetry.metrics

Solution A - Deprecated logging exporter: Update otel-collector-config.yml:

exporters:
  debug:
    verbosity: normal
  # NOT:
  # logging:
  #   loglevel: info

Solution B - Deprecated 'address' field (v0.123.0+):

If logs show: 'address' has invalid keys or similar error:

Update otel-collector-config.yml:

service:
  telemetry:
    metrics:
      level: detailed
      # REMOVE this line (deprecated in v0.123.0+):
      # address: ":8888"

The address field in service.telemetry.metrics is deprecated in newer OTEL Collector versions. Simply remove it - the collector will use default internal metrics endpoint.

2. Port Already in Use

# Check which ports are in use
lsof -i :3000  # Grafana
lsof -i :4317  # OTEL gRPC
lsof -i :4318  # OTEL HTTP
lsof -i :8889  # OTEL Prometheus exporter
lsof -i :9090  # Prometheus
lsof -i :3100  # Loki

Solution:

Stop conflicting service
Or change port in docker-compose.yml

3. Volume Permission Issues

# Check volume permissions
docker volume ls
docker volume inspect claude-telemetry_prometheus-data

Solution:

# Remove and recreate volumes
docker compose down -v
docker compose up -d

Containers Keep Restarting

Symptom: Container status shows "Restarting"

Diagnosis:

docker compose ps
docker compose logs --tail=50 <service-name>

Solutions:

Check memory limits: Increase memory_limiter in OTEL config
Check disk space: df -h
Check for configuration errors in logs
Restart Docker Desktop

Claude Code Settings Issues

🚨 CRITICAL: Telemetry Not Sending (Most Common Issue)

Symptom: No metrics appearing in Prometheus after Claude Code restart

ROOT CAUSE (90% of cases): Missing required exporter environment variables

Even when CLAUDE_CODE_ENABLE_TELEMETRY=1 is set, telemetry will not send without explicit exporter configuration. This is the #1 most common issue.

Diagnosis Checklist:

1. Check REQUIRED exporters (MOST IMPORTANT):

jq '.env.OTEL_METRICS_EXPORTER' ~/.claude/settings.json
# Must return: "otlp" (NOT null, NOT missing)

jq '.env.OTEL_LOGS_EXPORTER' ~/.claude/settings.json
# Should return: "otlp" (recommended for event tracking)

If either returns null or is missing, this is your problem!

2. Verify telemetry is enabled:

jq '.env.CLAUDE_CODE_ENABLE_TELEMETRY' ~/.claude/settings.json
# Should return: "1"

3. Check OTEL endpoint:

jq '.env.OTEL_EXPORTER_OTLP_ENDPOINT' ~/.claude/settings.json
# Should return: "http://localhost:4317" (for local setup)

3. Verify JSON is valid:

jq empty ~/.claude/settings.json
# No output = valid JSON

4. Check if Claude Code was restarted:

# Telemetry config only loads at startup!
# Must quit and restart Claude Code completely

5. Test OTEL endpoint connectivity:

nc -zv localhost 4317
# Should show: Connection to localhost port 4317 [tcp/*] succeeded!

Solutions:

If exporters are missing (MOST COMMON):

Add these REQUIRED settings to ~/.claude/settings.json:

{
  "env": {
    "CLAUDE_CODE_ENABLE_TELEMETRY": "1",
    "OTEL_METRICS_EXPORTER": "otlp",
    "OTEL_LOGS_EXPORTER": "otlp",
    "OTEL_EXPORTER_OTLP_ENDPOINT": "http://localhost:4317",
    "OTEL_EXPORTER_OTLP_PROTOCOL": "grpc"
  }
}

Then MUST restart Claude Code (settings only load at startup).

If endpoint unreachable:

Verify OTEL Collector container is running
Check firewall settings
Try HTTP endpoint instead: http://localhost:4318

If still no data:

Check OTEL Collector logs for incoming connections
Verify Claude Code is running (not just idle)
Wait 60 seconds (default export interval)

Settings.json Syntax Errors

Symptom: Claude Code won't start or shows errors

Diagnosis:

# Validate JSON
jq empty ~/.claude/settings.json

# Pretty-print to find issues
jq . ~/.claude/settings.json

Common Issues:

Missing commas between properties
Trailing commas before closing braces
Unescaped quotes in strings
Incorrect nesting

Solution:

# Restore backup
cp ~/.claude/settings.json.backup ~/.claude/settings.json

# Or fix JSON manually with editor

Grafana Issues

Can't Access Grafana

Symptom: localhost:3000 doesn't load

Diagnosis:

# Check if Grafana is running
docker ps | grep grafana

# Check Grafana logs
docker compose logs grafana

# Check port availability
lsof -i :3000

Solutions:

Verify container is running: docker compose up -d grafana
Wait 30 seconds for Grafana to initialize
Try http://127.0.0.1:3000 instead
Check Docker network: docker network inspect claude-telemetry

Dashboard Shows "Datasource Not Found"

Symptom: Dashboard panels show "datasource prometheus not found"

Cause: Dashboard has hardcoded datasource UID that doesn't match your Grafana instance

Diagnosis:

Go to: http://localhost:3000/connections/datasources
Click on Prometheus datasource
Note the UID from URL (e.g., PBFA97CFB590B2093)

Solution:

# Get your datasource UID
DATASOURCE_UID=$(curl -s -u admin:admin http://localhost:3000/api/datasources | jq -r '.[] | select(.type=="prometheus") | .uid')

echo "Your Prometheus datasource UID: $DATASOURCE_UID"

# Update dashboard JSON
cd ~/.claude/telemetry/dashboards
cat claude-code-overview.json | sed "s/PBFA97CFB590B2093/$DATASOURCE_UID/g" > claude-code-overview-fixed.json

# Re-import the fixed dashboard

Dashboard Shows "No Data"

Symptom: Dashboard loads but all panels show "No data"

Diagnosis Steps:

1. Check Prometheus has data:

# Query Prometheus directly
curl -s 'http://localhost:9090/api/v1/label/__name__/values' | jq . | grep claude_code

# Should see metrics like:
# "claude_code_claude_code_session_count_total"
# "claude_code_claude_code_cost_usage_USD_total"

2. Check datasource connection:

Go to: http://localhost:3000/connections/datasources
Click Prometheus
Click "Save & Test"
Should show: "Successfully queried the Prometheus API"

3. Verify metric names in queries:

# Check if metrics use double prefix
curl -s 'http://localhost:9090/api/v1/query?query=claude_code_claude_code_session_count_total' | jq .

Solutions:

If metrics don't exist:

Claude Code hasn't sent data yet (wait 60 seconds)
OTEL Collector isn't receiving data (check container logs)
Settings.json wasn't configured correctly

If metrics exist but dashboard shows no data:

Dashboard queries use wrong metric names
Update queries to use double prefix: claude_code_claude_code_*
Check time range (top-right corner of Grafana)

If single prefix metrics exist (claude_code_*): Your setup uses old naming. Update dashboard:

# Replace double prefix with single
sed 's/claude_code_claude_code_/claude_code_/g' dashboard.json > dashboard-fixed.json

Prometheus Issues

Prometheus Shows No Targets

Symptom: Prometheus UI (localhost:9090) → Status → Targets shows no targets or DOWN status

Diagnosis:

# Check Prometheus config
cat ~/.claude/telemetry/prometheus.yml

# Check if OTEL Collector is reachable from Prometheus
docker exec -it claude-prometheus ping otel-collector

Solutions:

Verify prometheus.yml has correct scrape_configs
Ensure OTEL Collector is running
Check Docker network connectivity
Restart Prometheus: docker compose restart prometheus

Prometheus Can't Scrape OTEL Collector

Symptom: Target shows as DOWN with error "context deadline exceeded"

Diagnosis:

# Check if OTEL Collector is exposing metrics
curl http://localhost:8889/metrics

# Check OTEL Collector logs
docker compose logs otel-collector

Solutions:

Verify OTEL Collector prometheus exporter is configured
Check port 8889 is exposed in docker-compose.yml
Restart OTEL Collector: docker compose restart otel-collector

Metric Issues

Metrics Have Double Prefix

Symptom: Metrics are named claude_code_claude_code_* instead of claude_code_*

Explanation: This is expected behavior with the current OTEL Collector configuration:

First claude_code = Prometheus exporter namespace
Second claude_code = Original metric name

Solutions:

Option 1: Accept it (Recommended)

Update dashboard queries to use double prefix
This is the standard configuration

Option 2: Remove namespace prefix Update otel-collector-config.yml:

exporters:
  prometheus:
    endpoint: "0.0.0.0:8889"
    namespace: ""  # Remove namespace

Then restart: docker compose restart otel-collector

Old Metrics Still Showing

Symptom: After changing configuration, old metrics still appear

Cause: Prometheus retains metrics until retention period expires

Solutions:

Quick fix: Delete Prometheus data:

docker compose down
docker volume rm claude-telemetry_prometheus-data
docker compose up -d

Proper fix: Wait for retention:

Default retention is 15 days
Old metrics will automatically disappear
New metrics will coexist temporarily

Network Issues

Can't Reach OTEL Endpoint from Claude Code

Symptom: Claude Code can't connect to localhost:4317

Diagnosis:

# Test gRPC endpoint
nc -zv localhost 4317

# Test HTTP endpoint
curl -v http://localhost:4318/v1/metrics -d '{}'

Solutions:

If connection refused:

Check OTEL Collector is running
Verify ports are exposed in docker-compose.yml
Check firewall/antivirus blocking localhost connections

If timeout:

Increase export timeout in settings.json
Try HTTP protocol instead of gRPC

macOS-specific:

Use http://host.docker.internal:4317 instead of localhost:4317
Or use bridge network mode

Enterprise Endpoint Unreachable

Symptom: Can't connect to company OTEL endpoint

Diagnosis:

# Test connectivity
ping otel.company.com

# Test port
nc -zv otel.company.com 4317

# Test with VPN
# (Ensure corporate VPN is connected)

Solutions:

Connect to corporate VPN
Check firewall allows outbound connections
Verify endpoint URL is correct
Try HTTP endpoint (port 4318) instead of gRPC
Contact platform team to verify endpoint is accessible

Performance Issues

High Memory Usage

Symptom: OTEL Collector or Prometheus using excessive memory

Diagnosis:

# Check container resource usage
docker stats

# Check Prometheus TSDB size
du -sh ~/.claude/telemetry/prometheus-data

Solutions:

OTEL Collector: Reduce memory_limiter in otel-collector-config.yml:

processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 256  # Reduce from 512

Prometheus: Reduce retention:

command:
  - '--storage.tsdb.retention.time=7d'  # Reduce from 15d
  - '--storage.tsdb.retention.size=1GB'

Slow Grafana Dashboards

Symptom: Dashboards take long time to load or timeout

Diagnosis:

# Check query performance in Prometheus
# Go to: http://localhost:9090/graph
# Run expensive queries like: sum by (account_uuid, model, type) (...)

Solutions:

Reduce dashboard time range (use 6h instead of 7d)
Increase dashboard refresh interval (1m → 5m)
Use recording rules for complex queries
Reduce number of panels
Use simpler aggregations

Data Quality Issues

Unexpected Cost Values

Symptom: Cost metrics seem incorrect

Diagnosis:

# Check raw cost values
curl -s 'http://localhost:9090/api/v1/query?query=claude_code_claude_code_cost_usage_USD_total' | jq .

# Check token usage
curl -s 'http://localhost:9090/api/v1/query?query=claude_code_claude_code_token_usage_tokens_total' | jq .

Causes:

Cost is cumulative counter (not reset between sessions)
Dashboard may be using wrong time range
Model pricing may have changed

Solutions:

Use increase([24h]) not raw counter values
Verify pricing in metrics reference
Check Claude Code version (pricing may vary)

Missing Sessions

Symptom: Some Claude Code sessions not recorded

Causes:

Claude Code wasn't restarted after settings update
OTEL Collector was down during session
Export interval hadn't elapsed yet (60 seconds default)
Network issue prevented export

Solutions:

Always restart Claude Code after settings changes
Monitor OTEL Collector uptime
Check OTEL Collector logs for export errors
Reduce export interval if real-time data needed

Getting Help

Collect Debug Information

When asking for help, provide:

# 1. Container status
docker compose ps

# 2. Container logs (last 50 lines)
docker compose logs --tail=50

# 3. Configuration files
cat ~/.claude/telemetry/otel-collector-config.yml
cat ~/.claude/telemetry/prometheus.yml

# 4. Claude Code settings (redact sensitive info!)
jq '.env | with_entries(select(.key | startswith("OTEL_")))' ~/.claude/settings.json

# 5. Prometheus metrics list
curl -s 'http://localhost:9090/api/v1/label/__name__/values' | jq . | grep claude_code

# 6. System info
docker --version
docker compose version
uname -a

Enable Debug Logging

OTEL Collector:

exporters:
  debug:
    verbosity: detailed  # Change from 'normal'

service:
  telemetry:
    logs:
      level: debug  # Change from 'info'

Claude Code: Add to settings.json:

"env": {
  "OTEL_LOG_LEVEL": "debug"
}

Then check logs:

docker compose logs -f otel-collector

Additional Resources

OTEL Collector Docs: https://opentelemetry.io/docs/collector/
Prometheus Troubleshooting: https://prometheus.io/docs/prometheus/latest/troubleshooting/
Grafana Troubleshooting: https://grafana.com/docs/grafana/latest/troubleshooting/
Docker Compose Docs: https://docs.docker.com/compose/

15 KiB Raw Blame History

Troubleshooting Guide

Container Issues

Docker Not Running

Containers Won't Start

Containers Keep Restarting

Claude Code Settings Issues

🚨 CRITICAL: Telemetry Not Sending (Most Common Issue)

Settings.json Syntax Errors

Grafana Issues

Can't Access Grafana

Dashboard Shows "Datasource Not Found"

Dashboard Shows "No Data"

Prometheus Issues

Prometheus Shows No Targets

Prometheus Can't Scrape OTEL Collector

Metric Issues

Metrics Have Double Prefix

Old Metrics Still Showing

Network Issues

Can't Reach OTEL Endpoint from Claude Code

Enterprise Endpoint Unreachable

Performance Issues

High Memory Usage

Slow Grafana Dashboards

Data Quality Issues

Unexpected Cost Values

Missing Sessions

Getting Help

Collect Debug Information

Enable Debug Logging

Additional Resources

15 KiB

Raw Blame History