659 lines
15 KiB
Markdown
659 lines
15 KiB
Markdown
# Troubleshooting Guide
|
|
|
|
Common issues and solutions for Claude Code OpenTelemetry setup.
|
|
|
|
---
|
|
|
|
## Container Issues
|
|
|
|
### Docker Not Running
|
|
|
|
**Symptom:** `Cannot connect to the Docker daemon`
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
docker info
|
|
```
|
|
|
|
**Solutions:**
|
|
1. Start Docker Desktop application
|
|
2. Wait for Docker to fully initialize
|
|
3. Check system tray for Docker icon
|
|
4. Verify Docker daemon is running: `ps aux | grep docker`
|
|
|
|
---
|
|
|
|
### Containers Won't Start
|
|
|
|
**Symptom:** Containers exit immediately after `docker compose up`
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check container logs
|
|
docker compose logs
|
|
|
|
# Check specific service
|
|
docker compose logs otel-collector
|
|
docker compose logs prometheus
|
|
```
|
|
|
|
**Common Causes:**
|
|
|
|
**1. OTEL Collector Configuration Error**
|
|
```bash
|
|
# Check for errors
|
|
docker compose logs otel-collector | grep -i error
|
|
|
|
# Common issues:
|
|
# - Deprecated logging exporter
|
|
# - Deprecated 'address' field in telemetry.metrics
|
|
```
|
|
|
|
**Solution A - Deprecated logging exporter:**
|
|
Update `otel-collector-config.yml`:
|
|
```yaml
|
|
exporters:
|
|
debug:
|
|
verbosity: normal
|
|
# NOT:
|
|
# logging:
|
|
# loglevel: info
|
|
```
|
|
|
|
**Solution B - Deprecated 'address' field (v0.123.0+):**
|
|
|
|
If logs show: `'address' has invalid keys` or similar error:
|
|
|
|
Update `otel-collector-config.yml`:
|
|
```yaml
|
|
service:
|
|
telemetry:
|
|
metrics:
|
|
level: detailed
|
|
# REMOVE this line (deprecated in v0.123.0+):
|
|
# address: ":8888"
|
|
```
|
|
|
|
The `address` field in `service.telemetry.metrics` is deprecated in newer OTEL Collector versions. Simply remove it - the collector will use default internal metrics endpoint.
|
|
|
|
**2. Port Already in Use**
|
|
```bash
|
|
# Check which ports are in use
|
|
lsof -i :3000 # Grafana
|
|
lsof -i :4317 # OTEL gRPC
|
|
lsof -i :4318 # OTEL HTTP
|
|
lsof -i :8889 # OTEL Prometheus exporter
|
|
lsof -i :9090 # Prometheus
|
|
lsof -i :3100 # Loki
|
|
```
|
|
|
|
**Solution:**
|
|
- Stop conflicting service
|
|
- Or change port in docker-compose.yml
|
|
|
|
**3. Volume Permission Issues**
|
|
```bash
|
|
# Check volume permissions
|
|
docker volume ls
|
|
docker volume inspect claude-telemetry_prometheus-data
|
|
```
|
|
|
|
**Solution:**
|
|
```bash
|
|
# Remove and recreate volumes
|
|
docker compose down -v
|
|
docker compose up -d
|
|
```
|
|
|
|
---
|
|
|
|
### Containers Keep Restarting
|
|
|
|
**Symptom:** Container status shows "Restarting"
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
docker compose ps
|
|
docker compose logs --tail=50 <service-name>
|
|
```
|
|
|
|
**Solutions:**
|
|
1. Check memory limits: Increase memory_limiter in OTEL config
|
|
2. Check disk space: `df -h`
|
|
3. Check for configuration errors in logs
|
|
4. Restart Docker Desktop
|
|
|
|
---
|
|
|
|
## Claude Code Settings Issues
|
|
|
|
### 🚨 CRITICAL: Telemetry Not Sending (Most Common Issue)
|
|
|
|
**Symptom:** No metrics appearing in Prometheus after Claude Code restart
|
|
|
|
**ROOT CAUSE (90% of cases):** Missing required exporter environment variables
|
|
|
|
Even when `CLAUDE_CODE_ENABLE_TELEMETRY=1` is set, telemetry **will not send** without explicit exporter configuration. This is the #1 most common issue.
|
|
|
|
**Diagnosis Checklist:**
|
|
|
|
**1. Check REQUIRED exporters (MOST IMPORTANT):**
|
|
```bash
|
|
jq '.env.OTEL_METRICS_EXPORTER' ~/.claude/settings.json
|
|
# Must return: "otlp" (NOT null, NOT missing)
|
|
|
|
jq '.env.OTEL_LOGS_EXPORTER' ~/.claude/settings.json
|
|
# Should return: "otlp" (recommended for event tracking)
|
|
```
|
|
|
|
**If either returns `null` or is missing, this is your problem!**
|
|
|
|
**2. Verify telemetry is enabled:**
|
|
```bash
|
|
jq '.env.CLAUDE_CODE_ENABLE_TELEMETRY' ~/.claude/settings.json
|
|
# Should return: "1"
|
|
```
|
|
|
|
**3. Check OTEL endpoint:**
|
|
```bash
|
|
jq '.env.OTEL_EXPORTER_OTLP_ENDPOINT' ~/.claude/settings.json
|
|
# Should return: "http://localhost:4317" (for local setup)
|
|
```
|
|
|
|
**3. Verify JSON is valid:**
|
|
```bash
|
|
jq empty ~/.claude/settings.json
|
|
# No output = valid JSON
|
|
```
|
|
|
|
**4. Check if Claude Code was restarted:**
|
|
```bash
|
|
# Telemetry config only loads at startup!
|
|
# Must quit and restart Claude Code completely
|
|
```
|
|
|
|
**5. Test OTEL endpoint connectivity:**
|
|
```bash
|
|
nc -zv localhost 4317
|
|
# Should show: Connection to localhost port 4317 [tcp/*] succeeded!
|
|
```
|
|
|
|
**Solutions:**
|
|
|
|
**If exporters are missing (MOST COMMON):**
|
|
|
|
Add these REQUIRED settings to ~/.claude/settings.json:
|
|
```json
|
|
{
|
|
"env": {
|
|
"CLAUDE_CODE_ENABLE_TELEMETRY": "1",
|
|
"OTEL_METRICS_EXPORTER": "otlp",
|
|
"OTEL_LOGS_EXPORTER": "otlp",
|
|
"OTEL_EXPORTER_OTLP_ENDPOINT": "http://localhost:4317",
|
|
"OTEL_EXPORTER_OTLP_PROTOCOL": "grpc"
|
|
}
|
|
}
|
|
```
|
|
|
|
Then **MUST restart Claude Code** (settings only load at startup).
|
|
|
|
**If endpoint unreachable:**
|
|
- Verify OTEL Collector container is running
|
|
- Check firewall settings
|
|
- Try HTTP endpoint instead: `http://localhost:4318`
|
|
|
|
**If still no data:**
|
|
- Check OTEL Collector logs for incoming connections
|
|
- Verify Claude Code is running (not just idle)
|
|
- Wait 60 seconds (default export interval)
|
|
|
|
---
|
|
|
|
### Settings.json Syntax Errors
|
|
|
|
**Symptom:** Claude Code won't start or shows errors
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Validate JSON
|
|
jq empty ~/.claude/settings.json
|
|
|
|
# Pretty-print to find issues
|
|
jq . ~/.claude/settings.json
|
|
```
|
|
|
|
**Common Issues:**
|
|
- Missing commas between properties
|
|
- Trailing commas before closing braces
|
|
- Unescaped quotes in strings
|
|
- Incorrect nesting
|
|
|
|
**Solution:**
|
|
```bash
|
|
# Restore backup
|
|
cp ~/.claude/settings.json.backup ~/.claude/settings.json
|
|
|
|
# Or fix JSON manually with editor
|
|
```
|
|
|
|
---
|
|
|
|
## Grafana Issues
|
|
|
|
### Can't Access Grafana
|
|
|
|
**Symptom:** `localhost:3000` doesn't load
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check if Grafana is running
|
|
docker ps | grep grafana
|
|
|
|
# Check Grafana logs
|
|
docker compose logs grafana
|
|
|
|
# Check port availability
|
|
lsof -i :3000
|
|
```
|
|
|
|
**Solutions:**
|
|
1. Verify container is running: `docker compose up -d grafana`
|
|
2. Wait 30 seconds for Grafana to initialize
|
|
3. Try `http://127.0.0.1:3000` instead
|
|
4. Check Docker network: `docker network inspect claude-telemetry`
|
|
|
|
---
|
|
|
|
### Dashboard Shows "Datasource Not Found"
|
|
|
|
**Symptom:** Dashboard panels show "datasource prometheus not found"
|
|
|
|
**Cause:** Dashboard has hardcoded datasource UID that doesn't match your Grafana instance
|
|
|
|
**Diagnosis:**
|
|
1. Go to: http://localhost:3000/connections/datasources
|
|
2. Click on Prometheus datasource
|
|
3. Note the UID from URL (e.g., `PBFA97CFB590B2093`)
|
|
|
|
**Solution:**
|
|
```bash
|
|
# Get your datasource UID
|
|
DATASOURCE_UID=$(curl -s -u admin:admin http://localhost:3000/api/datasources | jq -r '.[] | select(.type=="prometheus") | .uid')
|
|
|
|
echo "Your Prometheus datasource UID: $DATASOURCE_UID"
|
|
|
|
# Update dashboard JSON
|
|
cd ~/.claude/telemetry/dashboards
|
|
cat claude-code-overview.json | sed "s/PBFA97CFB590B2093/$DATASOURCE_UID/g" > claude-code-overview-fixed.json
|
|
|
|
# Re-import the fixed dashboard
|
|
```
|
|
|
|
---
|
|
|
|
### Dashboard Shows "No Data"
|
|
|
|
**Symptom:** Dashboard loads but all panels show "No data"
|
|
|
|
**Diagnosis Steps:**
|
|
|
|
**1. Check Prometheus has data:**
|
|
```bash
|
|
# Query Prometheus directly
|
|
curl -s 'http://localhost:9090/api/v1/label/__name__/values' | jq . | grep claude_code
|
|
|
|
# Should see metrics like:
|
|
# "claude_code_claude_code_session_count_total"
|
|
# "claude_code_claude_code_cost_usage_USD_total"
|
|
```
|
|
|
|
**2. Check datasource connection:**
|
|
- Go to: http://localhost:3000/connections/datasources
|
|
- Click Prometheus
|
|
- Click "Save & Test"
|
|
- Should show: "Successfully queried the Prometheus API"
|
|
|
|
**3. Verify metric names in queries:**
|
|
```bash
|
|
# Check if metrics use double prefix
|
|
curl -s 'http://localhost:9090/api/v1/query?query=claude_code_claude_code_session_count_total' | jq .
|
|
```
|
|
|
|
**Solutions:**
|
|
|
|
**If metrics don't exist:**
|
|
- Claude Code hasn't sent data yet (wait 60 seconds)
|
|
- OTEL Collector isn't receiving data (check container logs)
|
|
- Settings.json wasn't configured correctly
|
|
|
|
**If metrics exist but dashboard shows no data:**
|
|
- Dashboard queries use wrong metric names
|
|
- Update queries to use double prefix: `claude_code_claude_code_*`
|
|
- Check time range (top-right corner of Grafana)
|
|
|
|
**If single prefix metrics exist (`claude_code_*`):**
|
|
Your setup uses old naming. Update dashboard:
|
|
```bash
|
|
# Replace double prefix with single
|
|
sed 's/claude_code_claude_code_/claude_code_/g' dashboard.json > dashboard-fixed.json
|
|
```
|
|
|
|
---
|
|
|
|
## Prometheus Issues
|
|
|
|
### Prometheus Shows No Targets
|
|
|
|
**Symptom:** Prometheus UI (localhost:9090) → Status → Targets shows no targets or DOWN status
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check Prometheus config
|
|
cat ~/.claude/telemetry/prometheus.yml
|
|
|
|
# Check if OTEL Collector is reachable from Prometheus
|
|
docker exec -it claude-prometheus ping otel-collector
|
|
```
|
|
|
|
**Solutions:**
|
|
1. Verify `prometheus.yml` has correct scrape_configs
|
|
2. Ensure OTEL Collector is running
|
|
3. Check Docker network connectivity
|
|
4. Restart Prometheus: `docker compose restart prometheus`
|
|
|
|
---
|
|
|
|
### Prometheus Can't Scrape OTEL Collector
|
|
|
|
**Symptom:** Target shows as DOWN with error "context deadline exceeded"
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check if OTEL Collector is exposing metrics
|
|
curl http://localhost:8889/metrics
|
|
|
|
# Check OTEL Collector logs
|
|
docker compose logs otel-collector
|
|
```
|
|
|
|
**Solutions:**
|
|
1. Verify OTEL Collector prometheus exporter is configured
|
|
2. Check port 8889 is exposed in docker-compose.yml
|
|
3. Restart OTEL Collector: `docker compose restart otel-collector`
|
|
|
|
---
|
|
|
|
## Metric Issues
|
|
|
|
### Metrics Have Double Prefix
|
|
|
|
**Symptom:** Metrics are named `claude_code_claude_code_*` instead of `claude_code_*`
|
|
|
|
**Explanation:** This is expected behavior with the current OTEL Collector configuration:
|
|
- First `claude_code` = Prometheus exporter namespace
|
|
- Second `claude_code` = Original metric name
|
|
|
|
**Solutions:**
|
|
|
|
**Option 1: Accept it (Recommended)**
|
|
- Update dashboard queries to use double prefix
|
|
- This is the standard configuration
|
|
|
|
**Option 2: Remove namespace prefix**
|
|
Update `otel-collector-config.yml`:
|
|
```yaml
|
|
exporters:
|
|
prometheus:
|
|
endpoint: "0.0.0.0:8889"
|
|
namespace: "" # Remove namespace
|
|
```
|
|
|
|
Then restart: `docker compose restart otel-collector`
|
|
|
|
---
|
|
|
|
### Old Metrics Still Showing
|
|
|
|
**Symptom:** After changing configuration, old metrics still appear
|
|
|
|
**Cause:** Prometheus retains metrics until retention period expires
|
|
|
|
**Solutions:**
|
|
|
|
**Quick fix: Delete Prometheus data:**
|
|
```bash
|
|
docker compose down
|
|
docker volume rm claude-telemetry_prometheus-data
|
|
docker compose up -d
|
|
```
|
|
|
|
**Proper fix: Wait for retention:**
|
|
- Default retention is 15 days
|
|
- Old metrics will automatically disappear
|
|
- New metrics will coexist temporarily
|
|
|
|
---
|
|
|
|
## Network Issues
|
|
|
|
### Can't Reach OTEL Endpoint from Claude Code
|
|
|
|
**Symptom:** Claude Code can't connect to `localhost:4317`
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Test gRPC endpoint
|
|
nc -zv localhost 4317
|
|
|
|
# Test HTTP endpoint
|
|
curl -v http://localhost:4318/v1/metrics -d '{}'
|
|
```
|
|
|
|
**Solutions:**
|
|
|
|
**If connection refused:**
|
|
1. Check OTEL Collector is running
|
|
2. Verify ports are exposed in docker-compose.yml
|
|
3. Check firewall/antivirus blocking localhost connections
|
|
|
|
**If timeout:**
|
|
1. Increase export timeout in settings.json
|
|
2. Try HTTP protocol instead of gRPC
|
|
|
|
**macOS-specific:**
|
|
- Use `http://host.docker.internal:4317` instead of `localhost:4317`
|
|
- Or use bridge network mode
|
|
|
|
---
|
|
|
|
### Enterprise Endpoint Unreachable
|
|
|
|
**Symptom:** Can't connect to company OTEL endpoint
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Test connectivity
|
|
ping otel.company.com
|
|
|
|
# Test port
|
|
nc -zv otel.company.com 4317
|
|
|
|
# Test with VPN
|
|
# (Ensure corporate VPN is connected)
|
|
```
|
|
|
|
**Solutions:**
|
|
1. Connect to corporate VPN
|
|
2. Check firewall allows outbound connections
|
|
3. Verify endpoint URL is correct
|
|
4. Try HTTP endpoint (port 4318) instead of gRPC
|
|
5. Contact platform team to verify endpoint is accessible
|
|
|
|
---
|
|
|
|
## Performance Issues
|
|
|
|
### High Memory Usage
|
|
|
|
**Symptom:** OTEL Collector or Prometheus using excessive memory
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check container resource usage
|
|
docker stats
|
|
|
|
# Check Prometheus TSDB size
|
|
du -sh ~/.claude/telemetry/prometheus-data
|
|
```
|
|
|
|
**Solutions:**
|
|
|
|
**OTEL Collector:**
|
|
Reduce memory_limiter in `otel-collector-config.yml`:
|
|
```yaml
|
|
processors:
|
|
memory_limiter:
|
|
check_interval: 1s
|
|
limit_mib: 256 # Reduce from 512
|
|
```
|
|
|
|
**Prometheus:**
|
|
Reduce retention:
|
|
```yaml
|
|
command:
|
|
- '--storage.tsdb.retention.time=7d' # Reduce from 15d
|
|
- '--storage.tsdb.retention.size=1GB'
|
|
```
|
|
|
|
---
|
|
|
|
### Slow Grafana Dashboards
|
|
|
|
**Symptom:** Dashboards take long time to load or timeout
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check query performance in Prometheus
|
|
# Go to: http://localhost:9090/graph
|
|
# Run expensive queries like: sum by (account_uuid, model, type) (...)
|
|
```
|
|
|
|
**Solutions:**
|
|
1. Reduce dashboard time range (use 6h instead of 7d)
|
|
2. Increase dashboard refresh interval (1m → 5m)
|
|
3. Use recording rules for complex queries
|
|
4. Reduce number of panels
|
|
5. Use simpler aggregations
|
|
|
|
---
|
|
|
|
## Data Quality Issues
|
|
|
|
### Unexpected Cost Values
|
|
|
|
**Symptom:** Cost metrics seem incorrect
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check raw cost values
|
|
curl -s 'http://localhost:9090/api/v1/query?query=claude_code_claude_code_cost_usage_USD_total' | jq .
|
|
|
|
# Check token usage
|
|
curl -s 'http://localhost:9090/api/v1/query?query=claude_code_claude_code_token_usage_tokens_total' | jq .
|
|
```
|
|
|
|
**Causes:**
|
|
- Cost is cumulative counter (not reset between sessions)
|
|
- Dashboard may be using wrong time range
|
|
- Model pricing may have changed
|
|
|
|
**Solutions:**
|
|
- Use `increase([24h])` not raw counter values
|
|
- Verify pricing in metrics reference
|
|
- Check Claude Code version (pricing may vary)
|
|
|
|
---
|
|
|
|
### Missing Sessions
|
|
|
|
**Symptom:** Some Claude Code sessions not recorded
|
|
|
|
**Causes:**
|
|
1. Claude Code wasn't restarted after settings update
|
|
2. OTEL Collector was down during session
|
|
3. Export interval hadn't elapsed yet (60 seconds default)
|
|
4. Network issue prevented export
|
|
|
|
**Solutions:**
|
|
- Always restart Claude Code after settings changes
|
|
- Monitor OTEL Collector uptime
|
|
- Check OTEL Collector logs for export errors
|
|
- Reduce export interval if real-time data needed
|
|
|
|
---
|
|
|
|
## Getting Help
|
|
|
|
### Collect Debug Information
|
|
|
|
When asking for help, provide:
|
|
|
|
```bash
|
|
# 1. Container status
|
|
docker compose ps
|
|
|
|
# 2. Container logs (last 50 lines)
|
|
docker compose logs --tail=50
|
|
|
|
# 3. Configuration files
|
|
cat ~/.claude/telemetry/otel-collector-config.yml
|
|
cat ~/.claude/telemetry/prometheus.yml
|
|
|
|
# 4. Claude Code settings (redact sensitive info!)
|
|
jq '.env | with_entries(select(.key | startswith("OTEL_")))' ~/.claude/settings.json
|
|
|
|
# 5. Prometheus metrics list
|
|
curl -s 'http://localhost:9090/api/v1/label/__name__/values' | jq . | grep claude_code
|
|
|
|
# 6. System info
|
|
docker --version
|
|
docker compose version
|
|
uname -a
|
|
```
|
|
|
|
### Enable Debug Logging
|
|
|
|
**OTEL Collector:**
|
|
```yaml
|
|
exporters:
|
|
debug:
|
|
verbosity: detailed # Change from 'normal'
|
|
|
|
service:
|
|
telemetry:
|
|
logs:
|
|
level: debug # Change from 'info'
|
|
```
|
|
|
|
**Claude Code:**
|
|
Add to settings.json:
|
|
```json
|
|
"env": {
|
|
"OTEL_LOG_LEVEL": "debug"
|
|
}
|
|
```
|
|
|
|
Then check logs:
|
|
```bash
|
|
docker compose logs -f otel-collector
|
|
```
|
|
|
|
---
|
|
|
|
## Additional Resources
|
|
|
|
- **OTEL Collector Docs:** https://opentelemetry.io/docs/collector/
|
|
- **Prometheus Troubleshooting:** https://prometheus.io/docs/prometheus/latest/troubleshooting/
|
|
- **Grafana Troubleshooting:** https://grafana.com/docs/grafana/latest/troubleshooting/
|
|
- **Docker Compose Docs:** https://docs.docker.com/compose/
|