Initial commit

This commit is contained in:
Zhongwei Li
2025-11-29 18:16:51 +08:00
commit 4e8a12140c
88 changed files with 17078 additions and 0 deletions

View File

@@ -0,0 +1,381 @@
# Claude Code Metrics Reference
Complete reference for all Claude Code OpenTelemetry metrics.
**Important:** All metrics use a double prefix: `claude_code_claude_code_*`
---
## Metric Categories
1. **Usage Metrics** - Session counts, active time
2. **Token Metrics** - Input, output, cached tokens
3. **Cost Metrics** - API costs by model
4. **Productivity Metrics** - LOC, commits, PRs
5. **Error Metrics** - Failures, retries
---
## Usage Metrics
### claude_code_claude_code_session_count_total
**Type:** Counter
**Description:** Total number of Claude Code sessions started
**Labels:**
- `account_uuid` - Anonymous user identifier
- `version` - Claude Code version (e.g., "1.2.3")
**Example Query:**
```promql
# Total sessions across all users
sum(claude_code_claude_code_session_count_total)
# Sessions by version
sum by (version) (claude_code_claude_code_session_count_total)
# New sessions in last 24h
increase(claude_code_claude_code_session_count_total[24h])
```
---
### claude_code_claude_code_active_time_seconds_total
**Type:** Counter
**Description:** Total active time spent in Claude Code sessions (in seconds)
**Labels:**
- `account_uuid` - Anonymous user identifier
- `version` - Claude Code version
**Example Query:**
```promql
# Total active hours
sum(claude_code_claude_code_active_time_seconds_total) / 3600
# Active hours per day
increase(claude_code_claude_code_active_time_seconds_total[24h]) / 3600
# Average session duration
increase(claude_code_claude_code_active_time_seconds_total[24h])
/
increase(claude_code_claude_code_session_count_total[24h])
```
**Note:** "Active time" means time when Claude Code is actively processing or responding to user input.
---
## Token Metrics
### claude_code_claude_code_token_usage_tokens_total
**Type:** Counter
**Description:** Total tokens consumed by Claude Code API calls
**Labels:**
- `type` - Token type: `input`, `output`, `cache_creation`, `cache_read`
- `model` - Model name (e.g., "claude-sonnet-4-5-20250929", "claude-opus-4-20250514")
- `account_uuid` - Anonymous user identifier
- `version` - Claude Code version
**Token Types Explained:**
- **input:** User messages and tool results sent to Claude
- **output:** Claude's responses (text and tool calls)
- **cache_creation:** Tokens written to prompt cache (billed at input rate)
- **cache_read:** Tokens read from prompt cache (billed at 10% of input rate)
**Example Query:**
```promql
# Total tokens by type (24h)
sum by (type) (increase(claude_code_claude_code_token_usage_tokens_total[24h]))
# Tokens by model (24h)
sum by (model) (increase(claude_code_claude_code_token_usage_tokens_total[24h]))
# Cache hit rate
sum(increase(claude_code_claude_code_token_usage_tokens_total{type="cache_read"}[24h]))
/
sum(increase(claude_code_claude_code_token_usage_tokens_total{type=~"input|cache_creation|cache_read"}[24h]))
# Token usage rate (per minute)
rate(claude_code_claude_code_token_usage_tokens_total[5m]) * 60
```
---
## Cost Metrics
### claude_code_claude_code_cost_usage_USD_total
**Type:** Counter
**Description:** Total API costs in USD
**Labels:**
- `model` - Model name
- `account_uuid` - Anonymous user identifier
- `version` - Claude Code version
**Pricing Reference (as of Jan 2025):**
- **Claude Sonnet 4.5:** $3/MTok input, $15/MTok output
- **Claude Opus 4:** $15/MTok input, $75/MTok output
- **Cache read:** 10% of input price
- **Cache write:** Same as input price
**Example Query:**
```promql
# Total cost (24h)
sum(increase(claude_code_claude_code_cost_usage_USD_total[24h]))
# Cost by model (24h)
sum by (model) (increase(claude_code_claude_code_cost_usage_USD_total[24h]))
# Cost per hour
rate(claude_code_claude_code_cost_usage_USD_total[5m]) * 3600
# Average cost per session
increase(claude_code_claude_code_cost_usage_USD_total[24h])
/
increase(claude_code_claude_code_session_count_total[24h])
# Cumulative cost over time
sum(claude_code_claude_code_cost_usage_USD_total)
```
---
## Productivity Metrics
### claude_code_claude_code_lines_of_code_count_total
**Type:** Counter
**Description:** Total lines of code modified (added + changed + deleted)
**Labels:**
- `type` - Modification type: `added`, `changed`, `deleted`
- `account_uuid` - Anonymous user identifier
- `version` - Claude Code version
**Example Query:**
```promql
# Total LOC modified
sum(claude_code_claude_code_lines_of_code_count_total)
# LOC by type (24h)
sum by (type) (increase(claude_code_claude_code_lines_of_code_count_total[24h]))
# LOC per hour
rate(claude_code_claude_code_lines_of_code_count_total[5m]) * 3600
# Lines per dollar
sum(increase(claude_code_claude_code_lines_of_code_count_total[24h]))
/
sum(increase(claude_code_claude_code_cost_usage_USD_total[24h]))
```
---
### claude_code_claude_code_commit_count_total
**Type:** Counter
**Description:** Total git commits created by Claude Code
**Labels:**
- `account_uuid` - Anonymous user identifier
- `version` - Claude Code version
**Example Query:**
```promql
# Total commits
sum(claude_code_claude_code_commit_count_total)
# Commits per day
increase(claude_code_claude_code_commit_count_total[24h])
# Commits per session
increase(claude_code_claude_code_commit_count_total[24h])
/
increase(claude_code_claude_code_session_count_total[24h])
```
---
### claude_code_claude_code_pr_count_total
**Type:** Counter
**Description:** Total pull requests created by Claude Code
**Labels:**
- `account_uuid` - Anonymous user identifier
- `version` - Claude Code version
**Example Query:**
```promql
# Total PRs
sum(claude_code_claude_code_pr_count_total)
# PRs per week
increase(claude_code_claude_code_pr_count_total[7d])
```
---
## Cardinality and Resource Attributes
### Resource Attributes
All metrics include these resource attributes (configured in settings.json):
```json
"OTEL_RESOURCE_ATTRIBUTES": "environment=local,deployment=poc,team=platform"
```
**Common Attributes:**
- `service.name` = "claude-code" (set by OTEL Collector)
- `environment` - Deployment environment (local, dev, staging, prod)
- `deployment` - Deployment type (poc, enterprise)
- `team` - Team identifier
- `department` - Department identifier
- `project` - Project identifier
**Querying with Resource Attributes:**
```promql
# Filter by environment
sum(claude_code_claude_code_cost_usage_USD_total{environment="production"})
# Aggregate by team
sum by (team) (increase(claude_code_claude_code_cost_usage_USD_total[24h]))
```
---
## Metric Naming Convention
**Format:** `claude_code_claude_code_<metric_name>_<unit>_<type>`
**Why double prefix?**
- First `claude_code` comes from Prometheus exporter namespace in OTEL Collector config
- Second `claude_code` comes from the original metric name in Claude Code
- This is expected behavior with the current configuration
**Components:**
- `<metric_name>`: Descriptive name (e.g., `token_usage`, `cost_usage`)
- `<unit>`: Unit of measurement (e.g., `tokens`, `USD`, `seconds`, `count`)
- `<type>`: Metric type (always `total` for counters)
---
## Querying Best Practices
### Use increase() for Counters
Counters are cumulative, so use `increase()` for time windows:
```promql
# ✅ Correct - Shows cost in last 24h
increase(claude_code_claude_code_cost_usage_USD_total[24h])
# ❌ Wrong - Shows cumulative cost since start
claude_code_claude_code_cost_usage_USD_total
```
### Use rate() for Rates
Calculate per-second rate, then multiply for desired unit:
```promql
# Cost per hour
rate(claude_code_claude_code_cost_usage_USD_total[5m]) * 3600
# Tokens per minute
rate(claude_code_claude_code_token_usage_tokens_total[5m]) * 60
```
### Aggregate with sum()
Combine metrics across labels:
```promql
# Total tokens (all types)
sum(claude_code_claude_code_token_usage_tokens_total)
# Total tokens by type
sum by (type) (claude_code_claude_code_token_usage_tokens_total)
# Total cost across all models
sum(claude_code_claude_code_cost_usage_USD_total)
```
---
## Example Dashboards
### Executive Summary (single values)
```promql
# Total cost this month
sum(increase(claude_code_claude_code_cost_usage_USD_total[30d]))
# Total LOC this month
sum(increase(claude_code_claude_code_lines_of_code_count_total[30d]))
# Active users (unique account_uuids)
count(count by (account_uuid) (claude_code_claude_code_session_count_total))
# Average session cost
sum(increase(claude_code_claude_code_cost_usage_USD_total[30d]))
/
sum(increase(claude_code_claude_code_session_count_total[30d]))
```
### Cost Tracking
```promql
# Daily cost trend
sum(increase(claude_code_claude_code_cost_usage_USD_total[1d]))
# Cost by model (pie chart)
sum by (model) (increase(claude_code_claude_code_cost_usage_USD_total[7d]))
# Cost by team (bar chart)
sum by (team) (increase(claude_code_claude_code_cost_usage_USD_total[7d]))
```
### Productivity Tracking
```promql
# LOC per day
sum(increase(claude_code_claude_code_lines_of_code_count_total[1d]))
# Commits per week
sum(increase(claude_code_claude_code_commit_count_total[7d]))
# Efficiency: LOC per dollar
sum(increase(claude_code_claude_code_lines_of_code_count_total[30d]))
/
sum(increase(claude_code_claude_code_cost_usage_USD_total[30d]))
```
---
## Retention and Storage
**Default Prometheus Retention:** 15 days
**Adjust retention:**
```yaml
# In prometheus.yml or docker-compose.yml
command:
- '--storage.tsdb.retention.time=90d'
- '--storage.tsdb.retention.size=50GB'
```
**Disk usage estimation:**
- ~1-2 MB per day per active user
- ~30-60 MB per month per active user
- ~360-720 MB per year per active user
**For long-term storage:** Consider using Prometheus remote write to send data to a time-series database like VictoriaMetrics, Cortex, or Thanos.
---
## Additional Resources
- **Official OTEL Docs:** https://opentelemetry.io/docs/
- **Prometheus Query Docs:** https://prometheus.io/docs/prometheus/latest/querying/basics/
- **PromQL Examples:** See `prometheus-queries.md`

View File

@@ -0,0 +1,405 @@
# Useful Prometheus Queries (PromQL)
Collection of useful PromQL queries for Claude Code telemetry analysis.
**Note:** All queries use the double prefix: `claude_code_claude_code_*`
---
## Cost Analysis
### Daily Cost Trend
```promql
sum(increase(claude_code_claude_code_cost_usage_USD_total[1d]))
```
### Cost by Model
```promql
sum by (model) (increase(claude_code_claude_code_cost_usage_USD_total[24h]))
```
### Cost per Hour (Rate)
```promql
rate(claude_code_claude_code_cost_usage_USD_total[5m]) * 3600
```
### Average Cost per Session
```promql
sum(increase(claude_code_claude_code_cost_usage_USD_total[24h]))
/
sum(increase(claude_code_claude_code_session_count_total[24h]))
```
### Cumulative Monthly Cost
```promql
sum(increase(claude_code_claude_code_cost_usage_USD_total[30d]))
```
### Cost by Team
```promql
sum by (team) (increase(claude_code_claude_code_cost_usage_USD_total[24h]))
```
### Projected Monthly Cost (based on last 7 days)
```promql
(sum(increase(claude_code_claude_code_cost_usage_USD_total[7d])) / 7) * 30
```
---
## Token Usage
### Total Tokens by Type
```promql
sum by (type) (increase(claude_code_claude_code_token_usage_tokens_total[24h]))
```
### Tokens by Model
```promql
sum by (model) (increase(claude_code_claude_code_token_usage_tokens_total[24h]))
```
### Cache Hit Rate
```promql
sum(increase(claude_code_claude_code_token_usage_tokens_total{type="cache_read"}[24h]))
/
sum(increase(claude_code_claude_code_token_usage_tokens_total{type=~"input|cache_creation|cache_read"}[24h]))
* 100
```
### Input vs Output Token Ratio
```promql
sum(increase(claude_code_claude_code_token_usage_tokens_total{type="input"}[24h]))
/
sum(increase(claude_code_claude_code_token_usage_tokens_total{type="output"}[24h]))
```
### Token Usage Rate (per minute)
```promql
sum by (type) (rate(claude_code_claude_code_token_usage_tokens_total[5m]) * 60)
```
### Total Tokens (All Time)
```promql
sum(claude_code_claude_code_token_usage_tokens_total)
```
---
## Productivity Metrics
### Total Lines of Code Modified
```promql
sum(claude_code_claude_code_lines_of_code_count_total)
```
### LOC by Type (Added, Changed, Deleted)
```promql
sum by (type) (increase(claude_code_claude_code_lines_of_code_count_total[24h]))
```
### LOC per Hour
```promql
rate(claude_code_claude_code_lines_of_code_count_total[5m]) * 3600
```
### Lines per Dollar (Efficiency)
```promql
sum(increase(claude_code_claude_code_lines_of_code_count_total[24h]))
/
sum(increase(claude_code_claude_code_cost_usage_USD_total[24h]))
```
### Commits per Day
```promql
increase(claude_code_claude_code_commit_count_total[24h])
```
### PRs per Week
```promql
increase(claude_code_claude_code_pr_count_total[7d])
```
### LOC per Commit
```promql
sum(increase(claude_code_claude_code_lines_of_code_count_total[24h]))
/
sum(increase(claude_code_claude_code_commit_count_total[24h]))
```
---
## Session Analytics
### Total Sessions
```promql
sum(claude_code_claude_code_session_count_total)
```
### New Sessions (24h)
```promql
increase(claude_code_claude_code_session_count_total[24h])
```
### Active Users (Unique account_uuids)
```promql
count(count by (account_uuid) (claude_code_claude_code_session_count_total))
```
### Average Session Duration
```promql
sum(increase(claude_code_claude_code_active_time_seconds_total[24h]))
/
sum(increase(claude_code_claude_code_session_count_total[24h]))
/ 60
```
*Result in minutes*
### Total Active Hours (24h)
```promql
sum(increase(claude_code_claude_code_active_time_seconds_total[24h])) / 3600
```
### Sessions by Version
```promql
sum by (version) (increase(claude_code_claude_code_session_count_total[24h]))
```
---
## Team Aggregation
### Cost by Team (Last 24h)
```promql
sum by (team) (increase(claude_code_claude_code_cost_usage_USD_total[24h]))
```
### LOC by Team (Last 24h)
```promql
sum by (team) (increase(claude_code_claude_code_lines_of_code_count_total[24h]))
```
### Active Users per Team
```promql
count by (team) (count by (team, account_uuid) (claude_code_claude_code_session_count_total))
```
### Team Efficiency (LOC per Dollar)
```promql
sum by (team) (increase(claude_code_claude_code_lines_of_code_count_total[24h]))
/
sum by (team) (increase(claude_code_claude_code_cost_usage_USD_total[24h]))
```
### Top Spending Teams (Last 7 days)
```promql
topk(5, sum by (team) (increase(claude_code_claude_code_cost_usage_USD_total[7d])))
```
---
## Model Comparison
### Cost by Model (Pie Chart)
```promql
sum by (model) (increase(claude_code_claude_code_cost_usage_USD_total[7d]))
```
### Token Efficiency by Model (Tokens per Dollar)
```promql
sum by (model) (increase(claude_code_claude_code_token_usage_tokens_total[24h]))
/
sum by (model) (increase(claude_code_claude_code_cost_usage_USD_total[24h]))
```
### Most Used Model
```promql
topk(1, sum by (model) (increase(claude_code_claude_code_token_usage_tokens_total[24h])))
```
### Model Usage Distribution (%)
```promql
sum by (model) (increase(claude_code_claude_code_token_usage_tokens_total[24h]))
/
sum(increase(claude_code_claude_code_token_usage_tokens_total[24h]))
* 100
```
---
## Alerting Queries
### High Daily Cost Alert (> $50)
```promql
sum(increase(claude_code_claude_code_cost_usage_USD_total[24h])) > 50
```
### Cost Spike Alert (50% increase compared to yesterday)
```promql
sum(increase(claude_code_claude_code_cost_usage_USD_total[24h]))
/
sum(increase(claude_code_claude_code_cost_usage_USD_total[24h] offset 24h))
> 1.5
```
### No Activity Alert (no sessions in last hour)
```promql
increase(claude_code_claude_code_session_count_total[1h]) == 0
```
### Low Cache Hit Rate Alert (< 20%)
```promql
(
sum(increase(claude_code_claude_code_token_usage_tokens_total{type="cache_read"}[1h]))
/
sum(increase(claude_code_claude_code_token_usage_tokens_total{type=~"input|cache_creation|cache_read"}[1h]))
* 100
) < 20
```
---
## Forecasting
### Projected Monthly Cost (based on last 7 days)
```promql
(sum(increase(claude_code_claude_code_cost_usage_USD_total[7d])) / 7) * 30
```
### Projected Annual Cost (based on last 30 days)
```promql
(sum(increase(claude_code_claude_code_cost_usage_USD_total[30d])) / 30) * 365
```
### Average Daily Cost (Last 30 days)
```promql
sum(increase(claude_code_claude_code_cost_usage_USD_total[30d])) / 30
```
### Growth Rate (Week over Week)
```promql
(
sum(increase(claude_code_claude_code_cost_usage_USD_total[7d]))
-
sum(increase(claude_code_claude_code_cost_usage_USD_total[7d] offset 7d))
)
/
sum(increase(claude_code_claude_code_cost_usage_USD_total[7d] offset 7d))
* 100
```
*Result as percentage*
---
## Debugging Queries
### Check if Metrics Exist
```promql
claude_code_claude_code_session_count_total
```
### List All Claude Code Metrics
```
# Use Prometheus UI or API
curl -s 'http://localhost:9090/api/v1/label/__name__/values' | jq . | grep claude_code
```
### Check Metric Labels
```promql
# Returns all label combinations
count by (account_uuid, version, team, environment) (claude_code_claude_code_session_count_total)
```
### Latest Value for All Metrics
```promql
# Session count
claude_code_claude_code_session_count_total
# Cost
claude_code_claude_code_cost_usage_USD_total
# Tokens
claude_code_claude_code_token_usage_tokens_total
# LOC
claude_code_claude_code_lines_of_code_count_total
```
### Metrics Cardinality (Number of Time Series)
```promql
count(claude_code_claude_code_token_usage_tokens_total)
```
---
## Recording Rules
Save these as Prometheus recording rules for faster dashboard queries:
```yaml
groups:
- name: claude_code_aggregations
interval: 1m
rules:
# Daily cost
- record: claude_code:cost_usd:daily
expr: sum(increase(claude_code_claude_code_cost_usage_USD_total[24h]))
# Cost by team
- record: claude_code:cost_usd:daily:by_team
expr: sum by (team) (increase(claude_code_claude_code_cost_usage_USD_total[24h]))
# Cache hit rate
- record: claude_code:cache_hit_rate:daily
expr: |
sum(increase(claude_code_claude_code_token_usage_tokens_total{type="cache_read"}[24h]))
/
sum(increase(claude_code_claude_code_token_usage_tokens_total{type=~"input|cache_creation|cache_read"}[24h]))
* 100
# LOC efficiency
- record: claude_code:loc_per_dollar:daily
expr: |
sum(increase(claude_code_claude_code_lines_of_code_count_total[24h]))
/
sum(increase(claude_code_claude_code_cost_usage_USD_total[24h]))
```
Then use simplified queries:
```promql
# Instead of complex query, just use:
claude_code:cost_usd:daily
claude_code:cost_usd:daily:by_team
```
---
## Visualization Tips
### Time Series Panel
- Use `rate()` for smooth trends
- Set legend to `{{label_name}}` for clarity
- Enable "Lines" draw style with opacity
### Stat Panel
- Use `lastNotNull` for counters
- Use `increase([24h])` for daily totals
- Add thresholds for color coding
### Bar Chart
- Use `sum by (label)` for grouping
- Sort by value descending
- Limit to top 10 with `topk(10, ...)`
### Pie Chart
- Calculate percentages with division
- Use `sum by (label)` for segments
- Limit to top categories
---
## Additional Resources
- **Prometheus Query Docs:** https://prometheus.io/docs/prometheus/latest/querying/basics/
- **PromQL Examples:** https://prometheus.io/docs/prometheus/latest/querying/examples/
- **Grafana Query Editor:** https://grafana.com/docs/grafana/latest/datasources/prometheus/

View File

@@ -0,0 +1,658 @@
# Troubleshooting Guide
Common issues and solutions for Claude Code OpenTelemetry setup.
---
## Container Issues
### Docker Not Running
**Symptom:** `Cannot connect to the Docker daemon`
**Diagnosis:**
```bash
docker info
```
**Solutions:**
1. Start Docker Desktop application
2. Wait for Docker to fully initialize
3. Check system tray for Docker icon
4. Verify Docker daemon is running: `ps aux | grep docker`
---
### Containers Won't Start
**Symptom:** Containers exit immediately after `docker compose up`
**Diagnosis:**
```bash
# Check container logs
docker compose logs
# Check specific service
docker compose logs otel-collector
docker compose logs prometheus
```
**Common Causes:**
**1. OTEL Collector Configuration Error**
```bash
# Check for errors
docker compose logs otel-collector | grep -i error
# Common issues:
# - Deprecated logging exporter
# - Deprecated 'address' field in telemetry.metrics
```
**Solution A - Deprecated logging exporter:**
Update `otel-collector-config.yml`:
```yaml
exporters:
debug:
verbosity: normal
# NOT:
# logging:
# loglevel: info
```
**Solution B - Deprecated 'address' field (v0.123.0+):**
If logs show: `'address' has invalid keys` or similar error:
Update `otel-collector-config.yml`:
```yaml
service:
telemetry:
metrics:
level: detailed
# REMOVE this line (deprecated in v0.123.0+):
# address: ":8888"
```
The `address` field in `service.telemetry.metrics` is deprecated in newer OTEL Collector versions. Simply remove it - the collector will use default internal metrics endpoint.
**2. Port Already in Use**
```bash
# Check which ports are in use
lsof -i :3000 # Grafana
lsof -i :4317 # OTEL gRPC
lsof -i :4318 # OTEL HTTP
lsof -i :8889 # OTEL Prometheus exporter
lsof -i :9090 # Prometheus
lsof -i :3100 # Loki
```
**Solution:**
- Stop conflicting service
- Or change port in docker-compose.yml
**3. Volume Permission Issues**
```bash
# Check volume permissions
docker volume ls
docker volume inspect claude-telemetry_prometheus-data
```
**Solution:**
```bash
# Remove and recreate volumes
docker compose down -v
docker compose up -d
```
---
### Containers Keep Restarting
**Symptom:** Container status shows "Restarting"
**Diagnosis:**
```bash
docker compose ps
docker compose logs --tail=50 <service-name>
```
**Solutions:**
1. Check memory limits: Increase memory_limiter in OTEL config
2. Check disk space: `df -h`
3. Check for configuration errors in logs
4. Restart Docker Desktop
---
## Claude Code Settings Issues
### 🚨 CRITICAL: Telemetry Not Sending (Most Common Issue)
**Symptom:** No metrics appearing in Prometheus after Claude Code restart
**ROOT CAUSE (90% of cases):** Missing required exporter environment variables
Even when `CLAUDE_CODE_ENABLE_TELEMETRY=1` is set, telemetry **will not send** without explicit exporter configuration. This is the #1 most common issue.
**Diagnosis Checklist:**
**1. Check REQUIRED exporters (MOST IMPORTANT):**
```bash
jq '.env.OTEL_METRICS_EXPORTER' ~/.claude/settings.json
# Must return: "otlp" (NOT null, NOT missing)
jq '.env.OTEL_LOGS_EXPORTER' ~/.claude/settings.json
# Should return: "otlp" (recommended for event tracking)
```
**If either returns `null` or is missing, this is your problem!**
**2. Verify telemetry is enabled:**
```bash
jq '.env.CLAUDE_CODE_ENABLE_TELEMETRY' ~/.claude/settings.json
# Should return: "1"
```
**3. Check OTEL endpoint:**
```bash
jq '.env.OTEL_EXPORTER_OTLP_ENDPOINT' ~/.claude/settings.json
# Should return: "http://localhost:4317" (for local setup)
```
**3. Verify JSON is valid:**
```bash
jq empty ~/.claude/settings.json
# No output = valid JSON
```
**4. Check if Claude Code was restarted:**
```bash
# Telemetry config only loads at startup!
# Must quit and restart Claude Code completely
```
**5. Test OTEL endpoint connectivity:**
```bash
nc -zv localhost 4317
# Should show: Connection to localhost port 4317 [tcp/*] succeeded!
```
**Solutions:**
**If exporters are missing (MOST COMMON):**
Add these REQUIRED settings to ~/.claude/settings.json:
```json
{
"env": {
"CLAUDE_CODE_ENABLE_TELEMETRY": "1",
"OTEL_METRICS_EXPORTER": "otlp",
"OTEL_LOGS_EXPORTER": "otlp",
"OTEL_EXPORTER_OTLP_ENDPOINT": "http://localhost:4317",
"OTEL_EXPORTER_OTLP_PROTOCOL": "grpc"
}
}
```
Then **MUST restart Claude Code** (settings only load at startup).
**If endpoint unreachable:**
- Verify OTEL Collector container is running
- Check firewall settings
- Try HTTP endpoint instead: `http://localhost:4318`
**If still no data:**
- Check OTEL Collector logs for incoming connections
- Verify Claude Code is running (not just idle)
- Wait 60 seconds (default export interval)
---
### Settings.json Syntax Errors
**Symptom:** Claude Code won't start or shows errors
**Diagnosis:**
```bash
# Validate JSON
jq empty ~/.claude/settings.json
# Pretty-print to find issues
jq . ~/.claude/settings.json
```
**Common Issues:**
- Missing commas between properties
- Trailing commas before closing braces
- Unescaped quotes in strings
- Incorrect nesting
**Solution:**
```bash
# Restore backup
cp ~/.claude/settings.json.backup ~/.claude/settings.json
# Or fix JSON manually with editor
```
---
## Grafana Issues
### Can't Access Grafana
**Symptom:** `localhost:3000` doesn't load
**Diagnosis:**
```bash
# Check if Grafana is running
docker ps | grep grafana
# Check Grafana logs
docker compose logs grafana
# Check port availability
lsof -i :3000
```
**Solutions:**
1. Verify container is running: `docker compose up -d grafana`
2. Wait 30 seconds for Grafana to initialize
3. Try `http://127.0.0.1:3000` instead
4. Check Docker network: `docker network inspect claude-telemetry`
---
### Dashboard Shows "Datasource Not Found"
**Symptom:** Dashboard panels show "datasource prometheus not found"
**Cause:** Dashboard has hardcoded datasource UID that doesn't match your Grafana instance
**Diagnosis:**
1. Go to: http://localhost:3000/connections/datasources
2. Click on Prometheus datasource
3. Note the UID from URL (e.g., `PBFA97CFB590B2093`)
**Solution:**
```bash
# Get your datasource UID
DATASOURCE_UID=$(curl -s -u admin:admin http://localhost:3000/api/datasources | jq -r '.[] | select(.type=="prometheus") | .uid')
echo "Your Prometheus datasource UID: $DATASOURCE_UID"
# Update dashboard JSON
cd ~/.claude/telemetry/dashboards
cat claude-code-overview.json | sed "s/PBFA97CFB590B2093/$DATASOURCE_UID/g" > claude-code-overview-fixed.json
# Re-import the fixed dashboard
```
---
### Dashboard Shows "No Data"
**Symptom:** Dashboard loads but all panels show "No data"
**Diagnosis Steps:**
**1. Check Prometheus has data:**
```bash
# Query Prometheus directly
curl -s 'http://localhost:9090/api/v1/label/__name__/values' | jq . | grep claude_code
# Should see metrics like:
# "claude_code_claude_code_session_count_total"
# "claude_code_claude_code_cost_usage_USD_total"
```
**2. Check datasource connection:**
- Go to: http://localhost:3000/connections/datasources
- Click Prometheus
- Click "Save & Test"
- Should show: "Successfully queried the Prometheus API"
**3. Verify metric names in queries:**
```bash
# Check if metrics use double prefix
curl -s 'http://localhost:9090/api/v1/query?query=claude_code_claude_code_session_count_total' | jq .
```
**Solutions:**
**If metrics don't exist:**
- Claude Code hasn't sent data yet (wait 60 seconds)
- OTEL Collector isn't receiving data (check container logs)
- Settings.json wasn't configured correctly
**If metrics exist but dashboard shows no data:**
- Dashboard queries use wrong metric names
- Update queries to use double prefix: `claude_code_claude_code_*`
- Check time range (top-right corner of Grafana)
**If single prefix metrics exist (`claude_code_*`):**
Your setup uses old naming. Update dashboard:
```bash
# Replace double prefix with single
sed 's/claude_code_claude_code_/claude_code_/g' dashboard.json > dashboard-fixed.json
```
---
## Prometheus Issues
### Prometheus Shows No Targets
**Symptom:** Prometheus UI (localhost:9090) → Status → Targets shows no targets or DOWN status
**Diagnosis:**
```bash
# Check Prometheus config
cat ~/.claude/telemetry/prometheus.yml
# Check if OTEL Collector is reachable from Prometheus
docker exec -it claude-prometheus ping otel-collector
```
**Solutions:**
1. Verify `prometheus.yml` has correct scrape_configs
2. Ensure OTEL Collector is running
3. Check Docker network connectivity
4. Restart Prometheus: `docker compose restart prometheus`
---
### Prometheus Can't Scrape OTEL Collector
**Symptom:** Target shows as DOWN with error "context deadline exceeded"
**Diagnosis:**
```bash
# Check if OTEL Collector is exposing metrics
curl http://localhost:8889/metrics
# Check OTEL Collector logs
docker compose logs otel-collector
```
**Solutions:**
1. Verify OTEL Collector prometheus exporter is configured
2. Check port 8889 is exposed in docker-compose.yml
3. Restart OTEL Collector: `docker compose restart otel-collector`
---
## Metric Issues
### Metrics Have Double Prefix
**Symptom:** Metrics are named `claude_code_claude_code_*` instead of `claude_code_*`
**Explanation:** This is expected behavior with the current OTEL Collector configuration:
- First `claude_code` = Prometheus exporter namespace
- Second `claude_code` = Original metric name
**Solutions:**
**Option 1: Accept it (Recommended)**
- Update dashboard queries to use double prefix
- This is the standard configuration
**Option 2: Remove namespace prefix**
Update `otel-collector-config.yml`:
```yaml
exporters:
prometheus:
endpoint: "0.0.0.0:8889"
namespace: "" # Remove namespace
```
Then restart: `docker compose restart otel-collector`
---
### Old Metrics Still Showing
**Symptom:** After changing configuration, old metrics still appear
**Cause:** Prometheus retains metrics until retention period expires
**Solutions:**
**Quick fix: Delete Prometheus data:**
```bash
docker compose down
docker volume rm claude-telemetry_prometheus-data
docker compose up -d
```
**Proper fix: Wait for retention:**
- Default retention is 15 days
- Old metrics will automatically disappear
- New metrics will coexist temporarily
---
## Network Issues
### Can't Reach OTEL Endpoint from Claude Code
**Symptom:** Claude Code can't connect to `localhost:4317`
**Diagnosis:**
```bash
# Test gRPC endpoint
nc -zv localhost 4317
# Test HTTP endpoint
curl -v http://localhost:4318/v1/metrics -d '{}'
```
**Solutions:**
**If connection refused:**
1. Check OTEL Collector is running
2. Verify ports are exposed in docker-compose.yml
3. Check firewall/antivirus blocking localhost connections
**If timeout:**
1. Increase export timeout in settings.json
2. Try HTTP protocol instead of gRPC
**macOS-specific:**
- Use `http://host.docker.internal:4317` instead of `localhost:4317`
- Or use bridge network mode
---
### Enterprise Endpoint Unreachable
**Symptom:** Can't connect to company OTEL endpoint
**Diagnosis:**
```bash
# Test connectivity
ping otel.company.com
# Test port
nc -zv otel.company.com 4317
# Test with VPN
# (Ensure corporate VPN is connected)
```
**Solutions:**
1. Connect to corporate VPN
2. Check firewall allows outbound connections
3. Verify endpoint URL is correct
4. Try HTTP endpoint (port 4318) instead of gRPC
5. Contact platform team to verify endpoint is accessible
---
## Performance Issues
### High Memory Usage
**Symptom:** OTEL Collector or Prometheus using excessive memory
**Diagnosis:**
```bash
# Check container resource usage
docker stats
# Check Prometheus TSDB size
du -sh ~/.claude/telemetry/prometheus-data
```
**Solutions:**
**OTEL Collector:**
Reduce memory_limiter in `otel-collector-config.yml`:
```yaml
processors:
memory_limiter:
check_interval: 1s
limit_mib: 256 # Reduce from 512
```
**Prometheus:**
Reduce retention:
```yaml
command:
- '--storage.tsdb.retention.time=7d' # Reduce from 15d
- '--storage.tsdb.retention.size=1GB'
```
---
### Slow Grafana Dashboards
**Symptom:** Dashboards take long time to load or timeout
**Diagnosis:**
```bash
# Check query performance in Prometheus
# Go to: http://localhost:9090/graph
# Run expensive queries like: sum by (account_uuid, model, type) (...)
```
**Solutions:**
1. Reduce dashboard time range (use 6h instead of 7d)
2. Increase dashboard refresh interval (1m → 5m)
3. Use recording rules for complex queries
4. Reduce number of panels
5. Use simpler aggregations
---
## Data Quality Issues
### Unexpected Cost Values
**Symptom:** Cost metrics seem incorrect
**Diagnosis:**
```bash
# Check raw cost values
curl -s 'http://localhost:9090/api/v1/query?query=claude_code_claude_code_cost_usage_USD_total' | jq .
# Check token usage
curl -s 'http://localhost:9090/api/v1/query?query=claude_code_claude_code_token_usage_tokens_total' | jq .
```
**Causes:**
- Cost is cumulative counter (not reset between sessions)
- Dashboard may be using wrong time range
- Model pricing may have changed
**Solutions:**
- Use `increase([24h])` not raw counter values
- Verify pricing in metrics reference
- Check Claude Code version (pricing may vary)
---
### Missing Sessions
**Symptom:** Some Claude Code sessions not recorded
**Causes:**
1. Claude Code wasn't restarted after settings update
2. OTEL Collector was down during session
3. Export interval hadn't elapsed yet (60 seconds default)
4. Network issue prevented export
**Solutions:**
- Always restart Claude Code after settings changes
- Monitor OTEL Collector uptime
- Check OTEL Collector logs for export errors
- Reduce export interval if real-time data needed
---
## Getting Help
### Collect Debug Information
When asking for help, provide:
```bash
# 1. Container status
docker compose ps
# 2. Container logs (last 50 lines)
docker compose logs --tail=50
# 3. Configuration files
cat ~/.claude/telemetry/otel-collector-config.yml
cat ~/.claude/telemetry/prometheus.yml
# 4. Claude Code settings (redact sensitive info!)
jq '.env | with_entries(select(.key | startswith("OTEL_")))' ~/.claude/settings.json
# 5. Prometheus metrics list
curl -s 'http://localhost:9090/api/v1/label/__name__/values' | jq . | grep claude_code
# 6. System info
docker --version
docker compose version
uname -a
```
### Enable Debug Logging
**OTEL Collector:**
```yaml
exporters:
debug:
verbosity: detailed # Change from 'normal'
service:
telemetry:
logs:
level: debug # Change from 'info'
```
**Claude Code:**
Add to settings.json:
```json
"env": {
"OTEL_LOG_LEVEL": "debug"
}
```
Then check logs:
```bash
docker compose logs -f otel-collector
```
---
## Additional Resources
- **OTEL Collector Docs:** https://opentelemetry.io/docs/collector/
- **Prometheus Troubleshooting:** https://prometheus.io/docs/prometheus/latest/troubleshooting/
- **Grafana Troubleshooting:** https://grafana.com/docs/grafana/latest/troubleshooting/
- **Docker Compose Docs:** https://docs.docker.com/compose/