Initial commit
This commit is contained in:
381
skills/otel-monitoring-setup/data/metrics-reference.md
Normal file
381
skills/otel-monitoring-setup/data/metrics-reference.md
Normal file
@@ -0,0 +1,381 @@
|
||||
# Claude Code Metrics Reference
|
||||
|
||||
Complete reference for all Claude Code OpenTelemetry metrics.
|
||||
|
||||
**Important:** All metrics use a double prefix: `claude_code_claude_code_*`
|
||||
|
||||
---
|
||||
|
||||
## Metric Categories
|
||||
|
||||
1. **Usage Metrics** - Session counts, active time
|
||||
2. **Token Metrics** - Input, output, cached tokens
|
||||
3. **Cost Metrics** - API costs by model
|
||||
4. **Productivity Metrics** - LOC, commits, PRs
|
||||
5. **Error Metrics** - Failures, retries
|
||||
|
||||
---
|
||||
|
||||
## Usage Metrics
|
||||
|
||||
### claude_code_claude_code_session_count_total
|
||||
|
||||
**Type:** Counter
|
||||
**Description:** Total number of Claude Code sessions started
|
||||
**Labels:**
|
||||
- `account_uuid` - Anonymous user identifier
|
||||
- `version` - Claude Code version (e.g., "1.2.3")
|
||||
|
||||
**Example Query:**
|
||||
```promql
|
||||
# Total sessions across all users
|
||||
sum(claude_code_claude_code_session_count_total)
|
||||
|
||||
# Sessions by version
|
||||
sum by (version) (claude_code_claude_code_session_count_total)
|
||||
|
||||
# New sessions in last 24h
|
||||
increase(claude_code_claude_code_session_count_total[24h])
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### claude_code_claude_code_active_time_seconds_total
|
||||
|
||||
**Type:** Counter
|
||||
**Description:** Total active time spent in Claude Code sessions (in seconds)
|
||||
**Labels:**
|
||||
- `account_uuid` - Anonymous user identifier
|
||||
- `version` - Claude Code version
|
||||
|
||||
**Example Query:**
|
||||
```promql
|
||||
# Total active hours
|
||||
sum(claude_code_claude_code_active_time_seconds_total) / 3600
|
||||
|
||||
# Active hours per day
|
||||
increase(claude_code_claude_code_active_time_seconds_total[24h]) / 3600
|
||||
|
||||
# Average session duration
|
||||
increase(claude_code_claude_code_active_time_seconds_total[24h])
|
||||
/
|
||||
increase(claude_code_claude_code_session_count_total[24h])
|
||||
```
|
||||
|
||||
**Note:** "Active time" means time when Claude Code is actively processing or responding to user input.
|
||||
|
||||
---
|
||||
|
||||
## Token Metrics
|
||||
|
||||
### claude_code_claude_code_token_usage_tokens_total
|
||||
|
||||
**Type:** Counter
|
||||
**Description:** Total tokens consumed by Claude Code API calls
|
||||
**Labels:**
|
||||
- `type` - Token type: `input`, `output`, `cache_creation`, `cache_read`
|
||||
- `model` - Model name (e.g., "claude-sonnet-4-5-20250929", "claude-opus-4-20250514")
|
||||
- `account_uuid` - Anonymous user identifier
|
||||
- `version` - Claude Code version
|
||||
|
||||
**Token Types Explained:**
|
||||
- **input:** User messages and tool results sent to Claude
|
||||
- **output:** Claude's responses (text and tool calls)
|
||||
- **cache_creation:** Tokens written to prompt cache (billed at input rate)
|
||||
- **cache_read:** Tokens read from prompt cache (billed at 10% of input rate)
|
||||
|
||||
**Example Query:**
|
||||
```promql
|
||||
# Total tokens by type (24h)
|
||||
sum by (type) (increase(claude_code_claude_code_token_usage_tokens_total[24h]))
|
||||
|
||||
# Tokens by model (24h)
|
||||
sum by (model) (increase(claude_code_claude_code_token_usage_tokens_total[24h]))
|
||||
|
||||
# Cache hit rate
|
||||
sum(increase(claude_code_claude_code_token_usage_tokens_total{type="cache_read"}[24h]))
|
||||
/
|
||||
sum(increase(claude_code_claude_code_token_usage_tokens_total{type=~"input|cache_creation|cache_read"}[24h]))
|
||||
|
||||
# Token usage rate (per minute)
|
||||
rate(claude_code_claude_code_token_usage_tokens_total[5m]) * 60
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Cost Metrics
|
||||
|
||||
### claude_code_claude_code_cost_usage_USD_total
|
||||
|
||||
**Type:** Counter
|
||||
**Description:** Total API costs in USD
|
||||
**Labels:**
|
||||
- `model` - Model name
|
||||
- `account_uuid` - Anonymous user identifier
|
||||
- `version` - Claude Code version
|
||||
|
||||
**Pricing Reference (as of Jan 2025):**
|
||||
- **Claude Sonnet 4.5:** $3/MTok input, $15/MTok output
|
||||
- **Claude Opus 4:** $15/MTok input, $75/MTok output
|
||||
- **Cache read:** 10% of input price
|
||||
- **Cache write:** Same as input price
|
||||
|
||||
**Example Query:**
|
||||
```promql
|
||||
# Total cost (24h)
|
||||
sum(increase(claude_code_claude_code_cost_usage_USD_total[24h]))
|
||||
|
||||
# Cost by model (24h)
|
||||
sum by (model) (increase(claude_code_claude_code_cost_usage_USD_total[24h]))
|
||||
|
||||
# Cost per hour
|
||||
rate(claude_code_claude_code_cost_usage_USD_total[5m]) * 3600
|
||||
|
||||
# Average cost per session
|
||||
increase(claude_code_claude_code_cost_usage_USD_total[24h])
|
||||
/
|
||||
increase(claude_code_claude_code_session_count_total[24h])
|
||||
|
||||
# Cumulative cost over time
|
||||
sum(claude_code_claude_code_cost_usage_USD_total)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Productivity Metrics
|
||||
|
||||
### claude_code_claude_code_lines_of_code_count_total
|
||||
|
||||
**Type:** Counter
|
||||
**Description:** Total lines of code modified (added + changed + deleted)
|
||||
**Labels:**
|
||||
- `type` - Modification type: `added`, `changed`, `deleted`
|
||||
- `account_uuid` - Anonymous user identifier
|
||||
- `version` - Claude Code version
|
||||
|
||||
**Example Query:**
|
||||
```promql
|
||||
# Total LOC modified
|
||||
sum(claude_code_claude_code_lines_of_code_count_total)
|
||||
|
||||
# LOC by type (24h)
|
||||
sum by (type) (increase(claude_code_claude_code_lines_of_code_count_total[24h]))
|
||||
|
||||
# LOC per hour
|
||||
rate(claude_code_claude_code_lines_of_code_count_total[5m]) * 3600
|
||||
|
||||
# Lines per dollar
|
||||
sum(increase(claude_code_claude_code_lines_of_code_count_total[24h]))
|
||||
/
|
||||
sum(increase(claude_code_claude_code_cost_usage_USD_total[24h]))
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### claude_code_claude_code_commit_count_total
|
||||
|
||||
**Type:** Counter
|
||||
**Description:** Total git commits created by Claude Code
|
||||
**Labels:**
|
||||
- `account_uuid` - Anonymous user identifier
|
||||
- `version` - Claude Code version
|
||||
|
||||
**Example Query:**
|
||||
```promql
|
||||
# Total commits
|
||||
sum(claude_code_claude_code_commit_count_total)
|
||||
|
||||
# Commits per day
|
||||
increase(claude_code_claude_code_commit_count_total[24h])
|
||||
|
||||
# Commits per session
|
||||
increase(claude_code_claude_code_commit_count_total[24h])
|
||||
/
|
||||
increase(claude_code_claude_code_session_count_total[24h])
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### claude_code_claude_code_pr_count_total
|
||||
|
||||
**Type:** Counter
|
||||
**Description:** Total pull requests created by Claude Code
|
||||
**Labels:**
|
||||
- `account_uuid` - Anonymous user identifier
|
||||
- `version` - Claude Code version
|
||||
|
||||
**Example Query:**
|
||||
```promql
|
||||
# Total PRs
|
||||
sum(claude_code_claude_code_pr_count_total)
|
||||
|
||||
# PRs per week
|
||||
increase(claude_code_claude_code_pr_count_total[7d])
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Cardinality and Resource Attributes
|
||||
|
||||
### Resource Attributes
|
||||
|
||||
All metrics include these resource attributes (configured in settings.json):
|
||||
|
||||
```json
|
||||
"OTEL_RESOURCE_ATTRIBUTES": "environment=local,deployment=poc,team=platform"
|
||||
```
|
||||
|
||||
**Common Attributes:**
|
||||
- `service.name` = "claude-code" (set by OTEL Collector)
|
||||
- `environment` - Deployment environment (local, dev, staging, prod)
|
||||
- `deployment` - Deployment type (poc, enterprise)
|
||||
- `team` - Team identifier
|
||||
- `department` - Department identifier
|
||||
- `project` - Project identifier
|
||||
|
||||
**Querying with Resource Attributes:**
|
||||
```promql
|
||||
# Filter by environment
|
||||
sum(claude_code_claude_code_cost_usage_USD_total{environment="production"})
|
||||
|
||||
# Aggregate by team
|
||||
sum by (team) (increase(claude_code_claude_code_cost_usage_USD_total[24h]))
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Metric Naming Convention
|
||||
|
||||
**Format:** `claude_code_claude_code_<metric_name>_<unit>_<type>`
|
||||
|
||||
**Why double prefix?**
|
||||
- First `claude_code` comes from Prometheus exporter namespace in OTEL Collector config
|
||||
- Second `claude_code` comes from the original metric name in Claude Code
|
||||
- This is expected behavior with the current configuration
|
||||
|
||||
**Components:**
|
||||
- `<metric_name>`: Descriptive name (e.g., `token_usage`, `cost_usage`)
|
||||
- `<unit>`: Unit of measurement (e.g., `tokens`, `USD`, `seconds`, `count`)
|
||||
- `<type>`: Metric type (always `total` for counters)
|
||||
|
||||
---
|
||||
|
||||
## Querying Best Practices
|
||||
|
||||
### Use increase() for Counters
|
||||
|
||||
Counters are cumulative, so use `increase()` for time windows:
|
||||
|
||||
```promql
|
||||
# ✅ Correct - Shows cost in last 24h
|
||||
increase(claude_code_claude_code_cost_usage_USD_total[24h])
|
||||
|
||||
# ❌ Wrong - Shows cumulative cost since start
|
||||
claude_code_claude_code_cost_usage_USD_total
|
||||
```
|
||||
|
||||
### Use rate() for Rates
|
||||
|
||||
Calculate per-second rate, then multiply for desired unit:
|
||||
|
||||
```promql
|
||||
# Cost per hour
|
||||
rate(claude_code_claude_code_cost_usage_USD_total[5m]) * 3600
|
||||
|
||||
# Tokens per minute
|
||||
rate(claude_code_claude_code_token_usage_tokens_total[5m]) * 60
|
||||
```
|
||||
|
||||
### Aggregate with sum()
|
||||
|
||||
Combine metrics across labels:
|
||||
|
||||
```promql
|
||||
# Total tokens (all types)
|
||||
sum(claude_code_claude_code_token_usage_tokens_total)
|
||||
|
||||
# Total tokens by type
|
||||
sum by (type) (claude_code_claude_code_token_usage_tokens_total)
|
||||
|
||||
# Total cost across all models
|
||||
sum(claude_code_claude_code_cost_usage_USD_total)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Example Dashboards
|
||||
|
||||
### Executive Summary (single values)
|
||||
|
||||
```promql
|
||||
# Total cost this month
|
||||
sum(increase(claude_code_claude_code_cost_usage_USD_total[30d]))
|
||||
|
||||
# Total LOC this month
|
||||
sum(increase(claude_code_claude_code_lines_of_code_count_total[30d]))
|
||||
|
||||
# Active users (unique account_uuids)
|
||||
count(count by (account_uuid) (claude_code_claude_code_session_count_total))
|
||||
|
||||
# Average session cost
|
||||
sum(increase(claude_code_claude_code_cost_usage_USD_total[30d]))
|
||||
/
|
||||
sum(increase(claude_code_claude_code_session_count_total[30d]))
|
||||
```
|
||||
|
||||
### Cost Tracking
|
||||
|
||||
```promql
|
||||
# Daily cost trend
|
||||
sum(increase(claude_code_claude_code_cost_usage_USD_total[1d]))
|
||||
|
||||
# Cost by model (pie chart)
|
||||
sum by (model) (increase(claude_code_claude_code_cost_usage_USD_total[7d]))
|
||||
|
||||
# Cost by team (bar chart)
|
||||
sum by (team) (increase(claude_code_claude_code_cost_usage_USD_total[7d]))
|
||||
```
|
||||
|
||||
### Productivity Tracking
|
||||
|
||||
```promql
|
||||
# LOC per day
|
||||
sum(increase(claude_code_claude_code_lines_of_code_count_total[1d]))
|
||||
|
||||
# Commits per week
|
||||
sum(increase(claude_code_claude_code_commit_count_total[7d]))
|
||||
|
||||
# Efficiency: LOC per dollar
|
||||
sum(increase(claude_code_claude_code_lines_of_code_count_total[30d]))
|
||||
/
|
||||
sum(increase(claude_code_claude_code_cost_usage_USD_total[30d]))
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Retention and Storage
|
||||
|
||||
**Default Prometheus Retention:** 15 days
|
||||
|
||||
**Adjust retention:**
|
||||
```yaml
|
||||
# In prometheus.yml or docker-compose.yml
|
||||
command:
|
||||
- '--storage.tsdb.retention.time=90d'
|
||||
- '--storage.tsdb.retention.size=50GB'
|
||||
```
|
||||
|
||||
**Disk usage estimation:**
|
||||
- ~1-2 MB per day per active user
|
||||
- ~30-60 MB per month per active user
|
||||
- ~360-720 MB per year per active user
|
||||
|
||||
**For long-term storage:** Consider using Prometheus remote write to send data to a time-series database like VictoriaMetrics, Cortex, or Thanos.
|
||||
|
||||
---
|
||||
|
||||
## Additional Resources
|
||||
|
||||
- **Official OTEL Docs:** https://opentelemetry.io/docs/
|
||||
- **Prometheus Query Docs:** https://prometheus.io/docs/prometheus/latest/querying/basics/
|
||||
- **PromQL Examples:** See `prometheus-queries.md`
|
||||
405
skills/otel-monitoring-setup/data/prometheus-queries.md
Normal file
405
skills/otel-monitoring-setup/data/prometheus-queries.md
Normal file
@@ -0,0 +1,405 @@
|
||||
# Useful Prometheus Queries (PromQL)
|
||||
|
||||
Collection of useful PromQL queries for Claude Code telemetry analysis.
|
||||
|
||||
**Note:** All queries use the double prefix: `claude_code_claude_code_*`
|
||||
|
||||
---
|
||||
|
||||
## Cost Analysis
|
||||
|
||||
### Daily Cost Trend
|
||||
```promql
|
||||
sum(increase(claude_code_claude_code_cost_usage_USD_total[1d]))
|
||||
```
|
||||
|
||||
### Cost by Model
|
||||
```promql
|
||||
sum by (model) (increase(claude_code_claude_code_cost_usage_USD_total[24h]))
|
||||
```
|
||||
|
||||
### Cost per Hour (Rate)
|
||||
```promql
|
||||
rate(claude_code_claude_code_cost_usage_USD_total[5m]) * 3600
|
||||
```
|
||||
|
||||
### Average Cost per Session
|
||||
```promql
|
||||
sum(increase(claude_code_claude_code_cost_usage_USD_total[24h]))
|
||||
/
|
||||
sum(increase(claude_code_claude_code_session_count_total[24h]))
|
||||
```
|
||||
|
||||
### Cumulative Monthly Cost
|
||||
```promql
|
||||
sum(increase(claude_code_claude_code_cost_usage_USD_total[30d]))
|
||||
```
|
||||
|
||||
### Cost by Team
|
||||
```promql
|
||||
sum by (team) (increase(claude_code_claude_code_cost_usage_USD_total[24h]))
|
||||
```
|
||||
|
||||
### Projected Monthly Cost (based on last 7 days)
|
||||
```promql
|
||||
(sum(increase(claude_code_claude_code_cost_usage_USD_total[7d])) / 7) * 30
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Token Usage
|
||||
|
||||
### Total Tokens by Type
|
||||
```promql
|
||||
sum by (type) (increase(claude_code_claude_code_token_usage_tokens_total[24h]))
|
||||
```
|
||||
|
||||
### Tokens by Model
|
||||
```promql
|
||||
sum by (model) (increase(claude_code_claude_code_token_usage_tokens_total[24h]))
|
||||
```
|
||||
|
||||
### Cache Hit Rate
|
||||
```promql
|
||||
sum(increase(claude_code_claude_code_token_usage_tokens_total{type="cache_read"}[24h]))
|
||||
/
|
||||
sum(increase(claude_code_claude_code_token_usage_tokens_total{type=~"input|cache_creation|cache_read"}[24h]))
|
||||
* 100
|
||||
```
|
||||
|
||||
### Input vs Output Token Ratio
|
||||
```promql
|
||||
sum(increase(claude_code_claude_code_token_usage_tokens_total{type="input"}[24h]))
|
||||
/
|
||||
sum(increase(claude_code_claude_code_token_usage_tokens_total{type="output"}[24h]))
|
||||
```
|
||||
|
||||
### Token Usage Rate (per minute)
|
||||
```promql
|
||||
sum by (type) (rate(claude_code_claude_code_token_usage_tokens_total[5m]) * 60)
|
||||
```
|
||||
|
||||
### Total Tokens (All Time)
|
||||
```promql
|
||||
sum(claude_code_claude_code_token_usage_tokens_total)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Productivity Metrics
|
||||
|
||||
### Total Lines of Code Modified
|
||||
```promql
|
||||
sum(claude_code_claude_code_lines_of_code_count_total)
|
||||
```
|
||||
|
||||
### LOC by Type (Added, Changed, Deleted)
|
||||
```promql
|
||||
sum by (type) (increase(claude_code_claude_code_lines_of_code_count_total[24h]))
|
||||
```
|
||||
|
||||
### LOC per Hour
|
||||
```promql
|
||||
rate(claude_code_claude_code_lines_of_code_count_total[5m]) * 3600
|
||||
```
|
||||
|
||||
### Lines per Dollar (Efficiency)
|
||||
```promql
|
||||
sum(increase(claude_code_claude_code_lines_of_code_count_total[24h]))
|
||||
/
|
||||
sum(increase(claude_code_claude_code_cost_usage_USD_total[24h]))
|
||||
```
|
||||
|
||||
### Commits per Day
|
||||
```promql
|
||||
increase(claude_code_claude_code_commit_count_total[24h])
|
||||
```
|
||||
|
||||
### PRs per Week
|
||||
```promql
|
||||
increase(claude_code_claude_code_pr_count_total[7d])
|
||||
```
|
||||
|
||||
### LOC per Commit
|
||||
```promql
|
||||
sum(increase(claude_code_claude_code_lines_of_code_count_total[24h]))
|
||||
/
|
||||
sum(increase(claude_code_claude_code_commit_count_total[24h]))
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Session Analytics
|
||||
|
||||
### Total Sessions
|
||||
```promql
|
||||
sum(claude_code_claude_code_session_count_total)
|
||||
```
|
||||
|
||||
### New Sessions (24h)
|
||||
```promql
|
||||
increase(claude_code_claude_code_session_count_total[24h])
|
||||
```
|
||||
|
||||
### Active Users (Unique account_uuids)
|
||||
```promql
|
||||
count(count by (account_uuid) (claude_code_claude_code_session_count_total))
|
||||
```
|
||||
|
||||
### Average Session Duration
|
||||
```promql
|
||||
sum(increase(claude_code_claude_code_active_time_seconds_total[24h]))
|
||||
/
|
||||
sum(increase(claude_code_claude_code_session_count_total[24h]))
|
||||
/ 60
|
||||
```
|
||||
*Result in minutes*
|
||||
|
||||
### Total Active Hours (24h)
|
||||
```promql
|
||||
sum(increase(claude_code_claude_code_active_time_seconds_total[24h])) / 3600
|
||||
```
|
||||
|
||||
### Sessions by Version
|
||||
```promql
|
||||
sum by (version) (increase(claude_code_claude_code_session_count_total[24h]))
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Team Aggregation
|
||||
|
||||
### Cost by Team (Last 24h)
|
||||
```promql
|
||||
sum by (team) (increase(claude_code_claude_code_cost_usage_USD_total[24h]))
|
||||
```
|
||||
|
||||
### LOC by Team (Last 24h)
|
||||
```promql
|
||||
sum by (team) (increase(claude_code_claude_code_lines_of_code_count_total[24h]))
|
||||
```
|
||||
|
||||
### Active Users per Team
|
||||
```promql
|
||||
count by (team) (count by (team, account_uuid) (claude_code_claude_code_session_count_total))
|
||||
```
|
||||
|
||||
### Team Efficiency (LOC per Dollar)
|
||||
```promql
|
||||
sum by (team) (increase(claude_code_claude_code_lines_of_code_count_total[24h]))
|
||||
/
|
||||
sum by (team) (increase(claude_code_claude_code_cost_usage_USD_total[24h]))
|
||||
```
|
||||
|
||||
### Top Spending Teams (Last 7 days)
|
||||
```promql
|
||||
topk(5, sum by (team) (increase(claude_code_claude_code_cost_usage_USD_total[7d])))
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Model Comparison
|
||||
|
||||
### Cost by Model (Pie Chart)
|
||||
```promql
|
||||
sum by (model) (increase(claude_code_claude_code_cost_usage_USD_total[7d]))
|
||||
```
|
||||
|
||||
### Token Efficiency by Model (Tokens per Dollar)
|
||||
```promql
|
||||
sum by (model) (increase(claude_code_claude_code_token_usage_tokens_total[24h]))
|
||||
/
|
||||
sum by (model) (increase(claude_code_claude_code_cost_usage_USD_total[24h]))
|
||||
```
|
||||
|
||||
### Most Used Model
|
||||
```promql
|
||||
topk(1, sum by (model) (increase(claude_code_claude_code_token_usage_tokens_total[24h])))
|
||||
```
|
||||
|
||||
### Model Usage Distribution (%)
|
||||
```promql
|
||||
sum by (model) (increase(claude_code_claude_code_token_usage_tokens_total[24h]))
|
||||
/
|
||||
sum(increase(claude_code_claude_code_token_usage_tokens_total[24h]))
|
||||
* 100
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Alerting Queries
|
||||
|
||||
### High Daily Cost Alert (> $50)
|
||||
```promql
|
||||
sum(increase(claude_code_claude_code_cost_usage_USD_total[24h])) > 50
|
||||
```
|
||||
|
||||
### Cost Spike Alert (50% increase compared to yesterday)
|
||||
```promql
|
||||
sum(increase(claude_code_claude_code_cost_usage_USD_total[24h]))
|
||||
/
|
||||
sum(increase(claude_code_claude_code_cost_usage_USD_total[24h] offset 24h))
|
||||
> 1.5
|
||||
```
|
||||
|
||||
### No Activity Alert (no sessions in last hour)
|
||||
```promql
|
||||
increase(claude_code_claude_code_session_count_total[1h]) == 0
|
||||
```
|
||||
|
||||
### Low Cache Hit Rate Alert (< 20%)
|
||||
```promql
|
||||
(
|
||||
sum(increase(claude_code_claude_code_token_usage_tokens_total{type="cache_read"}[1h]))
|
||||
/
|
||||
sum(increase(claude_code_claude_code_token_usage_tokens_total{type=~"input|cache_creation|cache_read"}[1h]))
|
||||
* 100
|
||||
) < 20
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Forecasting
|
||||
|
||||
### Projected Monthly Cost (based on last 7 days)
|
||||
```promql
|
||||
(sum(increase(claude_code_claude_code_cost_usage_USD_total[7d])) / 7) * 30
|
||||
```
|
||||
|
||||
### Projected Annual Cost (based on last 30 days)
|
||||
```promql
|
||||
(sum(increase(claude_code_claude_code_cost_usage_USD_total[30d])) / 30) * 365
|
||||
```
|
||||
|
||||
### Average Daily Cost (Last 30 days)
|
||||
```promql
|
||||
sum(increase(claude_code_claude_code_cost_usage_USD_total[30d])) / 30
|
||||
```
|
||||
|
||||
### Growth Rate (Week over Week)
|
||||
```promql
|
||||
(
|
||||
sum(increase(claude_code_claude_code_cost_usage_USD_total[7d]))
|
||||
-
|
||||
sum(increase(claude_code_claude_code_cost_usage_USD_total[7d] offset 7d))
|
||||
)
|
||||
/
|
||||
sum(increase(claude_code_claude_code_cost_usage_USD_total[7d] offset 7d))
|
||||
* 100
|
||||
```
|
||||
*Result as percentage*
|
||||
|
||||
---
|
||||
|
||||
## Debugging Queries
|
||||
|
||||
### Check if Metrics Exist
|
||||
```promql
|
||||
claude_code_claude_code_session_count_total
|
||||
```
|
||||
|
||||
### List All Claude Code Metrics
|
||||
```
|
||||
# Use Prometheus UI or API
|
||||
curl -s 'http://localhost:9090/api/v1/label/__name__/values' | jq . | grep claude_code
|
||||
```
|
||||
|
||||
### Check Metric Labels
|
||||
```promql
|
||||
# Returns all label combinations
|
||||
count by (account_uuid, version, team, environment) (claude_code_claude_code_session_count_total)
|
||||
```
|
||||
|
||||
### Latest Value for All Metrics
|
||||
```promql
|
||||
# Session count
|
||||
claude_code_claude_code_session_count_total
|
||||
|
||||
# Cost
|
||||
claude_code_claude_code_cost_usage_USD_total
|
||||
|
||||
# Tokens
|
||||
claude_code_claude_code_token_usage_tokens_total
|
||||
|
||||
# LOC
|
||||
claude_code_claude_code_lines_of_code_count_total
|
||||
```
|
||||
|
||||
### Metrics Cardinality (Number of Time Series)
|
||||
```promql
|
||||
count(claude_code_claude_code_token_usage_tokens_total)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Recording Rules
|
||||
|
||||
Save these as Prometheus recording rules for faster dashboard queries:
|
||||
|
||||
```yaml
|
||||
groups:
|
||||
- name: claude_code_aggregations
|
||||
interval: 1m
|
||||
rules:
|
||||
# Daily cost
|
||||
- record: claude_code:cost_usd:daily
|
||||
expr: sum(increase(claude_code_claude_code_cost_usage_USD_total[24h]))
|
||||
|
||||
# Cost by team
|
||||
- record: claude_code:cost_usd:daily:by_team
|
||||
expr: sum by (team) (increase(claude_code_claude_code_cost_usage_USD_total[24h]))
|
||||
|
||||
# Cache hit rate
|
||||
- record: claude_code:cache_hit_rate:daily
|
||||
expr: |
|
||||
sum(increase(claude_code_claude_code_token_usage_tokens_total{type="cache_read"}[24h]))
|
||||
/
|
||||
sum(increase(claude_code_claude_code_token_usage_tokens_total{type=~"input|cache_creation|cache_read"}[24h]))
|
||||
* 100
|
||||
|
||||
# LOC efficiency
|
||||
- record: claude_code:loc_per_dollar:daily
|
||||
expr: |
|
||||
sum(increase(claude_code_claude_code_lines_of_code_count_total[24h]))
|
||||
/
|
||||
sum(increase(claude_code_claude_code_cost_usage_USD_total[24h]))
|
||||
```
|
||||
|
||||
Then use simplified queries:
|
||||
```promql
|
||||
# Instead of complex query, just use:
|
||||
claude_code:cost_usd:daily
|
||||
claude_code:cost_usd:daily:by_team
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Visualization Tips
|
||||
|
||||
### Time Series Panel
|
||||
- Use `rate()` for smooth trends
|
||||
- Set legend to `{{label_name}}` for clarity
|
||||
- Enable "Lines" draw style with opacity
|
||||
|
||||
### Stat Panel
|
||||
- Use `lastNotNull` for counters
|
||||
- Use `increase([24h])` for daily totals
|
||||
- Add thresholds for color coding
|
||||
|
||||
### Bar Chart
|
||||
- Use `sum by (label)` for grouping
|
||||
- Sort by value descending
|
||||
- Limit to top 10 with `topk(10, ...)`
|
||||
|
||||
### Pie Chart
|
||||
- Calculate percentages with division
|
||||
- Use `sum by (label)` for segments
|
||||
- Limit to top categories
|
||||
|
||||
---
|
||||
|
||||
## Additional Resources
|
||||
|
||||
- **Prometheus Query Docs:** https://prometheus.io/docs/prometheus/latest/querying/basics/
|
||||
- **PromQL Examples:** https://prometheus.io/docs/prometheus/latest/querying/examples/
|
||||
- **Grafana Query Editor:** https://grafana.com/docs/grafana/latest/datasources/prometheus/
|
||||
658
skills/otel-monitoring-setup/data/troubleshooting.md
Normal file
658
skills/otel-monitoring-setup/data/troubleshooting.md
Normal file
@@ -0,0 +1,658 @@
|
||||
# Troubleshooting Guide
|
||||
|
||||
Common issues and solutions for Claude Code OpenTelemetry setup.
|
||||
|
||||
---
|
||||
|
||||
## Container Issues
|
||||
|
||||
### Docker Not Running
|
||||
|
||||
**Symptom:** `Cannot connect to the Docker daemon`
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
docker info
|
||||
```
|
||||
|
||||
**Solutions:**
|
||||
1. Start Docker Desktop application
|
||||
2. Wait for Docker to fully initialize
|
||||
3. Check system tray for Docker icon
|
||||
4. Verify Docker daemon is running: `ps aux | grep docker`
|
||||
|
||||
---
|
||||
|
||||
### Containers Won't Start
|
||||
|
||||
**Symptom:** Containers exit immediately after `docker compose up`
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
# Check container logs
|
||||
docker compose logs
|
||||
|
||||
# Check specific service
|
||||
docker compose logs otel-collector
|
||||
docker compose logs prometheus
|
||||
```
|
||||
|
||||
**Common Causes:**
|
||||
|
||||
**1. OTEL Collector Configuration Error**
|
||||
```bash
|
||||
# Check for errors
|
||||
docker compose logs otel-collector | grep -i error
|
||||
|
||||
# Common issues:
|
||||
# - Deprecated logging exporter
|
||||
# - Deprecated 'address' field in telemetry.metrics
|
||||
```
|
||||
|
||||
**Solution A - Deprecated logging exporter:**
|
||||
Update `otel-collector-config.yml`:
|
||||
```yaml
|
||||
exporters:
|
||||
debug:
|
||||
verbosity: normal
|
||||
# NOT:
|
||||
# logging:
|
||||
# loglevel: info
|
||||
```
|
||||
|
||||
**Solution B - Deprecated 'address' field (v0.123.0+):**
|
||||
|
||||
If logs show: `'address' has invalid keys` or similar error:
|
||||
|
||||
Update `otel-collector-config.yml`:
|
||||
```yaml
|
||||
service:
|
||||
telemetry:
|
||||
metrics:
|
||||
level: detailed
|
||||
# REMOVE this line (deprecated in v0.123.0+):
|
||||
# address: ":8888"
|
||||
```
|
||||
|
||||
The `address` field in `service.telemetry.metrics` is deprecated in newer OTEL Collector versions. Simply remove it - the collector will use default internal metrics endpoint.
|
||||
|
||||
**2. Port Already in Use**
|
||||
```bash
|
||||
# Check which ports are in use
|
||||
lsof -i :3000 # Grafana
|
||||
lsof -i :4317 # OTEL gRPC
|
||||
lsof -i :4318 # OTEL HTTP
|
||||
lsof -i :8889 # OTEL Prometheus exporter
|
||||
lsof -i :9090 # Prometheus
|
||||
lsof -i :3100 # Loki
|
||||
```
|
||||
|
||||
**Solution:**
|
||||
- Stop conflicting service
|
||||
- Or change port in docker-compose.yml
|
||||
|
||||
**3. Volume Permission Issues**
|
||||
```bash
|
||||
# Check volume permissions
|
||||
docker volume ls
|
||||
docker volume inspect claude-telemetry_prometheus-data
|
||||
```
|
||||
|
||||
**Solution:**
|
||||
```bash
|
||||
# Remove and recreate volumes
|
||||
docker compose down -v
|
||||
docker compose up -d
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Containers Keep Restarting
|
||||
|
||||
**Symptom:** Container status shows "Restarting"
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
docker compose ps
|
||||
docker compose logs --tail=50 <service-name>
|
||||
```
|
||||
|
||||
**Solutions:**
|
||||
1. Check memory limits: Increase memory_limiter in OTEL config
|
||||
2. Check disk space: `df -h`
|
||||
3. Check for configuration errors in logs
|
||||
4. Restart Docker Desktop
|
||||
|
||||
---
|
||||
|
||||
## Claude Code Settings Issues
|
||||
|
||||
### 🚨 CRITICAL: Telemetry Not Sending (Most Common Issue)
|
||||
|
||||
**Symptom:** No metrics appearing in Prometheus after Claude Code restart
|
||||
|
||||
**ROOT CAUSE (90% of cases):** Missing required exporter environment variables
|
||||
|
||||
Even when `CLAUDE_CODE_ENABLE_TELEMETRY=1` is set, telemetry **will not send** without explicit exporter configuration. This is the #1 most common issue.
|
||||
|
||||
**Diagnosis Checklist:**
|
||||
|
||||
**1. Check REQUIRED exporters (MOST IMPORTANT):**
|
||||
```bash
|
||||
jq '.env.OTEL_METRICS_EXPORTER' ~/.claude/settings.json
|
||||
# Must return: "otlp" (NOT null, NOT missing)
|
||||
|
||||
jq '.env.OTEL_LOGS_EXPORTER' ~/.claude/settings.json
|
||||
# Should return: "otlp" (recommended for event tracking)
|
||||
```
|
||||
|
||||
**If either returns `null` or is missing, this is your problem!**
|
||||
|
||||
**2. Verify telemetry is enabled:**
|
||||
```bash
|
||||
jq '.env.CLAUDE_CODE_ENABLE_TELEMETRY' ~/.claude/settings.json
|
||||
# Should return: "1"
|
||||
```
|
||||
|
||||
**3. Check OTEL endpoint:**
|
||||
```bash
|
||||
jq '.env.OTEL_EXPORTER_OTLP_ENDPOINT' ~/.claude/settings.json
|
||||
# Should return: "http://localhost:4317" (for local setup)
|
||||
```
|
||||
|
||||
**3. Verify JSON is valid:**
|
||||
```bash
|
||||
jq empty ~/.claude/settings.json
|
||||
# No output = valid JSON
|
||||
```
|
||||
|
||||
**4. Check if Claude Code was restarted:**
|
||||
```bash
|
||||
# Telemetry config only loads at startup!
|
||||
# Must quit and restart Claude Code completely
|
||||
```
|
||||
|
||||
**5. Test OTEL endpoint connectivity:**
|
||||
```bash
|
||||
nc -zv localhost 4317
|
||||
# Should show: Connection to localhost port 4317 [tcp/*] succeeded!
|
||||
```
|
||||
|
||||
**Solutions:**
|
||||
|
||||
**If exporters are missing (MOST COMMON):**
|
||||
|
||||
Add these REQUIRED settings to ~/.claude/settings.json:
|
||||
```json
|
||||
{
|
||||
"env": {
|
||||
"CLAUDE_CODE_ENABLE_TELEMETRY": "1",
|
||||
"OTEL_METRICS_EXPORTER": "otlp",
|
||||
"OTEL_LOGS_EXPORTER": "otlp",
|
||||
"OTEL_EXPORTER_OTLP_ENDPOINT": "http://localhost:4317",
|
||||
"OTEL_EXPORTER_OTLP_PROTOCOL": "grpc"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Then **MUST restart Claude Code** (settings only load at startup).
|
||||
|
||||
**If endpoint unreachable:**
|
||||
- Verify OTEL Collector container is running
|
||||
- Check firewall settings
|
||||
- Try HTTP endpoint instead: `http://localhost:4318`
|
||||
|
||||
**If still no data:**
|
||||
- Check OTEL Collector logs for incoming connections
|
||||
- Verify Claude Code is running (not just idle)
|
||||
- Wait 60 seconds (default export interval)
|
||||
|
||||
---
|
||||
|
||||
### Settings.json Syntax Errors
|
||||
|
||||
**Symptom:** Claude Code won't start or shows errors
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
# Validate JSON
|
||||
jq empty ~/.claude/settings.json
|
||||
|
||||
# Pretty-print to find issues
|
||||
jq . ~/.claude/settings.json
|
||||
```
|
||||
|
||||
**Common Issues:**
|
||||
- Missing commas between properties
|
||||
- Trailing commas before closing braces
|
||||
- Unescaped quotes in strings
|
||||
- Incorrect nesting
|
||||
|
||||
**Solution:**
|
||||
```bash
|
||||
# Restore backup
|
||||
cp ~/.claude/settings.json.backup ~/.claude/settings.json
|
||||
|
||||
# Or fix JSON manually with editor
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Grafana Issues
|
||||
|
||||
### Can't Access Grafana
|
||||
|
||||
**Symptom:** `localhost:3000` doesn't load
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
# Check if Grafana is running
|
||||
docker ps | grep grafana
|
||||
|
||||
# Check Grafana logs
|
||||
docker compose logs grafana
|
||||
|
||||
# Check port availability
|
||||
lsof -i :3000
|
||||
```
|
||||
|
||||
**Solutions:**
|
||||
1. Verify container is running: `docker compose up -d grafana`
|
||||
2. Wait 30 seconds for Grafana to initialize
|
||||
3. Try `http://127.0.0.1:3000` instead
|
||||
4. Check Docker network: `docker network inspect claude-telemetry`
|
||||
|
||||
---
|
||||
|
||||
### Dashboard Shows "Datasource Not Found"
|
||||
|
||||
**Symptom:** Dashboard panels show "datasource prometheus not found"
|
||||
|
||||
**Cause:** Dashboard has hardcoded datasource UID that doesn't match your Grafana instance
|
||||
|
||||
**Diagnosis:**
|
||||
1. Go to: http://localhost:3000/connections/datasources
|
||||
2. Click on Prometheus datasource
|
||||
3. Note the UID from URL (e.g., `PBFA97CFB590B2093`)
|
||||
|
||||
**Solution:**
|
||||
```bash
|
||||
# Get your datasource UID
|
||||
DATASOURCE_UID=$(curl -s -u admin:admin http://localhost:3000/api/datasources | jq -r '.[] | select(.type=="prometheus") | .uid')
|
||||
|
||||
echo "Your Prometheus datasource UID: $DATASOURCE_UID"
|
||||
|
||||
# Update dashboard JSON
|
||||
cd ~/.claude/telemetry/dashboards
|
||||
cat claude-code-overview.json | sed "s/PBFA97CFB590B2093/$DATASOURCE_UID/g" > claude-code-overview-fixed.json
|
||||
|
||||
# Re-import the fixed dashboard
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Dashboard Shows "No Data"
|
||||
|
||||
**Symptom:** Dashboard loads but all panels show "No data"
|
||||
|
||||
**Diagnosis Steps:**
|
||||
|
||||
**1. Check Prometheus has data:**
|
||||
```bash
|
||||
# Query Prometheus directly
|
||||
curl -s 'http://localhost:9090/api/v1/label/__name__/values' | jq . | grep claude_code
|
||||
|
||||
# Should see metrics like:
|
||||
# "claude_code_claude_code_session_count_total"
|
||||
# "claude_code_claude_code_cost_usage_USD_total"
|
||||
```
|
||||
|
||||
**2. Check datasource connection:**
|
||||
- Go to: http://localhost:3000/connections/datasources
|
||||
- Click Prometheus
|
||||
- Click "Save & Test"
|
||||
- Should show: "Successfully queried the Prometheus API"
|
||||
|
||||
**3. Verify metric names in queries:**
|
||||
```bash
|
||||
# Check if metrics use double prefix
|
||||
curl -s 'http://localhost:9090/api/v1/query?query=claude_code_claude_code_session_count_total' | jq .
|
||||
```
|
||||
|
||||
**Solutions:**
|
||||
|
||||
**If metrics don't exist:**
|
||||
- Claude Code hasn't sent data yet (wait 60 seconds)
|
||||
- OTEL Collector isn't receiving data (check container logs)
|
||||
- Settings.json wasn't configured correctly
|
||||
|
||||
**If metrics exist but dashboard shows no data:**
|
||||
- Dashboard queries use wrong metric names
|
||||
- Update queries to use double prefix: `claude_code_claude_code_*`
|
||||
- Check time range (top-right corner of Grafana)
|
||||
|
||||
**If single prefix metrics exist (`claude_code_*`):**
|
||||
Your setup uses old naming. Update dashboard:
|
||||
```bash
|
||||
# Replace double prefix with single
|
||||
sed 's/claude_code_claude_code_/claude_code_/g' dashboard.json > dashboard-fixed.json
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prometheus Issues
|
||||
|
||||
### Prometheus Shows No Targets
|
||||
|
||||
**Symptom:** Prometheus UI (localhost:9090) → Status → Targets shows no targets or DOWN status
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
# Check Prometheus config
|
||||
cat ~/.claude/telemetry/prometheus.yml
|
||||
|
||||
# Check if OTEL Collector is reachable from Prometheus
|
||||
docker exec -it claude-prometheus ping otel-collector
|
||||
```
|
||||
|
||||
**Solutions:**
|
||||
1. Verify `prometheus.yml` has correct scrape_configs
|
||||
2. Ensure OTEL Collector is running
|
||||
3. Check Docker network connectivity
|
||||
4. Restart Prometheus: `docker compose restart prometheus`
|
||||
|
||||
---
|
||||
|
||||
### Prometheus Can't Scrape OTEL Collector
|
||||
|
||||
**Symptom:** Target shows as DOWN with error "context deadline exceeded"
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
# Check if OTEL Collector is exposing metrics
|
||||
curl http://localhost:8889/metrics
|
||||
|
||||
# Check OTEL Collector logs
|
||||
docker compose logs otel-collector
|
||||
```
|
||||
|
||||
**Solutions:**
|
||||
1. Verify OTEL Collector prometheus exporter is configured
|
||||
2. Check port 8889 is exposed in docker-compose.yml
|
||||
3. Restart OTEL Collector: `docker compose restart otel-collector`
|
||||
|
||||
---
|
||||
|
||||
## Metric Issues
|
||||
|
||||
### Metrics Have Double Prefix
|
||||
|
||||
**Symptom:** Metrics are named `claude_code_claude_code_*` instead of `claude_code_*`
|
||||
|
||||
**Explanation:** This is expected behavior with the current OTEL Collector configuration:
|
||||
- First `claude_code` = Prometheus exporter namespace
|
||||
- Second `claude_code` = Original metric name
|
||||
|
||||
**Solutions:**
|
||||
|
||||
**Option 1: Accept it (Recommended)**
|
||||
- Update dashboard queries to use double prefix
|
||||
- This is the standard configuration
|
||||
|
||||
**Option 2: Remove namespace prefix**
|
||||
Update `otel-collector-config.yml`:
|
||||
```yaml
|
||||
exporters:
|
||||
prometheus:
|
||||
endpoint: "0.0.0.0:8889"
|
||||
namespace: "" # Remove namespace
|
||||
```
|
||||
|
||||
Then restart: `docker compose restart otel-collector`
|
||||
|
||||
---
|
||||
|
||||
### Old Metrics Still Showing
|
||||
|
||||
**Symptom:** After changing configuration, old metrics still appear
|
||||
|
||||
**Cause:** Prometheus retains metrics until retention period expires
|
||||
|
||||
**Solutions:**
|
||||
|
||||
**Quick fix: Delete Prometheus data:**
|
||||
```bash
|
||||
docker compose down
|
||||
docker volume rm claude-telemetry_prometheus-data
|
||||
docker compose up -d
|
||||
```
|
||||
|
||||
**Proper fix: Wait for retention:**
|
||||
- Default retention is 15 days
|
||||
- Old metrics will automatically disappear
|
||||
- New metrics will coexist temporarily
|
||||
|
||||
---
|
||||
|
||||
## Network Issues
|
||||
|
||||
### Can't Reach OTEL Endpoint from Claude Code
|
||||
|
||||
**Symptom:** Claude Code can't connect to `localhost:4317`
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
# Test gRPC endpoint
|
||||
nc -zv localhost 4317
|
||||
|
||||
# Test HTTP endpoint
|
||||
curl -v http://localhost:4318/v1/metrics -d '{}'
|
||||
```
|
||||
|
||||
**Solutions:**
|
||||
|
||||
**If connection refused:**
|
||||
1. Check OTEL Collector is running
|
||||
2. Verify ports are exposed in docker-compose.yml
|
||||
3. Check firewall/antivirus blocking localhost connections
|
||||
|
||||
**If timeout:**
|
||||
1. Increase export timeout in settings.json
|
||||
2. Try HTTP protocol instead of gRPC
|
||||
|
||||
**macOS-specific:**
|
||||
- Use `http://host.docker.internal:4317` instead of `localhost:4317`
|
||||
- Or use bridge network mode
|
||||
|
||||
---
|
||||
|
||||
### Enterprise Endpoint Unreachable
|
||||
|
||||
**Symptom:** Can't connect to company OTEL endpoint
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
# Test connectivity
|
||||
ping otel.company.com
|
||||
|
||||
# Test port
|
||||
nc -zv otel.company.com 4317
|
||||
|
||||
# Test with VPN
|
||||
# (Ensure corporate VPN is connected)
|
||||
```
|
||||
|
||||
**Solutions:**
|
||||
1. Connect to corporate VPN
|
||||
2. Check firewall allows outbound connections
|
||||
3. Verify endpoint URL is correct
|
||||
4. Try HTTP endpoint (port 4318) instead of gRPC
|
||||
5. Contact platform team to verify endpoint is accessible
|
||||
|
||||
---
|
||||
|
||||
## Performance Issues
|
||||
|
||||
### High Memory Usage
|
||||
|
||||
**Symptom:** OTEL Collector or Prometheus using excessive memory
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
# Check container resource usage
|
||||
docker stats
|
||||
|
||||
# Check Prometheus TSDB size
|
||||
du -sh ~/.claude/telemetry/prometheus-data
|
||||
```
|
||||
|
||||
**Solutions:**
|
||||
|
||||
**OTEL Collector:**
|
||||
Reduce memory_limiter in `otel-collector-config.yml`:
|
||||
```yaml
|
||||
processors:
|
||||
memory_limiter:
|
||||
check_interval: 1s
|
||||
limit_mib: 256 # Reduce from 512
|
||||
```
|
||||
|
||||
**Prometheus:**
|
||||
Reduce retention:
|
||||
```yaml
|
||||
command:
|
||||
- '--storage.tsdb.retention.time=7d' # Reduce from 15d
|
||||
- '--storage.tsdb.retention.size=1GB'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Slow Grafana Dashboards
|
||||
|
||||
**Symptom:** Dashboards take long time to load or timeout
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
# Check query performance in Prometheus
|
||||
# Go to: http://localhost:9090/graph
|
||||
# Run expensive queries like: sum by (account_uuid, model, type) (...)
|
||||
```
|
||||
|
||||
**Solutions:**
|
||||
1. Reduce dashboard time range (use 6h instead of 7d)
|
||||
2. Increase dashboard refresh interval (1m → 5m)
|
||||
3. Use recording rules for complex queries
|
||||
4. Reduce number of panels
|
||||
5. Use simpler aggregations
|
||||
|
||||
---
|
||||
|
||||
## Data Quality Issues
|
||||
|
||||
### Unexpected Cost Values
|
||||
|
||||
**Symptom:** Cost metrics seem incorrect
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
# Check raw cost values
|
||||
curl -s 'http://localhost:9090/api/v1/query?query=claude_code_claude_code_cost_usage_USD_total' | jq .
|
||||
|
||||
# Check token usage
|
||||
curl -s 'http://localhost:9090/api/v1/query?query=claude_code_claude_code_token_usage_tokens_total' | jq .
|
||||
```
|
||||
|
||||
**Causes:**
|
||||
- Cost is cumulative counter (not reset between sessions)
|
||||
- Dashboard may be using wrong time range
|
||||
- Model pricing may have changed
|
||||
|
||||
**Solutions:**
|
||||
- Use `increase([24h])` not raw counter values
|
||||
- Verify pricing in metrics reference
|
||||
- Check Claude Code version (pricing may vary)
|
||||
|
||||
---
|
||||
|
||||
### Missing Sessions
|
||||
|
||||
**Symptom:** Some Claude Code sessions not recorded
|
||||
|
||||
**Causes:**
|
||||
1. Claude Code wasn't restarted after settings update
|
||||
2. OTEL Collector was down during session
|
||||
3. Export interval hadn't elapsed yet (60 seconds default)
|
||||
4. Network issue prevented export
|
||||
|
||||
**Solutions:**
|
||||
- Always restart Claude Code after settings changes
|
||||
- Monitor OTEL Collector uptime
|
||||
- Check OTEL Collector logs for export errors
|
||||
- Reduce export interval if real-time data needed
|
||||
|
||||
---
|
||||
|
||||
## Getting Help
|
||||
|
||||
### Collect Debug Information
|
||||
|
||||
When asking for help, provide:
|
||||
|
||||
```bash
|
||||
# 1. Container status
|
||||
docker compose ps
|
||||
|
||||
# 2. Container logs (last 50 lines)
|
||||
docker compose logs --tail=50
|
||||
|
||||
# 3. Configuration files
|
||||
cat ~/.claude/telemetry/otel-collector-config.yml
|
||||
cat ~/.claude/telemetry/prometheus.yml
|
||||
|
||||
# 4. Claude Code settings (redact sensitive info!)
|
||||
jq '.env | with_entries(select(.key | startswith("OTEL_")))' ~/.claude/settings.json
|
||||
|
||||
# 5. Prometheus metrics list
|
||||
curl -s 'http://localhost:9090/api/v1/label/__name__/values' | jq . | grep claude_code
|
||||
|
||||
# 6. System info
|
||||
docker --version
|
||||
docker compose version
|
||||
uname -a
|
||||
```
|
||||
|
||||
### Enable Debug Logging
|
||||
|
||||
**OTEL Collector:**
|
||||
```yaml
|
||||
exporters:
|
||||
debug:
|
||||
verbosity: detailed # Change from 'normal'
|
||||
|
||||
service:
|
||||
telemetry:
|
||||
logs:
|
||||
level: debug # Change from 'info'
|
||||
```
|
||||
|
||||
**Claude Code:**
|
||||
Add to settings.json:
|
||||
```json
|
||||
"env": {
|
||||
"OTEL_LOG_LEVEL": "debug"
|
||||
}
|
||||
```
|
||||
|
||||
Then check logs:
|
||||
```bash
|
||||
docker compose logs -f otel-collector
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Additional Resources
|
||||
|
||||
- **OTEL Collector Docs:** https://opentelemetry.io/docs/collector/
|
||||
- **Prometheus Troubleshooting:** https://prometheus.io/docs/prometheus/latest/troubleshooting/
|
||||
- **Grafana Troubleshooting:** https://grafana.com/docs/grafana/latest/troubleshooting/
|
||||
- **Docker Compose Docs:** https://docs.docker.com/compose/
|
||||
Reference in New Issue
Block a user