Initial commit

2025-11-29 18:16:51 +08:00
commit 4e8a12140c
88 changed files with 17078 additions and 0 deletions
--- a/skills/otel-monitoring-setup/data/metrics-reference.md
+++ b/skills/otel-monitoring-setup/data/metrics-reference.md
@@ -0,0 +1,381 @@
+# Claude Code Metrics Reference
+
+Complete reference for all Claude Code OpenTelemetry metrics.
+
+**Important:** All metrics use a double prefix: `claude_code_claude_code_*`
+
+---
+
+## Metric Categories
+
+1. **Usage Metrics** - Session counts, active time
+2. **Token Metrics** - Input, output, cached tokens
+3. **Cost Metrics** - API costs by model
+4. **Productivity Metrics** - LOC, commits, PRs
+5. **Error Metrics** - Failures, retries
+
+---
+
+## Usage Metrics
+
+### claude_code_claude_code_session_count_total
+
+**Type:** Counter
+**Description:** Total number of Claude Code sessions started
+**Labels:**
+- `account_uuid` - Anonymous user identifier
+- `version` - Claude Code version (e.g., "1.2.3")
+
+**Example Query:**
+```promql
+# Total sessions across all users
+sum(claude_code_claude_code_session_count_total)
+
+# Sessions by version
+sum by (version) (claude_code_claude_code_session_count_total)
+
+# New sessions in last 24h
+increase(claude_code_claude_code_session_count_total[24h])
+```
+
+---
+
+### claude_code_claude_code_active_time_seconds_total
+
+**Type:** Counter
+**Description:** Total active time spent in Claude Code sessions (in seconds)
+**Labels:**
+- `account_uuid` - Anonymous user identifier
+- `version` - Claude Code version
+
+**Example Query:**
+```promql
+# Total active hours
+sum(claude_code_claude_code_active_time_seconds_total) / 3600
+
+# Active hours per day
+increase(claude_code_claude_code_active_time_seconds_total[24h]) / 3600
+
+# Average session duration
+increase(claude_code_claude_code_active_time_seconds_total[24h])
+/
+increase(claude_code_claude_code_session_count_total[24h])
+```
+
+**Note:** "Active time" means time when Claude Code is actively processing or responding to user input.
+
+---
+
+## Token Metrics
+
+### claude_code_claude_code_token_usage_tokens_total
+
+**Type:** Counter
+**Description:** Total tokens consumed by Claude Code API calls
+**Labels:**
+- `type` - Token type: `input`, `output`, `cache_creation`, `cache_read`
+- `model` - Model name (e.g., "claude-sonnet-4-5-20250929", "claude-opus-4-20250514")
+- `account_uuid` - Anonymous user identifier
+- `version` - Claude Code version
+
+**Token Types Explained:**
+- **input:** User messages and tool results sent to Claude
+- **output:** Claude's responses (text and tool calls)
+- **cache_creation:** Tokens written to prompt cache (billed at input rate)
+- **cache_read:** Tokens read from prompt cache (billed at 10% of input rate)
+
+**Example Query:**
+```promql
+# Total tokens by type (24h)
+sum by (type) (increase(claude_code_claude_code_token_usage_tokens_total[24h]))
+
+# Tokens by model (24h)
+sum by (model) (increase(claude_code_claude_code_token_usage_tokens_total[24h]))
+
+# Cache hit rate
+sum(increase(claude_code_claude_code_token_usage_tokens_total{type="cache_read"}[24h]))
+/
+sum(increase(claude_code_claude_code_token_usage_tokens_total{type=~"input|cache_creation|cache_read"}[24h]))
+
+# Token usage rate (per minute)
+rate(claude_code_claude_code_token_usage_tokens_total[5m]) * 60
+```
+
+---
+
+## Cost Metrics
+
+### claude_code_claude_code_cost_usage_USD_total
+
+**Type:** Counter
+**Description:** Total API costs in USD
+**Labels:**
+- `model` - Model name
+- `account_uuid` - Anonymous user identifier
+- `version` - Claude Code version
+
+**Pricing Reference (as of Jan 2025):**
+- **Claude Sonnet 4.5:** $3/MTok input, $15/MTok output
+- **Claude Opus 4:** $15/MTok input, $75/MTok output
+- **Cache read:** 10% of input price
+- **Cache write:** Same as input price
+
+**Example Query:**
+```promql
+# Total cost (24h)
+sum(increase(claude_code_claude_code_cost_usage_USD_total[24h]))
+
+# Cost by model (24h)
+sum by (model) (increase(claude_code_claude_code_cost_usage_USD_total[24h]))
+
+# Cost per hour
+rate(claude_code_claude_code_cost_usage_USD_total[5m]) * 3600
+
+# Average cost per session
+increase(claude_code_claude_code_cost_usage_USD_total[24h])
+/
+increase(claude_code_claude_code_session_count_total[24h])
+
+# Cumulative cost over time
+sum(claude_code_claude_code_cost_usage_USD_total)
+```
+
+---
+
+## Productivity Metrics
+
+### claude_code_claude_code_lines_of_code_count_total
+
+**Type:** Counter
+**Description:** Total lines of code modified (added + changed + deleted)
+**Labels:**
+- `type` - Modification type: `added`, `changed`, `deleted`
+- `account_uuid` - Anonymous user identifier
+- `version` - Claude Code version
+
+**Example Query:**
+```promql
+# Total LOC modified
+sum(claude_code_claude_code_lines_of_code_count_total)
+
+# LOC by type (24h)
+sum by (type) (increase(claude_code_claude_code_lines_of_code_count_total[24h]))
+
+# LOC per hour
+rate(claude_code_claude_code_lines_of_code_count_total[5m]) * 3600
+
+# Lines per dollar
+sum(increase(claude_code_claude_code_lines_of_code_count_total[24h]))
+/
+sum(increase(claude_code_claude_code_cost_usage_USD_total[24h]))
+```
+
+---
+
+### claude_code_claude_code_commit_count_total
+
+**Type:** Counter
+**Description:** Total git commits created by Claude Code
+**Labels:**
+- `account_uuid` - Anonymous user identifier
+- `version` - Claude Code version
+
+**Example Query:**
+```promql
+# Total commits
+sum(claude_code_claude_code_commit_count_total)
+
+# Commits per day
+increase(claude_code_claude_code_commit_count_total[24h])
+
+# Commits per session
+increase(claude_code_claude_code_commit_count_total[24h])
+/
+increase(claude_code_claude_code_session_count_total[24h])
+```
+
+---
+
+### claude_code_claude_code_pr_count_total
+
+**Type:** Counter
+**Description:** Total pull requests created by Claude Code
+**Labels:**
+- `account_uuid` - Anonymous user identifier
+- `version` - Claude Code version
+
+**Example Query:**
+```promql
+# Total PRs
+sum(claude_code_claude_code_pr_count_total)
+
+# PRs per week
+increase(claude_code_claude_code_pr_count_total[7d])
+```
+
+---
+
+## Cardinality and Resource Attributes
+
+### Resource Attributes
+
+All metrics include these resource attributes (configured in settings.json):
+
+```json
+"OTEL_RESOURCE_ATTRIBUTES": "environment=local,deployment=poc,team=platform"
+```
+
+**Common Attributes:**
+- `service.name` = "claude-code" (set by OTEL Collector)
+- `environment` - Deployment environment (local, dev, staging, prod)
+- `deployment` - Deployment type (poc, enterprise)
+- `team` - Team identifier
+- `department` - Department identifier
+- `project` - Project identifier
+
+**Querying with Resource Attributes:**
+```promql
+# Filter by environment
+sum(claude_code_claude_code_cost_usage_USD_total{environment="production"})
+
+# Aggregate by team
+sum by (team) (increase(claude_code_claude_code_cost_usage_USD_total[24h]))
+```
+
+---
+
+## Metric Naming Convention
+
+**Format:** `claude_code_claude_code_<metric_name>_<unit>_<type>`
+
+**Why double prefix?**
+- First `claude_code` comes from Prometheus exporter namespace in OTEL Collector config
+- Second `claude_code` comes from the original metric name in Claude Code
+- This is expected behavior with the current configuration
+
+**Components:**
+- `<metric_name>`: Descriptive name (e.g., `token_usage`, `cost_usage`)
+- `<unit>`: Unit of measurement (e.g., `tokens`, `USD`, `seconds`, `count`)
+- `<type>`: Metric type (always `total` for counters)
+
+---
+
+## Querying Best Practices
+
+### Use increase() for Counters
+
+Counters are cumulative, so use `increase()` for time windows:
+
+```promql
+# ✅ Correct - Shows cost in last 24h
+increase(claude_code_claude_code_cost_usage_USD_total[24h])
+
+# ❌ Wrong - Shows cumulative cost since start
+claude_code_claude_code_cost_usage_USD_total
+```
+
+### Use rate() for Rates
+
+Calculate per-second rate, then multiply for desired unit:
+
+```promql
+# Cost per hour
+rate(claude_code_claude_code_cost_usage_USD_total[5m]) * 3600
+
+# Tokens per minute
+rate(claude_code_claude_code_token_usage_tokens_total[5m]) * 60
+```
+
+### Aggregate with sum()
+
+Combine metrics across labels:
+
+```promql
+# Total tokens (all types)
+sum(claude_code_claude_code_token_usage_tokens_total)
+
+# Total tokens by type
+sum by (type) (claude_code_claude_code_token_usage_tokens_total)
+
+# Total cost across all models
+sum(claude_code_claude_code_cost_usage_USD_total)
+```
+
+---
+
+## Example Dashboards
+
+### Executive Summary (single values)
+
+```promql
+# Total cost this month
+sum(increase(claude_code_claude_code_cost_usage_USD_total[30d]))
+
+# Total LOC this month
+sum(increase(claude_code_claude_code_lines_of_code_count_total[30d]))
+
+# Active users (unique account_uuids)
+count(count by (account_uuid) (claude_code_claude_code_session_count_total))
+
+# Average session cost
+sum(increase(claude_code_claude_code_cost_usage_USD_total[30d]))
+/
+sum(increase(claude_code_claude_code_session_count_total[30d]))
+```
+
+### Cost Tracking
+
+```promql
+# Daily cost trend
+sum(increase(claude_code_claude_code_cost_usage_USD_total[1d]))
+
+# Cost by model (pie chart)
+sum by (model) (increase(claude_code_claude_code_cost_usage_USD_total[7d]))
+
+# Cost by team (bar chart)
+sum by (team) (increase(claude_code_claude_code_cost_usage_USD_total[7d]))
+```
+
+### Productivity Tracking
+
+```promql
+# LOC per day
+sum(increase(claude_code_claude_code_lines_of_code_count_total[1d]))
+
+# Commits per week
+sum(increase(claude_code_claude_code_commit_count_total[7d]))
+
+# Efficiency: LOC per dollar
+sum(increase(claude_code_claude_code_lines_of_code_count_total[30d]))
+/
+sum(increase(claude_code_claude_code_cost_usage_USD_total[30d]))
+```
+
+---
+
+## Retention and Storage
+
+**Default Prometheus Retention:** 15 days
+
+**Adjust retention:**
+```yaml
+# In prometheus.yml or docker-compose.yml
+command:
+  - '--storage.tsdb.retention.time=90d'
+  - '--storage.tsdb.retention.size=50GB'
+```
+
+**Disk usage estimation:**
+- ~1-2 MB per day per active user
+- ~30-60 MB per month per active user
+- ~360-720 MB per year per active user
+
+**For long-term storage:** Consider using Prometheus remote write to send data to a time-series database like VictoriaMetrics, Cortex, or Thanos.
+
+---
+
+## Additional Resources
+
+- **Official OTEL Docs:** https://opentelemetry.io/docs/
+- **Prometheus Query Docs:** https://prometheus.io/docs/prometheus/latest/querying/basics/
+- **PromQL Examples:** See `prometheus-queries.md`
--- a/skills/otel-monitoring-setup/data/prometheus-queries.md
+++ b/skills/otel-monitoring-setup/data/prometheus-queries.md
@@ -0,0 +1,405 @@
+# Useful Prometheus Queries (PromQL)
+
+Collection of useful PromQL queries for Claude Code telemetry analysis.
+
+**Note:** All queries use the double prefix: `claude_code_claude_code_*`
+
+---
+
+## Cost Analysis
+
+### Daily Cost Trend
+```promql
+sum(increase(claude_code_claude_code_cost_usage_USD_total[1d]))
+```
+
+### Cost by Model
+```promql
+sum by (model) (increase(claude_code_claude_code_cost_usage_USD_total[24h]))
+```
+
+### Cost per Hour (Rate)
+```promql
+rate(claude_code_claude_code_cost_usage_USD_total[5m]) * 3600
+```
+
+### Average Cost per Session
+```promql
+sum(increase(claude_code_claude_code_cost_usage_USD_total[24h]))
+/
+sum(increase(claude_code_claude_code_session_count_total[24h]))
+```
+
+### Cumulative Monthly Cost
+```promql
+sum(increase(claude_code_claude_code_cost_usage_USD_total[30d]))
+```
+
+### Cost by Team
+```promql
+sum by (team) (increase(claude_code_claude_code_cost_usage_USD_total[24h]))
+```
+
+### Projected Monthly Cost (based on last 7 days)
+```promql
+(sum(increase(claude_code_claude_code_cost_usage_USD_total[7d])) / 7) * 30
+```
+
+---
+
+## Token Usage
+
+### Total Tokens by Type
+```promql
+sum by (type) (increase(claude_code_claude_code_token_usage_tokens_total[24h]))
+```
+
+### Tokens by Model
+```promql
+sum by (model) (increase(claude_code_claude_code_token_usage_tokens_total[24h]))
+```
+
+### Cache Hit Rate
+```promql
+sum(increase(claude_code_claude_code_token_usage_tokens_total{type="cache_read"}[24h]))
+/
+sum(increase(claude_code_claude_code_token_usage_tokens_total{type=~"input|cache_creation|cache_read"}[24h]))
+* 100
+```
+
+### Input vs Output Token Ratio
+```promql
+sum(increase(claude_code_claude_code_token_usage_tokens_total{type="input"}[24h]))
+/
+sum(increase(claude_code_claude_code_token_usage_tokens_total{type="output"}[24h]))
+```
+
+### Token Usage Rate (per minute)
+```promql
+sum by (type) (rate(claude_code_claude_code_token_usage_tokens_total[5m]) * 60)
+```
+
+### Total Tokens (All Time)
+```promql
+sum(claude_code_claude_code_token_usage_tokens_total)
+```
+
+---
+
+## Productivity Metrics
+
+### Total Lines of Code Modified
+```promql
+sum(claude_code_claude_code_lines_of_code_count_total)
+```
+
+### LOC by Type (Added, Changed, Deleted)
+```promql
+sum by (type) (increase(claude_code_claude_code_lines_of_code_count_total[24h]))
+```
+
+### LOC per Hour
+```promql
+rate(claude_code_claude_code_lines_of_code_count_total[5m]) * 3600
+```
+
+### Lines per Dollar (Efficiency)
+```promql
+sum(increase(claude_code_claude_code_lines_of_code_count_total[24h]))
+/
+sum(increase(claude_code_claude_code_cost_usage_USD_total[24h]))
+```
+
+### Commits per Day
+```promql
+increase(claude_code_claude_code_commit_count_total[24h])
+```
+
+### PRs per Week
+```promql
+increase(claude_code_claude_code_pr_count_total[7d])
+```
+
+### LOC per Commit
+```promql
+sum(increase(claude_code_claude_code_lines_of_code_count_total[24h]))
+/
+sum(increase(claude_code_claude_code_commit_count_total[24h]))
+```
+
+---
+
+## Session Analytics
+
+### Total Sessions
+```promql
+sum(claude_code_claude_code_session_count_total)
+```
+
+### New Sessions (24h)
+```promql
+increase(claude_code_claude_code_session_count_total[24h])
+```
+
+### Active Users (Unique account_uuids)
+```promql
+count(count by (account_uuid) (claude_code_claude_code_session_count_total))
+```
+
+### Average Session Duration
+```promql
+sum(increase(claude_code_claude_code_active_time_seconds_total[24h]))
+/
+sum(increase(claude_code_claude_code_session_count_total[24h]))
+/ 60
+```
+*Result in minutes*
+
+### Total Active Hours (24h)
+```promql
+sum(increase(claude_code_claude_code_active_time_seconds_total[24h])) / 3600
+```
+
+### Sessions by Version
+```promql
+sum by (version) (increase(claude_code_claude_code_session_count_total[24h]))
+```
+
+---
+
+## Team Aggregation
+
+### Cost by Team (Last 24h)
+```promql
+sum by (team) (increase(claude_code_claude_code_cost_usage_USD_total[24h]))
+```
+
+### LOC by Team (Last 24h)
+```promql
+sum by (team) (increase(claude_code_claude_code_lines_of_code_count_total[24h]))
+```
+
+### Active Users per Team
+```promql
+count by (team) (count by (team, account_uuid) (claude_code_claude_code_session_count_total))
+```
+
+### Team Efficiency (LOC per Dollar)
+```promql
+sum by (team) (increase(claude_code_claude_code_lines_of_code_count_total[24h]))
+/
+sum by (team) (increase(claude_code_claude_code_cost_usage_USD_total[24h]))
+```
+
+### Top Spending Teams (Last 7 days)
+```promql
+topk(5, sum by (team) (increase(claude_code_claude_code_cost_usage_USD_total[7d])))
+```
+
+---
+
+## Model Comparison
+
+### Cost by Model (Pie Chart)
+```promql
+sum by (model) (increase(claude_code_claude_code_cost_usage_USD_total[7d]))
+```
+
+### Token Efficiency by Model (Tokens per Dollar)
+```promql
+sum by (model) (increase(claude_code_claude_code_token_usage_tokens_total[24h]))
+/
+sum by (model) (increase(claude_code_claude_code_cost_usage_USD_total[24h]))
+```
+
+### Most Used Model
+```promql
+topk(1, sum by (model) (increase(claude_code_claude_code_token_usage_tokens_total[24h])))
+```
+
+### Model Usage Distribution (%)
+```promql
+sum by (model) (increase(claude_code_claude_code_token_usage_tokens_total[24h]))
+/
+sum(increase(claude_code_claude_code_token_usage_tokens_total[24h]))
+* 100
+```
+
+---
+
+## Alerting Queries
+
+### High Daily Cost Alert (> $50)
+```promql
+sum(increase(claude_code_claude_code_cost_usage_USD_total[24h])) > 50
+```
+
+### Cost Spike Alert (50% increase compared to yesterday)
+```promql
+sum(increase(claude_code_claude_code_cost_usage_USD_total[24h]))
+/
+sum(increase(claude_code_claude_code_cost_usage_USD_total[24h] offset 24h))
+> 1.5
+```
+
+### No Activity Alert (no sessions in last hour)
+```promql
+increase(claude_code_claude_code_session_count_total[1h]) == 0
+```
+
+### Low Cache Hit Rate Alert (< 20%)
+```promql
+(
+  sum(increase(claude_code_claude_code_token_usage_tokens_total{type="cache_read"}[1h]))
+  /
+  sum(increase(claude_code_claude_code_token_usage_tokens_total{type=~"input|cache_creation|cache_read"}[1h]))
+  * 100
+) < 20
+```
+
+---
+
+## Forecasting
+
+### Projected Monthly Cost (based on last 7 days)
+```promql
+(sum(increase(claude_code_claude_code_cost_usage_USD_total[7d])) / 7) * 30
+```
+
+### Projected Annual Cost (based on last 30 days)
+```promql
+(sum(increase(claude_code_claude_code_cost_usage_USD_total[30d])) / 30) * 365
+```
+
+### Average Daily Cost (Last 30 days)
+```promql
+sum(increase(claude_code_claude_code_cost_usage_USD_total[30d])) / 30
+```
+
+### Growth Rate (Week over Week)
+```promql
+(
+  sum(increase(claude_code_claude_code_cost_usage_USD_total[7d]))
+  -
+  sum(increase(claude_code_claude_code_cost_usage_USD_total[7d] offset 7d))
+)
+/
+sum(increase(claude_code_claude_code_cost_usage_USD_total[7d] offset 7d))
+* 100
+```
+*Result as percentage*
+
+---
+
+## Debugging Queries
+
+### Check if Metrics Exist
+```promql
+claude_code_claude_code_session_count_total
+```
+
+### List All Claude Code Metrics
+```
+# Use Prometheus UI or API
+curl -s 'http://localhost:9090/api/v1/label/__name__/values' | jq . | grep claude_code
+```
+
+### Check Metric Labels
+```promql
+# Returns all label combinations
+count by (account_uuid, version, team, environment) (claude_code_claude_code_session_count_total)
+```
+
+### Latest Value for All Metrics
+```promql
+# Session count
+claude_code_claude_code_session_count_total
+
+# Cost
+claude_code_claude_code_cost_usage_USD_total
+
+# Tokens
+claude_code_claude_code_token_usage_tokens_total
+
+# LOC
+claude_code_claude_code_lines_of_code_count_total
+```
+
+### Metrics Cardinality (Number of Time Series)
+```promql
+count(claude_code_claude_code_token_usage_tokens_total)
+```
+
+---
+
+## Recording Rules
+
+Save these as Prometheus recording rules for faster dashboard queries:
+
+```yaml
+groups:
+  - name: claude_code_aggregations
+    interval: 1m
+    rules:
+      # Daily cost
+      - record: claude_code:cost_usd:daily
+        expr: sum(increase(claude_code_claude_code_cost_usage_USD_total[24h]))
+
+      # Cost by team
+      - record: claude_code:cost_usd:daily:by_team
+        expr: sum by (team) (increase(claude_code_claude_code_cost_usage_USD_total[24h]))
+
+      # Cache hit rate
+      - record: claude_code:cache_hit_rate:daily
+        expr: |
+          sum(increase(claude_code_claude_code_token_usage_tokens_total{type="cache_read"}[24h]))
+          /
+          sum(increase(claude_code_claude_code_token_usage_tokens_total{type=~"input|cache_creation|cache_read"}[24h]))
+          * 100
+
+      # LOC efficiency
+      - record: claude_code:loc_per_dollar:daily
+        expr: |
+          sum(increase(claude_code_claude_code_lines_of_code_count_total[24h]))
+          /
+          sum(increase(claude_code_claude_code_cost_usage_USD_total[24h]))
+```
+
+Then use simplified queries:
+```promql
+# Instead of complex query, just use:
+claude_code:cost_usd:daily
+claude_code:cost_usd:daily:by_team
+```
+
+---
+
+## Visualization Tips
+
+### Time Series Panel
+- Use `rate()` for smooth trends
+- Set legend to `{{label_name}}` for clarity
+- Enable "Lines" draw style with opacity
+
+### Stat Panel
+- Use `lastNotNull` for counters
+- Use `increase([24h])` for daily totals
+- Add thresholds for color coding
+
+### Bar Chart
+- Use `sum by (label)` for grouping
+- Sort by value descending
+- Limit to top 10 with `topk(10, ...)`
+
+### Pie Chart
+- Calculate percentages with division
+- Use `sum by (label)` for segments
+- Limit to top categories
+
+---
+
+## Additional Resources
+
+- **Prometheus Query Docs:** https://prometheus.io/docs/prometheus/latest/querying/basics/
+- **PromQL Examples:** https://prometheus.io/docs/prometheus/latest/querying/examples/
+- **Grafana Query Editor:** https://grafana.com/docs/grafana/latest/datasources/prometheus/
--- a/skills/otel-monitoring-setup/data/troubleshooting.md
+++ b/skills/otel-monitoring-setup/data/troubleshooting.md
@@ -0,0 +1,658 @@
+# Troubleshooting Guide
+
+Common issues and solutions for Claude Code OpenTelemetry setup.
+
+---
+
+## Container Issues
+
+### Docker Not Running
+
+**Symptom:** `Cannot connect to the Docker daemon`
+
+**Diagnosis:**
+```bash
+docker info
+```
+
+**Solutions:**
+1. Start Docker Desktop application
+2. Wait for Docker to fully initialize
+3. Check system tray for Docker icon
+4. Verify Docker daemon is running: `ps aux | grep docker`
+
+---
+
+### Containers Won't Start
+
+**Symptom:** Containers exit immediately after `docker compose up`
+
+**Diagnosis:**
+```bash
+# Check container logs
+docker compose logs
+
+# Check specific service
+docker compose logs otel-collector
+docker compose logs prometheus
+```
+
+**Common Causes:**
+
+**1. OTEL Collector Configuration Error**
+```bash
+# Check for errors
+docker compose logs otel-collector | grep -i error
+
+# Common issues:
+# - Deprecated logging exporter
+# - Deprecated 'address' field in telemetry.metrics
+```
+
+**Solution A - Deprecated logging exporter:**
+Update `otel-collector-config.yml`:
+```yaml
+exporters:
+  debug:
+    verbosity: normal
+  # NOT:
+  # logging:
+  #   loglevel: info
+```
+
+**Solution B - Deprecated 'address' field (v0.123.0+):**
+
+If logs show: `'address' has invalid keys` or similar error:
+
+Update `otel-collector-config.yml`:
+```yaml
+service:
+  telemetry:
+    metrics:
+      level: detailed
+      # REMOVE this line (deprecated in v0.123.0+):
+      # address: ":8888"
+```
+
+The `address` field in `service.telemetry.metrics` is deprecated in newer OTEL Collector versions. Simply remove it - the collector will use default internal metrics endpoint.
+
+**2. Port Already in Use**
+```bash
+# Check which ports are in use
+lsof -i :3000  # Grafana
+lsof -i :4317  # OTEL gRPC
+lsof -i :4318  # OTEL HTTP
+lsof -i :8889  # OTEL Prometheus exporter
+lsof -i :9090  # Prometheus
+lsof -i :3100  # Loki
+```
+
+**Solution:**
+- Stop conflicting service
+- Or change port in docker-compose.yml
+
+**3. Volume Permission Issues**
+```bash
+# Check volume permissions
+docker volume ls
+docker volume inspect claude-telemetry_prometheus-data
+```
+
+**Solution:**
+```bash
+# Remove and recreate volumes
+docker compose down -v
+docker compose up -d
+```
+
+---
+
+### Containers Keep Restarting
+
+**Symptom:** Container status shows "Restarting"
+
+**Diagnosis:**
+```bash
+docker compose ps
+docker compose logs --tail=50 <service-name>
+```
+
+**Solutions:**
+1. Check memory limits: Increase memory_limiter in OTEL config
+2. Check disk space: `df -h`
+3. Check for configuration errors in logs
+4. Restart Docker Desktop
+
+---
+
+## Claude Code Settings Issues
+
+### 🚨 CRITICAL: Telemetry Not Sending (Most Common Issue)
+
+**Symptom:** No metrics appearing in Prometheus after Claude Code restart
+
+**ROOT CAUSE (90% of cases):** Missing required exporter environment variables
+
+Even when `CLAUDE_CODE_ENABLE_TELEMETRY=1` is set, telemetry **will not send** without explicit exporter configuration. This is the #1 most common issue.
+
+**Diagnosis Checklist:**
+
+**1. Check REQUIRED exporters (MOST IMPORTANT):**
+```bash
+jq '.env.OTEL_METRICS_EXPORTER' ~/.claude/settings.json
+# Must return: "otlp" (NOT null, NOT missing)
+
+jq '.env.OTEL_LOGS_EXPORTER' ~/.claude/settings.json
+# Should return: "otlp" (recommended for event tracking)
+```
+
+**If either returns `null` or is missing, this is your problem!**
+
+**2. Verify telemetry is enabled:**
+```bash
+jq '.env.CLAUDE_CODE_ENABLE_TELEMETRY' ~/.claude/settings.json
+# Should return: "1"
+```
+
+**3. Check OTEL endpoint:**
+```bash
+jq '.env.OTEL_EXPORTER_OTLP_ENDPOINT' ~/.claude/settings.json
+# Should return: "http://localhost:4317" (for local setup)
+```
+
+**3. Verify JSON is valid:**
+```bash
+jq empty ~/.claude/settings.json
+# No output = valid JSON
+```
+
+**4. Check if Claude Code was restarted:**
+```bash
+# Telemetry config only loads at startup!
+# Must quit and restart Claude Code completely
+```
+
+**5. Test OTEL endpoint connectivity:**
+```bash
+nc -zv localhost 4317
+# Should show: Connection to localhost port 4317 [tcp/*] succeeded!
+```
+
+**Solutions:**
+
+**If exporters are missing (MOST COMMON):**
+
+Add these REQUIRED settings to ~/.claude/settings.json:
+```json
+{
+  "env": {
+    "CLAUDE_CODE_ENABLE_TELEMETRY": "1",
+    "OTEL_METRICS_EXPORTER": "otlp",
+    "OTEL_LOGS_EXPORTER": "otlp",
+    "OTEL_EXPORTER_OTLP_ENDPOINT": "http://localhost:4317",
+    "OTEL_EXPORTER_OTLP_PROTOCOL": "grpc"
+  }
+}
+```
+
+Then **MUST restart Claude Code** (settings only load at startup).
+
+**If endpoint unreachable:**
+- Verify OTEL Collector container is running
+- Check firewall settings
+- Try HTTP endpoint instead: `http://localhost:4318`
+
+**If still no data:**
+- Check OTEL Collector logs for incoming connections
+- Verify Claude Code is running (not just idle)
+- Wait 60 seconds (default export interval)
+
+---
+
+### Settings.json Syntax Errors
+
+**Symptom:** Claude Code won't start or shows errors
+
+**Diagnosis:**
+```bash
+# Validate JSON
+jq empty ~/.claude/settings.json
+
+# Pretty-print to find issues
+jq . ~/.claude/settings.json
+```
+
+**Common Issues:**
+- Missing commas between properties
+- Trailing commas before closing braces
+- Unescaped quotes in strings
+- Incorrect nesting
+
+**Solution:**
+```bash
+# Restore backup
+cp ~/.claude/settings.json.backup ~/.claude/settings.json
+
+# Or fix JSON manually with editor
+```
+
+---
+
+## Grafana Issues
+
+### Can't Access Grafana
+
+**Symptom:** `localhost:3000` doesn't load
+
+**Diagnosis:**
+```bash
+# Check if Grafana is running
+docker ps | grep grafana
+
+# Check Grafana logs
+docker compose logs grafana
+
+# Check port availability
+lsof -i :3000
+```
+
+**Solutions:**
+1. Verify container is running: `docker compose up -d grafana`
+2. Wait 30 seconds for Grafana to initialize
+3. Try `http://127.0.0.1:3000` instead
+4. Check Docker network: `docker network inspect claude-telemetry`
+
+---
+
+### Dashboard Shows "Datasource Not Found"
+
+**Symptom:** Dashboard panels show "datasource prometheus not found"
+
+**Cause:** Dashboard has hardcoded datasource UID that doesn't match your Grafana instance
+
+**Diagnosis:**
+1. Go to: http://localhost:3000/connections/datasources
+2. Click on Prometheus datasource
+3. Note the UID from URL (e.g., `PBFA97CFB590B2093`)
+
+**Solution:**
+```bash
+# Get your datasource UID
+DATASOURCE_UID=$(curl -s -u admin:admin http://localhost:3000/api/datasources | jq -r '.[] | select(.type=="prometheus") | .uid')
+
+echo "Your Prometheus datasource UID: $DATASOURCE_UID"
+
+# Update dashboard JSON
+cd ~/.claude/telemetry/dashboards
+cat claude-code-overview.json | sed "s/PBFA97CFB590B2093/$DATASOURCE_UID/g" > claude-code-overview-fixed.json
+
+# Re-import the fixed dashboard
+```
+
+---
+
+### Dashboard Shows "No Data"
+
+**Symptom:** Dashboard loads but all panels show "No data"
+
+**Diagnosis Steps:**
+
+**1. Check Prometheus has data:**
+```bash
+# Query Prometheus directly
+curl -s 'http://localhost:9090/api/v1/label/__name__/values' | jq . | grep claude_code
+
+# Should see metrics like:
+# "claude_code_claude_code_session_count_total"
+# "claude_code_claude_code_cost_usage_USD_total"
+```
+
+**2. Check datasource connection:**
+- Go to: http://localhost:3000/connections/datasources
+- Click Prometheus
+- Click "Save & Test"
+- Should show: "Successfully queried the Prometheus API"
+
+**3. Verify metric names in queries:**
+```bash
+# Check if metrics use double prefix
+curl -s 'http://localhost:9090/api/v1/query?query=claude_code_claude_code_session_count_total' | jq .
+```
+
+**Solutions:**
+
+**If metrics don't exist:**
+- Claude Code hasn't sent data yet (wait 60 seconds)
+- OTEL Collector isn't receiving data (check container logs)
+- Settings.json wasn't configured correctly
+
+**If metrics exist but dashboard shows no data:**
+- Dashboard queries use wrong metric names
+- Update queries to use double prefix: `claude_code_claude_code_*`
+- Check time range (top-right corner of Grafana)
+
+**If single prefix metrics exist (`claude_code_*`):**
+Your setup uses old naming. Update dashboard:
+```bash
+# Replace double prefix with single
+sed 's/claude_code_claude_code_/claude_code_/g' dashboard.json > dashboard-fixed.json
+```
+
+---
+
+## Prometheus Issues
+
+### Prometheus Shows No Targets
+
+**Symptom:** Prometheus UI (localhost:9090) → Status → Targets shows no targets or DOWN status
+
+**Diagnosis:**
+```bash
+# Check Prometheus config
+cat ~/.claude/telemetry/prometheus.yml
+
+# Check if OTEL Collector is reachable from Prometheus
+docker exec -it claude-prometheus ping otel-collector
+```
+
+**Solutions:**
+1. Verify `prometheus.yml` has correct scrape_configs
+2. Ensure OTEL Collector is running
+3. Check Docker network connectivity
+4. Restart Prometheus: `docker compose restart prometheus`
+
+---
+
+### Prometheus Can't Scrape OTEL Collector
+
+**Symptom:** Target shows as DOWN with error "context deadline exceeded"
+
+**Diagnosis:**
+```bash
+# Check if OTEL Collector is exposing metrics
+curl http://localhost:8889/metrics
+
+# Check OTEL Collector logs
+docker compose logs otel-collector
+```
+
+**Solutions:**
+1. Verify OTEL Collector prometheus exporter is configured
+2. Check port 8889 is exposed in docker-compose.yml
+3. Restart OTEL Collector: `docker compose restart otel-collector`
+
+---
+
+## Metric Issues
+
+### Metrics Have Double Prefix
+
+**Symptom:** Metrics are named `claude_code_claude_code_*` instead of `claude_code_*`
+
+**Explanation:** This is expected behavior with the current OTEL Collector configuration:
+- First `claude_code` = Prometheus exporter namespace
+- Second `claude_code` = Original metric name
+
+**Solutions:**
+
+**Option 1: Accept it (Recommended)**
+- Update dashboard queries to use double prefix
+- This is the standard configuration
+
+**Option 2: Remove namespace prefix**
+Update `otel-collector-config.yml`:
+```yaml
+exporters:
+  prometheus:
+    endpoint: "0.0.0.0:8889"
+    namespace: ""  # Remove namespace
+```
+
+Then restart: `docker compose restart otel-collector`
+
+---
+
+### Old Metrics Still Showing
+
+**Symptom:** After changing configuration, old metrics still appear
+
+**Cause:** Prometheus retains metrics until retention period expires
+
+**Solutions:**
+
+**Quick fix: Delete Prometheus data:**
+```bash
+docker compose down
+docker volume rm claude-telemetry_prometheus-data
+docker compose up -d
+```
+
+**Proper fix: Wait for retention:**
+- Default retention is 15 days
+- Old metrics will automatically disappear
+- New metrics will coexist temporarily
+
+---
+
+## Network Issues
+
+### Can't Reach OTEL Endpoint from Claude Code
+
+**Symptom:** Claude Code can't connect to `localhost:4317`
+
+**Diagnosis:**
+```bash
+# Test gRPC endpoint
+nc -zv localhost 4317
+
+# Test HTTP endpoint
+curl -v http://localhost:4318/v1/metrics -d '{}'
+```
+
+**Solutions:**
+
+**If connection refused:**
+1. Check OTEL Collector is running
+2. Verify ports are exposed in docker-compose.yml
+3. Check firewall/antivirus blocking localhost connections
+
+**If timeout:**
+1. Increase export timeout in settings.json
+2. Try HTTP protocol instead of gRPC
+
+**macOS-specific:**
+- Use `http://host.docker.internal:4317` instead of `localhost:4317`
+- Or use bridge network mode
+
+---
+
+### Enterprise Endpoint Unreachable
+
+**Symptom:** Can't connect to company OTEL endpoint
+
+**Diagnosis:**
+```bash
+# Test connectivity
+ping otel.company.com
+
+# Test port
+nc -zv otel.company.com 4317
+
+# Test with VPN
+# (Ensure corporate VPN is connected)
+```
+
+**Solutions:**
+1. Connect to corporate VPN
+2. Check firewall allows outbound connections
+3. Verify endpoint URL is correct
+4. Try HTTP endpoint (port 4318) instead of gRPC
+5. Contact platform team to verify endpoint is accessible
+
+---
+
+## Performance Issues
+
+### High Memory Usage
+
+**Symptom:** OTEL Collector or Prometheus using excessive memory
+
+**Diagnosis:**
+```bash
+# Check container resource usage
+docker stats
+
+# Check Prometheus TSDB size
+du -sh ~/.claude/telemetry/prometheus-data
+```
+
+**Solutions:**
+
+**OTEL Collector:**
+Reduce memory_limiter in `otel-collector-config.yml`:
+```yaml
+processors:
+  memory_limiter:
+    check_interval: 1s
+    limit_mib: 256  # Reduce from 512
+```
+
+**Prometheus:**
+Reduce retention:
+```yaml
+command:
+  - '--storage.tsdb.retention.time=7d'  # Reduce from 15d
+  - '--storage.tsdb.retention.size=1GB'
+```
+
+---
+
+### Slow Grafana Dashboards
+
+**Symptom:** Dashboards take long time to load or timeout
+
+**Diagnosis:**
+```bash
+# Check query performance in Prometheus
+# Go to: http://localhost:9090/graph
+# Run expensive queries like: sum by (account_uuid, model, type) (...)
+```
+
+**Solutions:**
+1. Reduce dashboard time range (use 6h instead of 7d)
+2. Increase dashboard refresh interval (1m → 5m)
+3. Use recording rules for complex queries
+4. Reduce number of panels
+5. Use simpler aggregations
+
+---
+
+## Data Quality Issues
+
+### Unexpected Cost Values
+
+**Symptom:** Cost metrics seem incorrect
+
+**Diagnosis:**
+```bash
+# Check raw cost values
+curl -s 'http://localhost:9090/api/v1/query?query=claude_code_claude_code_cost_usage_USD_total' | jq .
+
+# Check token usage
+curl -s 'http://localhost:9090/api/v1/query?query=claude_code_claude_code_token_usage_tokens_total' | jq .
+```
+
+**Causes:**
+- Cost is cumulative counter (not reset between sessions)
+- Dashboard may be using wrong time range
+- Model pricing may have changed
+
+**Solutions:**
+- Use `increase([24h])` not raw counter values
+- Verify pricing in metrics reference
+- Check Claude Code version (pricing may vary)
+
+---
+
+### Missing Sessions
+
+**Symptom:** Some Claude Code sessions not recorded
+
+**Causes:**
+1. Claude Code wasn't restarted after settings update
+2. OTEL Collector was down during session
+3. Export interval hadn't elapsed yet (60 seconds default)
+4. Network issue prevented export
+
+**Solutions:**
+- Always restart Claude Code after settings changes
+- Monitor OTEL Collector uptime
+- Check OTEL Collector logs for export errors
+- Reduce export interval if real-time data needed
+
+---
+
+## Getting Help
+
+### Collect Debug Information
+
+When asking for help, provide:
+
+```bash
+# 1. Container status
+docker compose ps
+
+# 2. Container logs (last 50 lines)
+docker compose logs --tail=50
+
+# 3. Configuration files
+cat ~/.claude/telemetry/otel-collector-config.yml
+cat ~/.claude/telemetry/prometheus.yml
+
+# 4. Claude Code settings (redact sensitive info!)
+jq '.env | with_entries(select(.key | startswith("OTEL_")))' ~/.claude/settings.json
+
+# 5. Prometheus metrics list
+curl -s 'http://localhost:9090/api/v1/label/__name__/values' | jq . | grep claude_code
+
+# 6. System info
+docker --version
+docker compose version
+uname -a
+```
+
+### Enable Debug Logging
+
+**OTEL Collector:**
+```yaml
+exporters:
+  debug:
+    verbosity: detailed  # Change from 'normal'
+
+service:
+  telemetry:
+    logs:
+      level: debug  # Change from 'info'
+```
+
+**Claude Code:**
+Add to settings.json:
+```json
+"env": {
+  "OTEL_LOG_LEVEL": "debug"
+}
+```
+
+Then check logs:
+```bash
+docker compose logs -f otel-collector
+```
+
+---
+
+## Additional Resources
+
+- **OTEL Collector Docs:** https://opentelemetry.io/docs/collector/
+- **Prometheus Troubleshooting:** https://prometheus.io/docs/prometheus/latest/troubleshooting/
+- **Grafana Troubleshooting:** https://grafana.com/docs/grafana/latest/troubleshooting/
+- **Docker Compose Docs:** https://docs.docker.com/compose/