Initial commit

This commit is contained in:
Zhongwei Li
2025-11-29 18:16:40 +08:00
commit f125e90b9f
370 changed files with 67769 additions and 0 deletions

View File

@@ -0,0 +1,14 @@
# Changelog
## 1.1.0
- Renamed from claude-code-otel-setup to otel-monitoring-setup
- Refactored to Anthropic progressive disclosure pattern
- Updated description with "Use PROACTIVELY when..." format
## 1.0.0
- Initial skill release
- Local PoC mode with Docker stack
- Enterprise mode for centralized infrastructure
- Grafana dashboard imports

View File

@@ -0,0 +1,558 @@
# Claude Code OpenTelemetry Setup Skill
Automated workflow for setting up OpenTelemetry telemetry collection for Claude Code usage monitoring, cost tracking, and productivity analytics.
**Version:** 1.0.0
**Author:** Prometheus Team
---
## Features
- **Mode 1: Local PoC Setup** - Full Docker stack with Grafana dashboards
- **Mode 2: Enterprise Setup** - Connect to centralized infrastructure
- Automated configuration file generation
- Dashboard import with UID detection
- Verification and testing procedures
- Comprehensive troubleshooting guides
---
## Quick Start
### Prerequisites
**For Mode 1 (Local PoC):**
- Docker Desktop installed and running
- Claude Code installed
- Write access to `~/.claude/settings.json`
**For Mode 2 (Enterprise):**
- OTEL Collector endpoint URL
- Authentication credentials
- Write access to `~/.claude/settings.json`
### Installation
This skill is designed to be invoked by Claude Code. No manual installation required.
### Usage
**Mode 1 - Local PoC Setup:**
```
"Set up Claude Code telemetry locally"
"I want to try OpenTelemetry with Claude Code"
"Create a local telemetry stack for me"
```
**Mode 2 - Enterprise Setup:**
```
"Connect Claude Code to our company OTEL endpoint at otel.company.com:4317"
"Set up telemetry for team rollout"
"Configure enterprise telemetry"
```
---
## What Gets Collected?
### Metrics
- **Session counts and active time** - How much you use Claude Code
- **Token usage** - Input, output, cached tokens by model
- **API costs** - Spend tracking by model and time
- **Lines of code** - Code modifications (added, changed, deleted)
- **Commits and PRs** - Git activity tracking
### Events/Logs
- User prompts (if enabled)
- Tool executions
- API requests
- Session lifecycle
**Privacy:** Metrics are anonymized. Source code content is never collected.
---
## Directory Structure
```
claude-code-otel-setup/
├── SKILL.md # Main skill definition
├── README.md # This file
├── modes/
│ ├── mode1-poc-setup.md # Detailed local setup workflow
│ └── mode2-enterprise.md # Detailed enterprise setup workflow
├── templates/
│ ├── docker-compose.yml # Docker Compose configuration
│ ├── otel-collector-config.yml # OTEL Collector configuration
│ ├── prometheus.yml # Prometheus scrape configuration
│ ├── grafana-datasources.yml # Grafana datasource provisioning
│ ├── settings.json.local # Local telemetry settings template
│ ├── settings.json.enterprise # Enterprise settings template
│ ├── start-telemetry.sh # Start script
│ └── stop-telemetry.sh # Stop script
├── dashboards/
│ ├── README.md # Dashboard import guide
│ ├── claude-code-overview.json # Comprehensive dashboard
│ └── claude-code-simple.json # Simplified dashboard
└── data/
├── metrics-reference.md # Complete metrics documentation
├── prometheus-queries.md # Useful PromQL queries
└── troubleshooting.md # Common issues and solutions
```
---
## Mode 1: Local PoC Setup
**What it does:**
- Creates `~/.claude/telemetry/` directory
- Generates Docker Compose configuration
- Starts 4 containers: OTEL Collector, Prometheus, Loki, Grafana
- Updates Claude Code settings.json
- Imports Grafana dashboards
- Verifies data flow
**Time:** 5-7 minutes
**Output:**
- Grafana: http://localhost:3000 (admin/admin)
- Prometheus: http://localhost:9090
- Working dashboards with real data
**Detailed workflow:** See `modes/mode1-poc-setup.md`
---
## Mode 2: Enterprise Setup
**What it does:**
- Collects enterprise OTEL endpoint details
- Updates Claude Code settings.json with endpoint and auth
- Adds team/environment resource attributes
- Tests connectivity (optional)
- Provides team rollout documentation
**Time:** 2-3 minutes
**Output:**
- Claude Code configured to send to central endpoint
- Connectivity verified
- Team rollout guide generated
**Detailed workflow:** See `modes/mode2-enterprise.md`
---
## Example Dashboards
### Overview Dashboard
Includes:
- Total Lines of Code (all-time)
- Total Cost (24h)
- Total Tokens (24h)
- Active Time (24h)
- Cost Over Time (timeseries)
- Token Usage by Type (stacked)
- Lines of Code Modified (bar chart)
- Commits Created (24h)
### Custom Queries
See `data/prometheus-queries.md` for 50+ ready-to-use PromQL queries:
- Cost analysis
- Token usage
- Productivity metrics
- Team aggregation
- Model comparison
- Alerting rules
---
## Common Use Cases
### Individual Developer
**Goal:** Track personal Claude Code usage and costs
**Setup:**
```
Mode 1 (Local PoC)
```
**Access:**
- Personal Grafana dashboard at localhost:3000
- All data stays local
---
### Team Pilot (5-10 Users)
**Goal:** Aggregate metrics across pilot users
**Setup:**
```
Mode 2 (Enterprise)
```
**Architecture:**
- Centralized OTEL Collector
- Team-level Prometheus/Grafana
- Aggregated dashboards
---
### Enterprise Rollout (100+ Users)
**Goal:** Organization-wide cost tracking and productivity analytics
**Setup:**
```
Mode 2 (Enterprise) + Managed Infrastructure
```
**Features:**
- Department/team/project attribution
- Chargeback reporting
- Executive dashboards
- Trend analysis
---
## Troubleshooting
### Quick Checks
**Containers not starting:**
```bash
docker compose logs
```
**No metrics in Prometheus:**
1. Restart Claude Code (telemetry loads at startup)
2. Wait 60 seconds (export interval)
3. Check OTEL Collector logs: `docker compose logs otel-collector`
**Dashboard shows "No data":**
1. Verify metric names use double prefix: `claude_code_claude_code_*`
2. Check time range (top-right corner)
3. Verify datasource UID matches
**Full troubleshooting guide:** See `data/troubleshooting.md`
---
## Known Issues
### Issue 1: 🚨 CRITICAL - Missing OTEL Exporters
**Description:** Claude Code not sending telemetry even with `CLAUDE_CODE_ENABLE_TELEMETRY=1`
**Cause:** Missing required `OTEL_METRICS_EXPORTER` and `OTEL_LOGS_EXPORTER` settings
**Solution:** The skill templates include these by default. **Always verify** they're present in settings.json. See Configuration Reference for details.
---
### Issue 2: OTEL Collector Deprecated 'address' Field
**Description:** Collector crashes with "'address' has invalid keys" error
**Cause:** The `address` field in `service.telemetry.metrics` is deprecated in collector v0.123.0+
**Solution:** Skill templates have this removed. If using custom config, remove the deprecated field.
---
### Issue 3: Metric Double Prefix
**Description:** Metrics are named `claude_code_claude_code_*` instead of `claude_code_*`
**Cause:** OTEL Collector Prometheus exporter adds namespace prefix
**Solution:** This is expected. Dashboards use correct naming.
---
### Issue 4: Dashboard Datasource UID Mismatch
**Description:** Dashboard shows "datasource prometheus not found"
**Cause:** Dashboard has hardcoded UID that doesn't match your Grafana
**Solution:** Skill automatically detects and fixes UID during import
---
### Issue 5: OTEL Collector Deprecated Exporter
**Description:** Container fails with "logging exporter has been deprecated"
**Cause:** Old OTEL configuration
**Solution:** Skill uses `debug` exporter (not deprecated `logging`)
---
## Configuration Reference
### Settings.json (Local)
**🚨 CRITICAL REQUIREMENTS:**
The following settings are **REQUIRED** (not optional) for telemetry to work:
- `CLAUDE_CODE_ENABLE_TELEMETRY: "1"` - Enables telemetry system
- `OTEL_METRICS_EXPORTER: "otlp"` - **REQUIRED** to send metrics (most common missing setting!)
- `OTEL_LOGS_EXPORTER: "otlp"` - **REQUIRED** to send events/logs
Without `OTEL_METRICS_EXPORTER` and `OTEL_LOGS_EXPORTER`, telemetry will not send even if `CLAUDE_CODE_ENABLE_TELEMETRY=1` is set.
```json
{
"env": {
"CLAUDE_CODE_ENABLE_TELEMETRY": "1",
"OTEL_METRICS_EXPORTER": "otlp", // REQUIRED!
"OTEL_LOGS_EXPORTER": "otlp", // REQUIRED!
"OTEL_EXPORTER_OTLP_PROTOCOL": "grpc",
"OTEL_EXPORTER_OTLP_ENDPOINT": "http://localhost:4317",
"OTEL_METRIC_EXPORT_INTERVAL": "60000",
"OTEL_LOGS_EXPORT_INTERVAL": "5000",
"OTEL_LOG_USER_PROMPTS": "1",
"OTEL_METRICS_INCLUDE_SESSION_ID": "true",
"OTEL_METRICS_INCLUDE_VERSION": "true",
"OTEL_METRICS_INCLUDE_ACCOUNT_UUID": "true",
"OTEL_RESOURCE_ATTRIBUTES": "environment=local,deployment=poc"
}
}
```
### Settings.json (Enterprise)
**Same CRITICAL requirements apply:**
- `OTEL_METRICS_EXPORTER: "otlp"` - **REQUIRED!**
- `OTEL_LOGS_EXPORTER: "otlp"` - **REQUIRED!**
```json
{
"env": {
"CLAUDE_CODE_ENABLE_TELEMETRY": "1",
"OTEL_METRICS_EXPORTER": "otlp", // REQUIRED!
"OTEL_LOGS_EXPORTER": "otlp", // REQUIRED!
"OTEL_EXPORTER_OTLP_PROTOCOL": "grpc",
"OTEL_EXPORTER_OTLP_ENDPOINT": "https://otel.company.com:4317",
"OTEL_EXPORTER_OTLP_HEADERS": "Authorization=Bearer YOUR_API_KEY",
"OTEL_METRIC_EXPORT_INTERVAL": "60000",
"OTEL_LOGS_EXPORT_INTERVAL": "5000",
"OTEL_LOG_USER_PROMPTS": "1",
"OTEL_METRICS_INCLUDE_SESSION_ID": "true",
"OTEL_METRICS_INCLUDE_VERSION": "true",
"OTEL_METRICS_INCLUDE_ACCOUNT_UUID": "true",
"OTEL_RESOURCE_ATTRIBUTES": "team=platform,environment=production"
}
}
```
---
## Management
### Start Telemetry Stack (Mode 1)
```bash
~/.claude/telemetry/start-telemetry.sh
```
### Stop Telemetry Stack (Mode 1)
```bash
~/.claude/telemetry/stop-telemetry.sh
```
### Check Status
```bash
docker compose ps
```
### View Logs
```bash
docker compose logs -f
```
### Restart Services
```bash
docker compose restart
```
---
## Data Retention
**Default:** 15 days in Prometheus
**Adjust retention:**
Edit `docker-compose.yml` or `prometheus.yml`:
```yaml
command:
- '--storage.tsdb.retention.time=90d'
- '--storage.tsdb.retention.size=50GB'
```
**Disk usage:** ~1-2 MB per day per active user
---
## Security Considerations
### Local Setup (Mode 1)
- Grafana accessible only on localhost
- Default credentials: admin/admin (change after first login)
- No external network exposure
- Data stored in Docker volumes
### Enterprise Setup (Mode 2)
- Use HTTPS endpoints
- Store API keys securely (environment variables, secrets manager)
- Enable mTLS for production
- Tag metrics with team/project for proper attribution
---
## Performance Tuning
### Reduce OTEL Collector Memory
Edit `otel-collector-config.yml`:
```yaml
processors:
memory_limiter:
limit_mib: 256 # Reduce from default
```
### Reduce Prometheus Retention
Edit `docker-compose.yml`:
```yaml
command:
- '--storage.tsdb.retention.time=7d' # Reduce from 15d
```
### Optimize Dashboard Queries
- Use recording rules for expensive queries
- Reduce dashboard time ranges
- Increase refresh intervals
See `data/prometheus-queries.md` for recording rule examples
---
## Integration Examples
### Cost Alerts (PagerDuty/Slack)
```yaml
# alertmanager.yml
groups:
- name: claude_code_cost
rules:
- alert: HighDailyCost
expr: sum(increase(claude_code_claude_code_cost_usage_USD_total[24h])) > 100
annotations:
summary: "Claude Code daily cost exceeded $100"
```
### Weekly Cost Reports (Email)
Use Grafana Reporting:
1. Create dashboard with cost panels
2. Set up email delivery
3. Schedule weekly reports
### Chargeback Integration
Export metrics to data warehouse:
```yaml
# Use Prometheus remote write
remote_write:
- url: "https://datawarehouse.company.com/prometheus"
```
---
## Contributing
This skill is maintained by the Prometheus Team.
**Feedback:** Open an issue or contact the team
**Improvements:** Submit pull requests with enhancements
---
## Changelog
### Version 1.1.0 (2025-11-01)
**Critical Updates from Production Testing:**
- 🚨 **CRITICAL FIX**: Documented missing OTEL_METRICS_EXPORTER/OTEL_LOGS_EXPORTER as #1 cause of "telemetry not working"
- ✅ Added deprecated `address` field fix for OTEL Collector v0.123.0+
- ✅ Enhanced troubleshooting with prominent exporter configuration section
- ✅ Updated all documentation with CRITICAL warnings for required settings
- ✅ Added comprehensive Known Issues section covering production scenarios
- ✅ Verified templates have correct exporter configuration
**What Changed:**
- Troubleshooting guide now prioritizes missing exporters as root cause
- Known Issues expanded from 3 to 6 issues with production learnings
- Configuration Reference includes prominent CRITICAL requirements callout
- SKILL.md Important Reminders section updated with exporter warnings
### Version 1.0.0 (2025-10-31)
**Initial Release:**
- Mode 1: Local PoC setup with full Docker stack
- Mode 2: Enterprise setup with centralized endpoint
- Comprehensive documentation and troubleshooting
- Dashboard templates with correct metric naming
- Automated UID detection and replacement
**Known Issues Fixed:**
- ✅ OTEL Collector deprecated logging exporter
- ✅ Dashboard datasource UID mismatch
- ✅ Metric double prefix handling
- ✅ Loki exporter configuration
---
## Additional Resources
- **Claude Code Monitoring Docs:** https://docs.claude.com/claude-code/monitoring
- **OpenTelemetry Docs:** https://opentelemetry.io/docs/
- **Prometheus Docs:** https://prometheus.io/docs/
- **Grafana Docs:** https://grafana.com/docs/
---
## License
Internal use within Elsevier organization.
---
## Support
**Issues?** Check `data/troubleshooting.md` first
**Questions?** Contact Prometheus Team or #claude-code-telemetry channel
**Emergency?** Rollback with: `cp ~/.claude/settings.json.backup ~/.claude/settings.json`
---
**Ready to monitor your Claude Code usage!** 🚀

View File

@@ -0,0 +1,150 @@
---
name: otel-monitoring-setup
description: Use PROACTIVELY when setting up OpenTelemetry monitoring for Claude Code usage tracking, cost analysis, or productivity metrics. Provides local PoC mode (full Docker stack with Grafana) and enterprise mode (centralized infrastructure). Configures telemetry collection, imports dashboards, and verifies data flow. Not for non-Claude telemetry or custom metric definitions.
---
# Claude Code OpenTelemetry Setup
Automated workflow for setting up OpenTelemetry telemetry collection for Claude Code usage monitoring, cost tracking, and productivity analytics.
## Quick Decision Matrix
| User Request | Mode | Action |
|--------------|------|--------|
| "Set up telemetry locally" | Mode 1 | Full PoC stack |
| "I want to try OpenTelemetry" | Mode 1 | Full PoC stack |
| "Connect to company endpoint" | Mode 2 | Enterprise config |
| "Set up for team rollout" | Mode 2 | Enterprise + docs |
| "Dashboard not working" | Troubleshoot | See known issues |
## Mode 1: Local PoC Setup
**Goal**: Complete local telemetry stack for individual developer
**Creates**:
- OpenTelemetry Collector (receives data)
- Prometheus (stores metrics)
- Loki (stores logs)
- Grafana (dashboards)
**Prerequisites**:
- Docker Desktop running
- 2GB free disk space
- Write access to ~/.claude/
**Time**: 5-7 minutes
**Workflow**: `modes/mode1-poc-setup.md`
**Output**:
- Grafana at http://localhost:3000 (admin/admin)
- Management scripts in ~/.claude/telemetry/
## Mode 2: Enterprise Setup
**Goal**: Connect Claude Code to centralized company infrastructure
**Required Info**:
- OTEL Collector endpoint URL
- Authentication (API key or certificates)
- Team/department identifier
**Time**: 2-3 minutes
**Workflow**: `modes/mode2-enterprise.md`
**Output**:
- settings.json configured for central endpoint
- Team rollout documentation
## Critical Configuration
**REQUIRED in settings.json** (without these, telemetry won't work):
```json
{
"env": {
"CLAUDE_CODE_ENABLE_TELEMETRY": "1",
"OTEL_METRICS_EXPORTER": "otlp",
"OTEL_LOGS_EXPORTER": "otlp",
"OTEL_EXPORTER_OTLP_ENDPOINT": "http://localhost:4317"
}
}
```
**Must restart Claude Code after settings changes!**
## Pre-Flight Check
Always run before setup:
```bash
# Verify Docker is running
docker info > /dev/null 2>&1 || echo "Start Docker Desktop first"
# Check available ports
for port in 3000 4317 4318 8889 9090; do
lsof -i :$port > /dev/null 2>&1 && echo "Port $port in use"
done
# Check disk space (need 2GB)
df -h ~/.claude
```
## Metrics Collected
- Session counts and active time
- Token usage (input/output/cached)
- API costs by model (USD)
- Lines of code modified
- Commits and PRs created
## Management Commands
```bash
# Start telemetry stack
~/.claude/telemetry/start-telemetry.sh
# Stop (preserves data)
~/.claude/telemetry/stop-telemetry.sh
# Full cleanup (removes all data)
~/.claude/telemetry/cleanup-telemetry.sh
```
## Common Issues
### No Data in Dashboard
1. Check OTEL_METRICS_EXPORTER and OTEL_LOGS_EXPORTER are set
2. Verify Claude Code was restarted
3. See `reference/known-issues.md`
### Datasource Not Found
Dashboard has wrong UID. Detect your UID:
```bash
curl -s http://admin:admin@localhost:3000/api/datasources | jq '.[0].uid'
```
Replace in dashboard JSON and re-import.
### Metric Names Double Prefix
Metrics use `claude_code_claude_code_*` format. Update dashboard queries accordingly.
## Reference Documentation
- `modes/mode1-poc-setup.md` - Detailed local setup workflow
- `modes/mode2-enterprise.md` - Enterprise configuration steps
- `reference/known-issues.md` - Troubleshooting guide
- `templates/` - Configuration file templates
- `dashboards/` - Grafana dashboard JSON files
## Safety Checklist
- [ ] Backup settings.json before modification
- [ ] Verify Docker is running first
- [ ] Check ports are available
- [ ] Test data flow before declaring success
- [ ] Provide cleanup instructions
---
**Version**: 1.1.0 | **Author**: Prometheus Team

View File

@@ -0,0 +1,160 @@
# Grafana Dashboard Templates
This directory contains pre-configured Grafana dashboards for Claude Code telemetry.
## Available Dashboards
### 1. claude-code-overview.json
**Comprehensive dashboard with all key metrics**
**Panels:**
- Total Lines of Code (all-time counter)
- Total Cost (24h rolling window)
- Total Tokens (24h rolling window)
- Active Time (24h rolling window)
- Cost Over Time (per hour rate)
- Token Usage by Type (stacked timeseries)
- Lines of Code Modified (bar chart)
- Commits Created (24h counter)
**Metrics Used:**
- `claude_code_claude_code_lines_of_code_count_total`
- `claude_code_claude_code_cost_usage_USD_total`
- `claude_code_claude_code_token_usage_tokens_total`
- `claude_code_claude_code_active_time_seconds_total`
- `claude_code_claude_code_commit_count_total`
**Note:** This dashboard uses the correct double-prefix metric names.
### 2. claude-code-simple.json
**Simplified dashboard for quick overview**
**Panels:**
- Active Sessions
- Total Cost (24h)
- Total Tokens (24h)
- Active Time (24h)
- Cost Over Time
- Token Usage by Type
**Use Case:** Lightweight dashboard for basic monitoring without detailed breakdowns.
## Importing Dashboards
### Method 1: Grafana UI (Recommended)
1. Access Grafana: http://localhost:3000
2. Login with admin/admin
3. Go to: Dashboards → New → Import
4. Click "Upload JSON file"
5. Select the dashboard JSON file
6. Click "Import"
### Method 2: Grafana API
```bash
# Get the datasource UID first
DATASOURCE_UID=$(curl -s -u admin:admin http://localhost:3000/api/datasources | jq -r '.[] | select(.type=="prometheus") | .uid')
# Update dashboard with correct UID
cat claude-code-overview.json | jq --arg uid "$DATASOURCE_UID" '
walk(if type == "object" and .datasource.type == "prometheus" then .datasource.uid = $uid else . end)
' > dashboard-updated.json
# Import dashboard
curl -X POST http://localhost:3000/api/dashboards/db \
-u admin:admin \
-H "Content-Type: application/json" \
-d @dashboard-updated.json
```
## Datasource UID Configuration
**Important:** The dashboards have a hardcoded Prometheus datasource UID: `PBFA97CFB590B2093`
If your Grafana instance has a different UID, you need to replace it:
```bash
# Find your datasource UID
curl -s -u admin:admin http://localhost:3000/api/datasources | jq '.[] | select(.type=="prometheus") | {name, uid}'
# Replace UID in dashboard
YOUR_UID="YOUR_ACTUAL_UID_HERE"
cat claude-code-overview.json | sed "s/PBFA97CFB590B2093/$YOUR_UID/g" > claude-code-overview-fixed.json
# Import the fixed version
```
The skill handles this automatically during Mode 1 setup!
## Customizing Dashboards
### Adding Custom Panels
Use these PromQL queries as templates:
**Total Tokens by Model:**
```promql
sum by (model) (increase(claude_code_claude_code_token_usage_tokens_total[24h]))
```
**Cost per Session:**
```promql
increase(claude_code_claude_code_cost_usage_USD_total[24h])
/
increase(claude_code_claude_code_session_count_total[24h])
```
**Lines of Code per Hour:**
```promql
rate(claude_code_claude_code_lines_of_code_count_total[5m]) * 3600
```
**Average Session Duration:**
```promql
increase(claude_code_claude_code_active_time_seconds_total[24h])
/
increase(claude_code_claude_code_session_count_total[24h])
```
### Time Range Recommendations
- **Real-time monitoring:** Last 15 minutes, 30s refresh
- **Daily review:** Last 24 hours, 1m refresh
- **Weekly analysis:** Last 7 days, 5m refresh
- **Monthly reports:** Last 30 days, 15m refresh
## Troubleshooting
### Dashboard Shows "No Data"
1. **Check data source connection:**
```bash
curl -s http://localhost:3000/api/health | jq .
```
2. **Verify Prometheus has data:**
```bash
curl -s 'http://localhost:9090/api/v1/label/__name__/values' | jq . | grep claude_code
```
3. **Check metric naming:**
- Ensure queries use double prefix: `claude_code_claude_code_*`
- Not single prefix: `claude_code_*`
### Dashboard Shows "Datasource Not Found"
- Your datasource UID doesn't match the dashboard
- Follow the "Datasource UID Configuration" section above
### Panels Show Different Time Ranges
- Set dashboard time range at top-right
- Individual panels inherit from dashboard unless overridden
- Check panel settings: Edit → Query Options → Time Range
## Additional Resources
- **Metric Reference:** See `../data/metrics-reference.md`
- **PromQL Queries:** See `../data/prometheus-queries.md`
- **Grafana Docs:** https://grafana.com/docs/grafana/latest/

View File

@@ -0,0 +1,391 @@
{
"title": "Claude Code - Overview (Working)",
"description": "High-level overview of Claude Code usage, costs, and performance",
"tags": ["claude-code", "overview"],
"timezone": "browser",
"schemaVersion": 42,
"version": 1,
"refresh": "30s",
"panels": [
{
"fieldConfig": {
"defaults": {
"color": {"mode": "thresholds"},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 10},
{"color": "red", "value": 50}
]
},
"unit": "short"
},
"overrides": []
},
"gridPos": {"h": 4, "w": 6, "x": 0, "y": 0},
"id": 1,
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "auto",
"orientation": "auto",
"percentChangeColorMode": "standard",
"reduceOptions": {
"calcs": ["lastNotNull"],
"fields": "",
"values": false
},
"showPercentChange": false,
"textMode": "auto",
"wideLayout": true
},
"pluginVersion": "12.2.1",
"targets": [
{
"datasource": {"type": "prometheus", "uid": "PBFA97CFB590B2093"},
"expr": "claude_code_claude_code_lines_of_code_count_total",
"refId": "A"
}
],
"title": "Total Lines of Code",
"type": "stat"
},
{
"fieldConfig": {
"defaults": {
"color": {"mode": "thresholds"},
"decimals": 2,
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 5},
{"color": "red", "value": 10}
]
},
"unit": "currencyUSD"
},
"overrides": []
},
"gridPos": {"h": 4, "w": 6, "x": 6, "y": 0},
"id": 2,
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "auto",
"orientation": "auto",
"percentChangeColorMode": "standard",
"reduceOptions": {
"calcs": ["lastNotNull"],
"fields": "",
"values": false
},
"showPercentChange": false,
"textMode": "auto",
"wideLayout": true
},
"pluginVersion": "12.2.1",
"targets": [
{
"datasource": {"type": "prometheus", "uid": "PBFA97CFB590B2093"},
"expr": "increase(claude_code_claude_code_cost_usage_USD_total[24h])",
"refId": "A"
}
],
"title": "Total Cost (24h)",
"type": "stat"
},
{
"fieldConfig": {
"defaults": {
"color": {"mode": "thresholds"},
"decimals": 0,
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [{"color": "green", "value": null}]
},
"unit": "short"
},
"overrides": []
},
"gridPos": {"h": 4, "w": 6, "x": 12, "y": 0},
"id": 3,
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "auto",
"orientation": "auto",
"percentChangeColorMode": "standard",
"reduceOptions": {
"calcs": ["lastNotNull"],
"fields": "",
"values": false
},
"showPercentChange": false,
"textMode": "auto",
"wideLayout": true
},
"pluginVersion": "12.2.1",
"targets": [
{
"datasource": {"type": "prometheus", "uid": "PBFA97CFB590B2093"},
"expr": "increase(claude_code_claude_code_token_usage_tokens_total[24h])",
"refId": "A"
}
],
"title": "Total Tokens (24h)",
"type": "stat"
},
{
"fieldConfig": {
"defaults": {
"color": {"mode": "palette-classic"},
"decimals": 1,
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [{"color": "green", "value": null}]
},
"unit": "h"
},
"overrides": []
},
"gridPos": {"h": 4, "w": 6, "x": 18, "y": 0},
"id": 4,
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "auto",
"orientation": "auto",
"percentChangeColorMode": "standard",
"reduceOptions": {
"calcs": ["lastNotNull"],
"fields": "",
"values": false
},
"showPercentChange": false,
"textMode": "auto",
"wideLayout": true
},
"pluginVersion": "12.2.1",
"targets": [
{
"datasource": {"type": "prometheus", "uid": "PBFA97CFB590B2093"},
"expr": "increase(claude_code_claude_code_active_time_seconds_total[24h]) / 3600",
"refId": "A"
}
],
"title": "Active Time (24h)",
"type": "stat"
},
{
"fieldConfig": {
"defaults": {
"color": {"mode": "palette-classic"},
"custom": {
"axisBorderShow": false,
"axisCenteredZero": false,
"axisColorMode": "text",
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"barWidthFactor": 0.6,
"drawStyle": "line",
"fillOpacity": 10,
"gradientMode": "none",
"hideFrom": {"legend": false, "tooltip": false, "viz": false},
"insertNulls": false,
"lineInterpolation": "linear",
"lineWidth": 2,
"pointSize": 5,
"scaleDistribution": {"type": "linear"},
"showPoints": "auto",
"spanNulls": false,
"stacking": {"group": "A", "mode": "none"},
"thresholdsStyle": {"mode": "off"}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [{"color": "green", "value": null}]
},
"unit": "currencyUSD"
},
"overrides": []
},
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 4},
"id": 5,
"options": {
"legend": {"calcs": [], "displayMode": "list", "placement": "bottom", "showLegend": true},
"tooltip": {"mode": "single", "sort": "none"}
},
"pluginVersion": "12.2.1",
"targets": [
{
"datasource": {"type": "prometheus", "uid": "PBFA97CFB590B2093"},
"expr": "rate(claude_code_claude_code_cost_usage_USD_total[5m]) * 3600",
"legendFormat": "Cost per hour",
"refId": "A"
}
],
"title": "Cost Over Time (per hour)",
"type": "timeseries"
},
{
"fieldConfig": {
"defaults": {
"color": {"mode": "palette-classic"},
"custom": {
"axisBorderShow": false,
"axisCenteredZero": false,
"axisColorMode": "text",
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"barWidthFactor": 0.6,
"drawStyle": "line",
"fillOpacity": 20,
"gradientMode": "none",
"hideFrom": {"legend": false, "tooltip": false, "viz": false},
"insertNulls": false,
"lineInterpolation": "linear",
"lineWidth": 2,
"pointSize": 5,
"scaleDistribution": {"type": "linear"},
"showPoints": "auto",
"spanNulls": false,
"stacking": {"group": "A", "mode": "normal"},
"thresholdsStyle": {"mode": "off"}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [{"color": "green", "value": null}]
},
"unit": "short"
},
"overrides": []
},
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 4},
"id": 6,
"options": {
"legend": {"calcs": [], "displayMode": "list", "placement": "bottom", "showLegend": true},
"tooltip": {"mode": "single", "sort": "none"}
},
"pluginVersion": "12.2.1",
"targets": [
{
"datasource": {"type": "prometheus", "uid": "PBFA97CFB590B2093"},
"expr": "sum by (type) (rate(claude_code_claude_code_token_usage_tokens_total[5m]) * 60)",
"legendFormat": "{{type}}",
"refId": "A"
}
],
"title": "Token Usage by Type",
"type": "timeseries"
},
{
"fieldConfig": {
"defaults": {
"color": {"mode": "palette-classic"},
"custom": {
"axisBorderShow": false,
"axisCenteredZero": false,
"axisColorMode": "text",
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"barWidthFactor": 0.6,
"drawStyle": "bars",
"fillOpacity": 80,
"gradientMode": "none",
"hideFrom": {"legend": false, "tooltip": false, "viz": false},
"insertNulls": false,
"lineInterpolation": "linear",
"lineWidth": 1,
"pointSize": 5,
"scaleDistribution": {"type": "linear"},
"showPoints": "never",
"spanNulls": false,
"stacking": {"group": "A", "mode": "none"},
"thresholdsStyle": {"mode": "off"}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [{"color": "green", "value": null}]
},
"unit": "short"
},
"overrides": []
},
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 12},
"id": 7,
"options": {
"legend": {"calcs": [], "displayMode": "list", "placement": "bottom", "showLegend": true},
"tooltip": {"mode": "single", "sort": "none"}
},
"pluginVersion": "12.2.1",
"targets": [
{
"datasource": {"type": "prometheus", "uid": "PBFA97CFB590B2093"},
"expr": "sum by (type) (rate(claude_code_claude_code_lines_of_code_count_total[5m]) * 60)",
"legendFormat": "{{type}}",
"refId": "A"
}
],
"title": "Lines of Code Modified",
"type": "timeseries"
},
{
"fieldConfig": {
"defaults": {
"color": {"mode": "palette-classic"},
"decimals": 0,
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [{"color": "green", "value": null}]
},
"unit": "short"
},
"overrides": []
},
"gridPos": {"h": 6, "w": 12, "x": 12, "y": 12},
"id": 10,
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "auto",
"orientation": "auto",
"percentChangeColorMode": "standard",
"reduceOptions": {
"calcs": ["lastNotNull"],
"fields": "",
"values": false
},
"showPercentChange": false,
"textMode": "auto",
"wideLayout": true
},
"pluginVersion": "12.2.1",
"targets": [
{
"datasource": {"type": "prometheus", "uid": "PBFA97CFB590B2093"},
"expr": "increase(claude_code_claude_code_commit_count_total[24h])",
"refId": "A"
}
],
"title": "Commits Created (24h)",
"type": "stat"
}
],
"time": {"from": "now-6h", "to": "now"},
"timepicker": {},
"timezone": "browser",
"version": 1
}

View File

@@ -0,0 +1,179 @@
{
"title": "Claude Code - Overview",
"description": "High-level overview of Claude Code usage, costs, and performance",
"tags": ["claude-code", "overview"],
"timezone": "browser",
"schemaVersion": 38,
"version": 1,
"refresh": "30s",
"panels": [
{
"id": 1,
"gridPos": { "h": 4, "w": 6, "x": 0, "y": 0 },
"type": "stat",
"title": "Active Sessions",
"targets": [
{
"expr": "sum(claude_code_session_count_total)",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "short",
"color": { "mode": "thresholds" },
"thresholds": {
"mode": "absolute",
"steps": [
{ "value": null, "color": "green" }
]
}
}
},
"options": {
"reduceOptions": {
"values": false,
"calcs": ["lastNotNull"]
}
}
},
{
"id": 2,
"gridPos": { "h": 4, "w": 6, "x": 6, "y": 0 },
"type": "stat",
"title": "Total Cost (24h)",
"targets": [
{
"expr": "sum(increase(claude_code_cost_usage_total[24h]))",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "currencyUSD",
"decimals": 2,
"color": { "mode": "thresholds" },
"thresholds": {
"mode": "absolute",
"steps": [
{ "value": null, "color": "green" },
{ "value": 5, "color": "yellow" },
{ "value": 10, "color": "red" }
]
}
}
},
"options": {
"reduceOptions": {
"values": false,
"calcs": ["lastNotNull"]
}
}
},
{
"id": 3,
"gridPos": { "h": 4, "w": 6, "x": 12, "y": 0 },
"type": "stat",
"title": "Total Tokens (24h)",
"targets": [
{
"expr": "sum(increase(claude_code_token_usage_total[24h]))",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "short",
"decimals": 0,
"color": { "mode": "thresholds" },
"thresholds": {
"mode": "absolute",
"steps": [
{ "value": null, "color": "green" }
]
}
}
},
"options": {
"reduceOptions": {
"values": false,
"calcs": ["lastNotNull"]
}
}
},
{
"id": 4,
"gridPos": { "h": 4, "w": 6, "x": 18, "y": 0 },
"type": "stat",
"title": "Active Time (24h)",
"targets": [
{
"expr": "sum(increase(claude_code_active_time_total_seconds[24h])) / 3600",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "h",
"decimals": 1,
"color": { "mode": "palette-classic" }
}
},
"options": {
"reduceOptions": {
"values": false,
"calcs": ["lastNotNull"]
}
}
},
{
"id": 5,
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 4 },
"type": "timeseries",
"title": "Cost Over Time (per hour)",
"targets": [
{
"expr": "sum(rate(claude_code_cost_usage_total[5m])) * 3600",
"legendFormat": "Cost per hour",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "currencyUSD",
"custom": {
"drawStyle": "line",
"lineWidth": 2,
"fillOpacity": 10,
"showPoints": "auto"
}
}
}
},
{
"id": 6,
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 4 },
"type": "timeseries",
"title": "Token Usage by Type",
"targets": [
{
"expr": "sum by (type) (rate(claude_code_token_usage_total[5m]) * 60)",
"legendFormat": "{{type}}",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "short",
"custom": {
"drawStyle": "line",
"lineWidth": 2,
"fillOpacity": 20,
"showPoints": "auto",
"stacking": { "mode": "normal" }
}
}
}
}
]
}

View File

@@ -0,0 +1,381 @@
# Claude Code Metrics Reference
Complete reference for all Claude Code OpenTelemetry metrics.
**Important:** All metrics use a double prefix: `claude_code_claude_code_*`
---
## Metric Categories
1. **Usage Metrics** - Session counts, active time
2. **Token Metrics** - Input, output, cached tokens
3. **Cost Metrics** - API costs by model
4. **Productivity Metrics** - LOC, commits, PRs
5. **Error Metrics** - Failures, retries
---
## Usage Metrics
### claude_code_claude_code_session_count_total
**Type:** Counter
**Description:** Total number of Claude Code sessions started
**Labels:**
- `account_uuid` - Anonymous user identifier
- `version` - Claude Code version (e.g., "1.2.3")
**Example Query:**
```promql
# Total sessions across all users
sum(claude_code_claude_code_session_count_total)
# Sessions by version
sum by (version) (claude_code_claude_code_session_count_total)
# New sessions in last 24h
increase(claude_code_claude_code_session_count_total[24h])
```
---
### claude_code_claude_code_active_time_seconds_total
**Type:** Counter
**Description:** Total active time spent in Claude Code sessions (in seconds)
**Labels:**
- `account_uuid` - Anonymous user identifier
- `version` - Claude Code version
**Example Query:**
```promql
# Total active hours
sum(claude_code_claude_code_active_time_seconds_total) / 3600
# Active hours per day
increase(claude_code_claude_code_active_time_seconds_total[24h]) / 3600
# Average session duration
increase(claude_code_claude_code_active_time_seconds_total[24h])
/
increase(claude_code_claude_code_session_count_total[24h])
```
**Note:** "Active time" means time when Claude Code is actively processing or responding to user input.
---
## Token Metrics
### claude_code_claude_code_token_usage_tokens_total
**Type:** Counter
**Description:** Total tokens consumed by Claude Code API calls
**Labels:**
- `type` - Token type: `input`, `output`, `cache_creation`, `cache_read`
- `model` - Model name (e.g., "claude-sonnet-4-5-20250929", "claude-opus-4-20250514")
- `account_uuid` - Anonymous user identifier
- `version` - Claude Code version
**Token Types Explained:**
- **input:** User messages and tool results sent to Claude
- **output:** Claude's responses (text and tool calls)
- **cache_creation:** Tokens written to prompt cache (billed at input rate)
- **cache_read:** Tokens read from prompt cache (billed at 10% of input rate)
**Example Query:**
```promql
# Total tokens by type (24h)
sum by (type) (increase(claude_code_claude_code_token_usage_tokens_total[24h]))
# Tokens by model (24h)
sum by (model) (increase(claude_code_claude_code_token_usage_tokens_total[24h]))
# Cache hit rate
sum(increase(claude_code_claude_code_token_usage_tokens_total{type="cache_read"}[24h]))
/
sum(increase(claude_code_claude_code_token_usage_tokens_total{type=~"input|cache_creation|cache_read"}[24h]))
# Token usage rate (per minute)
rate(claude_code_claude_code_token_usage_tokens_total[5m]) * 60
```
---
## Cost Metrics
### claude_code_claude_code_cost_usage_USD_total
**Type:** Counter
**Description:** Total API costs in USD
**Labels:**
- `model` - Model name
- `account_uuid` - Anonymous user identifier
- `version` - Claude Code version
**Pricing Reference (as of Jan 2025):**
- **Claude Sonnet 4.5:** $3/MTok input, $15/MTok output
- **Claude Opus 4:** $15/MTok input, $75/MTok output
- **Cache read:** 10% of input price
- **Cache write:** Same as input price
**Example Query:**
```promql
# Total cost (24h)
sum(increase(claude_code_claude_code_cost_usage_USD_total[24h]))
# Cost by model (24h)
sum by (model) (increase(claude_code_claude_code_cost_usage_USD_total[24h]))
# Cost per hour
rate(claude_code_claude_code_cost_usage_USD_total[5m]) * 3600
# Average cost per session
increase(claude_code_claude_code_cost_usage_USD_total[24h])
/
increase(claude_code_claude_code_session_count_total[24h])
# Cumulative cost over time
sum(claude_code_claude_code_cost_usage_USD_total)
```
---
## Productivity Metrics
### claude_code_claude_code_lines_of_code_count_total
**Type:** Counter
**Description:** Total lines of code modified (added + changed + deleted)
**Labels:**
- `type` - Modification type: `added`, `changed`, `deleted`
- `account_uuid` - Anonymous user identifier
- `version` - Claude Code version
**Example Query:**
```promql
# Total LOC modified
sum(claude_code_claude_code_lines_of_code_count_total)
# LOC by type (24h)
sum by (type) (increase(claude_code_claude_code_lines_of_code_count_total[24h]))
# LOC per hour
rate(claude_code_claude_code_lines_of_code_count_total[5m]) * 3600
# Lines per dollar
sum(increase(claude_code_claude_code_lines_of_code_count_total[24h]))
/
sum(increase(claude_code_claude_code_cost_usage_USD_total[24h]))
```
---
### claude_code_claude_code_commit_count_total
**Type:** Counter
**Description:** Total git commits created by Claude Code
**Labels:**
- `account_uuid` - Anonymous user identifier
- `version` - Claude Code version
**Example Query:**
```promql
# Total commits
sum(claude_code_claude_code_commit_count_total)
# Commits per day
increase(claude_code_claude_code_commit_count_total[24h])
# Commits per session
increase(claude_code_claude_code_commit_count_total[24h])
/
increase(claude_code_claude_code_session_count_total[24h])
```
---
### claude_code_claude_code_pr_count_total
**Type:** Counter
**Description:** Total pull requests created by Claude Code
**Labels:**
- `account_uuid` - Anonymous user identifier
- `version` - Claude Code version
**Example Query:**
```promql
# Total PRs
sum(claude_code_claude_code_pr_count_total)
# PRs per week
increase(claude_code_claude_code_pr_count_total[7d])
```
---
## Cardinality and Resource Attributes
### Resource Attributes
All metrics include these resource attributes (configured in settings.json):
```json
"OTEL_RESOURCE_ATTRIBUTES": "environment=local,deployment=poc,team=platform"
```
**Common Attributes:**
- `service.name` = "claude-code" (set by OTEL Collector)
- `environment` - Deployment environment (local, dev, staging, prod)
- `deployment` - Deployment type (poc, enterprise)
- `team` - Team identifier
- `department` - Department identifier
- `project` - Project identifier
**Querying with Resource Attributes:**
```promql
# Filter by environment
sum(claude_code_claude_code_cost_usage_USD_total{environment="production"})
# Aggregate by team
sum by (team) (increase(claude_code_claude_code_cost_usage_USD_total[24h]))
```
---
## Metric Naming Convention
**Format:** `claude_code_claude_code_<metric_name>_<unit>_<type>`
**Why double prefix?**
- First `claude_code` comes from Prometheus exporter namespace in OTEL Collector config
- Second `claude_code` comes from the original metric name in Claude Code
- This is expected behavior with the current configuration
**Components:**
- `<metric_name>`: Descriptive name (e.g., `token_usage`, `cost_usage`)
- `<unit>`: Unit of measurement (e.g., `tokens`, `USD`, `seconds`, `count`)
- `<type>`: Metric type (always `total` for counters)
---
## Querying Best Practices
### Use increase() for Counters
Counters are cumulative, so use `increase()` for time windows:
```promql
# ✅ Correct - Shows cost in last 24h
increase(claude_code_claude_code_cost_usage_USD_total[24h])
# ❌ Wrong - Shows cumulative cost since start
claude_code_claude_code_cost_usage_USD_total
```
### Use rate() for Rates
Calculate per-second rate, then multiply for desired unit:
```promql
# Cost per hour
rate(claude_code_claude_code_cost_usage_USD_total[5m]) * 3600
# Tokens per minute
rate(claude_code_claude_code_token_usage_tokens_total[5m]) * 60
```
### Aggregate with sum()
Combine metrics across labels:
```promql
# Total tokens (all types)
sum(claude_code_claude_code_token_usage_tokens_total)
# Total tokens by type
sum by (type) (claude_code_claude_code_token_usage_tokens_total)
# Total cost across all models
sum(claude_code_claude_code_cost_usage_USD_total)
```
---
## Example Dashboards
### Executive Summary (single values)
```promql
# Total cost this month
sum(increase(claude_code_claude_code_cost_usage_USD_total[30d]))
# Total LOC this month
sum(increase(claude_code_claude_code_lines_of_code_count_total[30d]))
# Active users (unique account_uuids)
count(count by (account_uuid) (claude_code_claude_code_session_count_total))
# Average session cost
sum(increase(claude_code_claude_code_cost_usage_USD_total[30d]))
/
sum(increase(claude_code_claude_code_session_count_total[30d]))
```
### Cost Tracking
```promql
# Daily cost trend
sum(increase(claude_code_claude_code_cost_usage_USD_total[1d]))
# Cost by model (pie chart)
sum by (model) (increase(claude_code_claude_code_cost_usage_USD_total[7d]))
# Cost by team (bar chart)
sum by (team) (increase(claude_code_claude_code_cost_usage_USD_total[7d]))
```
### Productivity Tracking
```promql
# LOC per day
sum(increase(claude_code_claude_code_lines_of_code_count_total[1d]))
# Commits per week
sum(increase(claude_code_claude_code_commit_count_total[7d]))
# Efficiency: LOC per dollar
sum(increase(claude_code_claude_code_lines_of_code_count_total[30d]))
/
sum(increase(claude_code_claude_code_cost_usage_USD_total[30d]))
```
---
## Retention and Storage
**Default Prometheus Retention:** 15 days
**Adjust retention:**
```yaml
# In prometheus.yml or docker-compose.yml
command:
- '--storage.tsdb.retention.time=90d'
- '--storage.tsdb.retention.size=50GB'
```
**Disk usage estimation:**
- ~1-2 MB per day per active user
- ~30-60 MB per month per active user
- ~360-720 MB per year per active user
**For long-term storage:** Consider using Prometheus remote write to send data to a time-series database like VictoriaMetrics, Cortex, or Thanos.
---
## Additional Resources
- **Official OTEL Docs:** https://opentelemetry.io/docs/
- **Prometheus Query Docs:** https://prometheus.io/docs/prometheus/latest/querying/basics/
- **PromQL Examples:** See `prometheus-queries.md`

View File

@@ -0,0 +1,405 @@
# Useful Prometheus Queries (PromQL)
Collection of useful PromQL queries for Claude Code telemetry analysis.
**Note:** All queries use the double prefix: `claude_code_claude_code_*`
---
## Cost Analysis
### Daily Cost Trend
```promql
sum(increase(claude_code_claude_code_cost_usage_USD_total[1d]))
```
### Cost by Model
```promql
sum by (model) (increase(claude_code_claude_code_cost_usage_USD_total[24h]))
```
### Cost per Hour (Rate)
```promql
rate(claude_code_claude_code_cost_usage_USD_total[5m]) * 3600
```
### Average Cost per Session
```promql
sum(increase(claude_code_claude_code_cost_usage_USD_total[24h]))
/
sum(increase(claude_code_claude_code_session_count_total[24h]))
```
### Cumulative Monthly Cost
```promql
sum(increase(claude_code_claude_code_cost_usage_USD_total[30d]))
```
### Cost by Team
```promql
sum by (team) (increase(claude_code_claude_code_cost_usage_USD_total[24h]))
```
### Projected Monthly Cost (based on last 7 days)
```promql
(sum(increase(claude_code_claude_code_cost_usage_USD_total[7d])) / 7) * 30
```
---
## Token Usage
### Total Tokens by Type
```promql
sum by (type) (increase(claude_code_claude_code_token_usage_tokens_total[24h]))
```
### Tokens by Model
```promql
sum by (model) (increase(claude_code_claude_code_token_usage_tokens_total[24h]))
```
### Cache Hit Rate
```promql
sum(increase(claude_code_claude_code_token_usage_tokens_total{type="cache_read"}[24h]))
/
sum(increase(claude_code_claude_code_token_usage_tokens_total{type=~"input|cache_creation|cache_read"}[24h]))
* 100
```
### Input vs Output Token Ratio
```promql
sum(increase(claude_code_claude_code_token_usage_tokens_total{type="input"}[24h]))
/
sum(increase(claude_code_claude_code_token_usage_tokens_total{type="output"}[24h]))
```
### Token Usage Rate (per minute)
```promql
sum by (type) (rate(claude_code_claude_code_token_usage_tokens_total[5m]) * 60)
```
### Total Tokens (All Time)
```promql
sum(claude_code_claude_code_token_usage_tokens_total)
```
---
## Productivity Metrics
### Total Lines of Code Modified
```promql
sum(claude_code_claude_code_lines_of_code_count_total)
```
### LOC by Type (Added, Changed, Deleted)
```promql
sum by (type) (increase(claude_code_claude_code_lines_of_code_count_total[24h]))
```
### LOC per Hour
```promql
rate(claude_code_claude_code_lines_of_code_count_total[5m]) * 3600
```
### Lines per Dollar (Efficiency)
```promql
sum(increase(claude_code_claude_code_lines_of_code_count_total[24h]))
/
sum(increase(claude_code_claude_code_cost_usage_USD_total[24h]))
```
### Commits per Day
```promql
increase(claude_code_claude_code_commit_count_total[24h])
```
### PRs per Week
```promql
increase(claude_code_claude_code_pr_count_total[7d])
```
### LOC per Commit
```promql
sum(increase(claude_code_claude_code_lines_of_code_count_total[24h]))
/
sum(increase(claude_code_claude_code_commit_count_total[24h]))
```
---
## Session Analytics
### Total Sessions
```promql
sum(claude_code_claude_code_session_count_total)
```
### New Sessions (24h)
```promql
increase(claude_code_claude_code_session_count_total[24h])
```
### Active Users (Unique account_uuids)
```promql
count(count by (account_uuid) (claude_code_claude_code_session_count_total))
```
### Average Session Duration
```promql
sum(increase(claude_code_claude_code_active_time_seconds_total[24h]))
/
sum(increase(claude_code_claude_code_session_count_total[24h]))
/ 60
```
*Result in minutes*
### Total Active Hours (24h)
```promql
sum(increase(claude_code_claude_code_active_time_seconds_total[24h])) / 3600
```
### Sessions by Version
```promql
sum by (version) (increase(claude_code_claude_code_session_count_total[24h]))
```
---
## Team Aggregation
### Cost by Team (Last 24h)
```promql
sum by (team) (increase(claude_code_claude_code_cost_usage_USD_total[24h]))
```
### LOC by Team (Last 24h)
```promql
sum by (team) (increase(claude_code_claude_code_lines_of_code_count_total[24h]))
```
### Active Users per Team
```promql
count by (team) (count by (team, account_uuid) (claude_code_claude_code_session_count_total))
```
### Team Efficiency (LOC per Dollar)
```promql
sum by (team) (increase(claude_code_claude_code_lines_of_code_count_total[24h]))
/
sum by (team) (increase(claude_code_claude_code_cost_usage_USD_total[24h]))
```
### Top Spending Teams (Last 7 days)
```promql
topk(5, sum by (team) (increase(claude_code_claude_code_cost_usage_USD_total[7d])))
```
---
## Model Comparison
### Cost by Model (Pie Chart)
```promql
sum by (model) (increase(claude_code_claude_code_cost_usage_USD_total[7d]))
```
### Token Efficiency by Model (Tokens per Dollar)
```promql
sum by (model) (increase(claude_code_claude_code_token_usage_tokens_total[24h]))
/
sum by (model) (increase(claude_code_claude_code_cost_usage_USD_total[24h]))
```
### Most Used Model
```promql
topk(1, sum by (model) (increase(claude_code_claude_code_token_usage_tokens_total[24h])))
```
### Model Usage Distribution (%)
```promql
sum by (model) (increase(claude_code_claude_code_token_usage_tokens_total[24h]))
/
sum(increase(claude_code_claude_code_token_usage_tokens_total[24h]))
* 100
```
---
## Alerting Queries
### High Daily Cost Alert (> $50)
```promql
sum(increase(claude_code_claude_code_cost_usage_USD_total[24h])) > 50
```
### Cost Spike Alert (50% increase compared to yesterday)
```promql
sum(increase(claude_code_claude_code_cost_usage_USD_total[24h]))
/
sum(increase(claude_code_claude_code_cost_usage_USD_total[24h] offset 24h))
> 1.5
```
### No Activity Alert (no sessions in last hour)
```promql
increase(claude_code_claude_code_session_count_total[1h]) == 0
```
### Low Cache Hit Rate Alert (< 20%)
```promql
(
sum(increase(claude_code_claude_code_token_usage_tokens_total{type="cache_read"}[1h]))
/
sum(increase(claude_code_claude_code_token_usage_tokens_total{type=~"input|cache_creation|cache_read"}[1h]))
* 100
) < 20
```
---
## Forecasting
### Projected Monthly Cost (based on last 7 days)
```promql
(sum(increase(claude_code_claude_code_cost_usage_USD_total[7d])) / 7) * 30
```
### Projected Annual Cost (based on last 30 days)
```promql
(sum(increase(claude_code_claude_code_cost_usage_USD_total[30d])) / 30) * 365
```
### Average Daily Cost (Last 30 days)
```promql
sum(increase(claude_code_claude_code_cost_usage_USD_total[30d])) / 30
```
### Growth Rate (Week over Week)
```promql
(
sum(increase(claude_code_claude_code_cost_usage_USD_total[7d]))
-
sum(increase(claude_code_claude_code_cost_usage_USD_total[7d] offset 7d))
)
/
sum(increase(claude_code_claude_code_cost_usage_USD_total[7d] offset 7d))
* 100
```
*Result as percentage*
---
## Debugging Queries
### Check if Metrics Exist
```promql
claude_code_claude_code_session_count_total
```
### List All Claude Code Metrics
```
# Use Prometheus UI or API
curl -s 'http://localhost:9090/api/v1/label/__name__/values' | jq . | grep claude_code
```
### Check Metric Labels
```promql
# Returns all label combinations
count by (account_uuid, version, team, environment) (claude_code_claude_code_session_count_total)
```
### Latest Value for All Metrics
```promql
# Session count
claude_code_claude_code_session_count_total
# Cost
claude_code_claude_code_cost_usage_USD_total
# Tokens
claude_code_claude_code_token_usage_tokens_total
# LOC
claude_code_claude_code_lines_of_code_count_total
```
### Metrics Cardinality (Number of Time Series)
```promql
count(claude_code_claude_code_token_usage_tokens_total)
```
---
## Recording Rules
Save these as Prometheus recording rules for faster dashboard queries:
```yaml
groups:
- name: claude_code_aggregations
interval: 1m
rules:
# Daily cost
- record: claude_code:cost_usd:daily
expr: sum(increase(claude_code_claude_code_cost_usage_USD_total[24h]))
# Cost by team
- record: claude_code:cost_usd:daily:by_team
expr: sum by (team) (increase(claude_code_claude_code_cost_usage_USD_total[24h]))
# Cache hit rate
- record: claude_code:cache_hit_rate:daily
expr: |
sum(increase(claude_code_claude_code_token_usage_tokens_total{type="cache_read"}[24h]))
/
sum(increase(claude_code_claude_code_token_usage_tokens_total{type=~"input|cache_creation|cache_read"}[24h]))
* 100
# LOC efficiency
- record: claude_code:loc_per_dollar:daily
expr: |
sum(increase(claude_code_claude_code_lines_of_code_count_total[24h]))
/
sum(increase(claude_code_claude_code_cost_usage_USD_total[24h]))
```
Then use simplified queries:
```promql
# Instead of complex query, just use:
claude_code:cost_usd:daily
claude_code:cost_usd:daily:by_team
```
---
## Visualization Tips
### Time Series Panel
- Use `rate()` for smooth trends
- Set legend to `{{label_name}}` for clarity
- Enable "Lines" draw style with opacity
### Stat Panel
- Use `lastNotNull` for counters
- Use `increase([24h])` for daily totals
- Add thresholds for color coding
### Bar Chart
- Use `sum by (label)` for grouping
- Sort by value descending
- Limit to top 10 with `topk(10, ...)`
### Pie Chart
- Calculate percentages with division
- Use `sum by (label)` for segments
- Limit to top categories
---
## Additional Resources
- **Prometheus Query Docs:** https://prometheus.io/docs/prometheus/latest/querying/basics/
- **PromQL Examples:** https://prometheus.io/docs/prometheus/latest/querying/examples/
- **Grafana Query Editor:** https://grafana.com/docs/grafana/latest/datasources/prometheus/

View File

@@ -0,0 +1,658 @@
# Troubleshooting Guide
Common issues and solutions for Claude Code OpenTelemetry setup.
---
## Container Issues
### Docker Not Running
**Symptom:** `Cannot connect to the Docker daemon`
**Diagnosis:**
```bash
docker info
```
**Solutions:**
1. Start Docker Desktop application
2. Wait for Docker to fully initialize
3. Check system tray for Docker icon
4. Verify Docker daemon is running: `ps aux | grep docker`
---
### Containers Won't Start
**Symptom:** Containers exit immediately after `docker compose up`
**Diagnosis:**
```bash
# Check container logs
docker compose logs
# Check specific service
docker compose logs otel-collector
docker compose logs prometheus
```
**Common Causes:**
**1. OTEL Collector Configuration Error**
```bash
# Check for errors
docker compose logs otel-collector | grep -i error
# Common issues:
# - Deprecated logging exporter
# - Deprecated 'address' field in telemetry.metrics
```
**Solution A - Deprecated logging exporter:**
Update `otel-collector-config.yml`:
```yaml
exporters:
debug:
verbosity: normal
# NOT:
# logging:
# loglevel: info
```
**Solution B - Deprecated 'address' field (v0.123.0+):**
If logs show: `'address' has invalid keys` or similar error:
Update `otel-collector-config.yml`:
```yaml
service:
telemetry:
metrics:
level: detailed
# REMOVE this line (deprecated in v0.123.0+):
# address: ":8888"
```
The `address` field in `service.telemetry.metrics` is deprecated in newer OTEL Collector versions. Simply remove it - the collector will use default internal metrics endpoint.
**2. Port Already in Use**
```bash
# Check which ports are in use
lsof -i :3000 # Grafana
lsof -i :4317 # OTEL gRPC
lsof -i :4318 # OTEL HTTP
lsof -i :8889 # OTEL Prometheus exporter
lsof -i :9090 # Prometheus
lsof -i :3100 # Loki
```
**Solution:**
- Stop conflicting service
- Or change port in docker-compose.yml
**3. Volume Permission Issues**
```bash
# Check volume permissions
docker volume ls
docker volume inspect claude-telemetry_prometheus-data
```
**Solution:**
```bash
# Remove and recreate volumes
docker compose down -v
docker compose up -d
```
---
### Containers Keep Restarting
**Symptom:** Container status shows "Restarting"
**Diagnosis:**
```bash
docker compose ps
docker compose logs --tail=50 <service-name>
```
**Solutions:**
1. Check memory limits: Increase memory_limiter in OTEL config
2. Check disk space: `df -h`
3. Check for configuration errors in logs
4. Restart Docker Desktop
---
## Claude Code Settings Issues
### 🚨 CRITICAL: Telemetry Not Sending (Most Common Issue)
**Symptom:** No metrics appearing in Prometheus after Claude Code restart
**ROOT CAUSE (90% of cases):** Missing required exporter environment variables
Even when `CLAUDE_CODE_ENABLE_TELEMETRY=1` is set, telemetry **will not send** without explicit exporter configuration. This is the #1 most common issue.
**Diagnosis Checklist:**
**1. Check REQUIRED exporters (MOST IMPORTANT):**
```bash
jq '.env.OTEL_METRICS_EXPORTER' ~/.claude/settings.json
# Must return: "otlp" (NOT null, NOT missing)
jq '.env.OTEL_LOGS_EXPORTER' ~/.claude/settings.json
# Should return: "otlp" (recommended for event tracking)
```
**If either returns `null` or is missing, this is your problem!**
**2. Verify telemetry is enabled:**
```bash
jq '.env.CLAUDE_CODE_ENABLE_TELEMETRY' ~/.claude/settings.json
# Should return: "1"
```
**3. Check OTEL endpoint:**
```bash
jq '.env.OTEL_EXPORTER_OTLP_ENDPOINT' ~/.claude/settings.json
# Should return: "http://localhost:4317" (for local setup)
```
**3. Verify JSON is valid:**
```bash
jq empty ~/.claude/settings.json
# No output = valid JSON
```
**4. Check if Claude Code was restarted:**
```bash
# Telemetry config only loads at startup!
# Must quit and restart Claude Code completely
```
**5. Test OTEL endpoint connectivity:**
```bash
nc -zv localhost 4317
# Should show: Connection to localhost port 4317 [tcp/*] succeeded!
```
**Solutions:**
**If exporters are missing (MOST COMMON):**
Add these REQUIRED settings to ~/.claude/settings.json:
```json
{
"env": {
"CLAUDE_CODE_ENABLE_TELEMETRY": "1",
"OTEL_METRICS_EXPORTER": "otlp",
"OTEL_LOGS_EXPORTER": "otlp",
"OTEL_EXPORTER_OTLP_ENDPOINT": "http://localhost:4317",
"OTEL_EXPORTER_OTLP_PROTOCOL": "grpc"
}
}
```
Then **MUST restart Claude Code** (settings only load at startup).
**If endpoint unreachable:**
- Verify OTEL Collector container is running
- Check firewall settings
- Try HTTP endpoint instead: `http://localhost:4318`
**If still no data:**
- Check OTEL Collector logs for incoming connections
- Verify Claude Code is running (not just idle)
- Wait 60 seconds (default export interval)
---
### Settings.json Syntax Errors
**Symptom:** Claude Code won't start or shows errors
**Diagnosis:**
```bash
# Validate JSON
jq empty ~/.claude/settings.json
# Pretty-print to find issues
jq . ~/.claude/settings.json
```
**Common Issues:**
- Missing commas between properties
- Trailing commas before closing braces
- Unescaped quotes in strings
- Incorrect nesting
**Solution:**
```bash
# Restore backup
cp ~/.claude/settings.json.backup ~/.claude/settings.json
# Or fix JSON manually with editor
```
---
## Grafana Issues
### Can't Access Grafana
**Symptom:** `localhost:3000` doesn't load
**Diagnosis:**
```bash
# Check if Grafana is running
docker ps | grep grafana
# Check Grafana logs
docker compose logs grafana
# Check port availability
lsof -i :3000
```
**Solutions:**
1. Verify container is running: `docker compose up -d grafana`
2. Wait 30 seconds for Grafana to initialize
3. Try `http://127.0.0.1:3000` instead
4. Check Docker network: `docker network inspect claude-telemetry`
---
### Dashboard Shows "Datasource Not Found"
**Symptom:** Dashboard panels show "datasource prometheus not found"
**Cause:** Dashboard has hardcoded datasource UID that doesn't match your Grafana instance
**Diagnosis:**
1. Go to: http://localhost:3000/connections/datasources
2. Click on Prometheus datasource
3. Note the UID from URL (e.g., `PBFA97CFB590B2093`)
**Solution:**
```bash
# Get your datasource UID
DATASOURCE_UID=$(curl -s -u admin:admin http://localhost:3000/api/datasources | jq -r '.[] | select(.type=="prometheus") | .uid')
echo "Your Prometheus datasource UID: $DATASOURCE_UID"
# Update dashboard JSON
cd ~/.claude/telemetry/dashboards
cat claude-code-overview.json | sed "s/PBFA97CFB590B2093/$DATASOURCE_UID/g" > claude-code-overview-fixed.json
# Re-import the fixed dashboard
```
---
### Dashboard Shows "No Data"
**Symptom:** Dashboard loads but all panels show "No data"
**Diagnosis Steps:**
**1. Check Prometheus has data:**
```bash
# Query Prometheus directly
curl -s 'http://localhost:9090/api/v1/label/__name__/values' | jq . | grep claude_code
# Should see metrics like:
# "claude_code_claude_code_session_count_total"
# "claude_code_claude_code_cost_usage_USD_total"
```
**2. Check datasource connection:**
- Go to: http://localhost:3000/connections/datasources
- Click Prometheus
- Click "Save & Test"
- Should show: "Successfully queried the Prometheus API"
**3. Verify metric names in queries:**
```bash
# Check if metrics use double prefix
curl -s 'http://localhost:9090/api/v1/query?query=claude_code_claude_code_session_count_total' | jq .
```
**Solutions:**
**If metrics don't exist:**
- Claude Code hasn't sent data yet (wait 60 seconds)
- OTEL Collector isn't receiving data (check container logs)
- Settings.json wasn't configured correctly
**If metrics exist but dashboard shows no data:**
- Dashboard queries use wrong metric names
- Update queries to use double prefix: `claude_code_claude_code_*`
- Check time range (top-right corner of Grafana)
**If single prefix metrics exist (`claude_code_*`):**
Your setup uses old naming. Update dashboard:
```bash
# Replace double prefix with single
sed 's/claude_code_claude_code_/claude_code_/g' dashboard.json > dashboard-fixed.json
```
---
## Prometheus Issues
### Prometheus Shows No Targets
**Symptom:** Prometheus UI (localhost:9090) → Status → Targets shows no targets or DOWN status
**Diagnosis:**
```bash
# Check Prometheus config
cat ~/.claude/telemetry/prometheus.yml
# Check if OTEL Collector is reachable from Prometheus
docker exec -it claude-prometheus ping otel-collector
```
**Solutions:**
1. Verify `prometheus.yml` has correct scrape_configs
2. Ensure OTEL Collector is running
3. Check Docker network connectivity
4. Restart Prometheus: `docker compose restart prometheus`
---
### Prometheus Can't Scrape OTEL Collector
**Symptom:** Target shows as DOWN with error "context deadline exceeded"
**Diagnosis:**
```bash
# Check if OTEL Collector is exposing metrics
curl http://localhost:8889/metrics
# Check OTEL Collector logs
docker compose logs otel-collector
```
**Solutions:**
1. Verify OTEL Collector prometheus exporter is configured
2. Check port 8889 is exposed in docker-compose.yml
3. Restart OTEL Collector: `docker compose restart otel-collector`
---
## Metric Issues
### Metrics Have Double Prefix
**Symptom:** Metrics are named `claude_code_claude_code_*` instead of `claude_code_*`
**Explanation:** This is expected behavior with the current OTEL Collector configuration:
- First `claude_code` = Prometheus exporter namespace
- Second `claude_code` = Original metric name
**Solutions:**
**Option 1: Accept it (Recommended)**
- Update dashboard queries to use double prefix
- This is the standard configuration
**Option 2: Remove namespace prefix**
Update `otel-collector-config.yml`:
```yaml
exporters:
prometheus:
endpoint: "0.0.0.0:8889"
namespace: "" # Remove namespace
```
Then restart: `docker compose restart otel-collector`
---
### Old Metrics Still Showing
**Symptom:** After changing configuration, old metrics still appear
**Cause:** Prometheus retains metrics until retention period expires
**Solutions:**
**Quick fix: Delete Prometheus data:**
```bash
docker compose down
docker volume rm claude-telemetry_prometheus-data
docker compose up -d
```
**Proper fix: Wait for retention:**
- Default retention is 15 days
- Old metrics will automatically disappear
- New metrics will coexist temporarily
---
## Network Issues
### Can't Reach OTEL Endpoint from Claude Code
**Symptom:** Claude Code can't connect to `localhost:4317`
**Diagnosis:**
```bash
# Test gRPC endpoint
nc -zv localhost 4317
# Test HTTP endpoint
curl -v http://localhost:4318/v1/metrics -d '{}'
```
**Solutions:**
**If connection refused:**
1. Check OTEL Collector is running
2. Verify ports are exposed in docker-compose.yml
3. Check firewall/antivirus blocking localhost connections
**If timeout:**
1. Increase export timeout in settings.json
2. Try HTTP protocol instead of gRPC
**macOS-specific:**
- Use `http://host.docker.internal:4317` instead of `localhost:4317`
- Or use bridge network mode
---
### Enterprise Endpoint Unreachable
**Symptom:** Can't connect to company OTEL endpoint
**Diagnosis:**
```bash
# Test connectivity
ping otel.company.com
# Test port
nc -zv otel.company.com 4317
# Test with VPN
# (Ensure corporate VPN is connected)
```
**Solutions:**
1. Connect to corporate VPN
2. Check firewall allows outbound connections
3. Verify endpoint URL is correct
4. Try HTTP endpoint (port 4318) instead of gRPC
5. Contact platform team to verify endpoint is accessible
---
## Performance Issues
### High Memory Usage
**Symptom:** OTEL Collector or Prometheus using excessive memory
**Diagnosis:**
```bash
# Check container resource usage
docker stats
# Check Prometheus TSDB size
du -sh ~/.claude/telemetry/prometheus-data
```
**Solutions:**
**OTEL Collector:**
Reduce memory_limiter in `otel-collector-config.yml`:
```yaml
processors:
memory_limiter:
check_interval: 1s
limit_mib: 256 # Reduce from 512
```
**Prometheus:**
Reduce retention:
```yaml
command:
- '--storage.tsdb.retention.time=7d' # Reduce from 15d
- '--storage.tsdb.retention.size=1GB'
```
---
### Slow Grafana Dashboards
**Symptom:** Dashboards take long time to load or timeout
**Diagnosis:**
```bash
# Check query performance in Prometheus
# Go to: http://localhost:9090/graph
# Run expensive queries like: sum by (account_uuid, model, type) (...)
```
**Solutions:**
1. Reduce dashboard time range (use 6h instead of 7d)
2. Increase dashboard refresh interval (1m → 5m)
3. Use recording rules for complex queries
4. Reduce number of panels
5. Use simpler aggregations
---
## Data Quality Issues
### Unexpected Cost Values
**Symptom:** Cost metrics seem incorrect
**Diagnosis:**
```bash
# Check raw cost values
curl -s 'http://localhost:9090/api/v1/query?query=claude_code_claude_code_cost_usage_USD_total' | jq .
# Check token usage
curl -s 'http://localhost:9090/api/v1/query?query=claude_code_claude_code_token_usage_tokens_total' | jq .
```
**Causes:**
- Cost is cumulative counter (not reset between sessions)
- Dashboard may be using wrong time range
- Model pricing may have changed
**Solutions:**
- Use `increase([24h])` not raw counter values
- Verify pricing in metrics reference
- Check Claude Code version (pricing may vary)
---
### Missing Sessions
**Symptom:** Some Claude Code sessions not recorded
**Causes:**
1. Claude Code wasn't restarted after settings update
2. OTEL Collector was down during session
3. Export interval hadn't elapsed yet (60 seconds default)
4. Network issue prevented export
**Solutions:**
- Always restart Claude Code after settings changes
- Monitor OTEL Collector uptime
- Check OTEL Collector logs for export errors
- Reduce export interval if real-time data needed
---
## Getting Help
### Collect Debug Information
When asking for help, provide:
```bash
# 1. Container status
docker compose ps
# 2. Container logs (last 50 lines)
docker compose logs --tail=50
# 3. Configuration files
cat ~/.claude/telemetry/otel-collector-config.yml
cat ~/.claude/telemetry/prometheus.yml
# 4. Claude Code settings (redact sensitive info!)
jq '.env | with_entries(select(.key | startswith("OTEL_")))' ~/.claude/settings.json
# 5. Prometheus metrics list
curl -s 'http://localhost:9090/api/v1/label/__name__/values' | jq . | grep claude_code
# 6. System info
docker --version
docker compose version
uname -a
```
### Enable Debug Logging
**OTEL Collector:**
```yaml
exporters:
debug:
verbosity: detailed # Change from 'normal'
service:
telemetry:
logs:
level: debug # Change from 'info'
```
**Claude Code:**
Add to settings.json:
```json
"env": {
"OTEL_LOG_LEVEL": "debug"
}
```
Then check logs:
```bash
docker compose logs -f otel-collector
```
---
## Additional Resources
- **OTEL Collector Docs:** https://opentelemetry.io/docs/collector/
- **Prometheus Troubleshooting:** https://prometheus.io/docs/prometheus/latest/troubleshooting/
- **Grafana Troubleshooting:** https://grafana.com/docs/grafana/latest/troubleshooting/
- **Docker Compose Docs:** https://docs.docker.com/compose/

View File

@@ -0,0 +1,812 @@
# Mode 1: Local PoC Setup - Detailed Workflow
Complete step-by-step process for setting up a local OpenTelemetry stack for Claude Code telemetry.
---
## Overview
**Goal:** Create a complete local telemetry monitoring stack
**Time:** 5-7 minutes
**Prerequisites:** Docker Desktop, Claude Code, 2GB+ free disk space
**Output:** Running Grafana dashboard with Claude Code metrics
---
## Phase 0: Prerequisites Verification
### Step 0.1: Check Docker Installation
```bash
# Check if Docker is installed
docker --version
# Expected: Docker version 20.10.0 or higher
```
**If not installed:**
```
Docker is not installed. Please install Docker Desktop:
- Mac: https://docs.docker.com/desktop/install/mac-install/
- Linux: https://docs.docker.com/desktop/install/linux-install/
- Windows: https://docs.docker.com/desktop/install/windows-install/
```
**Stop if:** Docker not installed
### Step 0.2: Verify Docker is Running
```bash
# Check Docker daemon
docker ps
# Expected: List of containers (or empty list)
# Error: "Cannot connect to Docker daemon" means Docker isn't running
```
**If not running:**
```
Docker Desktop is not running. Please:
1. Open Docker Desktop application
2. Wait for the whale icon to be stable (not animated)
3. Try again
```
**Stop if:** Docker not running
### Step 0.3: Check Docker Compose
```bash
# Modern Docker includes compose
docker compose version
# Expected: Docker Compose version v2.x.x or higher
```
**Note:** We use `docker compose` (not `docker-compose`)
### Step 0.4: Check Available Ports
```bash
# Check if ports are available
lsof -i :3000 -i :4317 -i :4318 -i :8889 -i :9090 -i :3100
# Expected: No output (ports are free)
```
**If ports in use:**
```
The following ports are required but already in use:
- 3000: Grafana
- 4317: OTEL Collector (gRPC)
- 4318: OTEL Collector (HTTP)
- 8889: OTEL Collector (Prometheus exporter)
- 9090: Prometheus
- 3100: Loki
Options:
1. Stop services using these ports
2. Modify port mappings in docker-compose.yml (advanced)
```
**Stop if:** Critical ports (3000, 4317, 9090) are in use
### Step 0.5: Check Disk Space
```bash
# Check available disk space
df -h ~
# Minimum: 2GB free (for Docker images ~1.5GB + data volumes)
# Recommended: 5GB+ free for comfortable operation
```
**If low disk space:**
```
Low disk space detected. Setup requires:
- Initial: ~1.5GB for Docker images (OTEL, Prometheus, Grafana, Loki)
- Runtime: 500MB+ for data volumes (grows over time)
- Minimum: 2GB free disk space required
Please free up space before continuing.
```
---
## Phase 1: Directory Structure Creation
### Step 1.1: Create Base Directory
```bash
mkdir -p ~/.claude/telemetry/{dashboards,docs}
cd ~/.claude/telemetry
```
**Verify:**
```bash
ls -la ~/.claude/telemetry
# Should show: dashboards/ and docs/ directories
```
---
## Phase 2: Configuration File Generation
### Step 2.1: Create docker-compose.yml
**Template:** `templates/docker-compose-template.yml`
```yaml
services:
# OpenTelemetry Collector - receives telemetry from Claude Code
otel-collector:
image: otel/opentelemetry-collector-contrib:0.115.1
container_name: claude-otel-collector
command: ["--config=/etc/otel-collector-config.yml"]
volumes:
- ./otel-collector-config.yml:/etc/otel-collector-config.yml
ports:
- "4317:4317" # OTLP gRPC receiver
- "4318:4318" # OTLP HTTP receiver
- "8889:8889" # Prometheus metrics exporter
networks:
- claude-telemetry
# Prometheus - stores metrics
prometheus:
image: prom/prometheus:v2.55.1
container_name: claude-prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--web.enable-lifecycle'
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus-data:/prometheus
ports:
- "9090:9090"
networks:
- claude-telemetry
depends_on:
- otel-collector
# Loki - stores logs
loki:
image: grafana/loki:3.0.0
container_name: claude-loki
ports:
- "3100:3100"
command: -config.file=/etc/loki/local-config.yaml
volumes:
- loki-data:/loki
networks:
- claude-telemetry
# Grafana - visualization dashboards
grafana:
image: grafana/grafana:11.3.0
container_name: claude-grafana
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=admin
- GF_USERS_ALLOW_SIGN_UP=false
volumes:
- grafana-data:/var/lib/grafana
- ./grafana-datasources.yml:/etc/grafana/provisioning/datasources/datasources.yml
networks:
- claude-telemetry
depends_on:
- prometheus
- loki
networks:
claude-telemetry:
driver: bridge
volumes:
prometheus-data:
loki-data:
grafana-data:
```
**Write to:** `~/.claude/telemetry/docker-compose.yml`
**Note on Image Versions:**
- Versions are pinned to prevent breaking changes from upstream
- Current versions (tested and stable):
- OTEL Collector: 0.115.1
- Prometheus: v2.55.1
- Loki: 3.0.0
- Grafana: 11.3.0
- To update: Change version tags in docker-compose.yml and run `docker compose pull`
### Step 2.2: Create OTEL Collector Configuration
**Template:** `templates/otel-collector-config-template.yml`
**CRITICAL:** Use `debug` exporter, not deprecated `logging` exporter
```yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 10s
send_batch_size: 1024
resource:
attributes:
- key: service.name
value: claude-code
action: upsert
memory_limiter:
check_interval: 1s
limit_mib: 512
exporters:
# Export metrics to Prometheus
prometheus:
endpoint: "0.0.0.0:8889"
namespace: claude_code
const_labels:
source: claude_code_telemetry
# Export logs to Loki via OTLP HTTP
otlphttp/loki:
endpoint: http://loki:3100/otlp
tls:
insecure: true
# Debug exporter (replaces deprecated logging exporter)
debug:
verbosity: normal
service:
pipelines:
metrics:
receivers: [otlp]
processors: [memory_limiter, batch, resource]
exporters: [prometheus, debug]
logs:
receivers: [otlp]
processors: [memory_limiter, batch, resource]
exporters: [otlphttp/loki, debug]
telemetry:
logs:
level: info
```
**Write to:** `~/.claude/telemetry/otel-collector-config.yml`
### Step 2.3: Create Prometheus Configuration
**Template:** `templates/prometheus-config-template.yml`
```yaml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'otel-collector'
static_configs:
- targets: ['otel-collector:8889']
```
**Write to:** `~/.claude/telemetry/prometheus.yml`
### Step 2.4: Create Grafana Datasources Configuration
**Template:** `templates/grafana-datasources-template.yml`
```yaml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: true
- name: Loki
type: loki
access: proxy
url: http://loki:3100
editable: true
```
**Write to:** `~/.claude/telemetry/grafana-datasources.yml`
### Step 2.5: Create Management Scripts
**Start Script:**
```bash
#!/bin/bash
# start-telemetry.sh
echo "🚀 Starting Claude Code Telemetry Stack..."
# Check if Docker is running
if ! docker info > /dev/null 2>&1; then
echo "❌ Docker is not running. Please start Docker Desktop."
exit 1
fi
cd ~/.claude/telemetry || exit 1
# Start containers
docker compose up -d
# Wait for services to be ready
echo "⏳ Waiting for services to start..."
sleep 5
# Check container status
echo ""
echo "📊 Container Status:"
docker ps --filter "name=claude-" --format "table {{.Names}}\t{{.Status}}"
echo ""
echo "✅ Telemetry stack started!"
echo ""
echo "🌐 Access URLs:"
echo " Grafana: http://localhost:3000 (admin/admin)"
echo " Prometheus: http://localhost:9090"
echo " Loki: http://localhost:3100"
echo ""
echo "📝 Next steps:"
echo " 1. Restart Claude Code to activate telemetry"
echo " 2. Import dashboards into Grafana"
echo " 3. Use Claude Code normally - metrics will appear in ~60 seconds"
```
**Write to:** `~/.claude/telemetry/start-telemetry.sh`
```bash
chmod +x ~/.claude/telemetry/start-telemetry.sh
```
**Stop Script:**
```bash
#!/bin/bash
# stop-telemetry.sh
echo "🛑 Stopping Claude Code Telemetry Stack..."
cd ~/.claude/telemetry || exit 1
docker compose down
echo "✅ Telemetry stack stopped"
echo ""
echo "Note: Data is preserved in Docker volumes."
echo "To start again: ./start-telemetry.sh"
echo "To completely remove all data: ./cleanup-telemetry.sh"
```
**Write to:** `~/.claude/telemetry/stop-telemetry.sh`
```bash
chmod +x ~/.claude/telemetry/stop-telemetry.sh
```
**Cleanup Script (Full Data Removal):**
```bash
#!/bin/bash
# cleanup-telemetry.sh
echo "⚠️ WARNING: This will remove ALL telemetry data including:"
echo " - All containers"
echo " - All Docker volumes (Grafana, Prometheus, Loki data)"
echo " - Network configuration"
echo ""
read -p "Are you sure you want to proceed? (yes/no): " -r
echo
if [[ ! $REPLY =~ ^[Yy][Ee][Ss]$ ]]; then
echo "Cleanup cancelled."
exit 0
fi
echo "Performing full cleanup of Claude Code telemetry stack..."
cd ~/.claude/telemetry || exit 1
docker compose down -v
echo ""
echo "✅ Full cleanup complete!"
echo ""
echo "Removed:"
echo " ✓ All containers (otel-collector, prometheus, loki, grafana)"
echo " ✓ All volumes (all historical data)"
echo " ✓ Network configuration"
echo ""
echo "Preserved:"
echo " ✓ Configuration files in ~/.claude/telemetry/"
echo " ✓ Claude Code settings in ~/.claude/settings.json"
echo ""
echo "To start fresh: ./start-telemetry.sh"
```
**Write to:** `~/.claude/telemetry/cleanup-telemetry.sh`
```bash
chmod +x ~/.claude/telemetry/cleanup-telemetry.sh
```
---
## Phase 3: Start Docker Containers
### Step 3.1: Start All Services
```bash
cd ~/.claude/telemetry
docker compose up -d
```
**Expected output:**
```
[+] Running 5/5
✔ Network claude_claude-telemetry Created
✔ Container claude-loki Started
✔ Container claude-otel-collector Started
✔ Container claude-prometheus Started
✔ Container claude-grafana Started
```
### Step 3.2: Verify Containers are Running
```bash
docker ps --filter "name=claude-" --format "table {{.Names}}\t{{.Status}}"
```
**Expected:** All 4 containers showing "Up X seconds/minutes"
**If OTEL Collector is not running:**
```bash
# Check logs
docker logs claude-otel-collector
```
**Common issue:** "logging exporter deprecated" error
**Solution:** Config file uses `debug` exporter (already fixed in template)
### Step 3.3: Wait for Services to be Healthy
```bash
# Give services time to initialize
sleep 10
# Test Prometheus
curl -s http://localhost:9090/-/healthy
# Expected: Prometheus is Healthy.
# Test Grafana
curl -s http://localhost:3000/api/health | jq
# Expected: {"database": "ok", ...}
```
---
## Phase 4: Update Claude Code Settings
### Step 4.1: Backup Existing Settings
```bash
cp ~/.claude/settings.json ~/.claude/settings.json.backup
```
### Step 4.2: Read Current Settings
```bash
# Read existing settings
cat ~/.claude/settings.json
```
### Step 4.3: Merge Telemetry Configuration
**Add to settings.json `env` section:**
```json
{
"env": {
"CLAUDE_CODE_ENABLE_TELEMETRY": "1",
"OTEL_METRICS_EXPORTER": "otlp",
"OTEL_LOGS_EXPORTER": "otlp",
"OTEL_EXPORTER_OTLP_PROTOCOL": "grpc",
"OTEL_EXPORTER_OTLP_ENDPOINT": "http://localhost:4317",
"OTEL_METRIC_EXPORT_INTERVAL": "60000",
"OTEL_LOGS_EXPORT_INTERVAL": "5000",
"OTEL_LOG_USER_PROMPTS": "1",
"OTEL_METRICS_INCLUDE_SESSION_ID": "true",
"OTEL_METRICS_INCLUDE_VERSION": "true",
"OTEL_METRICS_INCLUDE_ACCOUNT_UUID": "true",
"OTEL_RESOURCE_ATTRIBUTES": "environment=local,deployment=poc"
}
}
```
**Template:** `templates/settings-env-template.json`
**Note:** Merge with existing env vars, don't replace entire settings file
### Step 4.4: Verify Settings Updated
```bash
cat ~/.claude/settings.json | grep CLAUDE_CODE_ENABLE_TELEMETRY
# Expected: "CLAUDE_CODE_ENABLE_TELEMETRY": "1"
```
---
## Phase 5: Grafana Dashboard Import
### Step 5.1: Detect Prometheus Datasource UID
**Option A: Via Grafana API**
```bash
curl -s http://admin:admin@localhost:3000/api/datasources | \
jq '.[] | select(.type=="prometheus") | {name, uid}'
```
**Expected:**
```json
{
"name": "Prometheus",
"uid": "PBFA97CFB590B2093"
}
```
**Option B: Manual Detection**
1. Open http://localhost:3000
2. Go to Connections → Data sources
3. Click Prometheus
4. Note the UID from the URL: `/datasources/edit/{UID}`
### Step 5.2: Fix Dashboard with Correct UID
**Read dashboard template:** `dashboards/claude-code-overview-template.json`
**Replace all instances of:**
```json
"datasource": {
"type": "prometheus",
"uid": "prometheus"
}
```
**With:**
```json
"datasource": {
"type": "prometheus",
"uid": "PBFA97CFB590B2093"
}
```
**Use detected UID from Step 5.1**
### Step 5.3: Verify Metric Names
**CRITICAL:** Claude Code metrics use double prefix: `claude_code_claude_code_*`
**Verify actual metric names:**
```bash
curl -s 'http://localhost:9090/api/v1/label/__name__/values' | \
grep claude_code
```
**Expected metrics:**
- `claude_code_claude_code_active_time_seconds_total`
- `claude_code_claude_code_commit_count_total`
- `claude_code_claude_code_cost_usage_USD_total`
- `claude_code_claude_code_lines_of_code_count_total`
- `claude_code_claude_code_token_usage_tokens_total`
**Dashboard queries must use these exact names**
### Step 5.4: Save Corrected Dashboard
**Write to:** `~/.claude/telemetry/dashboards/claude-code-overview.json`
### Step 5.5: Import Dashboard
**Option A: Via Grafana UI**
1. Open http://localhost:3000 (admin/admin)
2. Dashboards → New → Import
3. Upload JSON file: `~/.claude/telemetry/dashboards/claude-code-overview.json`
4. Click Import
**Option B: Via API**
```bash
curl -X POST http://admin:admin@localhost:3000/api/dashboards/db \
-H "Content-Type: application/json" \
-d @~/.claude/telemetry/dashboards/claude-code-overview.json
```
---
## Phase 6: Verification & Testing
### Step 6.1: Verify OTEL Collector Receiving Data
**Note:** Claude Code must be restarted for telemetry to activate!
```bash
# Check OTEL Collector logs for incoming data
docker logs claude-otel-collector --tail 50 | grep -i "received"
```
**Expected:** Messages about receiving OTLP data
**If no data:**
```
Reminder: You must restart Claude Code for telemetry to activate.
1. Exit current Claude Code session
2. Start new session: claude
3. Wait 60 seconds
4. Check again
```
### Step 6.2: Query Prometheus for Metrics
```bash
# Check if any claude_code metrics exist
curl -s 'http://localhost:9090/api/v1/label/__name__/values' | \
jq '.data[] | select(. | startswith("claude_code"))'
```
**Expected:** List of claude_code metrics
**Sample query:**
```bash
curl -s 'http://localhost:9090/api/v1/query?query=claude_code_claude_code_lines_of_code_count_total' | \
jq '.data.result'
```
**Expected:** Non-empty result array
### Step 6.3: Test Grafana Dashboard
1. Open http://localhost:3000
2. Navigate to imported dashboard
3. Check panels show data (or "No data" if Claude Code hasn't been used yet)
**If "No data":**
- Normal if Claude Code hasn't generated any activity yet
- Use Claude Code for 1-2 minutes
- Refresh dashboard
**If "Datasource not found":**
- UID mismatch - go back to Step 5.1
**If queries fail:**
- Metric name mismatch - verify double prefix
### Step 6.4: Generate Test Data
**To populate dashboard quickly:**
```
Use Claude Code to:
1. Ask a question (generates token usage)
2. Request a code modification (generates LOC metrics)
3. Have a conversation (generates active time)
```
**Wait 60 seconds, then refresh Grafana dashboard**
---
## Phase 7: Documentation & Quickstart Guide
### Step 7.1: Create Quickstart Guide
**Write to:** `~/.claude/telemetry/docs/quickstart.md`
**Include:**
- URLs and credentials
- Management commands (start/stop)
- What metrics are being collected
- How to access dashboards
- Troubleshooting quick reference
**Template:** `data/quickstart-template.md`
### Step 7.2: Provide User Summary
```
✅ Setup Complete!
📦 Installation:
Location: ~/.claude/telemetry/
Containers: 4 running (OTEL Collector, Prometheus, Loki, Grafana)
🌐 Access URLs:
Grafana: http://localhost:3000 (admin/admin)
Prometheus: http://localhost:9090
OTEL Collector: localhost:4317 (gRPC), localhost:4318 (HTTP)
📊 Dashboards Imported:
✓ Claude Code - Overview
📝 What's Being Collected:
• Session counts and active time
• Token usage (input/output/cached)
• API costs by model
• Lines of code modified
• Commits and PRs created
• Tool execution metrics
⚙️ Management:
Start: ~/.claude/telemetry/start-telemetry.sh
Stop: ~/.claude/telemetry/stop-telemetry.sh (preserves data)
Cleanup: ~/.claude/telemetry/cleanup-telemetry.sh (removes all data)
Logs: docker logs claude-otel-collector
🚀 Next Steps:
1. ✅ Restart Claude Code (telemetry activates on startup)
2. Use Claude Code normally
3. Check dashboard in ~60 seconds
4. Review quickstart: ~/.claude/telemetry/docs/quickstart.md
📚 Documentation:
- Quickstart: ~/.claude/telemetry/docs/quickstart.md
- Metrics Reference: data/metrics-reference.md
- Troubleshooting: data/troubleshooting.md
```
---
## Cleanup Instructions
### Remove Stack (Keep Data)
```bash
cd ~/.claude/telemetry
docker compose down
```
### Remove Stack and Data
```bash
cd ~/.claude/telemetry
docker compose down -v
```
### Remove Telemetry from Claude Code
Edit `~/.claude/settings.json` and remove the `env` section with telemetry variables, or set:
```json
"CLAUDE_CODE_ENABLE_TELEMETRY": "0"
```
Then restart Claude Code.
---
## Troubleshooting
See `data/troubleshooting.md` for detailed solutions to common issues.
**Quick fixes:**
- Container won't start → Check logs: `docker logs claude-otel-collector`
- No metrics → Restart Claude Code
- Dashboard broken → Verify datasource UID
- Wrong metric names → Use double prefix: `claude_code_claude_code_*`

View File

@@ -0,0 +1,572 @@
# Mode 2: Enterprise Setup (Connect to Existing Infrastructure)
**Goal:** Configure Claude Code to send telemetry to centralized company infrastructure
**When to use:**
- Company has centralized OTEL Collector endpoint
- Team rollout scenario
- Want aggregated team metrics
- Privacy/compliance requires centralized control
- No need for local Grafana dashboards
**Prerequisites:**
- OTEL Collector endpoint URL (e.g., `https://otel.company.com:4317`)
- Authentication credentials (API key or mTLS certificates)
- Optional: Team/department identifiers
- Write access to `~/.claude/settings.json`
**Estimated Time:** 2-3 minutes
---
## Phase 0: Gather Requirements
### Step 0.1: Collect endpoint information from user
Ask the user for the following details:
1. **OTEL Collector Endpoint URL**
- Format: `https://otel.company.com:4317` or `http://otel.company.com:4318`
- Protocol: gRPC (port 4317) or HTTP (port 4318)
2. **Authentication Method**
- API Key/Bearer Token
- mTLS certificates
- Basic Auth
- No authentication (internal network)
3. **Team/Environment Identifiers**
- Team name (e.g., `team=platform`)
- Environment (e.g., `environment=production`)
- Department (e.g., `department=engineering`)
- Any other custom attributes
4. **Optional: Protocol Preferences**
- Default: gRPC (more efficient)
- Alternative: HTTP (better firewall compatibility)
**Example Questions:**
```
To configure enterprise telemetry, I need a few details:
1. **Endpoint:** What is your OTEL Collector endpoint URL?
(e.g., https://otel.company.com:4317)
2. **Protocol:** HTTPS or HTTP? gRPC or HTTP/protobuf?
3. **Authentication:** Do you have an API key, certificate, or other credentials?
4. **Team identifier:** What team/department should metrics be tagged with?
(e.g., team=platform, department=engineering)
```
---
## Phase 1: Backup Existing Settings
### Step 1.1: Backup settings.json
**Always backup before modifying!**
```bash
# Check if settings.json exists
if [ -f ~/.claude/settings.json ]; then
cp ~/.claude/settings.json ~/.claude/settings.json.backup.$(date +%Y%m%d-%H%M%S)
echo "✅ Backup created: ~/.claude/settings.json.backup.$(date +%Y%m%d-%H%M%S)"
else
echo "⚠️ No existing settings.json found - will create new one"
fi
```
### Step 1.2: Read existing settings
```bash
# Check current settings
cat ~/.claude/settings.json
```
**Important:** Preserve all existing settings when adding telemetry configuration!
---
## Phase 2: Update Claude Code Settings
### Step 2.1: Determine configuration based on authentication method
**Scenario A: API Key Authentication**
```json
{
"env": {
"CLAUDE_CODE_ENABLE_TELEMETRY": "1",
"OTEL_METRICS_EXPORTER": "otlp",
"OTEL_LOGS_EXPORTER": "otlp",
"OTEL_EXPORTER_OTLP_PROTOCOL": "grpc",
"OTEL_EXPORTER_OTLP_ENDPOINT": "https://otel.company.com:4317",
"OTEL_EXPORTER_OTLP_HEADERS": "Authorization=Bearer YOUR_API_KEY_HERE",
"OTEL_METRIC_EXPORT_INTERVAL": "60000",
"OTEL_LOGS_EXPORT_INTERVAL": "5000",
"OTEL_LOG_USER_PROMPTS": "1",
"OTEL_METRICS_INCLUDE_SESSION_ID": "true",
"OTEL_METRICS_INCLUDE_VERSION": "true",
"OTEL_METRICS_INCLUDE_ACCOUNT_UUID": "true",
"OTEL_RESOURCE_ATTRIBUTES": "team=platform,environment=production,deployment=enterprise"
}
}
```
**Scenario B: mTLS Certificate Authentication**
```json
{
"env": {
"CLAUDE_CODE_ENABLE_TELEMETRY": "1",
"OTEL_METRICS_EXPORTER": "otlp",
"OTEL_LOGS_EXPORTER": "otlp",
"OTEL_EXPORTER_OTLP_PROTOCOL": "grpc",
"OTEL_EXPORTER_OTLP_ENDPOINT": "https://otel.company.com:4317",
"OTEL_EXPORTER_OTLP_CERTIFICATE": "/path/to/client-cert.pem",
"OTEL_EXPORTER_OTLP_CLIENT_KEY": "/path/to/client-key.pem",
"OTEL_EXPORTER_OTLP_CLIENT_CERTIFICATE": "/path/to/ca-cert.pem",
"OTEL_METRIC_EXPORT_INTERVAL": "60000",
"OTEL_LOGS_EXPORT_INTERVAL": "5000",
"OTEL_LOG_USER_PROMPTS": "1",
"OTEL_METRICS_INCLUDE_SESSION_ID": "true",
"OTEL_METRICS_INCLUDE_VERSION": "true",
"OTEL_METRICS_INCLUDE_ACCOUNT_UUID": "true",
"OTEL_RESOURCE_ATTRIBUTES": "team=platform,environment=production,deployment=enterprise"
}
}
```
**Scenario C: HTTP Protocol (Port 4318)**
```json
{
"env": {
"CLAUDE_CODE_ENABLE_TELEMETRY": "1",
"OTEL_METRICS_EXPORTER": "otlp",
"OTEL_LOGS_EXPORTER": "otlp",
"OTEL_EXPORTER_OTLP_PROTOCOL": "http/protobuf",
"OTEL_EXPORTER_OTLP_ENDPOINT": "https://otel.company.com:4318",
"OTEL_EXPORTER_OTLP_HEADERS": "Authorization=Bearer YOUR_API_KEY_HERE",
"OTEL_METRIC_EXPORT_INTERVAL": "60000",
"OTEL_LOGS_EXPORT_INTERVAL": "5000",
"OTEL_LOG_USER_PROMPTS": "1",
"OTEL_METRICS_INCLUDE_SESSION_ID": "true",
"OTEL_METRICS_INCLUDE_VERSION": "true",
"OTEL_METRICS_INCLUDE_ACCOUNT_UUID": "true",
"OTEL_RESOURCE_ATTRIBUTES": "team=platform,environment=production,deployment=enterprise"
}
}
```
**Scenario D: No Authentication (Internal Network)**
```json
{
"env": {
"CLAUDE_CODE_ENABLE_TELEMETRY": "1",
"OTEL_METRICS_EXPORTER": "otlp",
"OTEL_LOGS_EXPORTER": "otlp",
"OTEL_EXPORTER_OTLP_PROTOCOL": "grpc",
"OTEL_EXPORTER_OTLP_ENDPOINT": "http://otel.internal.company.com:4317",
"OTEL_METRIC_EXPORT_INTERVAL": "60000",
"OTEL_LOGS_EXPORT_INTERVAL": "5000",
"OTEL_LOG_USER_PROMPTS": "1",
"OTEL_METRICS_INCLUDE_SESSION_ID": "true",
"OTEL_METRICS_INCLUDE_VERSION": "true",
"OTEL_METRICS_INCLUDE_ACCOUNT_UUID": "true",
"OTEL_RESOURCE_ATTRIBUTES": "team=platform,environment=production,deployment=enterprise"
}
}
```
### Step 2.2: Update settings.json
**Method 1: Manual Update (Safest)**
1. Open `~/.claude/settings.json` in editor
2. Merge the telemetry configuration into existing `env` object
3. Preserve all other settings
4. Save file
**Method 2: Programmatic Update (Use with Caution)**
```bash
# Read existing settings
existing_settings=$(cat ~/.claude/settings.json)
# Create merged settings (requires jq)
cat ~/.claude/settings.json | jq '. + {
"env": (.env // {} | . + {
"CLAUDE_CODE_ENABLE_TELEMETRY": "1",
"OTEL_METRICS_EXPORTER": "otlp",
"OTEL_LOGS_EXPORTER": "otlp",
"OTEL_EXPORTER_OTLP_PROTOCOL": "grpc",
"OTEL_EXPORTER_OTLP_ENDPOINT": "https://otel.company.com:4317",
"OTEL_EXPORTER_OTLP_HEADERS": "Authorization=Bearer YOUR_API_KEY_HERE",
"OTEL_METRIC_EXPORT_INTERVAL": "60000",
"OTEL_LOGS_EXPORT_INTERVAL": "5000",
"OTEL_LOG_USER_PROMPTS": "1",
"OTEL_METRICS_INCLUDE_SESSION_ID": "true",
"OTEL_METRICS_INCLUDE_VERSION": "true",
"OTEL_METRICS_INCLUDE_ACCOUNT_UUID": "true",
"OTEL_RESOURCE_ATTRIBUTES": "team=platform,environment=production,deployment=enterprise"
})
}' > ~/.claude/settings.json.new
# Validate JSON
if jq empty ~/.claude/settings.json.new 2>/dev/null; then
mv ~/.claude/settings.json.new ~/.claude/settings.json
echo "✅ Settings updated successfully"
else
echo "❌ Invalid JSON - restoring backup"
rm ~/.claude/settings.json.new
fi
```
### Step 2.3: Validate configuration
```bash
# Check that settings.json is valid JSON
jq empty ~/.claude/settings.json
# Display telemetry configuration
jq '.env | with_entries(select(.key | startswith("OTEL_") or . == "CLAUDE_CODE_ENABLE_TELEMETRY"))' ~/.claude/settings.json
```
---
## Phase 3: Test Connectivity (Optional)
### Step 3.1: Test OTEL endpoint reachability
```bash
# Test gRPC endpoint (port 4317)
nc -zv otel.company.com 4317
# Test HTTP endpoint (port 4318)
curl -v https://otel.company.com:4318/v1/metrics -d '{}' -H "Content-Type: application/json"
```
### Step 3.2: Validate authentication
```bash
# Test with API key
curl -v https://otel.company.com:4318/v1/metrics \
-H "Authorization: Bearer YOUR_API_KEY_HERE" \
-H "Content-Type: application/json" \
-d '{}'
# Expected: 200 or 401/403 (tells us auth is working)
# Unexpected: Connection refused, timeout (network issue)
```
---
## Phase 4: User Instructions
### Step 4.1: Provide restart instructions
**Display to user:**
```
✅ Configuration complete!
**Important Next Steps:**
1. **Restart Claude Code** for telemetry to take effect
- Telemetry configuration is only loaded at startup
- Close all Claude Code sessions and restart
2. **Verify with your platform team** that they see metrics
- Metrics should appear within 60 seconds of restart
- Tagged with: team=platform, environment=production
- Metric prefix: claude_code_claude_code_*
3. **Dashboard access**
- Contact your platform team for Grafana/dashboard URLs
- Dashboards should be centrally managed
**Troubleshooting:**
If metrics don't appear:
- Check network connectivity to OTEL endpoint
- Verify authentication credentials are correct
- Check firewall rules allow outbound connections
- Review OTEL Collector logs on backend (platform team)
- Verify OTEL_EXPORTER_OTLP_ENDPOINT is correct
**Rollback:**
If you need to disable telemetry:
- Restore backup: cp ~/.claude/settings.json.backup.TIMESTAMP ~/.claude/settings.json
- Or set: "CLAUDE_CODE_ENABLE_TELEMETRY": "0"
```
---
## Phase 5: Create Team Rollout Documentation
### Step 5.1: Generate rollout guide for team distribution
**Create file: `claude-code-telemetry-setup-guide.md`**
```markdown
# Claude Code Telemetry Setup Guide
**For:** [Team Name] Team Members
**Last Updated:** [Date]
## Overview
We're collecting Claude Code usage telemetry to:
- Track API costs and optimize spending
- Measure productivity metrics (LOC, commits, PRs)
- Understand token usage patterns
- Identify high-value use cases
**Privacy:** All metrics are aggregated and anonymized at the team level.
## Setup Instructions
### Step 1: Backup Your Settings
```bash
cp ~/.claude/settings.json ~/.claude/settings.json.backup
```
### Step 2: Update Configuration
Add the following to your `~/.claude/settings.json`:
```json
{
"env": {
"CLAUDE_CODE_ENABLE_TELEMETRY": "1",
"OTEL_METRICS_EXPORTER": "otlp",
"OTEL_LOGS_EXPORTER": "otlp",
"OTEL_EXPORTER_OTLP_PROTOCOL": "grpc",
"OTEL_EXPORTER_OTLP_ENDPOINT": "https://otel.company.com:4317",
"OTEL_EXPORTER_OTLP_HEADERS": "Authorization=Bearer [PROVIDED_BY_PLATFORM_TEAM]",
"OTEL_METRIC_EXPORT_INTERVAL": "60000",
"OTEL_LOGS_EXPORT_INTERVAL": "5000",
"OTEL_LOG_USER_PROMPTS": "1",
"OTEL_METRICS_INCLUDE_SESSION_ID": "true",
"OTEL_METRICS_INCLUDE_VERSION": "true",
"OTEL_METRICS_INCLUDE_ACCOUNT_UUID": "true",
"OTEL_RESOURCE_ATTRIBUTES": "team=[TEAM_NAME],environment=production"
}
}
```
**Important:** Replace `[PROVIDED_BY_PLATFORM_TEAM]` with your API key.
### Step 3: Restart Claude Code
Close all Claude Code sessions and restart for changes to take effect.
### Step 4: Verify Setup
After 5 minutes of usage:
1. Check team dashboard: [DASHBOARD_URL]
2. Verify your metrics appear in the team aggregation
3. Contact [TEAM_CONTACT] if you have issues
## What's Being Collected?
**Metrics:**
- Session counts and active time
- Token usage (input, output, cached)
- API costs by model
- Lines of code modified
- Commits and PRs created
**Events/Logs:**
- User prompts (anonymized)
- Tool executions
- API requests
**NOT Collected:**
- Source code content
- File names or paths
- Personal identifiers (beyond account UUID for deduplication)
## Dashboard Access
**Team Dashboard:** [URL]
**Login:** Use your company SSO
## Support
**Issues?** Contact [TEAM_CONTACT] or #claude-code-telemetry Slack channel
**Opt-Out:** Contact [TEAM_CONTACT] if you need to opt out for specific projects
```
---
## Phase 6: Success Criteria
### Checklist for Mode 2 completion:
- ✅ Backed up existing settings.json
- ✅ Updated settings with correct OTEL endpoint
- ✅ Added authentication (API key or certificates)
- ✅ Set team/environment resource attributes
- ✅ Validated JSON configuration
- ✅ Tested connectivity (optional)
- ✅ Provided restart instructions to user
- ✅ Created team rollout documentation (if applicable)
**Expected outcome:**
- Claude Code sends telemetry to central endpoint within 60 seconds of restart
- Platform team can see metrics tagged with team identifier
- User has clear instructions for verification and troubleshooting
---
## Troubleshooting
### Issue 1: Connection Refused
**Symptoms:** Claude Code can't reach OTEL endpoint
**Checks:**
```bash
# Test network connectivity
ping otel.company.com
# Test port access
nc -zv otel.company.com 4317
# Check corporate VPN/proxy
echo $HTTPS_PROXY
```
**Solutions:**
- Connect to corporate VPN
- Use HTTP proxy if required: `HTTPS_PROXY=http://proxy.company.com:8080`
- Try HTTP protocol (port 4318) instead of gRPC
- Contact network team to allow outbound connections
### Issue 2: Authentication Failed
**Symptoms:** 401 or 403 errors in logs
**Checks:**
```bash
# Verify API key format
jq '.env.OTEL_EXPORTER_OTLP_HEADERS' ~/.claude/settings.json
# Test manually
curl -v https://otel.company.com:4318/v1/metrics \
-H "Authorization: Bearer YOUR_KEY" \
-d '{}'
```
**Solutions:**
- Verify API key is correct and not expired
- Check header format: `Authorization=Bearer TOKEN` (no quotes, equals sign)
- Confirm permissions with platform team
- Try rotating API key
### Issue 3: Metrics Not Appearing
**Symptoms:** Platform team doesn't see metrics after 5 minutes
**Checks:**
```bash
# Verify telemetry is enabled
jq '.env.CLAUDE_CODE_ENABLE_TELEMETRY' ~/.claude/settings.json
# Check endpoint configuration
jq '.env.OTEL_EXPORTER_OTLP_ENDPOINT' ~/.claude/settings.json
# Confirm Claude Code was restarted
ps aux | grep claude
```
**Solutions:**
- Restart Claude Code (telemetry loads at startup only)
- Verify endpoint URL has correct protocol and port
- Check with platform team if OTEL Collector is receiving data
- Review OTEL Collector logs for errors
- Verify resource attributes match expected format
### Issue 4: Certificate Errors (mTLS)
**Symptoms:** SSL/TLS handshake errors
**Checks:**
```bash
# Verify certificate paths
ls -la /path/to/client-cert.pem
ls -la /path/to/client-key.pem
ls -la /path/to/ca-cert.pem
# Check certificate validity
openssl x509 -in /path/to/client-cert.pem -noout -dates
```
**Solutions:**
- Ensure certificate files are readable
- Verify certificates haven't expired
- Check certificate chain is complete
- Confirm CA certificate matches server
- Contact platform team for new certificates if needed
---
## Enterprise Configuration Examples
### Example 1: Multi-Environment Setup
**Development:**
```json
"OTEL_RESOURCE_ATTRIBUTES": "team=platform,environment=development,user=john.doe"
```
**Staging:**
```json
"OTEL_RESOURCE_ATTRIBUTES": "team=platform,environment=staging,user=john.doe"
```
**Production:**
```json
"OTEL_RESOURCE_ATTRIBUTES": "team=platform,environment=production,user=john.doe"
```
### Example 2: Department-Level Aggregation
```json
"OTEL_RESOURCE_ATTRIBUTES": "department=engineering,team=platform,squad=backend,environment=production"
```
Enables queries like:
- Cost by department
- Usage by team within department
- Squad-level productivity metrics
### Example 3: Project-Based Tagging
```json
"OTEL_RESOURCE_ATTRIBUTES": "team=platform,project=api-v2-migration,environment=production"
```
Track costs and effort for specific initiatives.
---
## Additional Resources
- **OTEL Specification:** https://opentelemetry.io/docs/specs/otel/
- **Claude Code Metrics Reference:** See `data/metrics-reference.md`
- **Enterprise Architecture:** See `data/enterprise-architecture.md`
- **Team Dashboard Queries:** See `data/prometheus-queries.md`
---
**Mode 2 Complete!**

View File

@@ -0,0 +1,214 @@
# Known Issues & Fixes
Common problems and solutions for Claude Code OpenTelemetry setup.
## Issue 1: Missing OTEL Exporters (Most Common)
**Problem**: Claude Code not sending telemetry even with `CLAUDE_CODE_ENABLE_TELEMETRY=1`
**Cause**: Missing required exporter settings
**Symptoms**:
- No metrics in Prometheus after restart
- OTEL Collector logs show no incoming connections
- Dashboard shows "No data"
**Fix**: Add to settings.json:
```json
{
"env": {
"OTEL_METRICS_EXPORTER": "otlp",
"OTEL_LOGS_EXPORTER": "otlp"
}
}
```
**Important**: Restart Claude Code after adding!
## Issue 2: OTEL Collector Deprecated 'address' Field
**Problem**: OTEL Collector crashes with "'address' has invalid keys" error
**Cause**: The `address` field in `service.telemetry.metrics` is deprecated in v0.123.0+
**Fix**: Remove the address field:
```yaml
service:
telemetry:
metrics:
level: detailed
# REMOVE: address: ":8888"
```
## Issue 3: OTEL Collector Deprecated Exporter
**Problem**: OTEL Collector fails with "logging exporter has been deprecated"
**Fix**: Use `debug` exporter instead:
```yaml
exporters:
debug:
verbosity: normal
service:
pipelines:
metrics:
exporters: [prometheus, debug]
```
## Issue 4: Dashboard Datasource Not Found
**Problem**: Grafana dashboard shows "datasource prometheus not found"
**Cause**: Dashboard has hardcoded UID that doesn't match your setup
**Fix**:
1. Detect your actual UID:
```bash
curl -s http://admin:admin@localhost:3000/api/datasources | jq '.[0].uid'
```
2. Replace all occurrences in dashboard JSON:
```bash
sed -i '' 's/"uid": "prometheus"/"uid": "YOUR_ACTUAL_UID"/g' dashboard.json
```
3. Re-import the dashboard
## Issue 5: Metric Names Double Prefix
**Problem**: Dashboard queries fail because metrics have format `claude_code_claude_code_*`
**Cause**: Claude Code adds prefix, OTEL Collector adds another
**Affected Metrics**:
- `claude_code_claude_code_lines_of_code_count_total`
- `claude_code_claude_code_cost_usage_USD_total`
- `claude_code_claude_code_token_usage_tokens_total`
- `claude_code_claude_code_active_time_seconds_total`
- `claude_code_claude_code_commit_count_total`
**Fix**: Update dashboard queries to use actual metric names
**Verify actual names**:
```bash
curl -s http://localhost:9090/api/v1/label/__name__/values | jq '.data[]' | grep claude
```
## Issue 6: No Data in Prometheus
**Diagnostic Steps**:
1. **Check containers running**:
```bash
docker ps --format "table {{.Names}}\t{{.Status}}"
```
2. **Check OTEL Collector logs**:
```bash
docker logs otel-collector 2>&1 | tail -50
```
3. **Query Prometheus directly**:
```bash
curl -s 'http://localhost:9090/api/v1/query?query=up' | jq '.data.result'
```
4. **Verify Claude Code settings**:
```bash
cat ~/.claude/settings.json | jq '.env'
```
**Common Causes**:
- Claude Code not restarted after settings change
- Missing OTEL_METRICS_EXPORTER setting
- Wrong endpoint (should be localhost:4317 for local)
- Firewall blocking ports
## Issue 7: Port Conflicts
**Problem**: Container fails to start due to port already in use
**Check ports**:
```bash
for port in 3000 4317 4318 8889 9090; do
lsof -i :$port && echo "Port $port in use"
done
```
**Solutions**:
- Stop conflicting service
- Change port in docker-compose.yml
- Use different port mapping
## Issue 8: Docker Not Running
**Problem**: Commands fail with "Cannot connect to Docker daemon"
**Fix**:
1. Start Docker Desktop application
2. Wait for it to fully initialize
3. Verify: `docker info`
## Issue 9: Insufficient Disk Space
**Problem**: Containers fail to start or crash
**Required**: Minimum 2GB free
**Check**:
```bash
df -h ~/.claude
```
**Solutions**:
- Clean Docker: `docker system prune`
- Remove old images: `docker image prune -a`
- Clear telemetry volumes: `~/.claude/telemetry/cleanup-telemetry.sh`
## Issue 10: Grafana Dashboard Empty After Import
**Diagnostic Steps**:
1. Check time range (upper right) - data might be outside range
2. Verify datasource is connected (green checkmark in settings)
3. Run test query in Explore view
4. Check metric names match actual names in Prometheus
## Debugging Commands
```bash
# Full container status
docker compose -f ~/.claude/telemetry/docker-compose.yml ps
# OTEL Collector config validation
docker exec otel-collector cat /etc/otel/config.yaml
# Prometheus targets
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets'
# Grafana datasources
curl -s http://admin:admin@localhost:3000/api/datasources | jq '.'
# All available metrics
curl -s http://localhost:9090/api/v1/label/__name__/values | jq '.data | length'
```
## Getting Help
If issues persist:
1. Collect diagnostics:
```bash
docker compose -f ~/.claude/telemetry/docker-compose.yml logs > telemetry-logs.txt
cat ~/.claude/settings.json | jq '.env' > settings-env.txt
```
2. Check versions:
```bash
docker --version
docker compose version
```
3. Provide: logs, settings, versions, and exact error message

View File

@@ -0,0 +1,38 @@
#!/bin/bash
# Full Cleanup of Claude Code Telemetry Stack
# WARNING: This removes all data including Docker volumes
echo "⚠️ WARNING: This will remove ALL telemetry data including:"
echo " - All containers"
echo " - All Docker volumes (Grafana, Prometheus, Loki data)"
echo " - Network configuration"
echo ""
read -p "Are you sure you want to proceed? (yes/no): " -r
echo
if [[ ! $REPLY =~ ^[Yy][Ee][Ss]$ ]]; then
echo "Cleanup cancelled."
exit 0
fi
echo "Performing full cleanup of Claude Code telemetry stack..."
# Navigate to telemetry directory
cd ~/.claude/telemetry || exit 1
# Stop and remove containers, networks, and volumes
docker compose down -v
echo ""
echo "✅ Full cleanup complete!"
echo ""
echo "Removed:"
echo " ✓ All containers (otel-collector, prometheus, loki, grafana)"
echo " ✓ All volumes (all historical data)"
echo " ✓ Network configuration"
echo ""
echo "Preserved:"
echo " ✓ Configuration files in ~/.claude/telemetry/"
echo " ✓ Claude Code settings in ~/.claude/settings.json"
echo ""
echo "To start fresh: ./start-telemetry.sh"

View File

@@ -0,0 +1,74 @@
services:
# OpenTelemetry Collector - receives telemetry from Claude Code
otel-collector:
image: otel/opentelemetry-collector-contrib:0.115.1
container_name: claude-otel-collector
command: ["--config=/etc/otel-collector-config.yml"]
volumes:
- ./otel-collector-config.yml:/etc/otel-collector-config.yml
ports:
- "4317:4317" # OTLP gRPC receiver
- "4318:4318" # OTLP HTTP receiver
- "8889:8889" # Prometheus metrics exporter
networks:
- claude-telemetry
# Prometheus - stores metrics
prometheus:
image: prom/prometheus:v2.55.1
container_name: claude-prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--web.enable-lifecycle'
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus-data:/prometheus
ports:
- "9090:9090"
networks:
- claude-telemetry
depends_on:
- otel-collector
# Loki - stores logs
loki:
image: grafana/loki:3.0.0
container_name: claude-loki
ports:
- "3100:3100"
command: -config.file=/etc/loki/local-config.yaml
volumes:
- loki-data:/loki
networks:
- claude-telemetry
# Grafana - visualization dashboards
grafana:
image: grafana/grafana:11.3.0
container_name: claude-grafana
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=admin
- GF_USERS_ALLOW_SIGN_UP=false
volumes:
- grafana-data:/var/lib/grafana
- ./grafana-datasources.yml:/etc/grafana/provisioning/datasources/datasources.yml
networks:
- claude-telemetry
depends_on:
- prometheus
- loki
networks:
claude-telemetry:
driver: bridge
volumes:
prometheus-data:
loki-data:
grafana-data:

View File

@@ -0,0 +1,19 @@
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: true
jsonData:
timeInterval: "15s"
- name: Loki
type: loki
access: proxy
url: http://loki:3100
editable: true
jsonData:
maxLines: 1000

View File

@@ -0,0 +1,56 @@
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 10s
send_batch_size: 1024
resource:
attributes:
- key: service.name
value: claude-code
action: upsert
memory_limiter:
check_interval: 1s
limit_mib: 512
exporters:
# Export metrics to Prometheus
prometheus:
endpoint: "0.0.0.0:8889"
namespace: claude_code
const_labels:
source: claude_code_telemetry
# Export logs to Loki via OTLP HTTP
otlphttp/loki:
endpoint: http://loki:3100/otlp
tls:
insecure: true
# Debug exporter (outputs to console for troubleshooting)
debug:
verbosity: normal
service:
pipelines:
metrics:
receivers: [otlp]
processors: [memory_limiter, batch, resource]
exporters: [prometheus, debug]
logs:
receivers: [otlp]
processors: [memory_limiter, batch, resource]
exporters: [otlphttp/loki, debug]
telemetry:
logs:
level: info

View File

@@ -0,0 +1,14 @@
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'otel-collector'
static_configs:
- targets: ['otel-collector:8889']
labels:
source: 'claude-code'
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']

View File

@@ -0,0 +1,17 @@
{
"env": {
"CLAUDE_CODE_ENABLE_TELEMETRY": "1",
"OTEL_METRICS_EXPORTER": "otlp",
"OTEL_LOGS_EXPORTER": "otlp",
"OTEL_EXPORTER_OTLP_PROTOCOL": "grpc",
"OTEL_EXPORTER_OTLP_ENDPOINT": "https://otel.company.com:4317",
"OTEL_EXPORTER_OTLP_HEADERS": "Authorization=Bearer YOUR_API_KEY_HERE",
"OTEL_METRIC_EXPORT_INTERVAL": "60000",
"OTEL_LOGS_EXPORT_INTERVAL": "5000",
"OTEL_LOG_USER_PROMPTS": "1",
"OTEL_METRICS_INCLUDE_SESSION_ID": "true",
"OTEL_METRICS_INCLUDE_VERSION": "true",
"OTEL_METRICS_INCLUDE_ACCOUNT_UUID": "true",
"OTEL_RESOURCE_ATTRIBUTES": "team=TEAM_NAME,environment=production,deployment=enterprise"
}
}

View File

@@ -0,0 +1,16 @@
{
"env": {
"CLAUDE_CODE_ENABLE_TELEMETRY": "1",
"OTEL_METRICS_EXPORTER": "otlp",
"OTEL_LOGS_EXPORTER": "otlp",
"OTEL_EXPORTER_OTLP_PROTOCOL": "grpc",
"OTEL_EXPORTER_OTLP_ENDPOINT": "http://localhost:4317",
"OTEL_METRIC_EXPORT_INTERVAL": "60000",
"OTEL_LOGS_EXPORT_INTERVAL": "5000",
"OTEL_LOG_USER_PROMPTS": "1",
"OTEL_METRICS_INCLUDE_SESSION_ID": "true",
"OTEL_METRICS_INCLUDE_VERSION": "true",
"OTEL_METRICS_INCLUDE_ACCOUNT_UUID": "true",
"OTEL_RESOURCE_ATTRIBUTES": "environment=local,deployment=poc"
}
}

View File

@@ -0,0 +1,39 @@
#!/bin/bash
# Start Claude Code Telemetry Stack
echo "Starting Claude Code telemetry stack..."
# Check if Docker is running
if ! docker info > /dev/null 2>&1; then
echo "❌ Error: Docker is not running. Please start Docker Desktop first."
exit 1
fi
# Navigate to telemetry directory
cd ~/.claude/telemetry || exit 1
# Start containers
docker compose up -d
# Wait for services to be ready
echo "Waiting for services to be ready..."
sleep 10
# Check container status
echo ""
echo "Container Status:"
docker compose ps
echo ""
echo "✅ Telemetry stack started!"
echo ""
echo "Access Points:"
echo " - Grafana: http://localhost:3000 (admin/admin)"
echo " - Prometheus: http://localhost:9090"
echo " - Loki: http://localhost:3100"
echo ""
echo "OTEL Endpoints:"
echo " - gRPC: http://localhost:4317"
echo " - HTTP: http://localhost:4318"
echo ""
echo "Next: Restart Claude Code to start sending telemetry data"

View File

@@ -0,0 +1,16 @@
#!/bin/bash
# Stop Claude Code Telemetry Stack
echo "Stopping Claude Code telemetry stack..."
# Navigate to telemetry directory
cd ~/.claude/telemetry || exit 1
# Stop containers
docker compose down
echo "✅ Telemetry stack stopped!"
echo ""
echo "Note: Data is preserved in Docker volumes."
echo "To start again: ./start-telemetry.sh"
echo "To completely remove all data: ./cleanup-telemetry.sh"