Initial commit
This commit is contained in:
14
skills/claude-code/otel-monitoring-setup/CHANGELOG.md
Normal file
14
skills/claude-code/otel-monitoring-setup/CHANGELOG.md
Normal file
@@ -0,0 +1,14 @@
|
||||
# Changelog
|
||||
|
||||
## 1.1.0
|
||||
|
||||
- Renamed from claude-code-otel-setup to otel-monitoring-setup
|
||||
- Refactored to Anthropic progressive disclosure pattern
|
||||
- Updated description with "Use PROACTIVELY when..." format
|
||||
|
||||
## 1.0.0
|
||||
|
||||
- Initial skill release
|
||||
- Local PoC mode with Docker stack
|
||||
- Enterprise mode for centralized infrastructure
|
||||
- Grafana dashboard imports
|
||||
558
skills/claude-code/otel-monitoring-setup/README.md
Normal file
558
skills/claude-code/otel-monitoring-setup/README.md
Normal file
@@ -0,0 +1,558 @@
|
||||
# Claude Code OpenTelemetry Setup Skill
|
||||
|
||||
Automated workflow for setting up OpenTelemetry telemetry collection for Claude Code usage monitoring, cost tracking, and productivity analytics.
|
||||
|
||||
**Version:** 1.0.0
|
||||
**Author:** Prometheus Team
|
||||
|
||||
---
|
||||
|
||||
## Features
|
||||
|
||||
- **Mode 1: Local PoC Setup** - Full Docker stack with Grafana dashboards
|
||||
- **Mode 2: Enterprise Setup** - Connect to centralized infrastructure
|
||||
- Automated configuration file generation
|
||||
- Dashboard import with UID detection
|
||||
- Verification and testing procedures
|
||||
- Comprehensive troubleshooting guides
|
||||
|
||||
---
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Prerequisites
|
||||
|
||||
**For Mode 1 (Local PoC):**
|
||||
- Docker Desktop installed and running
|
||||
- Claude Code installed
|
||||
- Write access to `~/.claude/settings.json`
|
||||
|
||||
**For Mode 2 (Enterprise):**
|
||||
- OTEL Collector endpoint URL
|
||||
- Authentication credentials
|
||||
- Write access to `~/.claude/settings.json`
|
||||
|
||||
### Installation
|
||||
|
||||
This skill is designed to be invoked by Claude Code. No manual installation required.
|
||||
|
||||
### Usage
|
||||
|
||||
**Mode 1 - Local PoC Setup:**
|
||||
```
|
||||
"Set up Claude Code telemetry locally"
|
||||
"I want to try OpenTelemetry with Claude Code"
|
||||
"Create a local telemetry stack for me"
|
||||
```
|
||||
|
||||
**Mode 2 - Enterprise Setup:**
|
||||
```
|
||||
"Connect Claude Code to our company OTEL endpoint at otel.company.com:4317"
|
||||
"Set up telemetry for team rollout"
|
||||
"Configure enterprise telemetry"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## What Gets Collected?
|
||||
|
||||
### Metrics
|
||||
- **Session counts and active time** - How much you use Claude Code
|
||||
- **Token usage** - Input, output, cached tokens by model
|
||||
- **API costs** - Spend tracking by model and time
|
||||
- **Lines of code** - Code modifications (added, changed, deleted)
|
||||
- **Commits and PRs** - Git activity tracking
|
||||
|
||||
### Events/Logs
|
||||
- User prompts (if enabled)
|
||||
- Tool executions
|
||||
- API requests
|
||||
- Session lifecycle
|
||||
|
||||
**Privacy:** Metrics are anonymized. Source code content is never collected.
|
||||
|
||||
---
|
||||
|
||||
## Directory Structure
|
||||
|
||||
```
|
||||
claude-code-otel-setup/
|
||||
├── SKILL.md # Main skill definition
|
||||
├── README.md # This file
|
||||
├── modes/
|
||||
│ ├── mode1-poc-setup.md # Detailed local setup workflow
|
||||
│ └── mode2-enterprise.md # Detailed enterprise setup workflow
|
||||
├── templates/
|
||||
│ ├── docker-compose.yml # Docker Compose configuration
|
||||
│ ├── otel-collector-config.yml # OTEL Collector configuration
|
||||
│ ├── prometheus.yml # Prometheus scrape configuration
|
||||
│ ├── grafana-datasources.yml # Grafana datasource provisioning
|
||||
│ ├── settings.json.local # Local telemetry settings template
|
||||
│ ├── settings.json.enterprise # Enterprise settings template
|
||||
│ ├── start-telemetry.sh # Start script
|
||||
│ └── stop-telemetry.sh # Stop script
|
||||
├── dashboards/
|
||||
│ ├── README.md # Dashboard import guide
|
||||
│ ├── claude-code-overview.json # Comprehensive dashboard
|
||||
│ └── claude-code-simple.json # Simplified dashboard
|
||||
└── data/
|
||||
├── metrics-reference.md # Complete metrics documentation
|
||||
├── prometheus-queries.md # Useful PromQL queries
|
||||
└── troubleshooting.md # Common issues and solutions
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Mode 1: Local PoC Setup
|
||||
|
||||
**What it does:**
|
||||
- Creates `~/.claude/telemetry/` directory
|
||||
- Generates Docker Compose configuration
|
||||
- Starts 4 containers: OTEL Collector, Prometheus, Loki, Grafana
|
||||
- Updates Claude Code settings.json
|
||||
- Imports Grafana dashboards
|
||||
- Verifies data flow
|
||||
|
||||
**Time:** 5-7 minutes
|
||||
|
||||
**Output:**
|
||||
- Grafana: http://localhost:3000 (admin/admin)
|
||||
- Prometheus: http://localhost:9090
|
||||
- Working dashboards with real data
|
||||
|
||||
**Detailed workflow:** See `modes/mode1-poc-setup.md`
|
||||
|
||||
---
|
||||
|
||||
## Mode 2: Enterprise Setup
|
||||
|
||||
**What it does:**
|
||||
- Collects enterprise OTEL endpoint details
|
||||
- Updates Claude Code settings.json with endpoint and auth
|
||||
- Adds team/environment resource attributes
|
||||
- Tests connectivity (optional)
|
||||
- Provides team rollout documentation
|
||||
|
||||
**Time:** 2-3 minutes
|
||||
|
||||
**Output:**
|
||||
- Claude Code configured to send to central endpoint
|
||||
- Connectivity verified
|
||||
- Team rollout guide generated
|
||||
|
||||
**Detailed workflow:** See `modes/mode2-enterprise.md`
|
||||
|
||||
---
|
||||
|
||||
## Example Dashboards
|
||||
|
||||
### Overview Dashboard
|
||||
|
||||
Includes:
|
||||
- Total Lines of Code (all-time)
|
||||
- Total Cost (24h)
|
||||
- Total Tokens (24h)
|
||||
- Active Time (24h)
|
||||
- Cost Over Time (timeseries)
|
||||
- Token Usage by Type (stacked)
|
||||
- Lines of Code Modified (bar chart)
|
||||
- Commits Created (24h)
|
||||
|
||||
### Custom Queries
|
||||
|
||||
See `data/prometheus-queries.md` for 50+ ready-to-use PromQL queries:
|
||||
- Cost analysis
|
||||
- Token usage
|
||||
- Productivity metrics
|
||||
- Team aggregation
|
||||
- Model comparison
|
||||
- Alerting rules
|
||||
|
||||
---
|
||||
|
||||
## Common Use Cases
|
||||
|
||||
### Individual Developer
|
||||
|
||||
**Goal:** Track personal Claude Code usage and costs
|
||||
|
||||
**Setup:**
|
||||
```
|
||||
Mode 1 (Local PoC)
|
||||
```
|
||||
|
||||
**Access:**
|
||||
- Personal Grafana dashboard at localhost:3000
|
||||
- All data stays local
|
||||
|
||||
---
|
||||
|
||||
### Team Pilot (5-10 Users)
|
||||
|
||||
**Goal:** Aggregate metrics across pilot users
|
||||
|
||||
**Setup:**
|
||||
```
|
||||
Mode 2 (Enterprise)
|
||||
```
|
||||
|
||||
**Architecture:**
|
||||
- Centralized OTEL Collector
|
||||
- Team-level Prometheus/Grafana
|
||||
- Aggregated dashboards
|
||||
|
||||
---
|
||||
|
||||
### Enterprise Rollout (100+ Users)
|
||||
|
||||
**Goal:** Organization-wide cost tracking and productivity analytics
|
||||
|
||||
**Setup:**
|
||||
```
|
||||
Mode 2 (Enterprise) + Managed Infrastructure
|
||||
```
|
||||
|
||||
**Features:**
|
||||
- Department/team/project attribution
|
||||
- Chargeback reporting
|
||||
- Executive dashboards
|
||||
- Trend analysis
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Quick Checks
|
||||
|
||||
**Containers not starting:**
|
||||
```bash
|
||||
docker compose logs
|
||||
```
|
||||
|
||||
**No metrics in Prometheus:**
|
||||
1. Restart Claude Code (telemetry loads at startup)
|
||||
2. Wait 60 seconds (export interval)
|
||||
3. Check OTEL Collector logs: `docker compose logs otel-collector`
|
||||
|
||||
**Dashboard shows "No data":**
|
||||
1. Verify metric names use double prefix: `claude_code_claude_code_*`
|
||||
2. Check time range (top-right corner)
|
||||
3. Verify datasource UID matches
|
||||
|
||||
**Full troubleshooting guide:** See `data/troubleshooting.md`
|
||||
|
||||
---
|
||||
|
||||
## Known Issues
|
||||
|
||||
### Issue 1: 🚨 CRITICAL - Missing OTEL Exporters
|
||||
|
||||
**Description:** Claude Code not sending telemetry even with `CLAUDE_CODE_ENABLE_TELEMETRY=1`
|
||||
|
||||
**Cause:** Missing required `OTEL_METRICS_EXPORTER` and `OTEL_LOGS_EXPORTER` settings
|
||||
|
||||
**Solution:** The skill templates include these by default. **Always verify** they're present in settings.json. See Configuration Reference for details.
|
||||
|
||||
---
|
||||
|
||||
### Issue 2: OTEL Collector Deprecated 'address' Field
|
||||
|
||||
**Description:** Collector crashes with "'address' has invalid keys" error
|
||||
|
||||
**Cause:** The `address` field in `service.telemetry.metrics` is deprecated in collector v0.123.0+
|
||||
|
||||
**Solution:** Skill templates have this removed. If using custom config, remove the deprecated field.
|
||||
|
||||
---
|
||||
|
||||
### Issue 3: Metric Double Prefix
|
||||
|
||||
**Description:** Metrics are named `claude_code_claude_code_*` instead of `claude_code_*`
|
||||
|
||||
**Cause:** OTEL Collector Prometheus exporter adds namespace prefix
|
||||
|
||||
**Solution:** This is expected. Dashboards use correct naming.
|
||||
|
||||
---
|
||||
|
||||
### Issue 4: Dashboard Datasource UID Mismatch
|
||||
|
||||
**Description:** Dashboard shows "datasource prometheus not found"
|
||||
|
||||
**Cause:** Dashboard has hardcoded UID that doesn't match your Grafana
|
||||
|
||||
**Solution:** Skill automatically detects and fixes UID during import
|
||||
|
||||
---
|
||||
|
||||
### Issue 5: OTEL Collector Deprecated Exporter
|
||||
|
||||
**Description:** Container fails with "logging exporter has been deprecated"
|
||||
|
||||
**Cause:** Old OTEL configuration
|
||||
|
||||
**Solution:** Skill uses `debug` exporter (not deprecated `logging`)
|
||||
|
||||
---
|
||||
|
||||
## Configuration Reference
|
||||
|
||||
### Settings.json (Local)
|
||||
|
||||
**🚨 CRITICAL REQUIREMENTS:**
|
||||
|
||||
The following settings are **REQUIRED** (not optional) for telemetry to work:
|
||||
- `CLAUDE_CODE_ENABLE_TELEMETRY: "1"` - Enables telemetry system
|
||||
- `OTEL_METRICS_EXPORTER: "otlp"` - **REQUIRED** to send metrics (most common missing setting!)
|
||||
- `OTEL_LOGS_EXPORTER: "otlp"` - **REQUIRED** to send events/logs
|
||||
|
||||
Without `OTEL_METRICS_EXPORTER` and `OTEL_LOGS_EXPORTER`, telemetry will not send even if `CLAUDE_CODE_ENABLE_TELEMETRY=1` is set.
|
||||
|
||||
```json
|
||||
{
|
||||
"env": {
|
||||
"CLAUDE_CODE_ENABLE_TELEMETRY": "1",
|
||||
"OTEL_METRICS_EXPORTER": "otlp", // REQUIRED!
|
||||
"OTEL_LOGS_EXPORTER": "otlp", // REQUIRED!
|
||||
"OTEL_EXPORTER_OTLP_PROTOCOL": "grpc",
|
||||
"OTEL_EXPORTER_OTLP_ENDPOINT": "http://localhost:4317",
|
||||
"OTEL_METRIC_EXPORT_INTERVAL": "60000",
|
||||
"OTEL_LOGS_EXPORT_INTERVAL": "5000",
|
||||
"OTEL_LOG_USER_PROMPTS": "1",
|
||||
"OTEL_METRICS_INCLUDE_SESSION_ID": "true",
|
||||
"OTEL_METRICS_INCLUDE_VERSION": "true",
|
||||
"OTEL_METRICS_INCLUDE_ACCOUNT_UUID": "true",
|
||||
"OTEL_RESOURCE_ATTRIBUTES": "environment=local,deployment=poc"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Settings.json (Enterprise)
|
||||
|
||||
**Same CRITICAL requirements apply:**
|
||||
- `OTEL_METRICS_EXPORTER: "otlp"` - **REQUIRED!**
|
||||
- `OTEL_LOGS_EXPORTER: "otlp"` - **REQUIRED!**
|
||||
|
||||
```json
|
||||
{
|
||||
"env": {
|
||||
"CLAUDE_CODE_ENABLE_TELEMETRY": "1",
|
||||
"OTEL_METRICS_EXPORTER": "otlp", // REQUIRED!
|
||||
"OTEL_LOGS_EXPORTER": "otlp", // REQUIRED!
|
||||
"OTEL_EXPORTER_OTLP_PROTOCOL": "grpc",
|
||||
"OTEL_EXPORTER_OTLP_ENDPOINT": "https://otel.company.com:4317",
|
||||
"OTEL_EXPORTER_OTLP_HEADERS": "Authorization=Bearer YOUR_API_KEY",
|
||||
"OTEL_METRIC_EXPORT_INTERVAL": "60000",
|
||||
"OTEL_LOGS_EXPORT_INTERVAL": "5000",
|
||||
"OTEL_LOG_USER_PROMPTS": "1",
|
||||
"OTEL_METRICS_INCLUDE_SESSION_ID": "true",
|
||||
"OTEL_METRICS_INCLUDE_VERSION": "true",
|
||||
"OTEL_METRICS_INCLUDE_ACCOUNT_UUID": "true",
|
||||
"OTEL_RESOURCE_ATTRIBUTES": "team=platform,environment=production"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Management
|
||||
|
||||
### Start Telemetry Stack (Mode 1)
|
||||
|
||||
```bash
|
||||
~/.claude/telemetry/start-telemetry.sh
|
||||
```
|
||||
|
||||
### Stop Telemetry Stack (Mode 1)
|
||||
|
||||
```bash
|
||||
~/.claude/telemetry/stop-telemetry.sh
|
||||
```
|
||||
|
||||
### Check Status
|
||||
|
||||
```bash
|
||||
docker compose ps
|
||||
```
|
||||
|
||||
### View Logs
|
||||
|
||||
```bash
|
||||
docker compose logs -f
|
||||
```
|
||||
|
||||
### Restart Services
|
||||
|
||||
```bash
|
||||
docker compose restart
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Data Retention
|
||||
|
||||
**Default:** 15 days in Prometheus
|
||||
|
||||
**Adjust retention:**
|
||||
Edit `docker-compose.yml` or `prometheus.yml`:
|
||||
```yaml
|
||||
command:
|
||||
- '--storage.tsdb.retention.time=90d'
|
||||
- '--storage.tsdb.retention.size=50GB'
|
||||
```
|
||||
|
||||
**Disk usage:** ~1-2 MB per day per active user
|
||||
|
||||
---
|
||||
|
||||
## Security Considerations
|
||||
|
||||
### Local Setup (Mode 1)
|
||||
|
||||
- Grafana accessible only on localhost
|
||||
- Default credentials: admin/admin (change after first login)
|
||||
- No external network exposure
|
||||
- Data stored in Docker volumes
|
||||
|
||||
### Enterprise Setup (Mode 2)
|
||||
|
||||
- Use HTTPS endpoints
|
||||
- Store API keys securely (environment variables, secrets manager)
|
||||
- Enable mTLS for production
|
||||
- Tag metrics with team/project for proper attribution
|
||||
|
||||
---
|
||||
|
||||
## Performance Tuning
|
||||
|
||||
### Reduce OTEL Collector Memory
|
||||
|
||||
Edit `otel-collector-config.yml`:
|
||||
```yaml
|
||||
processors:
|
||||
memory_limiter:
|
||||
limit_mib: 256 # Reduce from default
|
||||
```
|
||||
|
||||
### Reduce Prometheus Retention
|
||||
|
||||
Edit `docker-compose.yml`:
|
||||
```yaml
|
||||
command:
|
||||
- '--storage.tsdb.retention.time=7d' # Reduce from 15d
|
||||
```
|
||||
|
||||
### Optimize Dashboard Queries
|
||||
|
||||
- Use recording rules for expensive queries
|
||||
- Reduce dashboard time ranges
|
||||
- Increase refresh intervals
|
||||
|
||||
See `data/prometheus-queries.md` for recording rule examples
|
||||
|
||||
---
|
||||
|
||||
## Integration Examples
|
||||
|
||||
### Cost Alerts (PagerDuty/Slack)
|
||||
|
||||
```yaml
|
||||
# alertmanager.yml
|
||||
groups:
|
||||
- name: claude_code_cost
|
||||
rules:
|
||||
- alert: HighDailyCost
|
||||
expr: sum(increase(claude_code_claude_code_cost_usage_USD_total[24h])) > 100
|
||||
annotations:
|
||||
summary: "Claude Code daily cost exceeded $100"
|
||||
```
|
||||
|
||||
### Weekly Cost Reports (Email)
|
||||
|
||||
Use Grafana Reporting:
|
||||
1. Create dashboard with cost panels
|
||||
2. Set up email delivery
|
||||
3. Schedule weekly reports
|
||||
|
||||
### Chargeback Integration
|
||||
|
||||
Export metrics to data warehouse:
|
||||
```yaml
|
||||
# Use Prometheus remote write
|
||||
remote_write:
|
||||
- url: "https://datawarehouse.company.com/prometheus"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Contributing
|
||||
|
||||
This skill is maintained by the Prometheus Team.
|
||||
|
||||
**Feedback:** Open an issue or contact the team
|
||||
|
||||
**Improvements:** Submit pull requests with enhancements
|
||||
|
||||
---
|
||||
|
||||
## Changelog
|
||||
|
||||
### Version 1.1.0 (2025-11-01)
|
||||
|
||||
**Critical Updates from Production Testing:**
|
||||
- 🚨 **CRITICAL FIX**: Documented missing OTEL_METRICS_EXPORTER/OTEL_LOGS_EXPORTER as #1 cause of "telemetry not working"
|
||||
- ✅ Added deprecated `address` field fix for OTEL Collector v0.123.0+
|
||||
- ✅ Enhanced troubleshooting with prominent exporter configuration section
|
||||
- ✅ Updated all documentation with CRITICAL warnings for required settings
|
||||
- ✅ Added comprehensive Known Issues section covering production scenarios
|
||||
- ✅ Verified templates have correct exporter configuration
|
||||
|
||||
**What Changed:**
|
||||
- Troubleshooting guide now prioritizes missing exporters as root cause
|
||||
- Known Issues expanded from 3 to 6 issues with production learnings
|
||||
- Configuration Reference includes prominent CRITICAL requirements callout
|
||||
- SKILL.md Important Reminders section updated with exporter warnings
|
||||
|
||||
### Version 1.0.0 (2025-10-31)
|
||||
|
||||
**Initial Release:**
|
||||
- Mode 1: Local PoC setup with full Docker stack
|
||||
- Mode 2: Enterprise setup with centralized endpoint
|
||||
- Comprehensive documentation and troubleshooting
|
||||
- Dashboard templates with correct metric naming
|
||||
- Automated UID detection and replacement
|
||||
|
||||
**Known Issues Fixed:**
|
||||
- ✅ OTEL Collector deprecated logging exporter
|
||||
- ✅ Dashboard datasource UID mismatch
|
||||
- ✅ Metric double prefix handling
|
||||
- ✅ Loki exporter configuration
|
||||
|
||||
---
|
||||
|
||||
## Additional Resources
|
||||
|
||||
- **Claude Code Monitoring Docs:** https://docs.claude.com/claude-code/monitoring
|
||||
- **OpenTelemetry Docs:** https://opentelemetry.io/docs/
|
||||
- **Prometheus Docs:** https://prometheus.io/docs/
|
||||
- **Grafana Docs:** https://grafana.com/docs/
|
||||
|
||||
---
|
||||
|
||||
## License
|
||||
|
||||
Internal use within Elsevier organization.
|
||||
|
||||
---
|
||||
|
||||
## Support
|
||||
|
||||
**Issues?** Check `data/troubleshooting.md` first
|
||||
|
||||
**Questions?** Contact Prometheus Team or #claude-code-telemetry channel
|
||||
|
||||
**Emergency?** Rollback with: `cp ~/.claude/settings.json.backup ~/.claude/settings.json`
|
||||
|
||||
---
|
||||
|
||||
**Ready to monitor your Claude Code usage!** 🚀
|
||||
150
skills/claude-code/otel-monitoring-setup/SKILL.md
Normal file
150
skills/claude-code/otel-monitoring-setup/SKILL.md
Normal file
@@ -0,0 +1,150 @@
|
||||
---
|
||||
name: otel-monitoring-setup
|
||||
description: Use PROACTIVELY when setting up OpenTelemetry monitoring for Claude Code usage tracking, cost analysis, or productivity metrics. Provides local PoC mode (full Docker stack with Grafana) and enterprise mode (centralized infrastructure). Configures telemetry collection, imports dashboards, and verifies data flow. Not for non-Claude telemetry or custom metric definitions.
|
||||
---
|
||||
|
||||
# Claude Code OpenTelemetry Setup
|
||||
|
||||
Automated workflow for setting up OpenTelemetry telemetry collection for Claude Code usage monitoring, cost tracking, and productivity analytics.
|
||||
|
||||
## Quick Decision Matrix
|
||||
|
||||
| User Request | Mode | Action |
|
||||
|--------------|------|--------|
|
||||
| "Set up telemetry locally" | Mode 1 | Full PoC stack |
|
||||
| "I want to try OpenTelemetry" | Mode 1 | Full PoC stack |
|
||||
| "Connect to company endpoint" | Mode 2 | Enterprise config |
|
||||
| "Set up for team rollout" | Mode 2 | Enterprise + docs |
|
||||
| "Dashboard not working" | Troubleshoot | See known issues |
|
||||
|
||||
## Mode 1: Local PoC Setup
|
||||
|
||||
**Goal**: Complete local telemetry stack for individual developer
|
||||
|
||||
**Creates**:
|
||||
- OpenTelemetry Collector (receives data)
|
||||
- Prometheus (stores metrics)
|
||||
- Loki (stores logs)
|
||||
- Grafana (dashboards)
|
||||
|
||||
**Prerequisites**:
|
||||
- Docker Desktop running
|
||||
- 2GB free disk space
|
||||
- Write access to ~/.claude/
|
||||
|
||||
**Time**: 5-7 minutes
|
||||
|
||||
**Workflow**: `modes/mode1-poc-setup.md`
|
||||
|
||||
**Output**:
|
||||
- Grafana at http://localhost:3000 (admin/admin)
|
||||
- Management scripts in ~/.claude/telemetry/
|
||||
|
||||
## Mode 2: Enterprise Setup
|
||||
|
||||
**Goal**: Connect Claude Code to centralized company infrastructure
|
||||
|
||||
**Required Info**:
|
||||
- OTEL Collector endpoint URL
|
||||
- Authentication (API key or certificates)
|
||||
- Team/department identifier
|
||||
|
||||
**Time**: 2-3 minutes
|
||||
|
||||
**Workflow**: `modes/mode2-enterprise.md`
|
||||
|
||||
**Output**:
|
||||
- settings.json configured for central endpoint
|
||||
- Team rollout documentation
|
||||
|
||||
## Critical Configuration
|
||||
|
||||
**REQUIRED in settings.json** (without these, telemetry won't work):
|
||||
|
||||
```json
|
||||
{
|
||||
"env": {
|
||||
"CLAUDE_CODE_ENABLE_TELEMETRY": "1",
|
||||
"OTEL_METRICS_EXPORTER": "otlp",
|
||||
"OTEL_LOGS_EXPORTER": "otlp",
|
||||
"OTEL_EXPORTER_OTLP_ENDPOINT": "http://localhost:4317"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Must restart Claude Code after settings changes!**
|
||||
|
||||
## Pre-Flight Check
|
||||
|
||||
Always run before setup:
|
||||
|
||||
```bash
|
||||
# Verify Docker is running
|
||||
docker info > /dev/null 2>&1 || echo "Start Docker Desktop first"
|
||||
|
||||
# Check available ports
|
||||
for port in 3000 4317 4318 8889 9090; do
|
||||
lsof -i :$port > /dev/null 2>&1 && echo "Port $port in use"
|
||||
done
|
||||
|
||||
# Check disk space (need 2GB)
|
||||
df -h ~/.claude
|
||||
```
|
||||
|
||||
## Metrics Collected
|
||||
|
||||
- Session counts and active time
|
||||
- Token usage (input/output/cached)
|
||||
- API costs by model (USD)
|
||||
- Lines of code modified
|
||||
- Commits and PRs created
|
||||
|
||||
## Management Commands
|
||||
|
||||
```bash
|
||||
# Start telemetry stack
|
||||
~/.claude/telemetry/start-telemetry.sh
|
||||
|
||||
# Stop (preserves data)
|
||||
~/.claude/telemetry/stop-telemetry.sh
|
||||
|
||||
# Full cleanup (removes all data)
|
||||
~/.claude/telemetry/cleanup-telemetry.sh
|
||||
```
|
||||
|
||||
## Common Issues
|
||||
|
||||
### No Data in Dashboard
|
||||
1. Check OTEL_METRICS_EXPORTER and OTEL_LOGS_EXPORTER are set
|
||||
2. Verify Claude Code was restarted
|
||||
3. See `reference/known-issues.md`
|
||||
|
||||
### Datasource Not Found
|
||||
Dashboard has wrong UID. Detect your UID:
|
||||
```bash
|
||||
curl -s http://admin:admin@localhost:3000/api/datasources | jq '.[0].uid'
|
||||
```
|
||||
Replace in dashboard JSON and re-import.
|
||||
|
||||
### Metric Names Double Prefix
|
||||
Metrics use `claude_code_claude_code_*` format. Update dashboard queries accordingly.
|
||||
|
||||
## Reference Documentation
|
||||
|
||||
- `modes/mode1-poc-setup.md` - Detailed local setup workflow
|
||||
- `modes/mode2-enterprise.md` - Enterprise configuration steps
|
||||
- `reference/known-issues.md` - Troubleshooting guide
|
||||
- `templates/` - Configuration file templates
|
||||
- `dashboards/` - Grafana dashboard JSON files
|
||||
|
||||
## Safety Checklist
|
||||
|
||||
- [ ] Backup settings.json before modification
|
||||
- [ ] Verify Docker is running first
|
||||
- [ ] Check ports are available
|
||||
- [ ] Test data flow before declaring success
|
||||
- [ ] Provide cleanup instructions
|
||||
|
||||
---
|
||||
|
||||
**Version**: 1.1.0 | **Author**: Prometheus Team
|
||||
160
skills/claude-code/otel-monitoring-setup/dashboards/README.md
Normal file
160
skills/claude-code/otel-monitoring-setup/dashboards/README.md
Normal file
@@ -0,0 +1,160 @@
|
||||
# Grafana Dashboard Templates
|
||||
|
||||
This directory contains pre-configured Grafana dashboards for Claude Code telemetry.
|
||||
|
||||
## Available Dashboards
|
||||
|
||||
### 1. claude-code-overview.json
|
||||
**Comprehensive dashboard with all key metrics**
|
||||
|
||||
**Panels:**
|
||||
- Total Lines of Code (all-time counter)
|
||||
- Total Cost (24h rolling window)
|
||||
- Total Tokens (24h rolling window)
|
||||
- Active Time (24h rolling window)
|
||||
- Cost Over Time (per hour rate)
|
||||
- Token Usage by Type (stacked timeseries)
|
||||
- Lines of Code Modified (bar chart)
|
||||
- Commits Created (24h counter)
|
||||
|
||||
**Metrics Used:**
|
||||
- `claude_code_claude_code_lines_of_code_count_total`
|
||||
- `claude_code_claude_code_cost_usage_USD_total`
|
||||
- `claude_code_claude_code_token_usage_tokens_total`
|
||||
- `claude_code_claude_code_active_time_seconds_total`
|
||||
- `claude_code_claude_code_commit_count_total`
|
||||
|
||||
**Note:** This dashboard uses the correct double-prefix metric names.
|
||||
|
||||
### 2. claude-code-simple.json
|
||||
**Simplified dashboard for quick overview**
|
||||
|
||||
**Panels:**
|
||||
- Active Sessions
|
||||
- Total Cost (24h)
|
||||
- Total Tokens (24h)
|
||||
- Active Time (24h)
|
||||
- Cost Over Time
|
||||
- Token Usage by Type
|
||||
|
||||
**Use Case:** Lightweight dashboard for basic monitoring without detailed breakdowns.
|
||||
|
||||
## Importing Dashboards
|
||||
|
||||
### Method 1: Grafana UI (Recommended)
|
||||
|
||||
1. Access Grafana: http://localhost:3000
|
||||
2. Login with admin/admin
|
||||
3. Go to: Dashboards → New → Import
|
||||
4. Click "Upload JSON file"
|
||||
5. Select the dashboard JSON file
|
||||
6. Click "Import"
|
||||
|
||||
### Method 2: Grafana API
|
||||
|
||||
```bash
|
||||
# Get the datasource UID first
|
||||
DATASOURCE_UID=$(curl -s -u admin:admin http://localhost:3000/api/datasources | jq -r '.[] | select(.type=="prometheus") | .uid')
|
||||
|
||||
# Update dashboard with correct UID
|
||||
cat claude-code-overview.json | jq --arg uid "$DATASOURCE_UID" '
|
||||
walk(if type == "object" and .datasource.type == "prometheus" then .datasource.uid = $uid else . end)
|
||||
' > dashboard-updated.json
|
||||
|
||||
# Import dashboard
|
||||
curl -X POST http://localhost:3000/api/dashboards/db \
|
||||
-u admin:admin \
|
||||
-H "Content-Type: application/json" \
|
||||
-d @dashboard-updated.json
|
||||
```
|
||||
|
||||
## Datasource UID Configuration
|
||||
|
||||
**Important:** The dashboards have a hardcoded Prometheus datasource UID: `PBFA97CFB590B2093`
|
||||
|
||||
If your Grafana instance has a different UID, you need to replace it:
|
||||
|
||||
```bash
|
||||
# Find your datasource UID
|
||||
curl -s -u admin:admin http://localhost:3000/api/datasources | jq '.[] | select(.type=="prometheus") | {name, uid}'
|
||||
|
||||
# Replace UID in dashboard
|
||||
YOUR_UID="YOUR_ACTUAL_UID_HERE"
|
||||
cat claude-code-overview.json | sed "s/PBFA97CFB590B2093/$YOUR_UID/g" > claude-code-overview-fixed.json
|
||||
|
||||
# Import the fixed version
|
||||
```
|
||||
|
||||
The skill handles this automatically during Mode 1 setup!
|
||||
|
||||
## Customizing Dashboards
|
||||
|
||||
### Adding Custom Panels
|
||||
|
||||
Use these PromQL queries as templates:
|
||||
|
||||
**Total Tokens by Model:**
|
||||
```promql
|
||||
sum by (model) (increase(claude_code_claude_code_token_usage_tokens_total[24h]))
|
||||
```
|
||||
|
||||
**Cost per Session:**
|
||||
```promql
|
||||
increase(claude_code_claude_code_cost_usage_USD_total[24h])
|
||||
/
|
||||
increase(claude_code_claude_code_session_count_total[24h])
|
||||
```
|
||||
|
||||
**Lines of Code per Hour:**
|
||||
```promql
|
||||
rate(claude_code_claude_code_lines_of_code_count_total[5m]) * 3600
|
||||
```
|
||||
|
||||
**Average Session Duration:**
|
||||
```promql
|
||||
increase(claude_code_claude_code_active_time_seconds_total[24h])
|
||||
/
|
||||
increase(claude_code_claude_code_session_count_total[24h])
|
||||
```
|
||||
|
||||
### Time Range Recommendations
|
||||
|
||||
- **Real-time monitoring:** Last 15 minutes, 30s refresh
|
||||
- **Daily review:** Last 24 hours, 1m refresh
|
||||
- **Weekly analysis:** Last 7 days, 5m refresh
|
||||
- **Monthly reports:** Last 30 days, 15m refresh
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Dashboard Shows "No Data"
|
||||
|
||||
1. **Check data source connection:**
|
||||
```bash
|
||||
curl -s http://localhost:3000/api/health | jq .
|
||||
```
|
||||
|
||||
2. **Verify Prometheus has data:**
|
||||
```bash
|
||||
curl -s 'http://localhost:9090/api/v1/label/__name__/values' | jq . | grep claude_code
|
||||
```
|
||||
|
||||
3. **Check metric naming:**
|
||||
- Ensure queries use double prefix: `claude_code_claude_code_*`
|
||||
- Not single prefix: `claude_code_*`
|
||||
|
||||
### Dashboard Shows "Datasource Not Found"
|
||||
|
||||
- Your datasource UID doesn't match the dashboard
|
||||
- Follow the "Datasource UID Configuration" section above
|
||||
|
||||
### Panels Show Different Time Ranges
|
||||
|
||||
- Set dashboard time range at top-right
|
||||
- Individual panels inherit from dashboard unless overridden
|
||||
- Check panel settings: Edit → Query Options → Time Range
|
||||
|
||||
## Additional Resources
|
||||
|
||||
- **Metric Reference:** See `../data/metrics-reference.md`
|
||||
- **PromQL Queries:** See `../data/prometheus-queries.md`
|
||||
- **Grafana Docs:** https://grafana.com/docs/grafana/latest/
|
||||
@@ -0,0 +1,391 @@
|
||||
{
|
||||
"title": "Claude Code - Overview (Working)",
|
||||
"description": "High-level overview of Claude Code usage, costs, and performance",
|
||||
"tags": ["claude-code", "overview"],
|
||||
"timezone": "browser",
|
||||
"schemaVersion": 42,
|
||||
"version": 1,
|
||||
"refresh": "30s",
|
||||
"panels": [
|
||||
{
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {"mode": "thresholds"},
|
||||
"mappings": [],
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{"color": "green", "value": null},
|
||||
{"color": "yellow", "value": 10},
|
||||
{"color": "red", "value": 50}
|
||||
]
|
||||
},
|
||||
"unit": "short"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {"h": 4, "w": 6, "x": 0, "y": 0},
|
||||
"id": 1,
|
||||
"options": {
|
||||
"colorMode": "value",
|
||||
"graphMode": "area",
|
||||
"justifyMode": "auto",
|
||||
"orientation": "auto",
|
||||
"percentChangeColorMode": "standard",
|
||||
"reduceOptions": {
|
||||
"calcs": ["lastNotNull"],
|
||||
"fields": "",
|
||||
"values": false
|
||||
},
|
||||
"showPercentChange": false,
|
||||
"textMode": "auto",
|
||||
"wideLayout": true
|
||||
},
|
||||
"pluginVersion": "12.2.1",
|
||||
"targets": [
|
||||
{
|
||||
"datasource": {"type": "prometheus", "uid": "PBFA97CFB590B2093"},
|
||||
"expr": "claude_code_claude_code_lines_of_code_count_total",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"title": "Total Lines of Code",
|
||||
"type": "stat"
|
||||
},
|
||||
{
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {"mode": "thresholds"},
|
||||
"decimals": 2,
|
||||
"mappings": [],
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{"color": "green", "value": null},
|
||||
{"color": "yellow", "value": 5},
|
||||
{"color": "red", "value": 10}
|
||||
]
|
||||
},
|
||||
"unit": "currencyUSD"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {"h": 4, "w": 6, "x": 6, "y": 0},
|
||||
"id": 2,
|
||||
"options": {
|
||||
"colorMode": "value",
|
||||
"graphMode": "area",
|
||||
"justifyMode": "auto",
|
||||
"orientation": "auto",
|
||||
"percentChangeColorMode": "standard",
|
||||
"reduceOptions": {
|
||||
"calcs": ["lastNotNull"],
|
||||
"fields": "",
|
||||
"values": false
|
||||
},
|
||||
"showPercentChange": false,
|
||||
"textMode": "auto",
|
||||
"wideLayout": true
|
||||
},
|
||||
"pluginVersion": "12.2.1",
|
||||
"targets": [
|
||||
{
|
||||
"datasource": {"type": "prometheus", "uid": "PBFA97CFB590B2093"},
|
||||
"expr": "increase(claude_code_claude_code_cost_usage_USD_total[24h])",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"title": "Total Cost (24h)",
|
||||
"type": "stat"
|
||||
},
|
||||
{
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {"mode": "thresholds"},
|
||||
"decimals": 0,
|
||||
"mappings": [],
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [{"color": "green", "value": null}]
|
||||
},
|
||||
"unit": "short"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {"h": 4, "w": 6, "x": 12, "y": 0},
|
||||
"id": 3,
|
||||
"options": {
|
||||
"colorMode": "value",
|
||||
"graphMode": "area",
|
||||
"justifyMode": "auto",
|
||||
"orientation": "auto",
|
||||
"percentChangeColorMode": "standard",
|
||||
"reduceOptions": {
|
||||
"calcs": ["lastNotNull"],
|
||||
"fields": "",
|
||||
"values": false
|
||||
},
|
||||
"showPercentChange": false,
|
||||
"textMode": "auto",
|
||||
"wideLayout": true
|
||||
},
|
||||
"pluginVersion": "12.2.1",
|
||||
"targets": [
|
||||
{
|
||||
"datasource": {"type": "prometheus", "uid": "PBFA97CFB590B2093"},
|
||||
"expr": "increase(claude_code_claude_code_token_usage_tokens_total[24h])",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"title": "Total Tokens (24h)",
|
||||
"type": "stat"
|
||||
},
|
||||
{
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {"mode": "palette-classic"},
|
||||
"decimals": 1,
|
||||
"mappings": [],
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [{"color": "green", "value": null}]
|
||||
},
|
||||
"unit": "h"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {"h": 4, "w": 6, "x": 18, "y": 0},
|
||||
"id": 4,
|
||||
"options": {
|
||||
"colorMode": "value",
|
||||
"graphMode": "area",
|
||||
"justifyMode": "auto",
|
||||
"orientation": "auto",
|
||||
"percentChangeColorMode": "standard",
|
||||
"reduceOptions": {
|
||||
"calcs": ["lastNotNull"],
|
||||
"fields": "",
|
||||
"values": false
|
||||
},
|
||||
"showPercentChange": false,
|
||||
"textMode": "auto",
|
||||
"wideLayout": true
|
||||
},
|
||||
"pluginVersion": "12.2.1",
|
||||
"targets": [
|
||||
{
|
||||
"datasource": {"type": "prometheus", "uid": "PBFA97CFB590B2093"},
|
||||
"expr": "increase(claude_code_claude_code_active_time_seconds_total[24h]) / 3600",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"title": "Active Time (24h)",
|
||||
"type": "stat"
|
||||
},
|
||||
{
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {"mode": "palette-classic"},
|
||||
"custom": {
|
||||
"axisBorderShow": false,
|
||||
"axisCenteredZero": false,
|
||||
"axisColorMode": "text",
|
||||
"axisLabel": "",
|
||||
"axisPlacement": "auto",
|
||||
"barAlignment": 0,
|
||||
"barWidthFactor": 0.6,
|
||||
"drawStyle": "line",
|
||||
"fillOpacity": 10,
|
||||
"gradientMode": "none",
|
||||
"hideFrom": {"legend": false, "tooltip": false, "viz": false},
|
||||
"insertNulls": false,
|
||||
"lineInterpolation": "linear",
|
||||
"lineWidth": 2,
|
||||
"pointSize": 5,
|
||||
"scaleDistribution": {"type": "linear"},
|
||||
"showPoints": "auto",
|
||||
"spanNulls": false,
|
||||
"stacking": {"group": "A", "mode": "none"},
|
||||
"thresholdsStyle": {"mode": "off"}
|
||||
},
|
||||
"mappings": [],
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [{"color": "green", "value": null}]
|
||||
},
|
||||
"unit": "currencyUSD"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 4},
|
||||
"id": 5,
|
||||
"options": {
|
||||
"legend": {"calcs": [], "displayMode": "list", "placement": "bottom", "showLegend": true},
|
||||
"tooltip": {"mode": "single", "sort": "none"}
|
||||
},
|
||||
"pluginVersion": "12.2.1",
|
||||
"targets": [
|
||||
{
|
||||
"datasource": {"type": "prometheus", "uid": "PBFA97CFB590B2093"},
|
||||
"expr": "rate(claude_code_claude_code_cost_usage_USD_total[5m]) * 3600",
|
||||
"legendFormat": "Cost per hour",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"title": "Cost Over Time (per hour)",
|
||||
"type": "timeseries"
|
||||
},
|
||||
{
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {"mode": "palette-classic"},
|
||||
"custom": {
|
||||
"axisBorderShow": false,
|
||||
"axisCenteredZero": false,
|
||||
"axisColorMode": "text",
|
||||
"axisLabel": "",
|
||||
"axisPlacement": "auto",
|
||||
"barAlignment": 0,
|
||||
"barWidthFactor": 0.6,
|
||||
"drawStyle": "line",
|
||||
"fillOpacity": 20,
|
||||
"gradientMode": "none",
|
||||
"hideFrom": {"legend": false, "tooltip": false, "viz": false},
|
||||
"insertNulls": false,
|
||||
"lineInterpolation": "linear",
|
||||
"lineWidth": 2,
|
||||
"pointSize": 5,
|
||||
"scaleDistribution": {"type": "linear"},
|
||||
"showPoints": "auto",
|
||||
"spanNulls": false,
|
||||
"stacking": {"group": "A", "mode": "normal"},
|
||||
"thresholdsStyle": {"mode": "off"}
|
||||
},
|
||||
"mappings": [],
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [{"color": "green", "value": null}]
|
||||
},
|
||||
"unit": "short"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 4},
|
||||
"id": 6,
|
||||
"options": {
|
||||
"legend": {"calcs": [], "displayMode": "list", "placement": "bottom", "showLegend": true},
|
||||
"tooltip": {"mode": "single", "sort": "none"}
|
||||
},
|
||||
"pluginVersion": "12.2.1",
|
||||
"targets": [
|
||||
{
|
||||
"datasource": {"type": "prometheus", "uid": "PBFA97CFB590B2093"},
|
||||
"expr": "sum by (type) (rate(claude_code_claude_code_token_usage_tokens_total[5m]) * 60)",
|
||||
"legendFormat": "{{type}}",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"title": "Token Usage by Type",
|
||||
"type": "timeseries"
|
||||
},
|
||||
{
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {"mode": "palette-classic"},
|
||||
"custom": {
|
||||
"axisBorderShow": false,
|
||||
"axisCenteredZero": false,
|
||||
"axisColorMode": "text",
|
||||
"axisLabel": "",
|
||||
"axisPlacement": "auto",
|
||||
"barAlignment": 0,
|
||||
"barWidthFactor": 0.6,
|
||||
"drawStyle": "bars",
|
||||
"fillOpacity": 80,
|
||||
"gradientMode": "none",
|
||||
"hideFrom": {"legend": false, "tooltip": false, "viz": false},
|
||||
"insertNulls": false,
|
||||
"lineInterpolation": "linear",
|
||||
"lineWidth": 1,
|
||||
"pointSize": 5,
|
||||
"scaleDistribution": {"type": "linear"},
|
||||
"showPoints": "never",
|
||||
"spanNulls": false,
|
||||
"stacking": {"group": "A", "mode": "none"},
|
||||
"thresholdsStyle": {"mode": "off"}
|
||||
},
|
||||
"mappings": [],
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [{"color": "green", "value": null}]
|
||||
},
|
||||
"unit": "short"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 12},
|
||||
"id": 7,
|
||||
"options": {
|
||||
"legend": {"calcs": [], "displayMode": "list", "placement": "bottom", "showLegend": true},
|
||||
"tooltip": {"mode": "single", "sort": "none"}
|
||||
},
|
||||
"pluginVersion": "12.2.1",
|
||||
"targets": [
|
||||
{
|
||||
"datasource": {"type": "prometheus", "uid": "PBFA97CFB590B2093"},
|
||||
"expr": "sum by (type) (rate(claude_code_claude_code_lines_of_code_count_total[5m]) * 60)",
|
||||
"legendFormat": "{{type}}",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"title": "Lines of Code Modified",
|
||||
"type": "timeseries"
|
||||
},
|
||||
{
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": {"mode": "palette-classic"},
|
||||
"decimals": 0,
|
||||
"mappings": [],
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [{"color": "green", "value": null}]
|
||||
},
|
||||
"unit": "short"
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"gridPos": {"h": 6, "w": 12, "x": 12, "y": 12},
|
||||
"id": 10,
|
||||
"options": {
|
||||
"colorMode": "value",
|
||||
"graphMode": "area",
|
||||
"justifyMode": "auto",
|
||||
"orientation": "auto",
|
||||
"percentChangeColorMode": "standard",
|
||||
"reduceOptions": {
|
||||
"calcs": ["lastNotNull"],
|
||||
"fields": "",
|
||||
"values": false
|
||||
},
|
||||
"showPercentChange": false,
|
||||
"textMode": "auto",
|
||||
"wideLayout": true
|
||||
},
|
||||
"pluginVersion": "12.2.1",
|
||||
"targets": [
|
||||
{
|
||||
"datasource": {"type": "prometheus", "uid": "PBFA97CFB590B2093"},
|
||||
"expr": "increase(claude_code_claude_code_commit_count_total[24h])",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"title": "Commits Created (24h)",
|
||||
"type": "stat"
|
||||
}
|
||||
],
|
||||
"time": {"from": "now-6h", "to": "now"},
|
||||
"timepicker": {},
|
||||
"timezone": "browser",
|
||||
"version": 1
|
||||
}
|
||||
@@ -0,0 +1,179 @@
|
||||
{
|
||||
"title": "Claude Code - Overview",
|
||||
"description": "High-level overview of Claude Code usage, costs, and performance",
|
||||
"tags": ["claude-code", "overview"],
|
||||
"timezone": "browser",
|
||||
"schemaVersion": 38,
|
||||
"version": 1,
|
||||
"refresh": "30s",
|
||||
"panels": [
|
||||
{
|
||||
"id": 1,
|
||||
"gridPos": { "h": 4, "w": 6, "x": 0, "y": 0 },
|
||||
"type": "stat",
|
||||
"title": "Active Sessions",
|
||||
"targets": [
|
||||
{
|
||||
"expr": "sum(claude_code_session_count_total)",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "short",
|
||||
"color": { "mode": "thresholds" },
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{ "value": null, "color": "green" }
|
||||
]
|
||||
}
|
||||
}
|
||||
},
|
||||
"options": {
|
||||
"reduceOptions": {
|
||||
"values": false,
|
||||
"calcs": ["lastNotNull"]
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": 2,
|
||||
"gridPos": { "h": 4, "w": 6, "x": 6, "y": 0 },
|
||||
"type": "stat",
|
||||
"title": "Total Cost (24h)",
|
||||
"targets": [
|
||||
{
|
||||
"expr": "sum(increase(claude_code_cost_usage_total[24h]))",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "currencyUSD",
|
||||
"decimals": 2,
|
||||
"color": { "mode": "thresholds" },
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{ "value": null, "color": "green" },
|
||||
{ "value": 5, "color": "yellow" },
|
||||
{ "value": 10, "color": "red" }
|
||||
]
|
||||
}
|
||||
}
|
||||
},
|
||||
"options": {
|
||||
"reduceOptions": {
|
||||
"values": false,
|
||||
"calcs": ["lastNotNull"]
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": 3,
|
||||
"gridPos": { "h": 4, "w": 6, "x": 12, "y": 0 },
|
||||
"type": "stat",
|
||||
"title": "Total Tokens (24h)",
|
||||
"targets": [
|
||||
{
|
||||
"expr": "sum(increase(claude_code_token_usage_total[24h]))",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "short",
|
||||
"decimals": 0,
|
||||
"color": { "mode": "thresholds" },
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{ "value": null, "color": "green" }
|
||||
]
|
||||
}
|
||||
}
|
||||
},
|
||||
"options": {
|
||||
"reduceOptions": {
|
||||
"values": false,
|
||||
"calcs": ["lastNotNull"]
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": 4,
|
||||
"gridPos": { "h": 4, "w": 6, "x": 18, "y": 0 },
|
||||
"type": "stat",
|
||||
"title": "Active Time (24h)",
|
||||
"targets": [
|
||||
{
|
||||
"expr": "sum(increase(claude_code_active_time_total_seconds[24h])) / 3600",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "h",
|
||||
"decimals": 1,
|
||||
"color": { "mode": "palette-classic" }
|
||||
}
|
||||
},
|
||||
"options": {
|
||||
"reduceOptions": {
|
||||
"values": false,
|
||||
"calcs": ["lastNotNull"]
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": 5,
|
||||
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 4 },
|
||||
"type": "timeseries",
|
||||
"title": "Cost Over Time (per hour)",
|
||||
"targets": [
|
||||
{
|
||||
"expr": "sum(rate(claude_code_cost_usage_total[5m])) * 3600",
|
||||
"legendFormat": "Cost per hour",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "currencyUSD",
|
||||
"custom": {
|
||||
"drawStyle": "line",
|
||||
"lineWidth": 2,
|
||||
"fillOpacity": 10,
|
||||
"showPoints": "auto"
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": 6,
|
||||
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 4 },
|
||||
"type": "timeseries",
|
||||
"title": "Token Usage by Type",
|
||||
"targets": [
|
||||
{
|
||||
"expr": "sum by (type) (rate(claude_code_token_usage_total[5m]) * 60)",
|
||||
"legendFormat": "{{type}}",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "short",
|
||||
"custom": {
|
||||
"drawStyle": "line",
|
||||
"lineWidth": 2,
|
||||
"fillOpacity": 20,
|
||||
"showPoints": "auto",
|
||||
"stacking": { "mode": "normal" }
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
@@ -0,0 +1,381 @@
|
||||
# Claude Code Metrics Reference
|
||||
|
||||
Complete reference for all Claude Code OpenTelemetry metrics.
|
||||
|
||||
**Important:** All metrics use a double prefix: `claude_code_claude_code_*`
|
||||
|
||||
---
|
||||
|
||||
## Metric Categories
|
||||
|
||||
1. **Usage Metrics** - Session counts, active time
|
||||
2. **Token Metrics** - Input, output, cached tokens
|
||||
3. **Cost Metrics** - API costs by model
|
||||
4. **Productivity Metrics** - LOC, commits, PRs
|
||||
5. **Error Metrics** - Failures, retries
|
||||
|
||||
---
|
||||
|
||||
## Usage Metrics
|
||||
|
||||
### claude_code_claude_code_session_count_total
|
||||
|
||||
**Type:** Counter
|
||||
**Description:** Total number of Claude Code sessions started
|
||||
**Labels:**
|
||||
- `account_uuid` - Anonymous user identifier
|
||||
- `version` - Claude Code version (e.g., "1.2.3")
|
||||
|
||||
**Example Query:**
|
||||
```promql
|
||||
# Total sessions across all users
|
||||
sum(claude_code_claude_code_session_count_total)
|
||||
|
||||
# Sessions by version
|
||||
sum by (version) (claude_code_claude_code_session_count_total)
|
||||
|
||||
# New sessions in last 24h
|
||||
increase(claude_code_claude_code_session_count_total[24h])
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### claude_code_claude_code_active_time_seconds_total
|
||||
|
||||
**Type:** Counter
|
||||
**Description:** Total active time spent in Claude Code sessions (in seconds)
|
||||
**Labels:**
|
||||
- `account_uuid` - Anonymous user identifier
|
||||
- `version` - Claude Code version
|
||||
|
||||
**Example Query:**
|
||||
```promql
|
||||
# Total active hours
|
||||
sum(claude_code_claude_code_active_time_seconds_total) / 3600
|
||||
|
||||
# Active hours per day
|
||||
increase(claude_code_claude_code_active_time_seconds_total[24h]) / 3600
|
||||
|
||||
# Average session duration
|
||||
increase(claude_code_claude_code_active_time_seconds_total[24h])
|
||||
/
|
||||
increase(claude_code_claude_code_session_count_total[24h])
|
||||
```
|
||||
|
||||
**Note:** "Active time" means time when Claude Code is actively processing or responding to user input.
|
||||
|
||||
---
|
||||
|
||||
## Token Metrics
|
||||
|
||||
### claude_code_claude_code_token_usage_tokens_total
|
||||
|
||||
**Type:** Counter
|
||||
**Description:** Total tokens consumed by Claude Code API calls
|
||||
**Labels:**
|
||||
- `type` - Token type: `input`, `output`, `cache_creation`, `cache_read`
|
||||
- `model` - Model name (e.g., "claude-sonnet-4-5-20250929", "claude-opus-4-20250514")
|
||||
- `account_uuid` - Anonymous user identifier
|
||||
- `version` - Claude Code version
|
||||
|
||||
**Token Types Explained:**
|
||||
- **input:** User messages and tool results sent to Claude
|
||||
- **output:** Claude's responses (text and tool calls)
|
||||
- **cache_creation:** Tokens written to prompt cache (billed at input rate)
|
||||
- **cache_read:** Tokens read from prompt cache (billed at 10% of input rate)
|
||||
|
||||
**Example Query:**
|
||||
```promql
|
||||
# Total tokens by type (24h)
|
||||
sum by (type) (increase(claude_code_claude_code_token_usage_tokens_total[24h]))
|
||||
|
||||
# Tokens by model (24h)
|
||||
sum by (model) (increase(claude_code_claude_code_token_usage_tokens_total[24h]))
|
||||
|
||||
# Cache hit rate
|
||||
sum(increase(claude_code_claude_code_token_usage_tokens_total{type="cache_read"}[24h]))
|
||||
/
|
||||
sum(increase(claude_code_claude_code_token_usage_tokens_total{type=~"input|cache_creation|cache_read"}[24h]))
|
||||
|
||||
# Token usage rate (per minute)
|
||||
rate(claude_code_claude_code_token_usage_tokens_total[5m]) * 60
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Cost Metrics
|
||||
|
||||
### claude_code_claude_code_cost_usage_USD_total
|
||||
|
||||
**Type:** Counter
|
||||
**Description:** Total API costs in USD
|
||||
**Labels:**
|
||||
- `model` - Model name
|
||||
- `account_uuid` - Anonymous user identifier
|
||||
- `version` - Claude Code version
|
||||
|
||||
**Pricing Reference (as of Jan 2025):**
|
||||
- **Claude Sonnet 4.5:** $3/MTok input, $15/MTok output
|
||||
- **Claude Opus 4:** $15/MTok input, $75/MTok output
|
||||
- **Cache read:** 10% of input price
|
||||
- **Cache write:** Same as input price
|
||||
|
||||
**Example Query:**
|
||||
```promql
|
||||
# Total cost (24h)
|
||||
sum(increase(claude_code_claude_code_cost_usage_USD_total[24h]))
|
||||
|
||||
# Cost by model (24h)
|
||||
sum by (model) (increase(claude_code_claude_code_cost_usage_USD_total[24h]))
|
||||
|
||||
# Cost per hour
|
||||
rate(claude_code_claude_code_cost_usage_USD_total[5m]) * 3600
|
||||
|
||||
# Average cost per session
|
||||
increase(claude_code_claude_code_cost_usage_USD_total[24h])
|
||||
/
|
||||
increase(claude_code_claude_code_session_count_total[24h])
|
||||
|
||||
# Cumulative cost over time
|
||||
sum(claude_code_claude_code_cost_usage_USD_total)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Productivity Metrics
|
||||
|
||||
### claude_code_claude_code_lines_of_code_count_total
|
||||
|
||||
**Type:** Counter
|
||||
**Description:** Total lines of code modified (added + changed + deleted)
|
||||
**Labels:**
|
||||
- `type` - Modification type: `added`, `changed`, `deleted`
|
||||
- `account_uuid` - Anonymous user identifier
|
||||
- `version` - Claude Code version
|
||||
|
||||
**Example Query:**
|
||||
```promql
|
||||
# Total LOC modified
|
||||
sum(claude_code_claude_code_lines_of_code_count_total)
|
||||
|
||||
# LOC by type (24h)
|
||||
sum by (type) (increase(claude_code_claude_code_lines_of_code_count_total[24h]))
|
||||
|
||||
# LOC per hour
|
||||
rate(claude_code_claude_code_lines_of_code_count_total[5m]) * 3600
|
||||
|
||||
# Lines per dollar
|
||||
sum(increase(claude_code_claude_code_lines_of_code_count_total[24h]))
|
||||
/
|
||||
sum(increase(claude_code_claude_code_cost_usage_USD_total[24h]))
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### claude_code_claude_code_commit_count_total
|
||||
|
||||
**Type:** Counter
|
||||
**Description:** Total git commits created by Claude Code
|
||||
**Labels:**
|
||||
- `account_uuid` - Anonymous user identifier
|
||||
- `version` - Claude Code version
|
||||
|
||||
**Example Query:**
|
||||
```promql
|
||||
# Total commits
|
||||
sum(claude_code_claude_code_commit_count_total)
|
||||
|
||||
# Commits per day
|
||||
increase(claude_code_claude_code_commit_count_total[24h])
|
||||
|
||||
# Commits per session
|
||||
increase(claude_code_claude_code_commit_count_total[24h])
|
||||
/
|
||||
increase(claude_code_claude_code_session_count_total[24h])
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### claude_code_claude_code_pr_count_total
|
||||
|
||||
**Type:** Counter
|
||||
**Description:** Total pull requests created by Claude Code
|
||||
**Labels:**
|
||||
- `account_uuid` - Anonymous user identifier
|
||||
- `version` - Claude Code version
|
||||
|
||||
**Example Query:**
|
||||
```promql
|
||||
# Total PRs
|
||||
sum(claude_code_claude_code_pr_count_total)
|
||||
|
||||
# PRs per week
|
||||
increase(claude_code_claude_code_pr_count_total[7d])
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Cardinality and Resource Attributes
|
||||
|
||||
### Resource Attributes
|
||||
|
||||
All metrics include these resource attributes (configured in settings.json):
|
||||
|
||||
```json
|
||||
"OTEL_RESOURCE_ATTRIBUTES": "environment=local,deployment=poc,team=platform"
|
||||
```
|
||||
|
||||
**Common Attributes:**
|
||||
- `service.name` = "claude-code" (set by OTEL Collector)
|
||||
- `environment` - Deployment environment (local, dev, staging, prod)
|
||||
- `deployment` - Deployment type (poc, enterprise)
|
||||
- `team` - Team identifier
|
||||
- `department` - Department identifier
|
||||
- `project` - Project identifier
|
||||
|
||||
**Querying with Resource Attributes:**
|
||||
```promql
|
||||
# Filter by environment
|
||||
sum(claude_code_claude_code_cost_usage_USD_total{environment="production"})
|
||||
|
||||
# Aggregate by team
|
||||
sum by (team) (increase(claude_code_claude_code_cost_usage_USD_total[24h]))
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Metric Naming Convention
|
||||
|
||||
**Format:** `claude_code_claude_code_<metric_name>_<unit>_<type>`
|
||||
|
||||
**Why double prefix?**
|
||||
- First `claude_code` comes from Prometheus exporter namespace in OTEL Collector config
|
||||
- Second `claude_code` comes from the original metric name in Claude Code
|
||||
- This is expected behavior with the current configuration
|
||||
|
||||
**Components:**
|
||||
- `<metric_name>`: Descriptive name (e.g., `token_usage`, `cost_usage`)
|
||||
- `<unit>`: Unit of measurement (e.g., `tokens`, `USD`, `seconds`, `count`)
|
||||
- `<type>`: Metric type (always `total` for counters)
|
||||
|
||||
---
|
||||
|
||||
## Querying Best Practices
|
||||
|
||||
### Use increase() for Counters
|
||||
|
||||
Counters are cumulative, so use `increase()` for time windows:
|
||||
|
||||
```promql
|
||||
# ✅ Correct - Shows cost in last 24h
|
||||
increase(claude_code_claude_code_cost_usage_USD_total[24h])
|
||||
|
||||
# ❌ Wrong - Shows cumulative cost since start
|
||||
claude_code_claude_code_cost_usage_USD_total
|
||||
```
|
||||
|
||||
### Use rate() for Rates
|
||||
|
||||
Calculate per-second rate, then multiply for desired unit:
|
||||
|
||||
```promql
|
||||
# Cost per hour
|
||||
rate(claude_code_claude_code_cost_usage_USD_total[5m]) * 3600
|
||||
|
||||
# Tokens per minute
|
||||
rate(claude_code_claude_code_token_usage_tokens_total[5m]) * 60
|
||||
```
|
||||
|
||||
### Aggregate with sum()
|
||||
|
||||
Combine metrics across labels:
|
||||
|
||||
```promql
|
||||
# Total tokens (all types)
|
||||
sum(claude_code_claude_code_token_usage_tokens_total)
|
||||
|
||||
# Total tokens by type
|
||||
sum by (type) (claude_code_claude_code_token_usage_tokens_total)
|
||||
|
||||
# Total cost across all models
|
||||
sum(claude_code_claude_code_cost_usage_USD_total)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Example Dashboards
|
||||
|
||||
### Executive Summary (single values)
|
||||
|
||||
```promql
|
||||
# Total cost this month
|
||||
sum(increase(claude_code_claude_code_cost_usage_USD_total[30d]))
|
||||
|
||||
# Total LOC this month
|
||||
sum(increase(claude_code_claude_code_lines_of_code_count_total[30d]))
|
||||
|
||||
# Active users (unique account_uuids)
|
||||
count(count by (account_uuid) (claude_code_claude_code_session_count_total))
|
||||
|
||||
# Average session cost
|
||||
sum(increase(claude_code_claude_code_cost_usage_USD_total[30d]))
|
||||
/
|
||||
sum(increase(claude_code_claude_code_session_count_total[30d]))
|
||||
```
|
||||
|
||||
### Cost Tracking
|
||||
|
||||
```promql
|
||||
# Daily cost trend
|
||||
sum(increase(claude_code_claude_code_cost_usage_USD_total[1d]))
|
||||
|
||||
# Cost by model (pie chart)
|
||||
sum by (model) (increase(claude_code_claude_code_cost_usage_USD_total[7d]))
|
||||
|
||||
# Cost by team (bar chart)
|
||||
sum by (team) (increase(claude_code_claude_code_cost_usage_USD_total[7d]))
|
||||
```
|
||||
|
||||
### Productivity Tracking
|
||||
|
||||
```promql
|
||||
# LOC per day
|
||||
sum(increase(claude_code_claude_code_lines_of_code_count_total[1d]))
|
||||
|
||||
# Commits per week
|
||||
sum(increase(claude_code_claude_code_commit_count_total[7d]))
|
||||
|
||||
# Efficiency: LOC per dollar
|
||||
sum(increase(claude_code_claude_code_lines_of_code_count_total[30d]))
|
||||
/
|
||||
sum(increase(claude_code_claude_code_cost_usage_USD_total[30d]))
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Retention and Storage
|
||||
|
||||
**Default Prometheus Retention:** 15 days
|
||||
|
||||
**Adjust retention:**
|
||||
```yaml
|
||||
# In prometheus.yml or docker-compose.yml
|
||||
command:
|
||||
- '--storage.tsdb.retention.time=90d'
|
||||
- '--storage.tsdb.retention.size=50GB'
|
||||
```
|
||||
|
||||
**Disk usage estimation:**
|
||||
- ~1-2 MB per day per active user
|
||||
- ~30-60 MB per month per active user
|
||||
- ~360-720 MB per year per active user
|
||||
|
||||
**For long-term storage:** Consider using Prometheus remote write to send data to a time-series database like VictoriaMetrics, Cortex, or Thanos.
|
||||
|
||||
---
|
||||
|
||||
## Additional Resources
|
||||
|
||||
- **Official OTEL Docs:** https://opentelemetry.io/docs/
|
||||
- **Prometheus Query Docs:** https://prometheus.io/docs/prometheus/latest/querying/basics/
|
||||
- **PromQL Examples:** See `prometheus-queries.md`
|
||||
@@ -0,0 +1,405 @@
|
||||
# Useful Prometheus Queries (PromQL)
|
||||
|
||||
Collection of useful PromQL queries for Claude Code telemetry analysis.
|
||||
|
||||
**Note:** All queries use the double prefix: `claude_code_claude_code_*`
|
||||
|
||||
---
|
||||
|
||||
## Cost Analysis
|
||||
|
||||
### Daily Cost Trend
|
||||
```promql
|
||||
sum(increase(claude_code_claude_code_cost_usage_USD_total[1d]))
|
||||
```
|
||||
|
||||
### Cost by Model
|
||||
```promql
|
||||
sum by (model) (increase(claude_code_claude_code_cost_usage_USD_total[24h]))
|
||||
```
|
||||
|
||||
### Cost per Hour (Rate)
|
||||
```promql
|
||||
rate(claude_code_claude_code_cost_usage_USD_total[5m]) * 3600
|
||||
```
|
||||
|
||||
### Average Cost per Session
|
||||
```promql
|
||||
sum(increase(claude_code_claude_code_cost_usage_USD_total[24h]))
|
||||
/
|
||||
sum(increase(claude_code_claude_code_session_count_total[24h]))
|
||||
```
|
||||
|
||||
### Cumulative Monthly Cost
|
||||
```promql
|
||||
sum(increase(claude_code_claude_code_cost_usage_USD_total[30d]))
|
||||
```
|
||||
|
||||
### Cost by Team
|
||||
```promql
|
||||
sum by (team) (increase(claude_code_claude_code_cost_usage_USD_total[24h]))
|
||||
```
|
||||
|
||||
### Projected Monthly Cost (based on last 7 days)
|
||||
```promql
|
||||
(sum(increase(claude_code_claude_code_cost_usage_USD_total[7d])) / 7) * 30
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Token Usage
|
||||
|
||||
### Total Tokens by Type
|
||||
```promql
|
||||
sum by (type) (increase(claude_code_claude_code_token_usage_tokens_total[24h]))
|
||||
```
|
||||
|
||||
### Tokens by Model
|
||||
```promql
|
||||
sum by (model) (increase(claude_code_claude_code_token_usage_tokens_total[24h]))
|
||||
```
|
||||
|
||||
### Cache Hit Rate
|
||||
```promql
|
||||
sum(increase(claude_code_claude_code_token_usage_tokens_total{type="cache_read"}[24h]))
|
||||
/
|
||||
sum(increase(claude_code_claude_code_token_usage_tokens_total{type=~"input|cache_creation|cache_read"}[24h]))
|
||||
* 100
|
||||
```
|
||||
|
||||
### Input vs Output Token Ratio
|
||||
```promql
|
||||
sum(increase(claude_code_claude_code_token_usage_tokens_total{type="input"}[24h]))
|
||||
/
|
||||
sum(increase(claude_code_claude_code_token_usage_tokens_total{type="output"}[24h]))
|
||||
```
|
||||
|
||||
### Token Usage Rate (per minute)
|
||||
```promql
|
||||
sum by (type) (rate(claude_code_claude_code_token_usage_tokens_total[5m]) * 60)
|
||||
```
|
||||
|
||||
### Total Tokens (All Time)
|
||||
```promql
|
||||
sum(claude_code_claude_code_token_usage_tokens_total)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Productivity Metrics
|
||||
|
||||
### Total Lines of Code Modified
|
||||
```promql
|
||||
sum(claude_code_claude_code_lines_of_code_count_total)
|
||||
```
|
||||
|
||||
### LOC by Type (Added, Changed, Deleted)
|
||||
```promql
|
||||
sum by (type) (increase(claude_code_claude_code_lines_of_code_count_total[24h]))
|
||||
```
|
||||
|
||||
### LOC per Hour
|
||||
```promql
|
||||
rate(claude_code_claude_code_lines_of_code_count_total[5m]) * 3600
|
||||
```
|
||||
|
||||
### Lines per Dollar (Efficiency)
|
||||
```promql
|
||||
sum(increase(claude_code_claude_code_lines_of_code_count_total[24h]))
|
||||
/
|
||||
sum(increase(claude_code_claude_code_cost_usage_USD_total[24h]))
|
||||
```
|
||||
|
||||
### Commits per Day
|
||||
```promql
|
||||
increase(claude_code_claude_code_commit_count_total[24h])
|
||||
```
|
||||
|
||||
### PRs per Week
|
||||
```promql
|
||||
increase(claude_code_claude_code_pr_count_total[7d])
|
||||
```
|
||||
|
||||
### LOC per Commit
|
||||
```promql
|
||||
sum(increase(claude_code_claude_code_lines_of_code_count_total[24h]))
|
||||
/
|
||||
sum(increase(claude_code_claude_code_commit_count_total[24h]))
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Session Analytics
|
||||
|
||||
### Total Sessions
|
||||
```promql
|
||||
sum(claude_code_claude_code_session_count_total)
|
||||
```
|
||||
|
||||
### New Sessions (24h)
|
||||
```promql
|
||||
increase(claude_code_claude_code_session_count_total[24h])
|
||||
```
|
||||
|
||||
### Active Users (Unique account_uuids)
|
||||
```promql
|
||||
count(count by (account_uuid) (claude_code_claude_code_session_count_total))
|
||||
```
|
||||
|
||||
### Average Session Duration
|
||||
```promql
|
||||
sum(increase(claude_code_claude_code_active_time_seconds_total[24h]))
|
||||
/
|
||||
sum(increase(claude_code_claude_code_session_count_total[24h]))
|
||||
/ 60
|
||||
```
|
||||
*Result in minutes*
|
||||
|
||||
### Total Active Hours (24h)
|
||||
```promql
|
||||
sum(increase(claude_code_claude_code_active_time_seconds_total[24h])) / 3600
|
||||
```
|
||||
|
||||
### Sessions by Version
|
||||
```promql
|
||||
sum by (version) (increase(claude_code_claude_code_session_count_total[24h]))
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Team Aggregation
|
||||
|
||||
### Cost by Team (Last 24h)
|
||||
```promql
|
||||
sum by (team) (increase(claude_code_claude_code_cost_usage_USD_total[24h]))
|
||||
```
|
||||
|
||||
### LOC by Team (Last 24h)
|
||||
```promql
|
||||
sum by (team) (increase(claude_code_claude_code_lines_of_code_count_total[24h]))
|
||||
```
|
||||
|
||||
### Active Users per Team
|
||||
```promql
|
||||
count by (team) (count by (team, account_uuid) (claude_code_claude_code_session_count_total))
|
||||
```
|
||||
|
||||
### Team Efficiency (LOC per Dollar)
|
||||
```promql
|
||||
sum by (team) (increase(claude_code_claude_code_lines_of_code_count_total[24h]))
|
||||
/
|
||||
sum by (team) (increase(claude_code_claude_code_cost_usage_USD_total[24h]))
|
||||
```
|
||||
|
||||
### Top Spending Teams (Last 7 days)
|
||||
```promql
|
||||
topk(5, sum by (team) (increase(claude_code_claude_code_cost_usage_USD_total[7d])))
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Model Comparison
|
||||
|
||||
### Cost by Model (Pie Chart)
|
||||
```promql
|
||||
sum by (model) (increase(claude_code_claude_code_cost_usage_USD_total[7d]))
|
||||
```
|
||||
|
||||
### Token Efficiency by Model (Tokens per Dollar)
|
||||
```promql
|
||||
sum by (model) (increase(claude_code_claude_code_token_usage_tokens_total[24h]))
|
||||
/
|
||||
sum by (model) (increase(claude_code_claude_code_cost_usage_USD_total[24h]))
|
||||
```
|
||||
|
||||
### Most Used Model
|
||||
```promql
|
||||
topk(1, sum by (model) (increase(claude_code_claude_code_token_usage_tokens_total[24h])))
|
||||
```
|
||||
|
||||
### Model Usage Distribution (%)
|
||||
```promql
|
||||
sum by (model) (increase(claude_code_claude_code_token_usage_tokens_total[24h]))
|
||||
/
|
||||
sum(increase(claude_code_claude_code_token_usage_tokens_total[24h]))
|
||||
* 100
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Alerting Queries
|
||||
|
||||
### High Daily Cost Alert (> $50)
|
||||
```promql
|
||||
sum(increase(claude_code_claude_code_cost_usage_USD_total[24h])) > 50
|
||||
```
|
||||
|
||||
### Cost Spike Alert (50% increase compared to yesterday)
|
||||
```promql
|
||||
sum(increase(claude_code_claude_code_cost_usage_USD_total[24h]))
|
||||
/
|
||||
sum(increase(claude_code_claude_code_cost_usage_USD_total[24h] offset 24h))
|
||||
> 1.5
|
||||
```
|
||||
|
||||
### No Activity Alert (no sessions in last hour)
|
||||
```promql
|
||||
increase(claude_code_claude_code_session_count_total[1h]) == 0
|
||||
```
|
||||
|
||||
### Low Cache Hit Rate Alert (< 20%)
|
||||
```promql
|
||||
(
|
||||
sum(increase(claude_code_claude_code_token_usage_tokens_total{type="cache_read"}[1h]))
|
||||
/
|
||||
sum(increase(claude_code_claude_code_token_usage_tokens_total{type=~"input|cache_creation|cache_read"}[1h]))
|
||||
* 100
|
||||
) < 20
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Forecasting
|
||||
|
||||
### Projected Monthly Cost (based on last 7 days)
|
||||
```promql
|
||||
(sum(increase(claude_code_claude_code_cost_usage_USD_total[7d])) / 7) * 30
|
||||
```
|
||||
|
||||
### Projected Annual Cost (based on last 30 days)
|
||||
```promql
|
||||
(sum(increase(claude_code_claude_code_cost_usage_USD_total[30d])) / 30) * 365
|
||||
```
|
||||
|
||||
### Average Daily Cost (Last 30 days)
|
||||
```promql
|
||||
sum(increase(claude_code_claude_code_cost_usage_USD_total[30d])) / 30
|
||||
```
|
||||
|
||||
### Growth Rate (Week over Week)
|
||||
```promql
|
||||
(
|
||||
sum(increase(claude_code_claude_code_cost_usage_USD_total[7d]))
|
||||
-
|
||||
sum(increase(claude_code_claude_code_cost_usage_USD_total[7d] offset 7d))
|
||||
)
|
||||
/
|
||||
sum(increase(claude_code_claude_code_cost_usage_USD_total[7d] offset 7d))
|
||||
* 100
|
||||
```
|
||||
*Result as percentage*
|
||||
|
||||
---
|
||||
|
||||
## Debugging Queries
|
||||
|
||||
### Check if Metrics Exist
|
||||
```promql
|
||||
claude_code_claude_code_session_count_total
|
||||
```
|
||||
|
||||
### List All Claude Code Metrics
|
||||
```
|
||||
# Use Prometheus UI or API
|
||||
curl -s 'http://localhost:9090/api/v1/label/__name__/values' | jq . | grep claude_code
|
||||
```
|
||||
|
||||
### Check Metric Labels
|
||||
```promql
|
||||
# Returns all label combinations
|
||||
count by (account_uuid, version, team, environment) (claude_code_claude_code_session_count_total)
|
||||
```
|
||||
|
||||
### Latest Value for All Metrics
|
||||
```promql
|
||||
# Session count
|
||||
claude_code_claude_code_session_count_total
|
||||
|
||||
# Cost
|
||||
claude_code_claude_code_cost_usage_USD_total
|
||||
|
||||
# Tokens
|
||||
claude_code_claude_code_token_usage_tokens_total
|
||||
|
||||
# LOC
|
||||
claude_code_claude_code_lines_of_code_count_total
|
||||
```
|
||||
|
||||
### Metrics Cardinality (Number of Time Series)
|
||||
```promql
|
||||
count(claude_code_claude_code_token_usage_tokens_total)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Recording Rules
|
||||
|
||||
Save these as Prometheus recording rules for faster dashboard queries:
|
||||
|
||||
```yaml
|
||||
groups:
|
||||
- name: claude_code_aggregations
|
||||
interval: 1m
|
||||
rules:
|
||||
# Daily cost
|
||||
- record: claude_code:cost_usd:daily
|
||||
expr: sum(increase(claude_code_claude_code_cost_usage_USD_total[24h]))
|
||||
|
||||
# Cost by team
|
||||
- record: claude_code:cost_usd:daily:by_team
|
||||
expr: sum by (team) (increase(claude_code_claude_code_cost_usage_USD_total[24h]))
|
||||
|
||||
# Cache hit rate
|
||||
- record: claude_code:cache_hit_rate:daily
|
||||
expr: |
|
||||
sum(increase(claude_code_claude_code_token_usage_tokens_total{type="cache_read"}[24h]))
|
||||
/
|
||||
sum(increase(claude_code_claude_code_token_usage_tokens_total{type=~"input|cache_creation|cache_read"}[24h]))
|
||||
* 100
|
||||
|
||||
# LOC efficiency
|
||||
- record: claude_code:loc_per_dollar:daily
|
||||
expr: |
|
||||
sum(increase(claude_code_claude_code_lines_of_code_count_total[24h]))
|
||||
/
|
||||
sum(increase(claude_code_claude_code_cost_usage_USD_total[24h]))
|
||||
```
|
||||
|
||||
Then use simplified queries:
|
||||
```promql
|
||||
# Instead of complex query, just use:
|
||||
claude_code:cost_usd:daily
|
||||
claude_code:cost_usd:daily:by_team
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Visualization Tips
|
||||
|
||||
### Time Series Panel
|
||||
- Use `rate()` for smooth trends
|
||||
- Set legend to `{{label_name}}` for clarity
|
||||
- Enable "Lines" draw style with opacity
|
||||
|
||||
### Stat Panel
|
||||
- Use `lastNotNull` for counters
|
||||
- Use `increase([24h])` for daily totals
|
||||
- Add thresholds for color coding
|
||||
|
||||
### Bar Chart
|
||||
- Use `sum by (label)` for grouping
|
||||
- Sort by value descending
|
||||
- Limit to top 10 with `topk(10, ...)`
|
||||
|
||||
### Pie Chart
|
||||
- Calculate percentages with division
|
||||
- Use `sum by (label)` for segments
|
||||
- Limit to top categories
|
||||
|
||||
---
|
||||
|
||||
## Additional Resources
|
||||
|
||||
- **Prometheus Query Docs:** https://prometheus.io/docs/prometheus/latest/querying/basics/
|
||||
- **PromQL Examples:** https://prometheus.io/docs/prometheus/latest/querying/examples/
|
||||
- **Grafana Query Editor:** https://grafana.com/docs/grafana/latest/datasources/prometheus/
|
||||
658
skills/claude-code/otel-monitoring-setup/data/troubleshooting.md
Normal file
658
skills/claude-code/otel-monitoring-setup/data/troubleshooting.md
Normal file
@@ -0,0 +1,658 @@
|
||||
# Troubleshooting Guide
|
||||
|
||||
Common issues and solutions for Claude Code OpenTelemetry setup.
|
||||
|
||||
---
|
||||
|
||||
## Container Issues
|
||||
|
||||
### Docker Not Running
|
||||
|
||||
**Symptom:** `Cannot connect to the Docker daemon`
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
docker info
|
||||
```
|
||||
|
||||
**Solutions:**
|
||||
1. Start Docker Desktop application
|
||||
2. Wait for Docker to fully initialize
|
||||
3. Check system tray for Docker icon
|
||||
4. Verify Docker daemon is running: `ps aux | grep docker`
|
||||
|
||||
---
|
||||
|
||||
### Containers Won't Start
|
||||
|
||||
**Symptom:** Containers exit immediately after `docker compose up`
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
# Check container logs
|
||||
docker compose logs
|
||||
|
||||
# Check specific service
|
||||
docker compose logs otel-collector
|
||||
docker compose logs prometheus
|
||||
```
|
||||
|
||||
**Common Causes:**
|
||||
|
||||
**1. OTEL Collector Configuration Error**
|
||||
```bash
|
||||
# Check for errors
|
||||
docker compose logs otel-collector | grep -i error
|
||||
|
||||
# Common issues:
|
||||
# - Deprecated logging exporter
|
||||
# - Deprecated 'address' field in telemetry.metrics
|
||||
```
|
||||
|
||||
**Solution A - Deprecated logging exporter:**
|
||||
Update `otel-collector-config.yml`:
|
||||
```yaml
|
||||
exporters:
|
||||
debug:
|
||||
verbosity: normal
|
||||
# NOT:
|
||||
# logging:
|
||||
# loglevel: info
|
||||
```
|
||||
|
||||
**Solution B - Deprecated 'address' field (v0.123.0+):**
|
||||
|
||||
If logs show: `'address' has invalid keys` or similar error:
|
||||
|
||||
Update `otel-collector-config.yml`:
|
||||
```yaml
|
||||
service:
|
||||
telemetry:
|
||||
metrics:
|
||||
level: detailed
|
||||
# REMOVE this line (deprecated in v0.123.0+):
|
||||
# address: ":8888"
|
||||
```
|
||||
|
||||
The `address` field in `service.telemetry.metrics` is deprecated in newer OTEL Collector versions. Simply remove it - the collector will use default internal metrics endpoint.
|
||||
|
||||
**2. Port Already in Use**
|
||||
```bash
|
||||
# Check which ports are in use
|
||||
lsof -i :3000 # Grafana
|
||||
lsof -i :4317 # OTEL gRPC
|
||||
lsof -i :4318 # OTEL HTTP
|
||||
lsof -i :8889 # OTEL Prometheus exporter
|
||||
lsof -i :9090 # Prometheus
|
||||
lsof -i :3100 # Loki
|
||||
```
|
||||
|
||||
**Solution:**
|
||||
- Stop conflicting service
|
||||
- Or change port in docker-compose.yml
|
||||
|
||||
**3. Volume Permission Issues**
|
||||
```bash
|
||||
# Check volume permissions
|
||||
docker volume ls
|
||||
docker volume inspect claude-telemetry_prometheus-data
|
||||
```
|
||||
|
||||
**Solution:**
|
||||
```bash
|
||||
# Remove and recreate volumes
|
||||
docker compose down -v
|
||||
docker compose up -d
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Containers Keep Restarting
|
||||
|
||||
**Symptom:** Container status shows "Restarting"
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
docker compose ps
|
||||
docker compose logs --tail=50 <service-name>
|
||||
```
|
||||
|
||||
**Solutions:**
|
||||
1. Check memory limits: Increase memory_limiter in OTEL config
|
||||
2. Check disk space: `df -h`
|
||||
3. Check for configuration errors in logs
|
||||
4. Restart Docker Desktop
|
||||
|
||||
---
|
||||
|
||||
## Claude Code Settings Issues
|
||||
|
||||
### 🚨 CRITICAL: Telemetry Not Sending (Most Common Issue)
|
||||
|
||||
**Symptom:** No metrics appearing in Prometheus after Claude Code restart
|
||||
|
||||
**ROOT CAUSE (90% of cases):** Missing required exporter environment variables
|
||||
|
||||
Even when `CLAUDE_CODE_ENABLE_TELEMETRY=1` is set, telemetry **will not send** without explicit exporter configuration. This is the #1 most common issue.
|
||||
|
||||
**Diagnosis Checklist:**
|
||||
|
||||
**1. Check REQUIRED exporters (MOST IMPORTANT):**
|
||||
```bash
|
||||
jq '.env.OTEL_METRICS_EXPORTER' ~/.claude/settings.json
|
||||
# Must return: "otlp" (NOT null, NOT missing)
|
||||
|
||||
jq '.env.OTEL_LOGS_EXPORTER' ~/.claude/settings.json
|
||||
# Should return: "otlp" (recommended for event tracking)
|
||||
```
|
||||
|
||||
**If either returns `null` or is missing, this is your problem!**
|
||||
|
||||
**2. Verify telemetry is enabled:**
|
||||
```bash
|
||||
jq '.env.CLAUDE_CODE_ENABLE_TELEMETRY' ~/.claude/settings.json
|
||||
# Should return: "1"
|
||||
```
|
||||
|
||||
**3. Check OTEL endpoint:**
|
||||
```bash
|
||||
jq '.env.OTEL_EXPORTER_OTLP_ENDPOINT' ~/.claude/settings.json
|
||||
# Should return: "http://localhost:4317" (for local setup)
|
||||
```
|
||||
|
||||
**3. Verify JSON is valid:**
|
||||
```bash
|
||||
jq empty ~/.claude/settings.json
|
||||
# No output = valid JSON
|
||||
```
|
||||
|
||||
**4. Check if Claude Code was restarted:**
|
||||
```bash
|
||||
# Telemetry config only loads at startup!
|
||||
# Must quit and restart Claude Code completely
|
||||
```
|
||||
|
||||
**5. Test OTEL endpoint connectivity:**
|
||||
```bash
|
||||
nc -zv localhost 4317
|
||||
# Should show: Connection to localhost port 4317 [tcp/*] succeeded!
|
||||
```
|
||||
|
||||
**Solutions:**
|
||||
|
||||
**If exporters are missing (MOST COMMON):**
|
||||
|
||||
Add these REQUIRED settings to ~/.claude/settings.json:
|
||||
```json
|
||||
{
|
||||
"env": {
|
||||
"CLAUDE_CODE_ENABLE_TELEMETRY": "1",
|
||||
"OTEL_METRICS_EXPORTER": "otlp",
|
||||
"OTEL_LOGS_EXPORTER": "otlp",
|
||||
"OTEL_EXPORTER_OTLP_ENDPOINT": "http://localhost:4317",
|
||||
"OTEL_EXPORTER_OTLP_PROTOCOL": "grpc"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Then **MUST restart Claude Code** (settings only load at startup).
|
||||
|
||||
**If endpoint unreachable:**
|
||||
- Verify OTEL Collector container is running
|
||||
- Check firewall settings
|
||||
- Try HTTP endpoint instead: `http://localhost:4318`
|
||||
|
||||
**If still no data:**
|
||||
- Check OTEL Collector logs for incoming connections
|
||||
- Verify Claude Code is running (not just idle)
|
||||
- Wait 60 seconds (default export interval)
|
||||
|
||||
---
|
||||
|
||||
### Settings.json Syntax Errors
|
||||
|
||||
**Symptom:** Claude Code won't start or shows errors
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
# Validate JSON
|
||||
jq empty ~/.claude/settings.json
|
||||
|
||||
# Pretty-print to find issues
|
||||
jq . ~/.claude/settings.json
|
||||
```
|
||||
|
||||
**Common Issues:**
|
||||
- Missing commas between properties
|
||||
- Trailing commas before closing braces
|
||||
- Unescaped quotes in strings
|
||||
- Incorrect nesting
|
||||
|
||||
**Solution:**
|
||||
```bash
|
||||
# Restore backup
|
||||
cp ~/.claude/settings.json.backup ~/.claude/settings.json
|
||||
|
||||
# Or fix JSON manually with editor
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Grafana Issues
|
||||
|
||||
### Can't Access Grafana
|
||||
|
||||
**Symptom:** `localhost:3000` doesn't load
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
# Check if Grafana is running
|
||||
docker ps | grep grafana
|
||||
|
||||
# Check Grafana logs
|
||||
docker compose logs grafana
|
||||
|
||||
# Check port availability
|
||||
lsof -i :3000
|
||||
```
|
||||
|
||||
**Solutions:**
|
||||
1. Verify container is running: `docker compose up -d grafana`
|
||||
2. Wait 30 seconds for Grafana to initialize
|
||||
3. Try `http://127.0.0.1:3000` instead
|
||||
4. Check Docker network: `docker network inspect claude-telemetry`
|
||||
|
||||
---
|
||||
|
||||
### Dashboard Shows "Datasource Not Found"
|
||||
|
||||
**Symptom:** Dashboard panels show "datasource prometheus not found"
|
||||
|
||||
**Cause:** Dashboard has hardcoded datasource UID that doesn't match your Grafana instance
|
||||
|
||||
**Diagnosis:**
|
||||
1. Go to: http://localhost:3000/connections/datasources
|
||||
2. Click on Prometheus datasource
|
||||
3. Note the UID from URL (e.g., `PBFA97CFB590B2093`)
|
||||
|
||||
**Solution:**
|
||||
```bash
|
||||
# Get your datasource UID
|
||||
DATASOURCE_UID=$(curl -s -u admin:admin http://localhost:3000/api/datasources | jq -r '.[] | select(.type=="prometheus") | .uid')
|
||||
|
||||
echo "Your Prometheus datasource UID: $DATASOURCE_UID"
|
||||
|
||||
# Update dashboard JSON
|
||||
cd ~/.claude/telemetry/dashboards
|
||||
cat claude-code-overview.json | sed "s/PBFA97CFB590B2093/$DATASOURCE_UID/g" > claude-code-overview-fixed.json
|
||||
|
||||
# Re-import the fixed dashboard
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Dashboard Shows "No Data"
|
||||
|
||||
**Symptom:** Dashboard loads but all panels show "No data"
|
||||
|
||||
**Diagnosis Steps:**
|
||||
|
||||
**1. Check Prometheus has data:**
|
||||
```bash
|
||||
# Query Prometheus directly
|
||||
curl -s 'http://localhost:9090/api/v1/label/__name__/values' | jq . | grep claude_code
|
||||
|
||||
# Should see metrics like:
|
||||
# "claude_code_claude_code_session_count_total"
|
||||
# "claude_code_claude_code_cost_usage_USD_total"
|
||||
```
|
||||
|
||||
**2. Check datasource connection:**
|
||||
- Go to: http://localhost:3000/connections/datasources
|
||||
- Click Prometheus
|
||||
- Click "Save & Test"
|
||||
- Should show: "Successfully queried the Prometheus API"
|
||||
|
||||
**3. Verify metric names in queries:**
|
||||
```bash
|
||||
# Check if metrics use double prefix
|
||||
curl -s 'http://localhost:9090/api/v1/query?query=claude_code_claude_code_session_count_total' | jq .
|
||||
```
|
||||
|
||||
**Solutions:**
|
||||
|
||||
**If metrics don't exist:**
|
||||
- Claude Code hasn't sent data yet (wait 60 seconds)
|
||||
- OTEL Collector isn't receiving data (check container logs)
|
||||
- Settings.json wasn't configured correctly
|
||||
|
||||
**If metrics exist but dashboard shows no data:**
|
||||
- Dashboard queries use wrong metric names
|
||||
- Update queries to use double prefix: `claude_code_claude_code_*`
|
||||
- Check time range (top-right corner of Grafana)
|
||||
|
||||
**If single prefix metrics exist (`claude_code_*`):**
|
||||
Your setup uses old naming. Update dashboard:
|
||||
```bash
|
||||
# Replace double prefix with single
|
||||
sed 's/claude_code_claude_code_/claude_code_/g' dashboard.json > dashboard-fixed.json
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prometheus Issues
|
||||
|
||||
### Prometheus Shows No Targets
|
||||
|
||||
**Symptom:** Prometheus UI (localhost:9090) → Status → Targets shows no targets or DOWN status
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
# Check Prometheus config
|
||||
cat ~/.claude/telemetry/prometheus.yml
|
||||
|
||||
# Check if OTEL Collector is reachable from Prometheus
|
||||
docker exec -it claude-prometheus ping otel-collector
|
||||
```
|
||||
|
||||
**Solutions:**
|
||||
1. Verify `prometheus.yml` has correct scrape_configs
|
||||
2. Ensure OTEL Collector is running
|
||||
3. Check Docker network connectivity
|
||||
4. Restart Prometheus: `docker compose restart prometheus`
|
||||
|
||||
---
|
||||
|
||||
### Prometheus Can't Scrape OTEL Collector
|
||||
|
||||
**Symptom:** Target shows as DOWN with error "context deadline exceeded"
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
# Check if OTEL Collector is exposing metrics
|
||||
curl http://localhost:8889/metrics
|
||||
|
||||
# Check OTEL Collector logs
|
||||
docker compose logs otel-collector
|
||||
```
|
||||
|
||||
**Solutions:**
|
||||
1. Verify OTEL Collector prometheus exporter is configured
|
||||
2. Check port 8889 is exposed in docker-compose.yml
|
||||
3. Restart OTEL Collector: `docker compose restart otel-collector`
|
||||
|
||||
---
|
||||
|
||||
## Metric Issues
|
||||
|
||||
### Metrics Have Double Prefix
|
||||
|
||||
**Symptom:** Metrics are named `claude_code_claude_code_*` instead of `claude_code_*`
|
||||
|
||||
**Explanation:** This is expected behavior with the current OTEL Collector configuration:
|
||||
- First `claude_code` = Prometheus exporter namespace
|
||||
- Second `claude_code` = Original metric name
|
||||
|
||||
**Solutions:**
|
||||
|
||||
**Option 1: Accept it (Recommended)**
|
||||
- Update dashboard queries to use double prefix
|
||||
- This is the standard configuration
|
||||
|
||||
**Option 2: Remove namespace prefix**
|
||||
Update `otel-collector-config.yml`:
|
||||
```yaml
|
||||
exporters:
|
||||
prometheus:
|
||||
endpoint: "0.0.0.0:8889"
|
||||
namespace: "" # Remove namespace
|
||||
```
|
||||
|
||||
Then restart: `docker compose restart otel-collector`
|
||||
|
||||
---
|
||||
|
||||
### Old Metrics Still Showing
|
||||
|
||||
**Symptom:** After changing configuration, old metrics still appear
|
||||
|
||||
**Cause:** Prometheus retains metrics until retention period expires
|
||||
|
||||
**Solutions:**
|
||||
|
||||
**Quick fix: Delete Prometheus data:**
|
||||
```bash
|
||||
docker compose down
|
||||
docker volume rm claude-telemetry_prometheus-data
|
||||
docker compose up -d
|
||||
```
|
||||
|
||||
**Proper fix: Wait for retention:**
|
||||
- Default retention is 15 days
|
||||
- Old metrics will automatically disappear
|
||||
- New metrics will coexist temporarily
|
||||
|
||||
---
|
||||
|
||||
## Network Issues
|
||||
|
||||
### Can't Reach OTEL Endpoint from Claude Code
|
||||
|
||||
**Symptom:** Claude Code can't connect to `localhost:4317`
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
# Test gRPC endpoint
|
||||
nc -zv localhost 4317
|
||||
|
||||
# Test HTTP endpoint
|
||||
curl -v http://localhost:4318/v1/metrics -d '{}'
|
||||
```
|
||||
|
||||
**Solutions:**
|
||||
|
||||
**If connection refused:**
|
||||
1. Check OTEL Collector is running
|
||||
2. Verify ports are exposed in docker-compose.yml
|
||||
3. Check firewall/antivirus blocking localhost connections
|
||||
|
||||
**If timeout:**
|
||||
1. Increase export timeout in settings.json
|
||||
2. Try HTTP protocol instead of gRPC
|
||||
|
||||
**macOS-specific:**
|
||||
- Use `http://host.docker.internal:4317` instead of `localhost:4317`
|
||||
- Or use bridge network mode
|
||||
|
||||
---
|
||||
|
||||
### Enterprise Endpoint Unreachable
|
||||
|
||||
**Symptom:** Can't connect to company OTEL endpoint
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
# Test connectivity
|
||||
ping otel.company.com
|
||||
|
||||
# Test port
|
||||
nc -zv otel.company.com 4317
|
||||
|
||||
# Test with VPN
|
||||
# (Ensure corporate VPN is connected)
|
||||
```
|
||||
|
||||
**Solutions:**
|
||||
1. Connect to corporate VPN
|
||||
2. Check firewall allows outbound connections
|
||||
3. Verify endpoint URL is correct
|
||||
4. Try HTTP endpoint (port 4318) instead of gRPC
|
||||
5. Contact platform team to verify endpoint is accessible
|
||||
|
||||
---
|
||||
|
||||
## Performance Issues
|
||||
|
||||
### High Memory Usage
|
||||
|
||||
**Symptom:** OTEL Collector or Prometheus using excessive memory
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
# Check container resource usage
|
||||
docker stats
|
||||
|
||||
# Check Prometheus TSDB size
|
||||
du -sh ~/.claude/telemetry/prometheus-data
|
||||
```
|
||||
|
||||
**Solutions:**
|
||||
|
||||
**OTEL Collector:**
|
||||
Reduce memory_limiter in `otel-collector-config.yml`:
|
||||
```yaml
|
||||
processors:
|
||||
memory_limiter:
|
||||
check_interval: 1s
|
||||
limit_mib: 256 # Reduce from 512
|
||||
```
|
||||
|
||||
**Prometheus:**
|
||||
Reduce retention:
|
||||
```yaml
|
||||
command:
|
||||
- '--storage.tsdb.retention.time=7d' # Reduce from 15d
|
||||
- '--storage.tsdb.retention.size=1GB'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Slow Grafana Dashboards
|
||||
|
||||
**Symptom:** Dashboards take long time to load or timeout
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
# Check query performance in Prometheus
|
||||
# Go to: http://localhost:9090/graph
|
||||
# Run expensive queries like: sum by (account_uuid, model, type) (...)
|
||||
```
|
||||
|
||||
**Solutions:**
|
||||
1. Reduce dashboard time range (use 6h instead of 7d)
|
||||
2. Increase dashboard refresh interval (1m → 5m)
|
||||
3. Use recording rules for complex queries
|
||||
4. Reduce number of panels
|
||||
5. Use simpler aggregations
|
||||
|
||||
---
|
||||
|
||||
## Data Quality Issues
|
||||
|
||||
### Unexpected Cost Values
|
||||
|
||||
**Symptom:** Cost metrics seem incorrect
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
# Check raw cost values
|
||||
curl -s 'http://localhost:9090/api/v1/query?query=claude_code_claude_code_cost_usage_USD_total' | jq .
|
||||
|
||||
# Check token usage
|
||||
curl -s 'http://localhost:9090/api/v1/query?query=claude_code_claude_code_token_usage_tokens_total' | jq .
|
||||
```
|
||||
|
||||
**Causes:**
|
||||
- Cost is cumulative counter (not reset between sessions)
|
||||
- Dashboard may be using wrong time range
|
||||
- Model pricing may have changed
|
||||
|
||||
**Solutions:**
|
||||
- Use `increase([24h])` not raw counter values
|
||||
- Verify pricing in metrics reference
|
||||
- Check Claude Code version (pricing may vary)
|
||||
|
||||
---
|
||||
|
||||
### Missing Sessions
|
||||
|
||||
**Symptom:** Some Claude Code sessions not recorded
|
||||
|
||||
**Causes:**
|
||||
1. Claude Code wasn't restarted after settings update
|
||||
2. OTEL Collector was down during session
|
||||
3. Export interval hadn't elapsed yet (60 seconds default)
|
||||
4. Network issue prevented export
|
||||
|
||||
**Solutions:**
|
||||
- Always restart Claude Code after settings changes
|
||||
- Monitor OTEL Collector uptime
|
||||
- Check OTEL Collector logs for export errors
|
||||
- Reduce export interval if real-time data needed
|
||||
|
||||
---
|
||||
|
||||
## Getting Help
|
||||
|
||||
### Collect Debug Information
|
||||
|
||||
When asking for help, provide:
|
||||
|
||||
```bash
|
||||
# 1. Container status
|
||||
docker compose ps
|
||||
|
||||
# 2. Container logs (last 50 lines)
|
||||
docker compose logs --tail=50
|
||||
|
||||
# 3. Configuration files
|
||||
cat ~/.claude/telemetry/otel-collector-config.yml
|
||||
cat ~/.claude/telemetry/prometheus.yml
|
||||
|
||||
# 4. Claude Code settings (redact sensitive info!)
|
||||
jq '.env | with_entries(select(.key | startswith("OTEL_")))' ~/.claude/settings.json
|
||||
|
||||
# 5. Prometheus metrics list
|
||||
curl -s 'http://localhost:9090/api/v1/label/__name__/values' | jq . | grep claude_code
|
||||
|
||||
# 6. System info
|
||||
docker --version
|
||||
docker compose version
|
||||
uname -a
|
||||
```
|
||||
|
||||
### Enable Debug Logging
|
||||
|
||||
**OTEL Collector:**
|
||||
```yaml
|
||||
exporters:
|
||||
debug:
|
||||
verbosity: detailed # Change from 'normal'
|
||||
|
||||
service:
|
||||
telemetry:
|
||||
logs:
|
||||
level: debug # Change from 'info'
|
||||
```
|
||||
|
||||
**Claude Code:**
|
||||
Add to settings.json:
|
||||
```json
|
||||
"env": {
|
||||
"OTEL_LOG_LEVEL": "debug"
|
||||
}
|
||||
```
|
||||
|
||||
Then check logs:
|
||||
```bash
|
||||
docker compose logs -f otel-collector
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Additional Resources
|
||||
|
||||
- **OTEL Collector Docs:** https://opentelemetry.io/docs/collector/
|
||||
- **Prometheus Troubleshooting:** https://prometheus.io/docs/prometheus/latest/troubleshooting/
|
||||
- **Grafana Troubleshooting:** https://grafana.com/docs/grafana/latest/troubleshooting/
|
||||
- **Docker Compose Docs:** https://docs.docker.com/compose/
|
||||
@@ -0,0 +1,812 @@
|
||||
# Mode 1: Local PoC Setup - Detailed Workflow
|
||||
|
||||
Complete step-by-step process for setting up a local OpenTelemetry stack for Claude Code telemetry.
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
**Goal:** Create a complete local telemetry monitoring stack
|
||||
**Time:** 5-7 minutes
|
||||
**Prerequisites:** Docker Desktop, Claude Code, 2GB+ free disk space
|
||||
**Output:** Running Grafana dashboard with Claude Code metrics
|
||||
|
||||
---
|
||||
|
||||
## Phase 0: Prerequisites Verification
|
||||
|
||||
### Step 0.1: Check Docker Installation
|
||||
|
||||
```bash
|
||||
# Check if Docker is installed
|
||||
docker --version
|
||||
|
||||
# Expected: Docker version 20.10.0 or higher
|
||||
```
|
||||
|
||||
**If not installed:**
|
||||
```
|
||||
Docker is not installed. Please install Docker Desktop:
|
||||
- Mac: https://docs.docker.com/desktop/install/mac-install/
|
||||
- Linux: https://docs.docker.com/desktop/install/linux-install/
|
||||
- Windows: https://docs.docker.com/desktop/install/windows-install/
|
||||
```
|
||||
|
||||
**Stop if:** Docker not installed
|
||||
|
||||
### Step 0.2: Verify Docker is Running
|
||||
|
||||
```bash
|
||||
# Check Docker daemon
|
||||
docker ps
|
||||
|
||||
# Expected: List of containers (or empty list)
|
||||
# Error: "Cannot connect to Docker daemon" means Docker isn't running
|
||||
```
|
||||
|
||||
**If not running:**
|
||||
```
|
||||
Docker Desktop is not running. Please:
|
||||
1. Open Docker Desktop application
|
||||
2. Wait for the whale icon to be stable (not animated)
|
||||
3. Try again
|
||||
```
|
||||
|
||||
**Stop if:** Docker not running
|
||||
|
||||
### Step 0.3: Check Docker Compose
|
||||
|
||||
```bash
|
||||
# Modern Docker includes compose
|
||||
docker compose version
|
||||
|
||||
# Expected: Docker Compose version v2.x.x or higher
|
||||
```
|
||||
|
||||
**Note:** We use `docker compose` (not `docker-compose`)
|
||||
|
||||
### Step 0.4: Check Available Ports
|
||||
|
||||
```bash
|
||||
# Check if ports are available
|
||||
lsof -i :3000 -i :4317 -i :4318 -i :8889 -i :9090 -i :3100
|
||||
|
||||
# Expected: No output (ports are free)
|
||||
```
|
||||
|
||||
**If ports in use:**
|
||||
```
|
||||
The following ports are required but already in use:
|
||||
- 3000: Grafana
|
||||
- 4317: OTEL Collector (gRPC)
|
||||
- 4318: OTEL Collector (HTTP)
|
||||
- 8889: OTEL Collector (Prometheus exporter)
|
||||
- 9090: Prometheus
|
||||
- 3100: Loki
|
||||
|
||||
Options:
|
||||
1. Stop services using these ports
|
||||
2. Modify port mappings in docker-compose.yml (advanced)
|
||||
```
|
||||
|
||||
**Stop if:** Critical ports (3000, 4317, 9090) are in use
|
||||
|
||||
### Step 0.5: Check Disk Space
|
||||
|
||||
```bash
|
||||
# Check available disk space
|
||||
df -h ~
|
||||
|
||||
# Minimum: 2GB free (for Docker images ~1.5GB + data volumes)
|
||||
# Recommended: 5GB+ free for comfortable operation
|
||||
```
|
||||
|
||||
**If low disk space:**
|
||||
```
|
||||
Low disk space detected. Setup requires:
|
||||
- Initial: ~1.5GB for Docker images (OTEL, Prometheus, Grafana, Loki)
|
||||
- Runtime: 500MB+ for data volumes (grows over time)
|
||||
- Minimum: 2GB free disk space required
|
||||
|
||||
Please free up space before continuing.
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Phase 1: Directory Structure Creation
|
||||
|
||||
### Step 1.1: Create Base Directory
|
||||
|
||||
```bash
|
||||
mkdir -p ~/.claude/telemetry/{dashboards,docs}
|
||||
cd ~/.claude/telemetry
|
||||
```
|
||||
|
||||
**Verify:**
|
||||
```bash
|
||||
ls -la ~/.claude/telemetry
|
||||
# Should show: dashboards/ and docs/ directories
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Phase 2: Configuration File Generation
|
||||
|
||||
### Step 2.1: Create docker-compose.yml
|
||||
|
||||
**Template:** `templates/docker-compose-template.yml`
|
||||
|
||||
```yaml
|
||||
services:
|
||||
# OpenTelemetry Collector - receives telemetry from Claude Code
|
||||
otel-collector:
|
||||
image: otel/opentelemetry-collector-contrib:0.115.1
|
||||
container_name: claude-otel-collector
|
||||
command: ["--config=/etc/otel-collector-config.yml"]
|
||||
volumes:
|
||||
- ./otel-collector-config.yml:/etc/otel-collector-config.yml
|
||||
ports:
|
||||
- "4317:4317" # OTLP gRPC receiver
|
||||
- "4318:4318" # OTLP HTTP receiver
|
||||
- "8889:8889" # Prometheus metrics exporter
|
||||
networks:
|
||||
- claude-telemetry
|
||||
|
||||
# Prometheus - stores metrics
|
||||
prometheus:
|
||||
image: prom/prometheus:v2.55.1
|
||||
container_name: claude-prometheus
|
||||
command:
|
||||
- '--config.file=/etc/prometheus/prometheus.yml'
|
||||
- '--storage.tsdb.path=/prometheus'
|
||||
- '--web.console.libraries=/etc/prometheus/console_libraries'
|
||||
- '--web.console.templates=/etc/prometheus/consoles'
|
||||
- '--web.enable-lifecycle'
|
||||
volumes:
|
||||
- ./prometheus.yml:/etc/prometheus/prometheus.yml
|
||||
- prometheus-data:/prometheus
|
||||
ports:
|
||||
- "9090:9090"
|
||||
networks:
|
||||
- claude-telemetry
|
||||
depends_on:
|
||||
- otel-collector
|
||||
|
||||
# Loki - stores logs
|
||||
loki:
|
||||
image: grafana/loki:3.0.0
|
||||
container_name: claude-loki
|
||||
ports:
|
||||
- "3100:3100"
|
||||
command: -config.file=/etc/loki/local-config.yaml
|
||||
volumes:
|
||||
- loki-data:/loki
|
||||
networks:
|
||||
- claude-telemetry
|
||||
|
||||
# Grafana - visualization dashboards
|
||||
grafana:
|
||||
image: grafana/grafana:11.3.0
|
||||
container_name: claude-grafana
|
||||
ports:
|
||||
- "3000:3000"
|
||||
environment:
|
||||
- GF_SECURITY_ADMIN_USER=admin
|
||||
- GF_SECURITY_ADMIN_PASSWORD=admin
|
||||
- GF_USERS_ALLOW_SIGN_UP=false
|
||||
volumes:
|
||||
- grafana-data:/var/lib/grafana
|
||||
- ./grafana-datasources.yml:/etc/grafana/provisioning/datasources/datasources.yml
|
||||
networks:
|
||||
- claude-telemetry
|
||||
depends_on:
|
||||
- prometheus
|
||||
- loki
|
||||
|
||||
networks:
|
||||
claude-telemetry:
|
||||
driver: bridge
|
||||
|
||||
volumes:
|
||||
prometheus-data:
|
||||
loki-data:
|
||||
grafana-data:
|
||||
```
|
||||
|
||||
**Write to:** `~/.claude/telemetry/docker-compose.yml`
|
||||
|
||||
**Note on Image Versions:**
|
||||
- Versions are pinned to prevent breaking changes from upstream
|
||||
- Current versions (tested and stable):
|
||||
- OTEL Collector: 0.115.1
|
||||
- Prometheus: v2.55.1
|
||||
- Loki: 3.0.0
|
||||
- Grafana: 11.3.0
|
||||
- To update: Change version tags in docker-compose.yml and run `docker compose pull`
|
||||
|
||||
### Step 2.2: Create OTEL Collector Configuration
|
||||
|
||||
**Template:** `templates/otel-collector-config-template.yml`
|
||||
|
||||
**CRITICAL:** Use `debug` exporter, not deprecated `logging` exporter
|
||||
|
||||
```yaml
|
||||
receivers:
|
||||
otlp:
|
||||
protocols:
|
||||
grpc:
|
||||
endpoint: 0.0.0.0:4317
|
||||
http:
|
||||
endpoint: 0.0.0.0:4318
|
||||
|
||||
processors:
|
||||
batch:
|
||||
timeout: 10s
|
||||
send_batch_size: 1024
|
||||
|
||||
resource:
|
||||
attributes:
|
||||
- key: service.name
|
||||
value: claude-code
|
||||
action: upsert
|
||||
|
||||
memory_limiter:
|
||||
check_interval: 1s
|
||||
limit_mib: 512
|
||||
|
||||
exporters:
|
||||
# Export metrics to Prometheus
|
||||
prometheus:
|
||||
endpoint: "0.0.0.0:8889"
|
||||
namespace: claude_code
|
||||
const_labels:
|
||||
source: claude_code_telemetry
|
||||
|
||||
# Export logs to Loki via OTLP HTTP
|
||||
otlphttp/loki:
|
||||
endpoint: http://loki:3100/otlp
|
||||
tls:
|
||||
insecure: true
|
||||
|
||||
# Debug exporter (replaces deprecated logging exporter)
|
||||
debug:
|
||||
verbosity: normal
|
||||
|
||||
service:
|
||||
pipelines:
|
||||
metrics:
|
||||
receivers: [otlp]
|
||||
processors: [memory_limiter, batch, resource]
|
||||
exporters: [prometheus, debug]
|
||||
|
||||
logs:
|
||||
receivers: [otlp]
|
||||
processors: [memory_limiter, batch, resource]
|
||||
exporters: [otlphttp/loki, debug]
|
||||
|
||||
telemetry:
|
||||
logs:
|
||||
level: info
|
||||
```
|
||||
|
||||
**Write to:** `~/.claude/telemetry/otel-collector-config.yml`
|
||||
|
||||
### Step 2.3: Create Prometheus Configuration
|
||||
|
||||
**Template:** `templates/prometheus-config-template.yml`
|
||||
|
||||
```yaml
|
||||
global:
|
||||
scrape_interval: 15s
|
||||
evaluation_interval: 15s
|
||||
|
||||
scrape_configs:
|
||||
- job_name: 'otel-collector'
|
||||
static_configs:
|
||||
- targets: ['otel-collector:8889']
|
||||
```
|
||||
|
||||
**Write to:** `~/.claude/telemetry/prometheus.yml`
|
||||
|
||||
### Step 2.4: Create Grafana Datasources Configuration
|
||||
|
||||
**Template:** `templates/grafana-datasources-template.yml`
|
||||
|
||||
```yaml
|
||||
apiVersion: 1
|
||||
|
||||
datasources:
|
||||
- name: Prometheus
|
||||
type: prometheus
|
||||
access: proxy
|
||||
url: http://prometheus:9090
|
||||
isDefault: true
|
||||
editable: true
|
||||
|
||||
- name: Loki
|
||||
type: loki
|
||||
access: proxy
|
||||
url: http://loki:3100
|
||||
editable: true
|
||||
```
|
||||
|
||||
**Write to:** `~/.claude/telemetry/grafana-datasources.yml`
|
||||
|
||||
### Step 2.5: Create Management Scripts
|
||||
|
||||
**Start Script:**
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# start-telemetry.sh
|
||||
|
||||
echo "🚀 Starting Claude Code Telemetry Stack..."
|
||||
|
||||
# Check if Docker is running
|
||||
if ! docker info > /dev/null 2>&1; then
|
||||
echo "❌ Docker is not running. Please start Docker Desktop."
|
||||
exit 1
|
||||
fi
|
||||
|
||||
cd ~/.claude/telemetry || exit 1
|
||||
|
||||
# Start containers
|
||||
docker compose up -d
|
||||
|
||||
# Wait for services to be ready
|
||||
echo "⏳ Waiting for services to start..."
|
||||
sleep 5
|
||||
|
||||
# Check container status
|
||||
echo ""
|
||||
echo "📊 Container Status:"
|
||||
docker ps --filter "name=claude-" --format "table {{.Names}}\t{{.Status}}"
|
||||
|
||||
echo ""
|
||||
echo "✅ Telemetry stack started!"
|
||||
echo ""
|
||||
echo "🌐 Access URLs:"
|
||||
echo " Grafana: http://localhost:3000 (admin/admin)"
|
||||
echo " Prometheus: http://localhost:9090"
|
||||
echo " Loki: http://localhost:3100"
|
||||
echo ""
|
||||
echo "📝 Next steps:"
|
||||
echo " 1. Restart Claude Code to activate telemetry"
|
||||
echo " 2. Import dashboards into Grafana"
|
||||
echo " 3. Use Claude Code normally - metrics will appear in ~60 seconds"
|
||||
```
|
||||
|
||||
**Write to:** `~/.claude/telemetry/start-telemetry.sh`
|
||||
|
||||
```bash
|
||||
chmod +x ~/.claude/telemetry/start-telemetry.sh
|
||||
```
|
||||
|
||||
**Stop Script:**
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# stop-telemetry.sh
|
||||
|
||||
echo "🛑 Stopping Claude Code Telemetry Stack..."
|
||||
|
||||
cd ~/.claude/telemetry || exit 1
|
||||
|
||||
docker compose down
|
||||
|
||||
echo "✅ Telemetry stack stopped"
|
||||
echo ""
|
||||
echo "Note: Data is preserved in Docker volumes."
|
||||
echo "To start again: ./start-telemetry.sh"
|
||||
echo "To completely remove all data: ./cleanup-telemetry.sh"
|
||||
```
|
||||
|
||||
**Write to:** `~/.claude/telemetry/stop-telemetry.sh`
|
||||
|
||||
```bash
|
||||
chmod +x ~/.claude/telemetry/stop-telemetry.sh
|
||||
```
|
||||
|
||||
**Cleanup Script (Full Data Removal):**
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# cleanup-telemetry.sh
|
||||
|
||||
echo "⚠️ WARNING: This will remove ALL telemetry data including:"
|
||||
echo " - All containers"
|
||||
echo " - All Docker volumes (Grafana, Prometheus, Loki data)"
|
||||
echo " - Network configuration"
|
||||
echo ""
|
||||
read -p "Are you sure you want to proceed? (yes/no): " -r
|
||||
echo
|
||||
|
||||
if [[ ! $REPLY =~ ^[Yy][Ee][Ss]$ ]]; then
|
||||
echo "Cleanup cancelled."
|
||||
exit 0
|
||||
fi
|
||||
|
||||
echo "Performing full cleanup of Claude Code telemetry stack..."
|
||||
|
||||
cd ~/.claude/telemetry || exit 1
|
||||
|
||||
docker compose down -v
|
||||
|
||||
echo ""
|
||||
echo "✅ Full cleanup complete!"
|
||||
echo ""
|
||||
echo "Removed:"
|
||||
echo " ✓ All containers (otel-collector, prometheus, loki, grafana)"
|
||||
echo " ✓ All volumes (all historical data)"
|
||||
echo " ✓ Network configuration"
|
||||
echo ""
|
||||
echo "Preserved:"
|
||||
echo " ✓ Configuration files in ~/.claude/telemetry/"
|
||||
echo " ✓ Claude Code settings in ~/.claude/settings.json"
|
||||
echo ""
|
||||
echo "To start fresh: ./start-telemetry.sh"
|
||||
```
|
||||
|
||||
**Write to:** `~/.claude/telemetry/cleanup-telemetry.sh`
|
||||
|
||||
```bash
|
||||
chmod +x ~/.claude/telemetry/cleanup-telemetry.sh
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Phase 3: Start Docker Containers
|
||||
|
||||
### Step 3.1: Start All Services
|
||||
|
||||
```bash
|
||||
cd ~/.claude/telemetry
|
||||
docker compose up -d
|
||||
```
|
||||
|
||||
**Expected output:**
|
||||
```
|
||||
[+] Running 5/5
|
||||
✔ Network claude_claude-telemetry Created
|
||||
✔ Container claude-loki Started
|
||||
✔ Container claude-otel-collector Started
|
||||
✔ Container claude-prometheus Started
|
||||
✔ Container claude-grafana Started
|
||||
```
|
||||
|
||||
### Step 3.2: Verify Containers are Running
|
||||
|
||||
```bash
|
||||
docker ps --filter "name=claude-" --format "table {{.Names}}\t{{.Status}}"
|
||||
```
|
||||
|
||||
**Expected:** All 4 containers showing "Up X seconds/minutes"
|
||||
|
||||
**If OTEL Collector is not running:**
|
||||
```bash
|
||||
# Check logs
|
||||
docker logs claude-otel-collector
|
||||
```
|
||||
|
||||
**Common issue:** "logging exporter deprecated" error
|
||||
**Solution:** Config file uses `debug` exporter (already fixed in template)
|
||||
|
||||
### Step 3.3: Wait for Services to be Healthy
|
||||
|
||||
```bash
|
||||
# Give services time to initialize
|
||||
sleep 10
|
||||
|
||||
# Test Prometheus
|
||||
curl -s http://localhost:9090/-/healthy
|
||||
# Expected: Prometheus is Healthy.
|
||||
|
||||
# Test Grafana
|
||||
curl -s http://localhost:3000/api/health | jq
|
||||
# Expected: {"database": "ok", ...}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Phase 4: Update Claude Code Settings
|
||||
|
||||
### Step 4.1: Backup Existing Settings
|
||||
|
||||
```bash
|
||||
cp ~/.claude/settings.json ~/.claude/settings.json.backup
|
||||
```
|
||||
|
||||
### Step 4.2: Read Current Settings
|
||||
|
||||
```bash
|
||||
# Read existing settings
|
||||
cat ~/.claude/settings.json
|
||||
```
|
||||
|
||||
### Step 4.3: Merge Telemetry Configuration
|
||||
|
||||
**Add to settings.json `env` section:**
|
||||
|
||||
```json
|
||||
{
|
||||
"env": {
|
||||
"CLAUDE_CODE_ENABLE_TELEMETRY": "1",
|
||||
"OTEL_METRICS_EXPORTER": "otlp",
|
||||
"OTEL_LOGS_EXPORTER": "otlp",
|
||||
"OTEL_EXPORTER_OTLP_PROTOCOL": "grpc",
|
||||
"OTEL_EXPORTER_OTLP_ENDPOINT": "http://localhost:4317",
|
||||
"OTEL_METRIC_EXPORT_INTERVAL": "60000",
|
||||
"OTEL_LOGS_EXPORT_INTERVAL": "5000",
|
||||
"OTEL_LOG_USER_PROMPTS": "1",
|
||||
"OTEL_METRICS_INCLUDE_SESSION_ID": "true",
|
||||
"OTEL_METRICS_INCLUDE_VERSION": "true",
|
||||
"OTEL_METRICS_INCLUDE_ACCOUNT_UUID": "true",
|
||||
"OTEL_RESOURCE_ATTRIBUTES": "environment=local,deployment=poc"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Template:** `templates/settings-env-template.json`
|
||||
|
||||
**Note:** Merge with existing env vars, don't replace entire settings file
|
||||
|
||||
### Step 4.4: Verify Settings Updated
|
||||
|
||||
```bash
|
||||
cat ~/.claude/settings.json | grep CLAUDE_CODE_ENABLE_TELEMETRY
|
||||
# Expected: "CLAUDE_CODE_ENABLE_TELEMETRY": "1"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Phase 5: Grafana Dashboard Import
|
||||
|
||||
### Step 5.1: Detect Prometheus Datasource UID
|
||||
|
||||
**Option A: Via Grafana API**
|
||||
|
||||
```bash
|
||||
curl -s http://admin:admin@localhost:3000/api/datasources | \
|
||||
jq '.[] | select(.type=="prometheus") | {name, uid}'
|
||||
```
|
||||
|
||||
**Expected:**
|
||||
```json
|
||||
{
|
||||
"name": "Prometheus",
|
||||
"uid": "PBFA97CFB590B2093"
|
||||
}
|
||||
```
|
||||
|
||||
**Option B: Manual Detection**
|
||||
1. Open http://localhost:3000
|
||||
2. Go to Connections → Data sources
|
||||
3. Click Prometheus
|
||||
4. Note the UID from the URL: `/datasources/edit/{UID}`
|
||||
|
||||
### Step 5.2: Fix Dashboard with Correct UID
|
||||
|
||||
**Read dashboard template:** `dashboards/claude-code-overview-template.json`
|
||||
|
||||
**Replace all instances of:**
|
||||
```json
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "prometheus"
|
||||
}
|
||||
```
|
||||
|
||||
**With:**
|
||||
```json
|
||||
"datasource": {
|
||||
"type": "prometheus",
|
||||
"uid": "PBFA97CFB590B2093"
|
||||
}
|
||||
```
|
||||
|
||||
**Use detected UID from Step 5.1**
|
||||
|
||||
### Step 5.3: Verify Metric Names
|
||||
|
||||
**CRITICAL:** Claude Code metrics use double prefix: `claude_code_claude_code_*`
|
||||
|
||||
**Verify actual metric names:**
|
||||
```bash
|
||||
curl -s 'http://localhost:9090/api/v1/label/__name__/values' | \
|
||||
grep claude_code
|
||||
```
|
||||
|
||||
**Expected metrics:**
|
||||
- `claude_code_claude_code_active_time_seconds_total`
|
||||
- `claude_code_claude_code_commit_count_total`
|
||||
- `claude_code_claude_code_cost_usage_USD_total`
|
||||
- `claude_code_claude_code_lines_of_code_count_total`
|
||||
- `claude_code_claude_code_token_usage_tokens_total`
|
||||
|
||||
**Dashboard queries must use these exact names**
|
||||
|
||||
### Step 5.4: Save Corrected Dashboard
|
||||
|
||||
**Write to:** `~/.claude/telemetry/dashboards/claude-code-overview.json`
|
||||
|
||||
### Step 5.5: Import Dashboard
|
||||
|
||||
**Option A: Via Grafana UI**
|
||||
1. Open http://localhost:3000 (admin/admin)
|
||||
2. Dashboards → New → Import
|
||||
3. Upload JSON file: `~/.claude/telemetry/dashboards/claude-code-overview.json`
|
||||
4. Click Import
|
||||
|
||||
**Option B: Via API**
|
||||
```bash
|
||||
curl -X POST http://admin:admin@localhost:3000/api/dashboards/db \
|
||||
-H "Content-Type: application/json" \
|
||||
-d @~/.claude/telemetry/dashboards/claude-code-overview.json
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Phase 6: Verification & Testing
|
||||
|
||||
### Step 6.1: Verify OTEL Collector Receiving Data
|
||||
|
||||
**Note:** Claude Code must be restarted for telemetry to activate!
|
||||
|
||||
```bash
|
||||
# Check OTEL Collector logs for incoming data
|
||||
docker logs claude-otel-collector --tail 50 | grep -i "received"
|
||||
```
|
||||
|
||||
**Expected:** Messages about receiving OTLP data
|
||||
|
||||
**If no data:**
|
||||
```
|
||||
Reminder: You must restart Claude Code for telemetry to activate.
|
||||
1. Exit current Claude Code session
|
||||
2. Start new session: claude
|
||||
3. Wait 60 seconds
|
||||
4. Check again
|
||||
```
|
||||
|
||||
### Step 6.2: Query Prometheus for Metrics
|
||||
|
||||
```bash
|
||||
# Check if any claude_code metrics exist
|
||||
curl -s 'http://localhost:9090/api/v1/label/__name__/values' | \
|
||||
jq '.data[] | select(. | startswith("claude_code"))'
|
||||
```
|
||||
|
||||
**Expected:** List of claude_code metrics
|
||||
|
||||
**Sample query:**
|
||||
```bash
|
||||
curl -s 'http://localhost:9090/api/v1/query?query=claude_code_claude_code_lines_of_code_count_total' | \
|
||||
jq '.data.result'
|
||||
```
|
||||
|
||||
**Expected:** Non-empty result array
|
||||
|
||||
### Step 6.3: Test Grafana Dashboard
|
||||
|
||||
1. Open http://localhost:3000
|
||||
2. Navigate to imported dashboard
|
||||
3. Check panels show data (or "No data" if Claude Code hasn't been used yet)
|
||||
|
||||
**If "No data":**
|
||||
- Normal if Claude Code hasn't generated any activity yet
|
||||
- Use Claude Code for 1-2 minutes
|
||||
- Refresh dashboard
|
||||
|
||||
**If "Datasource not found":**
|
||||
- UID mismatch - go back to Step 5.1
|
||||
|
||||
**If queries fail:**
|
||||
- Metric name mismatch - verify double prefix
|
||||
|
||||
### Step 6.4: Generate Test Data
|
||||
|
||||
**To populate dashboard quickly:**
|
||||
```
|
||||
Use Claude Code to:
|
||||
1. Ask a question (generates token usage)
|
||||
2. Request a code modification (generates LOC metrics)
|
||||
3. Have a conversation (generates active time)
|
||||
```
|
||||
|
||||
**Wait 60 seconds, then refresh Grafana dashboard**
|
||||
|
||||
---
|
||||
|
||||
## Phase 7: Documentation & Quickstart Guide
|
||||
|
||||
### Step 7.1: Create Quickstart Guide
|
||||
|
||||
**Write to:** `~/.claude/telemetry/docs/quickstart.md`
|
||||
|
||||
**Include:**
|
||||
- URLs and credentials
|
||||
- Management commands (start/stop)
|
||||
- What metrics are being collected
|
||||
- How to access dashboards
|
||||
- Troubleshooting quick reference
|
||||
|
||||
**Template:** `data/quickstart-template.md`
|
||||
|
||||
### Step 7.2: Provide User Summary
|
||||
|
||||
```
|
||||
✅ Setup Complete!
|
||||
|
||||
📦 Installation:
|
||||
Location: ~/.claude/telemetry/
|
||||
Containers: 4 running (OTEL Collector, Prometheus, Loki, Grafana)
|
||||
|
||||
🌐 Access URLs:
|
||||
Grafana: http://localhost:3000 (admin/admin)
|
||||
Prometheus: http://localhost:9090
|
||||
OTEL Collector: localhost:4317 (gRPC), localhost:4318 (HTTP)
|
||||
|
||||
📊 Dashboards Imported:
|
||||
✓ Claude Code - Overview
|
||||
|
||||
📝 What's Being Collected:
|
||||
• Session counts and active time
|
||||
• Token usage (input/output/cached)
|
||||
• API costs by model
|
||||
• Lines of code modified
|
||||
• Commits and PRs created
|
||||
• Tool execution metrics
|
||||
|
||||
⚙️ Management:
|
||||
Start: ~/.claude/telemetry/start-telemetry.sh
|
||||
Stop: ~/.claude/telemetry/stop-telemetry.sh (preserves data)
|
||||
Cleanup: ~/.claude/telemetry/cleanup-telemetry.sh (removes all data)
|
||||
Logs: docker logs claude-otel-collector
|
||||
|
||||
🚀 Next Steps:
|
||||
1. ✅ Restart Claude Code (telemetry activates on startup)
|
||||
2. Use Claude Code normally
|
||||
3. Check dashboard in ~60 seconds
|
||||
4. Review quickstart: ~/.claude/telemetry/docs/quickstart.md
|
||||
|
||||
📚 Documentation:
|
||||
- Quickstart: ~/.claude/telemetry/docs/quickstart.md
|
||||
- Metrics Reference: data/metrics-reference.md
|
||||
- Troubleshooting: data/troubleshooting.md
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Cleanup Instructions
|
||||
|
||||
### Remove Stack (Keep Data)
|
||||
```bash
|
||||
cd ~/.claude/telemetry
|
||||
docker compose down
|
||||
```
|
||||
|
||||
### Remove Stack and Data
|
||||
```bash
|
||||
cd ~/.claude/telemetry
|
||||
docker compose down -v
|
||||
```
|
||||
|
||||
### Remove Telemetry from Claude Code
|
||||
Edit `~/.claude/settings.json` and remove the `env` section with telemetry variables, or set:
|
||||
```json
|
||||
"CLAUDE_CODE_ENABLE_TELEMETRY": "0"
|
||||
```
|
||||
|
||||
Then restart Claude Code.
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
See `data/troubleshooting.md` for detailed solutions to common issues.
|
||||
|
||||
**Quick fixes:**
|
||||
- Container won't start → Check logs: `docker logs claude-otel-collector`
|
||||
- No metrics → Restart Claude Code
|
||||
- Dashboard broken → Verify datasource UID
|
||||
- Wrong metric names → Use double prefix: `claude_code_claude_code_*`
|
||||
@@ -0,0 +1,572 @@
|
||||
# Mode 2: Enterprise Setup (Connect to Existing Infrastructure)
|
||||
|
||||
**Goal:** Configure Claude Code to send telemetry to centralized company infrastructure
|
||||
|
||||
**When to use:**
|
||||
- Company has centralized OTEL Collector endpoint
|
||||
- Team rollout scenario
|
||||
- Want aggregated team metrics
|
||||
- Privacy/compliance requires centralized control
|
||||
- No need for local Grafana dashboards
|
||||
|
||||
**Prerequisites:**
|
||||
- OTEL Collector endpoint URL (e.g., `https://otel.company.com:4317`)
|
||||
- Authentication credentials (API key or mTLS certificates)
|
||||
- Optional: Team/department identifiers
|
||||
- Write access to `~/.claude/settings.json`
|
||||
|
||||
**Estimated Time:** 2-3 minutes
|
||||
|
||||
---
|
||||
|
||||
## Phase 0: Gather Requirements
|
||||
|
||||
### Step 0.1: Collect endpoint information from user
|
||||
|
||||
Ask the user for the following details:
|
||||
|
||||
1. **OTEL Collector Endpoint URL**
|
||||
- Format: `https://otel.company.com:4317` or `http://otel.company.com:4318`
|
||||
- Protocol: gRPC (port 4317) or HTTP (port 4318)
|
||||
|
||||
2. **Authentication Method**
|
||||
- API Key/Bearer Token
|
||||
- mTLS certificates
|
||||
- Basic Auth
|
||||
- No authentication (internal network)
|
||||
|
||||
3. **Team/Environment Identifiers**
|
||||
- Team name (e.g., `team=platform`)
|
||||
- Environment (e.g., `environment=production`)
|
||||
- Department (e.g., `department=engineering`)
|
||||
- Any other custom attributes
|
||||
|
||||
4. **Optional: Protocol Preferences**
|
||||
- Default: gRPC (more efficient)
|
||||
- Alternative: HTTP (better firewall compatibility)
|
||||
|
||||
**Example Questions:**
|
||||
|
||||
```
|
||||
To configure enterprise telemetry, I need a few details:
|
||||
|
||||
1. **Endpoint:** What is your OTEL Collector endpoint URL?
|
||||
(e.g., https://otel.company.com:4317)
|
||||
|
||||
2. **Protocol:** HTTPS or HTTP? gRPC or HTTP/protobuf?
|
||||
|
||||
3. **Authentication:** Do you have an API key, certificate, or other credentials?
|
||||
|
||||
4. **Team identifier:** What team/department should metrics be tagged with?
|
||||
(e.g., team=platform, department=engineering)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Phase 1: Backup Existing Settings
|
||||
|
||||
### Step 1.1: Backup settings.json
|
||||
|
||||
**Always backup before modifying!**
|
||||
|
||||
```bash
|
||||
# Check if settings.json exists
|
||||
if [ -f ~/.claude/settings.json ]; then
|
||||
cp ~/.claude/settings.json ~/.claude/settings.json.backup.$(date +%Y%m%d-%H%M%S)
|
||||
echo "✅ Backup created: ~/.claude/settings.json.backup.$(date +%Y%m%d-%H%M%S)"
|
||||
else
|
||||
echo "⚠️ No existing settings.json found - will create new one"
|
||||
fi
|
||||
```
|
||||
|
||||
### Step 1.2: Read existing settings
|
||||
|
||||
```bash
|
||||
# Check current settings
|
||||
cat ~/.claude/settings.json
|
||||
```
|
||||
|
||||
**Important:** Preserve all existing settings when adding telemetry configuration!
|
||||
|
||||
---
|
||||
|
||||
## Phase 2: Update Claude Code Settings
|
||||
|
||||
### Step 2.1: Determine configuration based on authentication method
|
||||
|
||||
**Scenario A: API Key Authentication**
|
||||
|
||||
```json
|
||||
{
|
||||
"env": {
|
||||
"CLAUDE_CODE_ENABLE_TELEMETRY": "1",
|
||||
"OTEL_METRICS_EXPORTER": "otlp",
|
||||
"OTEL_LOGS_EXPORTER": "otlp",
|
||||
"OTEL_EXPORTER_OTLP_PROTOCOL": "grpc",
|
||||
"OTEL_EXPORTER_OTLP_ENDPOINT": "https://otel.company.com:4317",
|
||||
"OTEL_EXPORTER_OTLP_HEADERS": "Authorization=Bearer YOUR_API_KEY_HERE",
|
||||
"OTEL_METRIC_EXPORT_INTERVAL": "60000",
|
||||
"OTEL_LOGS_EXPORT_INTERVAL": "5000",
|
||||
"OTEL_LOG_USER_PROMPTS": "1",
|
||||
"OTEL_METRICS_INCLUDE_SESSION_ID": "true",
|
||||
"OTEL_METRICS_INCLUDE_VERSION": "true",
|
||||
"OTEL_METRICS_INCLUDE_ACCOUNT_UUID": "true",
|
||||
"OTEL_RESOURCE_ATTRIBUTES": "team=platform,environment=production,deployment=enterprise"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Scenario B: mTLS Certificate Authentication**
|
||||
|
||||
```json
|
||||
{
|
||||
"env": {
|
||||
"CLAUDE_CODE_ENABLE_TELEMETRY": "1",
|
||||
"OTEL_METRICS_EXPORTER": "otlp",
|
||||
"OTEL_LOGS_EXPORTER": "otlp",
|
||||
"OTEL_EXPORTER_OTLP_PROTOCOL": "grpc",
|
||||
"OTEL_EXPORTER_OTLP_ENDPOINT": "https://otel.company.com:4317",
|
||||
"OTEL_EXPORTER_OTLP_CERTIFICATE": "/path/to/client-cert.pem",
|
||||
"OTEL_EXPORTER_OTLP_CLIENT_KEY": "/path/to/client-key.pem",
|
||||
"OTEL_EXPORTER_OTLP_CLIENT_CERTIFICATE": "/path/to/ca-cert.pem",
|
||||
"OTEL_METRIC_EXPORT_INTERVAL": "60000",
|
||||
"OTEL_LOGS_EXPORT_INTERVAL": "5000",
|
||||
"OTEL_LOG_USER_PROMPTS": "1",
|
||||
"OTEL_METRICS_INCLUDE_SESSION_ID": "true",
|
||||
"OTEL_METRICS_INCLUDE_VERSION": "true",
|
||||
"OTEL_METRICS_INCLUDE_ACCOUNT_UUID": "true",
|
||||
"OTEL_RESOURCE_ATTRIBUTES": "team=platform,environment=production,deployment=enterprise"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Scenario C: HTTP Protocol (Port 4318)**
|
||||
|
||||
```json
|
||||
{
|
||||
"env": {
|
||||
"CLAUDE_CODE_ENABLE_TELEMETRY": "1",
|
||||
"OTEL_METRICS_EXPORTER": "otlp",
|
||||
"OTEL_LOGS_EXPORTER": "otlp",
|
||||
"OTEL_EXPORTER_OTLP_PROTOCOL": "http/protobuf",
|
||||
"OTEL_EXPORTER_OTLP_ENDPOINT": "https://otel.company.com:4318",
|
||||
"OTEL_EXPORTER_OTLP_HEADERS": "Authorization=Bearer YOUR_API_KEY_HERE",
|
||||
"OTEL_METRIC_EXPORT_INTERVAL": "60000",
|
||||
"OTEL_LOGS_EXPORT_INTERVAL": "5000",
|
||||
"OTEL_LOG_USER_PROMPTS": "1",
|
||||
"OTEL_METRICS_INCLUDE_SESSION_ID": "true",
|
||||
"OTEL_METRICS_INCLUDE_VERSION": "true",
|
||||
"OTEL_METRICS_INCLUDE_ACCOUNT_UUID": "true",
|
||||
"OTEL_RESOURCE_ATTRIBUTES": "team=platform,environment=production,deployment=enterprise"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Scenario D: No Authentication (Internal Network)**
|
||||
|
||||
```json
|
||||
{
|
||||
"env": {
|
||||
"CLAUDE_CODE_ENABLE_TELEMETRY": "1",
|
||||
"OTEL_METRICS_EXPORTER": "otlp",
|
||||
"OTEL_LOGS_EXPORTER": "otlp",
|
||||
"OTEL_EXPORTER_OTLP_PROTOCOL": "grpc",
|
||||
"OTEL_EXPORTER_OTLP_ENDPOINT": "http://otel.internal.company.com:4317",
|
||||
"OTEL_METRIC_EXPORT_INTERVAL": "60000",
|
||||
"OTEL_LOGS_EXPORT_INTERVAL": "5000",
|
||||
"OTEL_LOG_USER_PROMPTS": "1",
|
||||
"OTEL_METRICS_INCLUDE_SESSION_ID": "true",
|
||||
"OTEL_METRICS_INCLUDE_VERSION": "true",
|
||||
"OTEL_METRICS_INCLUDE_ACCOUNT_UUID": "true",
|
||||
"OTEL_RESOURCE_ATTRIBUTES": "team=platform,environment=production,deployment=enterprise"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Step 2.2: Update settings.json
|
||||
|
||||
**Method 1: Manual Update (Safest)**
|
||||
|
||||
1. Open `~/.claude/settings.json` in editor
|
||||
2. Merge the telemetry configuration into existing `env` object
|
||||
3. Preserve all other settings
|
||||
4. Save file
|
||||
|
||||
**Method 2: Programmatic Update (Use with Caution)**
|
||||
|
||||
```bash
|
||||
# Read existing settings
|
||||
existing_settings=$(cat ~/.claude/settings.json)
|
||||
|
||||
# Create merged settings (requires jq)
|
||||
cat ~/.claude/settings.json | jq '. + {
|
||||
"env": (.env // {} | . + {
|
||||
"CLAUDE_CODE_ENABLE_TELEMETRY": "1",
|
||||
"OTEL_METRICS_EXPORTER": "otlp",
|
||||
"OTEL_LOGS_EXPORTER": "otlp",
|
||||
"OTEL_EXPORTER_OTLP_PROTOCOL": "grpc",
|
||||
"OTEL_EXPORTER_OTLP_ENDPOINT": "https://otel.company.com:4317",
|
||||
"OTEL_EXPORTER_OTLP_HEADERS": "Authorization=Bearer YOUR_API_KEY_HERE",
|
||||
"OTEL_METRIC_EXPORT_INTERVAL": "60000",
|
||||
"OTEL_LOGS_EXPORT_INTERVAL": "5000",
|
||||
"OTEL_LOG_USER_PROMPTS": "1",
|
||||
"OTEL_METRICS_INCLUDE_SESSION_ID": "true",
|
||||
"OTEL_METRICS_INCLUDE_VERSION": "true",
|
||||
"OTEL_METRICS_INCLUDE_ACCOUNT_UUID": "true",
|
||||
"OTEL_RESOURCE_ATTRIBUTES": "team=platform,environment=production,deployment=enterprise"
|
||||
})
|
||||
}' > ~/.claude/settings.json.new
|
||||
|
||||
# Validate JSON
|
||||
if jq empty ~/.claude/settings.json.new 2>/dev/null; then
|
||||
mv ~/.claude/settings.json.new ~/.claude/settings.json
|
||||
echo "✅ Settings updated successfully"
|
||||
else
|
||||
echo "❌ Invalid JSON - restoring backup"
|
||||
rm ~/.claude/settings.json.new
|
||||
fi
|
||||
```
|
||||
|
||||
### Step 2.3: Validate configuration
|
||||
|
||||
```bash
|
||||
# Check that settings.json is valid JSON
|
||||
jq empty ~/.claude/settings.json
|
||||
|
||||
# Display telemetry configuration
|
||||
jq '.env | with_entries(select(.key | startswith("OTEL_") or . == "CLAUDE_CODE_ENABLE_TELEMETRY"))' ~/.claude/settings.json
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Phase 3: Test Connectivity (Optional)
|
||||
|
||||
### Step 3.1: Test OTEL endpoint reachability
|
||||
|
||||
```bash
|
||||
# Test gRPC endpoint (port 4317)
|
||||
nc -zv otel.company.com 4317
|
||||
|
||||
# Test HTTP endpoint (port 4318)
|
||||
curl -v https://otel.company.com:4318/v1/metrics -d '{}' -H "Content-Type: application/json"
|
||||
```
|
||||
|
||||
### Step 3.2: Validate authentication
|
||||
|
||||
```bash
|
||||
# Test with API key
|
||||
curl -v https://otel.company.com:4318/v1/metrics \
|
||||
-H "Authorization: Bearer YOUR_API_KEY_HERE" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{}'
|
||||
|
||||
# Expected: 200 or 401/403 (tells us auth is working)
|
||||
# Unexpected: Connection refused, timeout (network issue)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Phase 4: User Instructions
|
||||
|
||||
### Step 4.1: Provide restart instructions
|
||||
|
||||
**Display to user:**
|
||||
|
||||
```
|
||||
✅ Configuration complete!
|
||||
|
||||
**Important Next Steps:**
|
||||
|
||||
1. **Restart Claude Code** for telemetry to take effect
|
||||
- Telemetry configuration is only loaded at startup
|
||||
- Close all Claude Code sessions and restart
|
||||
|
||||
2. **Verify with your platform team** that they see metrics
|
||||
- Metrics should appear within 60 seconds of restart
|
||||
- Tagged with: team=platform, environment=production
|
||||
- Metric prefix: claude_code_claude_code_*
|
||||
|
||||
3. **Dashboard access**
|
||||
- Contact your platform team for Grafana/dashboard URLs
|
||||
- Dashboards should be centrally managed
|
||||
|
||||
**Troubleshooting:**
|
||||
|
||||
If metrics don't appear:
|
||||
- Check network connectivity to OTEL endpoint
|
||||
- Verify authentication credentials are correct
|
||||
- Check firewall rules allow outbound connections
|
||||
- Review OTEL Collector logs on backend (platform team)
|
||||
- Verify OTEL_EXPORTER_OTLP_ENDPOINT is correct
|
||||
|
||||
**Rollback:**
|
||||
|
||||
If you need to disable telemetry:
|
||||
- Restore backup: cp ~/.claude/settings.json.backup.TIMESTAMP ~/.claude/settings.json
|
||||
- Or set: "CLAUDE_CODE_ENABLE_TELEMETRY": "0"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Phase 5: Create Team Rollout Documentation
|
||||
|
||||
### Step 5.1: Generate rollout guide for team distribution
|
||||
|
||||
**Create file: `claude-code-telemetry-setup-guide.md`**
|
||||
|
||||
```markdown
|
||||
# Claude Code Telemetry Setup Guide
|
||||
|
||||
**For:** [Team Name] Team Members
|
||||
**Last Updated:** [Date]
|
||||
|
||||
## Overview
|
||||
|
||||
We're collecting Claude Code usage telemetry to:
|
||||
- Track API costs and optimize spending
|
||||
- Measure productivity metrics (LOC, commits, PRs)
|
||||
- Understand token usage patterns
|
||||
- Identify high-value use cases
|
||||
|
||||
**Privacy:** All metrics are aggregated and anonymized at the team level.
|
||||
|
||||
## Setup Instructions
|
||||
|
||||
### Step 1: Backup Your Settings
|
||||
|
||||
```bash
|
||||
cp ~/.claude/settings.json ~/.claude/settings.json.backup
|
||||
```
|
||||
|
||||
### Step 2: Update Configuration
|
||||
|
||||
Add the following to your `~/.claude/settings.json`:
|
||||
|
||||
```json
|
||||
{
|
||||
"env": {
|
||||
"CLAUDE_CODE_ENABLE_TELEMETRY": "1",
|
||||
"OTEL_METRICS_EXPORTER": "otlp",
|
||||
"OTEL_LOGS_EXPORTER": "otlp",
|
||||
"OTEL_EXPORTER_OTLP_PROTOCOL": "grpc",
|
||||
"OTEL_EXPORTER_OTLP_ENDPOINT": "https://otel.company.com:4317",
|
||||
"OTEL_EXPORTER_OTLP_HEADERS": "Authorization=Bearer [PROVIDED_BY_PLATFORM_TEAM]",
|
||||
"OTEL_METRIC_EXPORT_INTERVAL": "60000",
|
||||
"OTEL_LOGS_EXPORT_INTERVAL": "5000",
|
||||
"OTEL_LOG_USER_PROMPTS": "1",
|
||||
"OTEL_METRICS_INCLUDE_SESSION_ID": "true",
|
||||
"OTEL_METRICS_INCLUDE_VERSION": "true",
|
||||
"OTEL_METRICS_INCLUDE_ACCOUNT_UUID": "true",
|
||||
"OTEL_RESOURCE_ATTRIBUTES": "team=[TEAM_NAME],environment=production"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Important:** Replace `[PROVIDED_BY_PLATFORM_TEAM]` with your API key.
|
||||
|
||||
### Step 3: Restart Claude Code
|
||||
|
||||
Close all Claude Code sessions and restart for changes to take effect.
|
||||
|
||||
### Step 4: Verify Setup
|
||||
|
||||
After 5 minutes of usage:
|
||||
1. Check team dashboard: [DASHBOARD_URL]
|
||||
2. Verify your metrics appear in the team aggregation
|
||||
3. Contact [TEAM_CONTACT] if you have issues
|
||||
|
||||
## What's Being Collected?
|
||||
|
||||
**Metrics:**
|
||||
- Session counts and active time
|
||||
- Token usage (input, output, cached)
|
||||
- API costs by model
|
||||
- Lines of code modified
|
||||
- Commits and PRs created
|
||||
|
||||
**Events/Logs:**
|
||||
- User prompts (anonymized)
|
||||
- Tool executions
|
||||
- API requests
|
||||
|
||||
**NOT Collected:**
|
||||
- Source code content
|
||||
- File names or paths
|
||||
- Personal identifiers (beyond account UUID for deduplication)
|
||||
|
||||
## Dashboard Access
|
||||
|
||||
**Team Dashboard:** [URL]
|
||||
**Login:** Use your company SSO
|
||||
|
||||
## Support
|
||||
|
||||
**Issues?** Contact [TEAM_CONTACT] or #claude-code-telemetry Slack channel
|
||||
|
||||
**Opt-Out:** Contact [TEAM_CONTACT] if you need to opt out for specific projects
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Phase 6: Success Criteria
|
||||
|
||||
### Checklist for Mode 2 completion:
|
||||
|
||||
- ✅ Backed up existing settings.json
|
||||
- ✅ Updated settings with correct OTEL endpoint
|
||||
- ✅ Added authentication (API key or certificates)
|
||||
- ✅ Set team/environment resource attributes
|
||||
- ✅ Validated JSON configuration
|
||||
- ✅ Tested connectivity (optional)
|
||||
- ✅ Provided restart instructions to user
|
||||
- ✅ Created team rollout documentation (if applicable)
|
||||
|
||||
**Expected outcome:**
|
||||
- Claude Code sends telemetry to central endpoint within 60 seconds of restart
|
||||
- Platform team can see metrics tagged with team identifier
|
||||
- User has clear instructions for verification and troubleshooting
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Issue 1: Connection Refused
|
||||
|
||||
**Symptoms:** Claude Code can't reach OTEL endpoint
|
||||
|
||||
**Checks:**
|
||||
```bash
|
||||
# Test network connectivity
|
||||
ping otel.company.com
|
||||
|
||||
# Test port access
|
||||
nc -zv otel.company.com 4317
|
||||
|
||||
# Check corporate VPN/proxy
|
||||
echo $HTTPS_PROXY
|
||||
```
|
||||
|
||||
**Solutions:**
|
||||
- Connect to corporate VPN
|
||||
- Use HTTP proxy if required: `HTTPS_PROXY=http://proxy.company.com:8080`
|
||||
- Try HTTP protocol (port 4318) instead of gRPC
|
||||
- Contact network team to allow outbound connections
|
||||
|
||||
### Issue 2: Authentication Failed
|
||||
|
||||
**Symptoms:** 401 or 403 errors in logs
|
||||
|
||||
**Checks:**
|
||||
```bash
|
||||
# Verify API key format
|
||||
jq '.env.OTEL_EXPORTER_OTLP_HEADERS' ~/.claude/settings.json
|
||||
|
||||
# Test manually
|
||||
curl -v https://otel.company.com:4318/v1/metrics \
|
||||
-H "Authorization: Bearer YOUR_KEY" \
|
||||
-d '{}'
|
||||
```
|
||||
|
||||
**Solutions:**
|
||||
- Verify API key is correct and not expired
|
||||
- Check header format: `Authorization=Bearer TOKEN` (no quotes, equals sign)
|
||||
- Confirm permissions with platform team
|
||||
- Try rotating API key
|
||||
|
||||
### Issue 3: Metrics Not Appearing
|
||||
|
||||
**Symptoms:** Platform team doesn't see metrics after 5 minutes
|
||||
|
||||
**Checks:**
|
||||
```bash
|
||||
# Verify telemetry is enabled
|
||||
jq '.env.CLAUDE_CODE_ENABLE_TELEMETRY' ~/.claude/settings.json
|
||||
|
||||
# Check endpoint configuration
|
||||
jq '.env.OTEL_EXPORTER_OTLP_ENDPOINT' ~/.claude/settings.json
|
||||
|
||||
# Confirm Claude Code was restarted
|
||||
ps aux | grep claude
|
||||
```
|
||||
|
||||
**Solutions:**
|
||||
- Restart Claude Code (telemetry loads at startup only)
|
||||
- Verify endpoint URL has correct protocol and port
|
||||
- Check with platform team if OTEL Collector is receiving data
|
||||
- Review OTEL Collector logs for errors
|
||||
- Verify resource attributes match expected format
|
||||
|
||||
### Issue 4: Certificate Errors (mTLS)
|
||||
|
||||
**Symptoms:** SSL/TLS handshake errors
|
||||
|
||||
**Checks:**
|
||||
```bash
|
||||
# Verify certificate paths
|
||||
ls -la /path/to/client-cert.pem
|
||||
ls -la /path/to/client-key.pem
|
||||
ls -la /path/to/ca-cert.pem
|
||||
|
||||
# Check certificate validity
|
||||
openssl x509 -in /path/to/client-cert.pem -noout -dates
|
||||
```
|
||||
|
||||
**Solutions:**
|
||||
- Ensure certificate files are readable
|
||||
- Verify certificates haven't expired
|
||||
- Check certificate chain is complete
|
||||
- Confirm CA certificate matches server
|
||||
- Contact platform team for new certificates if needed
|
||||
|
||||
---
|
||||
|
||||
## Enterprise Configuration Examples
|
||||
|
||||
### Example 1: Multi-Environment Setup
|
||||
|
||||
**Development:**
|
||||
```json
|
||||
"OTEL_RESOURCE_ATTRIBUTES": "team=platform,environment=development,user=john.doe"
|
||||
```
|
||||
|
||||
**Staging:**
|
||||
```json
|
||||
"OTEL_RESOURCE_ATTRIBUTES": "team=platform,environment=staging,user=john.doe"
|
||||
```
|
||||
|
||||
**Production:**
|
||||
```json
|
||||
"OTEL_RESOURCE_ATTRIBUTES": "team=platform,environment=production,user=john.doe"
|
||||
```
|
||||
|
||||
### Example 2: Department-Level Aggregation
|
||||
|
||||
```json
|
||||
"OTEL_RESOURCE_ATTRIBUTES": "department=engineering,team=platform,squad=backend,environment=production"
|
||||
```
|
||||
|
||||
Enables queries like:
|
||||
- Cost by department
|
||||
- Usage by team within department
|
||||
- Squad-level productivity metrics
|
||||
|
||||
### Example 3: Project-Based Tagging
|
||||
|
||||
```json
|
||||
"OTEL_RESOURCE_ATTRIBUTES": "team=platform,project=api-v2-migration,environment=production"
|
||||
```
|
||||
|
||||
Track costs and effort for specific initiatives.
|
||||
|
||||
---
|
||||
|
||||
## Additional Resources
|
||||
|
||||
- **OTEL Specification:** https://opentelemetry.io/docs/specs/otel/
|
||||
- **Claude Code Metrics Reference:** See `data/metrics-reference.md`
|
||||
- **Enterprise Architecture:** See `data/enterprise-architecture.md`
|
||||
- **Team Dashboard Queries:** See `data/prometheus-queries.md`
|
||||
|
||||
---
|
||||
|
||||
**Mode 2 Complete!** ✅
|
||||
@@ -0,0 +1,214 @@
|
||||
# Known Issues & Fixes
|
||||
|
||||
Common problems and solutions for Claude Code OpenTelemetry setup.
|
||||
|
||||
## Issue 1: Missing OTEL Exporters (Most Common)
|
||||
|
||||
**Problem**: Claude Code not sending telemetry even with `CLAUDE_CODE_ENABLE_TELEMETRY=1`
|
||||
|
||||
**Cause**: Missing required exporter settings
|
||||
|
||||
**Symptoms**:
|
||||
- No metrics in Prometheus after restart
|
||||
- OTEL Collector logs show no incoming connections
|
||||
- Dashboard shows "No data"
|
||||
|
||||
**Fix**: Add to settings.json:
|
||||
```json
|
||||
{
|
||||
"env": {
|
||||
"OTEL_METRICS_EXPORTER": "otlp",
|
||||
"OTEL_LOGS_EXPORTER": "otlp"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Important**: Restart Claude Code after adding!
|
||||
|
||||
## Issue 2: OTEL Collector Deprecated 'address' Field
|
||||
|
||||
**Problem**: OTEL Collector crashes with "'address' has invalid keys" error
|
||||
|
||||
**Cause**: The `address` field in `service.telemetry.metrics` is deprecated in v0.123.0+
|
||||
|
||||
**Fix**: Remove the address field:
|
||||
```yaml
|
||||
service:
|
||||
telemetry:
|
||||
metrics:
|
||||
level: detailed
|
||||
# REMOVE: address: ":8888"
|
||||
```
|
||||
|
||||
## Issue 3: OTEL Collector Deprecated Exporter
|
||||
|
||||
**Problem**: OTEL Collector fails with "logging exporter has been deprecated"
|
||||
|
||||
**Fix**: Use `debug` exporter instead:
|
||||
```yaml
|
||||
exporters:
|
||||
debug:
|
||||
verbosity: normal
|
||||
|
||||
service:
|
||||
pipelines:
|
||||
metrics:
|
||||
exporters: [prometheus, debug]
|
||||
```
|
||||
|
||||
## Issue 4: Dashboard Datasource Not Found
|
||||
|
||||
**Problem**: Grafana dashboard shows "datasource prometheus not found"
|
||||
|
||||
**Cause**: Dashboard has hardcoded UID that doesn't match your setup
|
||||
|
||||
**Fix**:
|
||||
|
||||
1. Detect your actual UID:
|
||||
```bash
|
||||
curl -s http://admin:admin@localhost:3000/api/datasources | jq '.[0].uid'
|
||||
```
|
||||
|
||||
2. Replace all occurrences in dashboard JSON:
|
||||
```bash
|
||||
sed -i '' 's/"uid": "prometheus"/"uid": "YOUR_ACTUAL_UID"/g' dashboard.json
|
||||
```
|
||||
|
||||
3. Re-import the dashboard
|
||||
|
||||
## Issue 5: Metric Names Double Prefix
|
||||
|
||||
**Problem**: Dashboard queries fail because metrics have format `claude_code_claude_code_*`
|
||||
|
||||
**Cause**: Claude Code adds prefix, OTEL Collector adds another
|
||||
|
||||
**Affected Metrics**:
|
||||
- `claude_code_claude_code_lines_of_code_count_total`
|
||||
- `claude_code_claude_code_cost_usage_USD_total`
|
||||
- `claude_code_claude_code_token_usage_tokens_total`
|
||||
- `claude_code_claude_code_active_time_seconds_total`
|
||||
- `claude_code_claude_code_commit_count_total`
|
||||
|
||||
**Fix**: Update dashboard queries to use actual metric names
|
||||
|
||||
**Verify actual names**:
|
||||
```bash
|
||||
curl -s http://localhost:9090/api/v1/label/__name__/values | jq '.data[]' | grep claude
|
||||
```
|
||||
|
||||
## Issue 6: No Data in Prometheus
|
||||
|
||||
**Diagnostic Steps**:
|
||||
|
||||
1. **Check containers running**:
|
||||
```bash
|
||||
docker ps --format "table {{.Names}}\t{{.Status}}"
|
||||
```
|
||||
|
||||
2. **Check OTEL Collector logs**:
|
||||
```bash
|
||||
docker logs otel-collector 2>&1 | tail -50
|
||||
```
|
||||
|
||||
3. **Query Prometheus directly**:
|
||||
```bash
|
||||
curl -s 'http://localhost:9090/api/v1/query?query=up' | jq '.data.result'
|
||||
```
|
||||
|
||||
4. **Verify Claude Code settings**:
|
||||
```bash
|
||||
cat ~/.claude/settings.json | jq '.env'
|
||||
```
|
||||
|
||||
**Common Causes**:
|
||||
- Claude Code not restarted after settings change
|
||||
- Missing OTEL_METRICS_EXPORTER setting
|
||||
- Wrong endpoint (should be localhost:4317 for local)
|
||||
- Firewall blocking ports
|
||||
|
||||
## Issue 7: Port Conflicts
|
||||
|
||||
**Problem**: Container fails to start due to port already in use
|
||||
|
||||
**Check ports**:
|
||||
```bash
|
||||
for port in 3000 4317 4318 8889 9090; do
|
||||
lsof -i :$port && echo "Port $port in use"
|
||||
done
|
||||
```
|
||||
|
||||
**Solutions**:
|
||||
- Stop conflicting service
|
||||
- Change port in docker-compose.yml
|
||||
- Use different port mapping
|
||||
|
||||
## Issue 8: Docker Not Running
|
||||
|
||||
**Problem**: Commands fail with "Cannot connect to Docker daemon"
|
||||
|
||||
**Fix**:
|
||||
1. Start Docker Desktop application
|
||||
2. Wait for it to fully initialize
|
||||
3. Verify: `docker info`
|
||||
|
||||
## Issue 9: Insufficient Disk Space
|
||||
|
||||
**Problem**: Containers fail to start or crash
|
||||
|
||||
**Required**: Minimum 2GB free
|
||||
|
||||
**Check**:
|
||||
```bash
|
||||
df -h ~/.claude
|
||||
```
|
||||
|
||||
**Solutions**:
|
||||
- Clean Docker: `docker system prune`
|
||||
- Remove old images: `docker image prune -a`
|
||||
- Clear telemetry volumes: `~/.claude/telemetry/cleanup-telemetry.sh`
|
||||
|
||||
## Issue 10: Grafana Dashboard Empty After Import
|
||||
|
||||
**Diagnostic Steps**:
|
||||
|
||||
1. Check time range (upper right) - data might be outside range
|
||||
2. Verify datasource is connected (green checkmark in settings)
|
||||
3. Run test query in Explore view
|
||||
4. Check metric names match actual names in Prometheus
|
||||
|
||||
## Debugging Commands
|
||||
|
||||
```bash
|
||||
# Full container status
|
||||
docker compose -f ~/.claude/telemetry/docker-compose.yml ps
|
||||
|
||||
# OTEL Collector config validation
|
||||
docker exec otel-collector cat /etc/otel/config.yaml
|
||||
|
||||
# Prometheus targets
|
||||
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets'
|
||||
|
||||
# Grafana datasources
|
||||
curl -s http://admin:admin@localhost:3000/api/datasources | jq '.'
|
||||
|
||||
# All available metrics
|
||||
curl -s http://localhost:9090/api/v1/label/__name__/values | jq '.data | length'
|
||||
```
|
||||
|
||||
## Getting Help
|
||||
|
||||
If issues persist:
|
||||
|
||||
1. Collect diagnostics:
|
||||
```bash
|
||||
docker compose -f ~/.claude/telemetry/docker-compose.yml logs > telemetry-logs.txt
|
||||
cat ~/.claude/settings.json | jq '.env' > settings-env.txt
|
||||
```
|
||||
|
||||
2. Check versions:
|
||||
```bash
|
||||
docker --version
|
||||
docker compose version
|
||||
```
|
||||
|
||||
3. Provide: logs, settings, versions, and exact error message
|
||||
@@ -0,0 +1,38 @@
|
||||
#!/bin/bash
|
||||
# Full Cleanup of Claude Code Telemetry Stack
|
||||
# WARNING: This removes all data including Docker volumes
|
||||
|
||||
echo "⚠️ WARNING: This will remove ALL telemetry data including:"
|
||||
echo " - All containers"
|
||||
echo " - All Docker volumes (Grafana, Prometheus, Loki data)"
|
||||
echo " - Network configuration"
|
||||
echo ""
|
||||
read -p "Are you sure you want to proceed? (yes/no): " -r
|
||||
echo
|
||||
|
||||
if [[ ! $REPLY =~ ^[Yy][Ee][Ss]$ ]]; then
|
||||
echo "Cleanup cancelled."
|
||||
exit 0
|
||||
fi
|
||||
|
||||
echo "Performing full cleanup of Claude Code telemetry stack..."
|
||||
|
||||
# Navigate to telemetry directory
|
||||
cd ~/.claude/telemetry || exit 1
|
||||
|
||||
# Stop and remove containers, networks, and volumes
|
||||
docker compose down -v
|
||||
|
||||
echo ""
|
||||
echo "✅ Full cleanup complete!"
|
||||
echo ""
|
||||
echo "Removed:"
|
||||
echo " ✓ All containers (otel-collector, prometheus, loki, grafana)"
|
||||
echo " ✓ All volumes (all historical data)"
|
||||
echo " ✓ Network configuration"
|
||||
echo ""
|
||||
echo "Preserved:"
|
||||
echo " ✓ Configuration files in ~/.claude/telemetry/"
|
||||
echo " ✓ Claude Code settings in ~/.claude/settings.json"
|
||||
echo ""
|
||||
echo "To start fresh: ./start-telemetry.sh"
|
||||
@@ -0,0 +1,74 @@
|
||||
services:
|
||||
# OpenTelemetry Collector - receives telemetry from Claude Code
|
||||
otel-collector:
|
||||
image: otel/opentelemetry-collector-contrib:0.115.1
|
||||
container_name: claude-otel-collector
|
||||
command: ["--config=/etc/otel-collector-config.yml"]
|
||||
volumes:
|
||||
- ./otel-collector-config.yml:/etc/otel-collector-config.yml
|
||||
ports:
|
||||
- "4317:4317" # OTLP gRPC receiver
|
||||
- "4318:4318" # OTLP HTTP receiver
|
||||
- "8889:8889" # Prometheus metrics exporter
|
||||
networks:
|
||||
- claude-telemetry
|
||||
|
||||
# Prometheus - stores metrics
|
||||
prometheus:
|
||||
image: prom/prometheus:v2.55.1
|
||||
container_name: claude-prometheus
|
||||
command:
|
||||
- '--config.file=/etc/prometheus/prometheus.yml'
|
||||
- '--storage.tsdb.path=/prometheus'
|
||||
- '--web.console.libraries=/etc/prometheus/console_libraries'
|
||||
- '--web.console.templates=/etc/prometheus/consoles'
|
||||
- '--web.enable-lifecycle'
|
||||
volumes:
|
||||
- ./prometheus.yml:/etc/prometheus/prometheus.yml
|
||||
- prometheus-data:/prometheus
|
||||
ports:
|
||||
- "9090:9090"
|
||||
networks:
|
||||
- claude-telemetry
|
||||
depends_on:
|
||||
- otel-collector
|
||||
|
||||
# Loki - stores logs
|
||||
loki:
|
||||
image: grafana/loki:3.0.0
|
||||
container_name: claude-loki
|
||||
ports:
|
||||
- "3100:3100"
|
||||
command: -config.file=/etc/loki/local-config.yaml
|
||||
volumes:
|
||||
- loki-data:/loki
|
||||
networks:
|
||||
- claude-telemetry
|
||||
|
||||
# Grafana - visualization dashboards
|
||||
grafana:
|
||||
image: grafana/grafana:11.3.0
|
||||
container_name: claude-grafana
|
||||
ports:
|
||||
- "3000:3000"
|
||||
environment:
|
||||
- GF_SECURITY_ADMIN_USER=admin
|
||||
- GF_SECURITY_ADMIN_PASSWORD=admin
|
||||
- GF_USERS_ALLOW_SIGN_UP=false
|
||||
volumes:
|
||||
- grafana-data:/var/lib/grafana
|
||||
- ./grafana-datasources.yml:/etc/grafana/provisioning/datasources/datasources.yml
|
||||
networks:
|
||||
- claude-telemetry
|
||||
depends_on:
|
||||
- prometheus
|
||||
- loki
|
||||
|
||||
networks:
|
||||
claude-telemetry:
|
||||
driver: bridge
|
||||
|
||||
volumes:
|
||||
prometheus-data:
|
||||
loki-data:
|
||||
grafana-data:
|
||||
@@ -0,0 +1,19 @@
|
||||
apiVersion: 1
|
||||
|
||||
datasources:
|
||||
- name: Prometheus
|
||||
type: prometheus
|
||||
access: proxy
|
||||
url: http://prometheus:9090
|
||||
isDefault: true
|
||||
editable: true
|
||||
jsonData:
|
||||
timeInterval: "15s"
|
||||
|
||||
- name: Loki
|
||||
type: loki
|
||||
access: proxy
|
||||
url: http://loki:3100
|
||||
editable: true
|
||||
jsonData:
|
||||
maxLines: 1000
|
||||
@@ -0,0 +1,56 @@
|
||||
receivers:
|
||||
otlp:
|
||||
protocols:
|
||||
grpc:
|
||||
endpoint: 0.0.0.0:4317
|
||||
http:
|
||||
endpoint: 0.0.0.0:4318
|
||||
|
||||
processors:
|
||||
batch:
|
||||
timeout: 10s
|
||||
send_batch_size: 1024
|
||||
|
||||
resource:
|
||||
attributes:
|
||||
- key: service.name
|
||||
value: claude-code
|
||||
action: upsert
|
||||
|
||||
memory_limiter:
|
||||
check_interval: 1s
|
||||
limit_mib: 512
|
||||
|
||||
exporters:
|
||||
# Export metrics to Prometheus
|
||||
prometheus:
|
||||
endpoint: "0.0.0.0:8889"
|
||||
namespace: claude_code
|
||||
const_labels:
|
||||
source: claude_code_telemetry
|
||||
|
||||
# Export logs to Loki via OTLP HTTP
|
||||
otlphttp/loki:
|
||||
endpoint: http://loki:3100/otlp
|
||||
tls:
|
||||
insecure: true
|
||||
|
||||
# Debug exporter (outputs to console for troubleshooting)
|
||||
debug:
|
||||
verbosity: normal
|
||||
|
||||
service:
|
||||
pipelines:
|
||||
metrics:
|
||||
receivers: [otlp]
|
||||
processors: [memory_limiter, batch, resource]
|
||||
exporters: [prometheus, debug]
|
||||
|
||||
logs:
|
||||
receivers: [otlp]
|
||||
processors: [memory_limiter, batch, resource]
|
||||
exporters: [otlphttp/loki, debug]
|
||||
|
||||
telemetry:
|
||||
logs:
|
||||
level: info
|
||||
@@ -0,0 +1,14 @@
|
||||
global:
|
||||
scrape_interval: 15s
|
||||
evaluation_interval: 15s
|
||||
|
||||
scrape_configs:
|
||||
- job_name: 'otel-collector'
|
||||
static_configs:
|
||||
- targets: ['otel-collector:8889']
|
||||
labels:
|
||||
source: 'claude-code'
|
||||
|
||||
- job_name: 'prometheus'
|
||||
static_configs:
|
||||
- targets: ['localhost:9090']
|
||||
@@ -0,0 +1,17 @@
|
||||
{
|
||||
"env": {
|
||||
"CLAUDE_CODE_ENABLE_TELEMETRY": "1",
|
||||
"OTEL_METRICS_EXPORTER": "otlp",
|
||||
"OTEL_LOGS_EXPORTER": "otlp",
|
||||
"OTEL_EXPORTER_OTLP_PROTOCOL": "grpc",
|
||||
"OTEL_EXPORTER_OTLP_ENDPOINT": "https://otel.company.com:4317",
|
||||
"OTEL_EXPORTER_OTLP_HEADERS": "Authorization=Bearer YOUR_API_KEY_HERE",
|
||||
"OTEL_METRIC_EXPORT_INTERVAL": "60000",
|
||||
"OTEL_LOGS_EXPORT_INTERVAL": "5000",
|
||||
"OTEL_LOG_USER_PROMPTS": "1",
|
||||
"OTEL_METRICS_INCLUDE_SESSION_ID": "true",
|
||||
"OTEL_METRICS_INCLUDE_VERSION": "true",
|
||||
"OTEL_METRICS_INCLUDE_ACCOUNT_UUID": "true",
|
||||
"OTEL_RESOURCE_ATTRIBUTES": "team=TEAM_NAME,environment=production,deployment=enterprise"
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,16 @@
|
||||
{
|
||||
"env": {
|
||||
"CLAUDE_CODE_ENABLE_TELEMETRY": "1",
|
||||
"OTEL_METRICS_EXPORTER": "otlp",
|
||||
"OTEL_LOGS_EXPORTER": "otlp",
|
||||
"OTEL_EXPORTER_OTLP_PROTOCOL": "grpc",
|
||||
"OTEL_EXPORTER_OTLP_ENDPOINT": "http://localhost:4317",
|
||||
"OTEL_METRIC_EXPORT_INTERVAL": "60000",
|
||||
"OTEL_LOGS_EXPORT_INTERVAL": "5000",
|
||||
"OTEL_LOG_USER_PROMPTS": "1",
|
||||
"OTEL_METRICS_INCLUDE_SESSION_ID": "true",
|
||||
"OTEL_METRICS_INCLUDE_VERSION": "true",
|
||||
"OTEL_METRICS_INCLUDE_ACCOUNT_UUID": "true",
|
||||
"OTEL_RESOURCE_ATTRIBUTES": "environment=local,deployment=poc"
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,39 @@
|
||||
#!/bin/bash
|
||||
# Start Claude Code Telemetry Stack
|
||||
|
||||
echo "Starting Claude Code telemetry stack..."
|
||||
|
||||
# Check if Docker is running
|
||||
if ! docker info > /dev/null 2>&1; then
|
||||
echo "❌ Error: Docker is not running. Please start Docker Desktop first."
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Navigate to telemetry directory
|
||||
cd ~/.claude/telemetry || exit 1
|
||||
|
||||
# Start containers
|
||||
docker compose up -d
|
||||
|
||||
# Wait for services to be ready
|
||||
echo "Waiting for services to be ready..."
|
||||
sleep 10
|
||||
|
||||
# Check container status
|
||||
echo ""
|
||||
echo "Container Status:"
|
||||
docker compose ps
|
||||
|
||||
echo ""
|
||||
echo "✅ Telemetry stack started!"
|
||||
echo ""
|
||||
echo "Access Points:"
|
||||
echo " - Grafana: http://localhost:3000 (admin/admin)"
|
||||
echo " - Prometheus: http://localhost:9090"
|
||||
echo " - Loki: http://localhost:3100"
|
||||
echo ""
|
||||
echo "OTEL Endpoints:"
|
||||
echo " - gRPC: http://localhost:4317"
|
||||
echo " - HTTP: http://localhost:4318"
|
||||
echo ""
|
||||
echo "Next: Restart Claude Code to start sending telemetry data"
|
||||
@@ -0,0 +1,16 @@
|
||||
#!/bin/bash
|
||||
# Stop Claude Code Telemetry Stack
|
||||
|
||||
echo "Stopping Claude Code telemetry stack..."
|
||||
|
||||
# Navigate to telemetry directory
|
||||
cd ~/.claude/telemetry || exit 1
|
||||
|
||||
# Stop containers
|
||||
docker compose down
|
||||
|
||||
echo "✅ Telemetry stack stopped!"
|
||||
echo ""
|
||||
echo "Note: Data is preserved in Docker volumes."
|
||||
echo "To start again: ./start-telemetry.sh"
|
||||
echo "To completely remove all data: ./cleanup-telemetry.sh"
|
||||
Reference in New Issue
Block a user