Files
2025-11-29 18:16:51 +08:00

559 lines
14 KiB
Markdown

# Claude Code OpenTelemetry Setup Skill
Automated workflow for setting up OpenTelemetry telemetry collection for Claude Code usage monitoring, cost tracking, and productivity analytics.
**Version:** 1.0.0
**Author:** Prometheus Team
---
## Features
- **Mode 1: Local PoC Setup** - Full Docker stack with Grafana dashboards
- **Mode 2: Enterprise Setup** - Connect to centralized infrastructure
- Automated configuration file generation
- Dashboard import with UID detection
- Verification and testing procedures
- Comprehensive troubleshooting guides
---
## Quick Start
### Prerequisites
**For Mode 1 (Local PoC):**
- Docker Desktop installed and running
- Claude Code installed
- Write access to `~/.claude/settings.json`
**For Mode 2 (Enterprise):**
- OTEL Collector endpoint URL
- Authentication credentials
- Write access to `~/.claude/settings.json`
### Installation
This skill is designed to be invoked by Claude Code. No manual installation required.
### Usage
**Mode 1 - Local PoC Setup:**
```
"Set up Claude Code telemetry locally"
"I want to try OpenTelemetry with Claude Code"
"Create a local telemetry stack for me"
```
**Mode 2 - Enterprise Setup:**
```
"Connect Claude Code to our company OTEL endpoint at otel.company.com:4317"
"Set up telemetry for team rollout"
"Configure enterprise telemetry"
```
---
## What Gets Collected?
### Metrics
- **Session counts and active time** - How much you use Claude Code
- **Token usage** - Input, output, cached tokens by model
- **API costs** - Spend tracking by model and time
- **Lines of code** - Code modifications (added, changed, deleted)
- **Commits and PRs** - Git activity tracking
### Events/Logs
- User prompts (if enabled)
- Tool executions
- API requests
- Session lifecycle
**Privacy:** Metrics are anonymized. Source code content is never collected.
---
## Directory Structure
```
claude-code-otel-setup/
├── SKILL.md # Main skill definition
├── README.md # This file
├── modes/
│ ├── mode1-poc-setup.md # Detailed local setup workflow
│ └── mode2-enterprise.md # Detailed enterprise setup workflow
├── templates/
│ ├── docker-compose.yml # Docker Compose configuration
│ ├── otel-collector-config.yml # OTEL Collector configuration
│ ├── prometheus.yml # Prometheus scrape configuration
│ ├── grafana-datasources.yml # Grafana datasource provisioning
│ ├── settings.json.local # Local telemetry settings template
│ ├── settings.json.enterprise # Enterprise settings template
│ ├── start-telemetry.sh # Start script
│ └── stop-telemetry.sh # Stop script
├── dashboards/
│ ├── README.md # Dashboard import guide
│ ├── claude-code-overview.json # Comprehensive dashboard
│ └── claude-code-simple.json # Simplified dashboard
└── data/
├── metrics-reference.md # Complete metrics documentation
├── prometheus-queries.md # Useful PromQL queries
└── troubleshooting.md # Common issues and solutions
```
---
## Mode 1: Local PoC Setup
**What it does:**
- Creates `~/.claude/telemetry/` directory
- Generates Docker Compose configuration
- Starts 4 containers: OTEL Collector, Prometheus, Loki, Grafana
- Updates Claude Code settings.json
- Imports Grafana dashboards
- Verifies data flow
**Time:** 5-7 minutes
**Output:**
- Grafana: http://localhost:3000 (admin/admin)
- Prometheus: http://localhost:9090
- Working dashboards with real data
**Detailed workflow:** See `modes/mode1-poc-setup.md`
---
## Mode 2: Enterprise Setup
**What it does:**
- Collects enterprise OTEL endpoint details
- Updates Claude Code settings.json with endpoint and auth
- Adds team/environment resource attributes
- Tests connectivity (optional)
- Provides team rollout documentation
**Time:** 2-3 minutes
**Output:**
- Claude Code configured to send to central endpoint
- Connectivity verified
- Team rollout guide generated
**Detailed workflow:** See `modes/mode2-enterprise.md`
---
## Example Dashboards
### Overview Dashboard
Includes:
- Total Lines of Code (all-time)
- Total Cost (24h)
- Total Tokens (24h)
- Active Time (24h)
- Cost Over Time (timeseries)
- Token Usage by Type (stacked)
- Lines of Code Modified (bar chart)
- Commits Created (24h)
### Custom Queries
See `data/prometheus-queries.md` for 50+ ready-to-use PromQL queries:
- Cost analysis
- Token usage
- Productivity metrics
- Team aggregation
- Model comparison
- Alerting rules
---
## Common Use Cases
### Individual Developer
**Goal:** Track personal Claude Code usage and costs
**Setup:**
```
Mode 1 (Local PoC)
```
**Access:**
- Personal Grafana dashboard at localhost:3000
- All data stays local
---
### Team Pilot (5-10 Users)
**Goal:** Aggregate metrics across pilot users
**Setup:**
```
Mode 2 (Enterprise)
```
**Architecture:**
- Centralized OTEL Collector
- Team-level Prometheus/Grafana
- Aggregated dashboards
---
### Enterprise Rollout (100+ Users)
**Goal:** Organization-wide cost tracking and productivity analytics
**Setup:**
```
Mode 2 (Enterprise) + Managed Infrastructure
```
**Features:**
- Department/team/project attribution
- Chargeback reporting
- Executive dashboards
- Trend analysis
---
## Troubleshooting
### Quick Checks
**Containers not starting:**
```bash
docker compose logs
```
**No metrics in Prometheus:**
1. Restart Claude Code (telemetry loads at startup)
2. Wait 60 seconds (export interval)
3. Check OTEL Collector logs: `docker compose logs otel-collector`
**Dashboard shows "No data":**
1. Verify metric names use double prefix: `claude_code_claude_code_*`
2. Check time range (top-right corner)
3. Verify datasource UID matches
**Full troubleshooting guide:** See `data/troubleshooting.md`
---
## Known Issues
### Issue 1: 🚨 CRITICAL - Missing OTEL Exporters
**Description:** Claude Code not sending telemetry even with `CLAUDE_CODE_ENABLE_TELEMETRY=1`
**Cause:** Missing required `OTEL_METRICS_EXPORTER` and `OTEL_LOGS_EXPORTER` settings
**Solution:** The skill templates include these by default. **Always verify** they're present in settings.json. See Configuration Reference for details.
---
### Issue 2: OTEL Collector Deprecated 'address' Field
**Description:** Collector crashes with "'address' has invalid keys" error
**Cause:** The `address` field in `service.telemetry.metrics` is deprecated in collector v0.123.0+
**Solution:** Skill templates have this removed. If using custom config, remove the deprecated field.
---
### Issue 3: Metric Double Prefix
**Description:** Metrics are named `claude_code_claude_code_*` instead of `claude_code_*`
**Cause:** OTEL Collector Prometheus exporter adds namespace prefix
**Solution:** This is expected. Dashboards use correct naming.
---
### Issue 4: Dashboard Datasource UID Mismatch
**Description:** Dashboard shows "datasource prometheus not found"
**Cause:** Dashboard has hardcoded UID that doesn't match your Grafana
**Solution:** Skill automatically detects and fixes UID during import
---
### Issue 5: OTEL Collector Deprecated Exporter
**Description:** Container fails with "logging exporter has been deprecated"
**Cause:** Old OTEL configuration
**Solution:** Skill uses `debug` exporter (not deprecated `logging`)
---
## Configuration Reference
### Settings.json (Local)
**🚨 CRITICAL REQUIREMENTS:**
The following settings are **REQUIRED** (not optional) for telemetry to work:
- `CLAUDE_CODE_ENABLE_TELEMETRY: "1"` - Enables telemetry system
- `OTEL_METRICS_EXPORTER: "otlp"` - **REQUIRED** to send metrics (most common missing setting!)
- `OTEL_LOGS_EXPORTER: "otlp"` - **REQUIRED** to send events/logs
Without `OTEL_METRICS_EXPORTER` and `OTEL_LOGS_EXPORTER`, telemetry will not send even if `CLAUDE_CODE_ENABLE_TELEMETRY=1` is set.
```json
{
"env": {
"CLAUDE_CODE_ENABLE_TELEMETRY": "1",
"OTEL_METRICS_EXPORTER": "otlp", // REQUIRED!
"OTEL_LOGS_EXPORTER": "otlp", // REQUIRED!
"OTEL_EXPORTER_OTLP_PROTOCOL": "grpc",
"OTEL_EXPORTER_OTLP_ENDPOINT": "http://localhost:4317",
"OTEL_METRIC_EXPORT_INTERVAL": "60000",
"OTEL_LOGS_EXPORT_INTERVAL": "5000",
"OTEL_LOG_USER_PROMPTS": "1",
"OTEL_METRICS_INCLUDE_SESSION_ID": "true",
"OTEL_METRICS_INCLUDE_VERSION": "true",
"OTEL_METRICS_INCLUDE_ACCOUNT_UUID": "true",
"OTEL_RESOURCE_ATTRIBUTES": "environment=local,deployment=poc"
}
}
```
### Settings.json (Enterprise)
**Same CRITICAL requirements apply:**
- `OTEL_METRICS_EXPORTER: "otlp"` - **REQUIRED!**
- `OTEL_LOGS_EXPORTER: "otlp"` - **REQUIRED!**
```json
{
"env": {
"CLAUDE_CODE_ENABLE_TELEMETRY": "1",
"OTEL_METRICS_EXPORTER": "otlp", // REQUIRED!
"OTEL_LOGS_EXPORTER": "otlp", // REQUIRED!
"OTEL_EXPORTER_OTLP_PROTOCOL": "grpc",
"OTEL_EXPORTER_OTLP_ENDPOINT": "https://otel.company.com:4317",
"OTEL_EXPORTER_OTLP_HEADERS": "Authorization=Bearer YOUR_API_KEY",
"OTEL_METRIC_EXPORT_INTERVAL": "60000",
"OTEL_LOGS_EXPORT_INTERVAL": "5000",
"OTEL_LOG_USER_PROMPTS": "1",
"OTEL_METRICS_INCLUDE_SESSION_ID": "true",
"OTEL_METRICS_INCLUDE_VERSION": "true",
"OTEL_METRICS_INCLUDE_ACCOUNT_UUID": "true",
"OTEL_RESOURCE_ATTRIBUTES": "team=platform,environment=production"
}
}
```
---
## Management
### Start Telemetry Stack (Mode 1)
```bash
~/.claude/telemetry/start-telemetry.sh
```
### Stop Telemetry Stack (Mode 1)
```bash
~/.claude/telemetry/stop-telemetry.sh
```
### Check Status
```bash
docker compose ps
```
### View Logs
```bash
docker compose logs -f
```
### Restart Services
```bash
docker compose restart
```
---
## Data Retention
**Default:** 15 days in Prometheus
**Adjust retention:**
Edit `docker-compose.yml` or `prometheus.yml`:
```yaml
command:
- '--storage.tsdb.retention.time=90d'
- '--storage.tsdb.retention.size=50GB'
```
**Disk usage:** ~1-2 MB per day per active user
---
## Security Considerations
### Local Setup (Mode 1)
- Grafana accessible only on localhost
- Default credentials: admin/admin (change after first login)
- No external network exposure
- Data stored in Docker volumes
### Enterprise Setup (Mode 2)
- Use HTTPS endpoints
- Store API keys securely (environment variables, secrets manager)
- Enable mTLS for production
- Tag metrics with team/project for proper attribution
---
## Performance Tuning
### Reduce OTEL Collector Memory
Edit `otel-collector-config.yml`:
```yaml
processors:
memory_limiter:
limit_mib: 256 # Reduce from default
```
### Reduce Prometheus Retention
Edit `docker-compose.yml`:
```yaml
command:
- '--storage.tsdb.retention.time=7d' # Reduce from 15d
```
### Optimize Dashboard Queries
- Use recording rules for expensive queries
- Reduce dashboard time ranges
- Increase refresh intervals
See `data/prometheus-queries.md` for recording rule examples
---
## Integration Examples
### Cost Alerts (PagerDuty/Slack)
```yaml
# alertmanager.yml
groups:
- name: claude_code_cost
rules:
- alert: HighDailyCost
expr: sum(increase(claude_code_claude_code_cost_usage_USD_total[24h])) > 100
annotations:
summary: "Claude Code daily cost exceeded $100"
```
### Weekly Cost Reports (Email)
Use Grafana Reporting:
1. Create dashboard with cost panels
2. Set up email delivery
3. Schedule weekly reports
### Chargeback Integration
Export metrics to data warehouse:
```yaml
# Use Prometheus remote write
remote_write:
- url: "https://datawarehouse.company.com/prometheus"
```
---
## Contributing
This skill is maintained by the Prometheus Team.
**Feedback:** Open an issue or contact the team
**Improvements:** Submit pull requests with enhancements
---
## Changelog
### Version 1.1.0 (2025-11-01)
**Critical Updates from Production Testing:**
- 🚨 **CRITICAL FIX**: Documented missing OTEL_METRICS_EXPORTER/OTEL_LOGS_EXPORTER as #1 cause of "telemetry not working"
- ✅ Added deprecated `address` field fix for OTEL Collector v0.123.0+
- ✅ Enhanced troubleshooting with prominent exporter configuration section
- ✅ Updated all documentation with CRITICAL warnings for required settings
- ✅ Added comprehensive Known Issues section covering production scenarios
- ✅ Verified templates have correct exporter configuration
**What Changed:**
- Troubleshooting guide now prioritizes missing exporters as root cause
- Known Issues expanded from 3 to 6 issues with production learnings
- Configuration Reference includes prominent CRITICAL requirements callout
- SKILL.md Important Reminders section updated with exporter warnings
### Version 1.0.0 (2025-10-31)
**Initial Release:**
- Mode 1: Local PoC setup with full Docker stack
- Mode 2: Enterprise setup with centralized endpoint
- Comprehensive documentation and troubleshooting
- Dashboard templates with correct metric naming
- Automated UID detection and replacement
**Known Issues Fixed:**
- ✅ OTEL Collector deprecated logging exporter
- ✅ Dashboard datasource UID mismatch
- ✅ Metric double prefix handling
- ✅ Loki exporter configuration
---
## Additional Resources
- **Claude Code Monitoring Docs:** https://docs.claude.com/claude-code/monitoring
- **OpenTelemetry Docs:** https://opentelemetry.io/docs/
- **Prometheus Docs:** https://prometheus.io/docs/
- **Grafana Docs:** https://grafana.com/docs/
---
## License
Internal use within Elsevier organization.
---
## Support
**Issues?** Check `data/troubleshooting.md` first
**Questions?** Contact Prometheus Team or #claude-code-telemetry channel
**Emergency?** Rollback with: `cp ~/.claude/settings.json.backup ~/.claude/settings.json`
---
**Ready to monitor your Claude Code usage!** 🚀