# Claude Code OpenTelemetry Setup Skill

Automated workflow for setting up OpenTelemetry telemetry collection for Claude Code usage monitoring, cost tracking, and productivity analytics.

**Version:** 1.0.0
**Author:** Prometheus Team

---

## Features

- **Mode 1: Local PoC Setup** - Full Docker stack with Grafana dashboards
- **Mode 2: Enterprise Setup** - Connect to centralized infrastructure
- Automated configuration file generation
- Dashboard import with UID detection
- Verification and testing procedures
- Comprehensive troubleshooting guides

---

## Quick Start

### Prerequisites

**For Mode 1 (Local PoC):**
- Docker Desktop installed and running
- Claude Code installed
- Write access to `~/.claude/settings.json`

**For Mode 2 (Enterprise):**
- OTEL Collector endpoint URL
- Authentication credentials
- Write access to `~/.claude/settings.json`

### Installation

This skill is designed to be invoked by Claude Code. No manual installation required.

### Usage

**Mode 1 - Local PoC Setup:**
```
"Set up Claude Code telemetry locally"
"I want to try OpenTelemetry with Claude Code"
"Create a local telemetry stack for me"
```

**Mode 2 - Enterprise Setup:**
```
"Connect Claude Code to our company OTEL endpoint at otel.company.com:4317"
"Set up telemetry for team rollout"
"Configure enterprise telemetry"
```

---

## What Gets Collected?

### Metrics
- **Session counts and active time** - How much you use Claude Code
- **Token usage** - Input, output, cached tokens by model
- **API costs** - Spend tracking by model and time
- **Lines of code** - Code modifications (added, changed, deleted)
- **Commits and PRs** - Git activity tracking

### Events/Logs
- User prompts (if enabled)
- Tool executions
- API requests
- Session lifecycle

**Privacy:** Metrics are anonymized. Source code content is never collected.

---

## Directory Structure

```
claude-code-otel-setup/
├── SKILL.md                  # Main skill definition
├── README.md                 # This file
├── modes/
│   ├── mode1-poc-setup.md    # Detailed local setup workflow
│   └── mode2-enterprise.md   # Detailed enterprise setup workflow
├── templates/
│   ├── docker-compose.yml    # Docker Compose configuration
│   ├── otel-collector-config.yml  # OTEL Collector configuration
│   ├── prometheus.yml        # Prometheus scrape configuration
│   ├── grafana-datasources.yml    # Grafana datasource provisioning
│   ├── settings.json.local   # Local telemetry settings template
│   ├── settings.json.enterprise  # Enterprise settings template
│   ├── start-telemetry.sh    # Start script
│   └── stop-telemetry.sh     # Stop script
├── dashboards/
│   ├── README.md             # Dashboard import guide
│   ├── claude-code-overview.json  # Comprehensive dashboard
│   └── claude-code-simple.json    # Simplified dashboard
└── data/
    ├── metrics-reference.md  # Complete metrics documentation
    ├── prometheus-queries.md # Useful PromQL queries
    └── troubleshooting.md    # Common issues and solutions
```

---

## Mode 1: Local PoC Setup

**What it does:**
- Creates `~/.claude/telemetry/` directory
- Generates Docker Compose configuration
- Starts 4 containers: OTEL Collector, Prometheus, Loki, Grafana
- Updates Claude Code settings.json
- Imports Grafana dashboards
- Verifies data flow

**Time:** 5-7 minutes

**Output:**
- Grafana: http://localhost:3000 (admin/admin)
- Prometheus: http://localhost:9090
- Working dashboards with real data

**Detailed workflow:** See `modes/mode1-poc-setup.md`

---

## Mode 2: Enterprise Setup

**What it does:**
- Collects enterprise OTEL endpoint details
- Updates Claude Code settings.json with endpoint and auth
- Adds team/environment resource attributes
- Tests connectivity (optional)
- Provides team rollout documentation

**Time:** 2-3 minutes

**Output:**
- Claude Code configured to send to central endpoint
- Connectivity verified
- Team rollout guide generated

**Detailed workflow:** See `modes/mode2-enterprise.md`

---

## Example Dashboards

### Overview Dashboard

Includes:
- Total Lines of Code (all-time)
- Total Cost (24h)
- Total Tokens (24h)
- Active Time (24h)
- Cost Over Time (timeseries)
- Token Usage by Type (stacked)
- Lines of Code Modified (bar chart)
- Commits Created (24h)

### Custom Queries

See `data/prometheus-queries.md` for 50+ ready-to-use PromQL queries:
- Cost analysis
- Token usage
- Productivity metrics
- Team aggregation
- Model comparison
- Alerting rules

---

## Common Use Cases

### Individual Developer

**Goal:** Track personal Claude Code usage and costs

**Setup:**
```
Mode 1 (Local PoC)
```

**Access:**
- Personal Grafana dashboard at localhost:3000
- All data stays local

---

### Team Pilot (5-10 Users)

**Goal:** Aggregate metrics across pilot users

**Setup:**
```
Mode 2 (Enterprise)
```

**Architecture:**
- Centralized OTEL Collector
- Team-level Prometheus/Grafana
- Aggregated dashboards

---

### Enterprise Rollout (100+ Users)

**Goal:** Organization-wide cost tracking and productivity analytics

**Setup:**
```
Mode 2 (Enterprise) + Managed Infrastructure
```

**Features:**
- Department/team/project attribution
- Chargeback reporting
- Executive dashboards
- Trend analysis

---

## Troubleshooting

### Quick Checks

**Containers not starting:**
```bash
docker compose logs
```

**No metrics in Prometheus:**
1. Restart Claude Code (telemetry loads at startup)
2. Wait 60 seconds (export interval)
3. Check OTEL Collector logs: `docker compose logs otel-collector`

**Dashboard shows "No data":**
1. Verify metric names use double prefix: `claude_code_claude_code_*`
2. Check time range (top-right corner)
3. Verify datasource UID matches

**Full troubleshooting guide:** See `data/troubleshooting.md`

---

## Known Issues

### Issue 1: 🚨 CRITICAL - Missing OTEL Exporters

**Description:** Claude Code not sending telemetry even with `CLAUDE_CODE_ENABLE_TELEMETRY=1`

**Cause:** Missing required `OTEL_METRICS_EXPORTER` and `OTEL_LOGS_EXPORTER` settings

**Solution:** The skill templates include these by default. **Always verify** they're present in settings.json. See Configuration Reference for details.

---

### Issue 2: OTEL Collector Deprecated 'address' Field

**Description:** Collector crashes with "'address' has invalid keys" error

**Cause:** The `address` field in `service.telemetry.metrics` is deprecated in collector v0.123.0+

**Solution:** Skill templates have this removed. If using custom config, remove the deprecated field.

---

### Issue 3: Metric Double Prefix

**Description:** Metrics are named `claude_code_claude_code_*` instead of `claude_code_*`

**Cause:** OTEL Collector Prometheus exporter adds namespace prefix

**Solution:** This is expected. Dashboards use correct naming.

---

### Issue 4: Dashboard Datasource UID Mismatch

**Description:** Dashboard shows "datasource prometheus not found"

**Cause:** Dashboard has hardcoded UID that doesn't match your Grafana

**Solution:** Skill automatically detects and fixes UID during import

---

### Issue 5: OTEL Collector Deprecated Exporter

**Description:** Container fails with "logging exporter has been deprecated"

**Cause:** Old OTEL configuration

**Solution:** Skill uses `debug` exporter (not deprecated `logging`)

---

## Configuration Reference

### Settings.json (Local)

**🚨 CRITICAL REQUIREMENTS:**

The following settings are **REQUIRED** (not optional) for telemetry to work:
- `CLAUDE_CODE_ENABLE_TELEMETRY: "1"` - Enables telemetry system
- `OTEL_METRICS_EXPORTER: "otlp"` - **REQUIRED** to send metrics (most common missing setting!)
- `OTEL_LOGS_EXPORTER: "otlp"` - **REQUIRED** to send events/logs

Without `OTEL_METRICS_EXPORTER` and `OTEL_LOGS_EXPORTER`, telemetry will not send even if `CLAUDE_CODE_ENABLE_TELEMETRY=1` is set.

```json
{
  "env": {
    "CLAUDE_CODE_ENABLE_TELEMETRY": "1",
    "OTEL_METRICS_EXPORTER": "otlp",           // REQUIRED!
    "OTEL_LOGS_EXPORTER": "otlp",              // REQUIRED!
    "OTEL_EXPORTER_OTLP_PROTOCOL": "grpc",
    "OTEL_EXPORTER_OTLP_ENDPOINT": "http://localhost:4317",
    "OTEL_METRIC_EXPORT_INTERVAL": "60000",
    "OTEL_LOGS_EXPORT_INTERVAL": "5000",
    "OTEL_LOG_USER_PROMPTS": "1",
    "OTEL_METRICS_INCLUDE_SESSION_ID": "true",
    "OTEL_METRICS_INCLUDE_VERSION": "true",
    "OTEL_METRICS_INCLUDE_ACCOUNT_UUID": "true",
    "OTEL_RESOURCE_ATTRIBUTES": "environment=local,deployment=poc"
  }
}
```

### Settings.json (Enterprise)

**Same CRITICAL requirements apply:**
- `OTEL_METRICS_EXPORTER: "otlp"` - **REQUIRED!**
- `OTEL_LOGS_EXPORTER: "otlp"` - **REQUIRED!**

```json
{
  "env": {
    "CLAUDE_CODE_ENABLE_TELEMETRY": "1",
    "OTEL_METRICS_EXPORTER": "otlp",           // REQUIRED!
    "OTEL_LOGS_EXPORTER": "otlp",              // REQUIRED!
    "OTEL_EXPORTER_OTLP_PROTOCOL": "grpc",
    "OTEL_EXPORTER_OTLP_ENDPOINT": "https://otel.company.com:4317",
    "OTEL_EXPORTER_OTLP_HEADERS": "Authorization=Bearer YOUR_API_KEY",
    "OTEL_METRIC_EXPORT_INTERVAL": "60000",
    "OTEL_LOGS_EXPORT_INTERVAL": "5000",
    "OTEL_LOG_USER_PROMPTS": "1",
    "OTEL_METRICS_INCLUDE_SESSION_ID": "true",
    "OTEL_METRICS_INCLUDE_VERSION": "true",
    "OTEL_METRICS_INCLUDE_ACCOUNT_UUID": "true",
    "OTEL_RESOURCE_ATTRIBUTES": "team=platform,environment=production"
  }
}
```

---

## Management

### Start Telemetry Stack (Mode 1)

```bash
~/.claude/telemetry/start-telemetry.sh
```

### Stop Telemetry Stack (Mode 1)

```bash
~/.claude/telemetry/stop-telemetry.sh
```

### Check Status

```bash
docker compose ps
```

### View Logs

```bash
docker compose logs -f
```

### Restart Services

```bash
docker compose restart
```

---

## Data Retention

**Default:** 15 days in Prometheus

**Adjust retention:**
Edit `docker-compose.yml` or `prometheus.yml`:
```yaml
command:
  - '--storage.tsdb.retention.time=90d'
  - '--storage.tsdb.retention.size=50GB'
```

**Disk usage:** ~1-2 MB per day per active user

---

## Security Considerations

### Local Setup (Mode 1)

- Grafana accessible only on localhost
- Default credentials: admin/admin (change after first login)
- No external network exposure
- Data stored in Docker volumes

### Enterprise Setup (Mode 2)

- Use HTTPS endpoints
- Store API keys securely (environment variables, secrets manager)
- Enable mTLS for production
- Tag metrics with team/project for proper attribution

---

## Performance Tuning

### Reduce OTEL Collector Memory

Edit `otel-collector-config.yml`:
```yaml
processors:
  memory_limiter:
    limit_mib: 256  # Reduce from default
```

### Reduce Prometheus Retention

Edit `docker-compose.yml`:
```yaml
command:
  - '--storage.tsdb.retention.time=7d'  # Reduce from 15d
```

### Optimize Dashboard Queries

- Use recording rules for expensive queries
- Reduce dashboard time ranges
- Increase refresh intervals

See `data/prometheus-queries.md` for recording rule examples

---

## Integration Examples

### Cost Alerts (PagerDuty/Slack)

```yaml
# alertmanager.yml
groups:
  - name: claude_code_cost
    rules:
      - alert: HighDailyCost
        expr: sum(increase(claude_code_claude_code_cost_usage_USD_total[24h])) > 100
        annotations:
          summary: "Claude Code daily cost exceeded $100"
```

### Weekly Cost Reports (Email)

Use Grafana Reporting:
1. Create dashboard with cost panels
2. Set up email delivery
3. Schedule weekly reports

### Chargeback Integration

Export metrics to data warehouse:
```yaml
# Use Prometheus remote write
remote_write:
  - url: "https://datawarehouse.company.com/prometheus"
```

---

## Contributing

This skill is maintained by the Prometheus Team.

**Feedback:** Open an issue or contact the team

**Improvements:** Submit pull requests with enhancements

---

## Changelog

### Version 1.1.0 (2025-11-01)

**Critical Updates from Production Testing:**
- 🚨 **CRITICAL FIX**: Documented missing OTEL_METRICS_EXPORTER/OTEL_LOGS_EXPORTER as #1 cause of "telemetry not working"
- ✅ Added deprecated `address` field fix for OTEL Collector v0.123.0+
- ✅ Enhanced troubleshooting with prominent exporter configuration section
- ✅ Updated all documentation with CRITICAL warnings for required settings
- ✅ Added comprehensive Known Issues section covering production scenarios
- ✅ Verified templates have correct exporter configuration

**What Changed:**
- Troubleshooting guide now prioritizes missing exporters as root cause
- Known Issues expanded from 3 to 6 issues with production learnings
- Configuration Reference includes prominent CRITICAL requirements callout
- SKILL.md Important Reminders section updated with exporter warnings

### Version 1.0.0 (2025-10-31)

**Initial Release:**
- Mode 1: Local PoC setup with full Docker stack
- Mode 2: Enterprise setup with centralized endpoint
- Comprehensive documentation and troubleshooting
- Dashboard templates with correct metric naming
- Automated UID detection and replacement

**Known Issues Fixed:**
- ✅ OTEL Collector deprecated logging exporter
- ✅ Dashboard datasource UID mismatch
- ✅ Metric double prefix handling
- ✅ Loki exporter configuration

---

## Additional Resources

- **Claude Code Monitoring Docs:** https://docs.claude.com/claude-code/monitoring
- **OpenTelemetry Docs:** https://opentelemetry.io/docs/
- **Prometheus Docs:** https://prometheus.io/docs/
- **Grafana Docs:** https://grafana.com/docs/

---

## License

Internal use within Elsevier organization.

---

## Support

**Issues?** Check `data/troubleshooting.md` first

**Questions?** Contact Prometheus Team or #claude-code-telemetry channel

**Emergency?** Rollback with: `cp ~/.claude/settings.json.backup ~/.claude/settings.json`

---

**Ready to monitor your Claude Code usage!** 🚀