zhongwei/gh-cskiro-claudex-claude-code-tools

Files

Zhongwei Li 4e8a12140c Initial commit

2025-11-29 18:16:51 +08:00

14 KiB

Raw Permalink Blame History

Claude Code OpenTelemetry Setup Skill

Automated workflow for setting up OpenTelemetry telemetry collection for Claude Code usage monitoring, cost tracking, and productivity analytics.

Version: 1.0.0 Author: Prometheus Team

Features

Mode 1: Local PoC Setup - Full Docker stack with Grafana dashboards
Mode 2: Enterprise Setup - Connect to centralized infrastructure
Automated configuration file generation
Dashboard import with UID detection
Verification and testing procedures
Comprehensive troubleshooting guides

Quick Start

Prerequisites

For Mode 1 (Local PoC):

Docker Desktop installed and running
Claude Code installed
Write access to ~/.claude/settings.json

For Mode 2 (Enterprise):

OTEL Collector endpoint URL
Authentication credentials
Write access to ~/.claude/settings.json

Installation

This skill is designed to be invoked by Claude Code. No manual installation required.

Usage

Mode 1 - Local PoC Setup:

"Set up Claude Code telemetry locally"
"I want to try OpenTelemetry with Claude Code"
"Create a local telemetry stack for me"

Mode 2 - Enterprise Setup:

"Connect Claude Code to our company OTEL endpoint at otel.company.com:4317"
"Set up telemetry for team rollout"
"Configure enterprise telemetry"

What Gets Collected?

Metrics

Session counts and active time - How much you use Claude Code
Token usage - Input, output, cached tokens by model
API costs - Spend tracking by model and time
Lines of code - Code modifications (added, changed, deleted)
Commits and PRs - Git activity tracking

Events/Logs

User prompts (if enabled)
Tool executions
API requests
Session lifecycle

Privacy: Metrics are anonymized. Source code content is never collected.

Directory Structure

claude-code-otel-setup/
├── SKILL.md                  # Main skill definition
├── README.md                 # This file
├── modes/
│   ├── mode1-poc-setup.md    # Detailed local setup workflow
│   └── mode2-enterprise.md   # Detailed enterprise setup workflow
├── templates/
│   ├── docker-compose.yml    # Docker Compose configuration
│   ├── otel-collector-config.yml  # OTEL Collector configuration
│   ├── prometheus.yml        # Prometheus scrape configuration
│   ├── grafana-datasources.yml    # Grafana datasource provisioning
│   ├── settings.json.local   # Local telemetry settings template
│   ├── settings.json.enterprise  # Enterprise settings template
│   ├── start-telemetry.sh    # Start script
│   └── stop-telemetry.sh     # Stop script
├── dashboards/
│   ├── README.md             # Dashboard import guide
│   ├── claude-code-overview.json  # Comprehensive dashboard
│   └── claude-code-simple.json    # Simplified dashboard
└── data/
    ├── metrics-reference.md  # Complete metrics documentation
    ├── prometheus-queries.md # Useful PromQL queries
    └── troubleshooting.md    # Common issues and solutions

Mode 1: Local PoC Setup

What it does:

Creates ~/.claude/telemetry/ directory
Generates Docker Compose configuration
Starts 4 containers: OTEL Collector, Prometheus, Loki, Grafana
Updates Claude Code settings.json
Imports Grafana dashboards
Verifies data flow

Time: 5-7 minutes

Output:

Grafana: http://localhost:3000 (admin/admin)
Prometheus: http://localhost:9090
Working dashboards with real data

Detailed workflow: See modes/mode1-poc-setup.md

Mode 2: Enterprise Setup

What it does:

Collects enterprise OTEL endpoint details
Updates Claude Code settings.json with endpoint and auth
Adds team/environment resource attributes
Tests connectivity (optional)
Provides team rollout documentation

Time: 2-3 minutes

Output:

Claude Code configured to send to central endpoint
Connectivity verified
Team rollout guide generated

Detailed workflow: See modes/mode2-enterprise.md

Example Dashboards

Overview Dashboard

Includes:

Total Lines of Code (all-time)
Total Cost (24h)
Total Tokens (24h)
Active Time (24h)
Cost Over Time (timeseries)
Token Usage by Type (stacked)
Lines of Code Modified (bar chart)
Commits Created (24h)

Custom Queries

See data/prometheus-queries.md for 50+ ready-to-use PromQL queries:

Cost analysis
Token usage
Productivity metrics
Team aggregation
Model comparison
Alerting rules

Common Use Cases

Individual Developer

Goal: Track personal Claude Code usage and costs

Setup:

Mode 1 (Local PoC)

Access:

Personal Grafana dashboard at localhost:3000
All data stays local

Team Pilot (5-10 Users)

Goal: Aggregate metrics across pilot users

Setup:

Mode 2 (Enterprise)

Architecture:

Centralized OTEL Collector
Team-level Prometheus/Grafana
Aggregated dashboards

Enterprise Rollout (100+ Users)

Goal: Organization-wide cost tracking and productivity analytics

Setup:

Mode 2 (Enterprise) + Managed Infrastructure

Features:

Department/team/project attribution
Chargeback reporting
Executive dashboards
Trend analysis

Troubleshooting

Quick Checks

Containers not starting:

docker compose logs

No metrics in Prometheus:

Restart Claude Code (telemetry loads at startup)
Wait 60 seconds (export interval)
Check OTEL Collector logs: docker compose logs otel-collector

Dashboard shows "No data":

Verify metric names use double prefix: claude_code_claude_code_*
Check time range (top-right corner)
Verify datasource UID matches

Full troubleshooting guide: See data/troubleshooting.md

Known Issues

Issue 1: 🚨 CRITICAL - Missing OTEL Exporters

Description: Claude Code not sending telemetry even with CLAUDE_CODE_ENABLE_TELEMETRY=1

Cause: Missing required OTEL_METRICS_EXPORTER and OTEL_LOGS_EXPORTER settings

Solution: The skill templates include these by default. Always verify they're present in settings.json. See Configuration Reference for details.

Issue 2: OTEL Collector Deprecated 'address' Field

Description: Collector crashes with "'address' has invalid keys" error

Cause: The address field in service.telemetry.metrics is deprecated in collector v0.123.0+

Solution: Skill templates have this removed. If using custom config, remove the deprecated field.

Issue 3: Metric Double Prefix

Description: Metrics are named claude_code_claude_code_* instead of claude_code_*

Cause: OTEL Collector Prometheus exporter adds namespace prefix

Solution: This is expected. Dashboards use correct naming.

Issue 4: Dashboard Datasource UID Mismatch

Description: Dashboard shows "datasource prometheus not found"

Cause: Dashboard has hardcoded UID that doesn't match your Grafana

Solution: Skill automatically detects and fixes UID during import

Issue 5: OTEL Collector Deprecated Exporter

Description: Container fails with "logging exporter has been deprecated"

Cause: Old OTEL configuration

Solution: Skill uses debug exporter (not deprecated logging)

Configuration Reference

Settings.json (Local)

🚨 CRITICAL REQUIREMENTS:

The following settings are REQUIRED (not optional) for telemetry to work:

CLAUDE_CODE_ENABLE_TELEMETRY: "1" - Enables telemetry system
OTEL_METRICS_EXPORTER: "otlp" - REQUIRED to send metrics (most common missing setting!)
OTEL_LOGS_EXPORTER: "otlp" - REQUIRED to send events/logs

Without OTEL_METRICS_EXPORTER and OTEL_LOGS_EXPORTER, telemetry will not send even if CLAUDE_CODE_ENABLE_TELEMETRY=1 is set.

{
  "env": {
    "CLAUDE_CODE_ENABLE_TELEMETRY": "1",
    "OTEL_METRICS_EXPORTER": "otlp",           // REQUIRED!
    "OTEL_LOGS_EXPORTER": "otlp",              // REQUIRED!
    "OTEL_EXPORTER_OTLP_PROTOCOL": "grpc",
    "OTEL_EXPORTER_OTLP_ENDPOINT": "http://localhost:4317",
    "OTEL_METRIC_EXPORT_INTERVAL": "60000",
    "OTEL_LOGS_EXPORT_INTERVAL": "5000",
    "OTEL_LOG_USER_PROMPTS": "1",
    "OTEL_METRICS_INCLUDE_SESSION_ID": "true",
    "OTEL_METRICS_INCLUDE_VERSION": "true",
    "OTEL_METRICS_INCLUDE_ACCOUNT_UUID": "true",
    "OTEL_RESOURCE_ATTRIBUTES": "environment=local,deployment=poc"
  }
}

Settings.json (Enterprise)

Same CRITICAL requirements apply:

OTEL_METRICS_EXPORTER: "otlp" - REQUIRED!
OTEL_LOGS_EXPORTER: "otlp" - REQUIRED!

{
  "env": {
    "CLAUDE_CODE_ENABLE_TELEMETRY": "1",
    "OTEL_METRICS_EXPORTER": "otlp",           // REQUIRED!
    "OTEL_LOGS_EXPORTER": "otlp",              // REQUIRED!
    "OTEL_EXPORTER_OTLP_PROTOCOL": "grpc",
    "OTEL_EXPORTER_OTLP_ENDPOINT": "https://otel.company.com:4317",
    "OTEL_EXPORTER_OTLP_HEADERS": "Authorization=Bearer YOUR_API_KEY",
    "OTEL_METRIC_EXPORT_INTERVAL": "60000",
    "OTEL_LOGS_EXPORT_INTERVAL": "5000",
    "OTEL_LOG_USER_PROMPTS": "1",
    "OTEL_METRICS_INCLUDE_SESSION_ID": "true",
    "OTEL_METRICS_INCLUDE_VERSION": "true",
    "OTEL_METRICS_INCLUDE_ACCOUNT_UUID": "true",
    "OTEL_RESOURCE_ATTRIBUTES": "team=platform,environment=production"
  }
}

Management

Start Telemetry Stack (Mode 1)

~/.claude/telemetry/start-telemetry.sh

Stop Telemetry Stack (Mode 1)

~/.claude/telemetry/stop-telemetry.sh

Check Status

docker compose ps

View Logs

docker compose logs -f

Restart Services

docker compose restart

Data Retention

Default: 15 days in Prometheus

Adjust retention: Edit docker-compose.yml or prometheus.yml:

command:
  - '--storage.tsdb.retention.time=90d'
  - '--storage.tsdb.retention.size=50GB'

Disk usage: ~1-2 MB per day per active user

Security Considerations

Local Setup (Mode 1)

Grafana accessible only on localhost
Default credentials: admin/admin (change after first login)
No external network exposure
Data stored in Docker volumes

Enterprise Setup (Mode 2)

Use HTTPS endpoints
Store API keys securely (environment variables, secrets manager)
Enable mTLS for production
Tag metrics with team/project for proper attribution

Performance Tuning

Reduce OTEL Collector Memory

Edit otel-collector-config.yml:

processors:
  memory_limiter:
    limit_mib: 256  # Reduce from default

Reduce Prometheus Retention

Edit docker-compose.yml:

command:
  - '--storage.tsdb.retention.time=7d'  # Reduce from 15d

Optimize Dashboard Queries

Use recording rules for expensive queries
Reduce dashboard time ranges
Increase refresh intervals

See data/prometheus-queries.md for recording rule examples

Integration Examples

Cost Alerts (PagerDuty/Slack)

# alertmanager.yml
groups:
  - name: claude_code_cost
    rules:
      - alert: HighDailyCost
        expr: sum(increase(claude_code_claude_code_cost_usage_USD_total[24h])) > 100
        annotations:
          summary: "Claude Code daily cost exceeded $100"

Weekly Cost Reports (Email)

Use Grafana Reporting:

Create dashboard with cost panels
Set up email delivery
Schedule weekly reports

Chargeback Integration

Export metrics to data warehouse:

# Use Prometheus remote write
remote_write:
  - url: "https://datawarehouse.company.com/prometheus"

Contributing

This skill is maintained by the Prometheus Team.

Feedback: Open an issue or contact the team

Improvements: Submit pull requests with enhancements

Changelog

Version 1.1.0 (2025-11-01)

Critical Updates from Production Testing:

🚨 CRITICAL FIX: Documented missing OTEL_METRICS_EXPORTER/OTEL_LOGS_EXPORTER as #1 cause of "telemetry not working"
✅ Added deprecated address field fix for OTEL Collector v0.123.0+
✅ Enhanced troubleshooting with prominent exporter configuration section
✅ Updated all documentation with CRITICAL warnings for required settings
✅ Added comprehensive Known Issues section covering production scenarios
✅ Verified templates have correct exporter configuration

What Changed:

Troubleshooting guide now prioritizes missing exporters as root cause
Known Issues expanded from 3 to 6 issues with production learnings
Configuration Reference includes prominent CRITICAL requirements callout
SKILL.md Important Reminders section updated with exporter warnings

Version 1.0.0 (2025-10-31)

Initial Release:

Mode 1: Local PoC setup with full Docker stack
Mode 2: Enterprise setup with centralized endpoint
Comprehensive documentation and troubleshooting
Dashboard templates with correct metric naming
Automated UID detection and replacement

Known Issues Fixed:

✅ OTEL Collector deprecated logging exporter
✅ Dashboard datasource UID mismatch
✅ Metric double prefix handling
✅ Loki exporter configuration

Additional Resources

Claude Code Monitoring Docs: https://docs.claude.com/claude-code/monitoring
OpenTelemetry Docs: https://opentelemetry.io/docs/
Prometheus Docs: https://prometheus.io/docs/
Grafana Docs: https://grafana.com/docs/

License

Internal use within Elsevier organization.

Support

Issues? Check data/troubleshooting.md first

Questions? Contact Prometheus Team or #claude-code-telemetry channel

Emergency? Rollback with: cp ~/.claude/settings.json.backup ~/.claude/settings.json

Ready to monitor your Claude Code usage! 🚀

14 KiB Raw Permalink Blame History

Claude Code OpenTelemetry Setup Skill

Features

Quick Start

Prerequisites

Installation

Usage

What Gets Collected?

Metrics

Events/Logs

Directory Structure

Mode 1: Local PoC Setup

Mode 2: Enterprise Setup

Example Dashboards

Overview Dashboard

Custom Queries

Common Use Cases

Individual Developer

Team Pilot (5-10 Users)

Enterprise Rollout (100+ Users)

Troubleshooting

Quick Checks

Known Issues

Issue 1: 🚨 CRITICAL - Missing OTEL Exporters

Issue 2: OTEL Collector Deprecated 'address' Field

Issue 3: Metric Double Prefix

Issue 4: Dashboard Datasource UID Mismatch

Issue 5: OTEL Collector Deprecated Exporter

Configuration Reference

Settings.json (Local)

Settings.json (Enterprise)

Management

Start Telemetry Stack (Mode 1)

Stop Telemetry Stack (Mode 1)

Check Status

View Logs

Restart Services

Data Retention

Security Considerations

Local Setup (Mode 1)

Enterprise Setup (Mode 2)

Performance Tuning

Reduce OTEL Collector Memory

Reduce Prometheus Retention

Optimize Dashboard Queries

Integration Examples

Cost Alerts (PagerDuty/Slack)

Weekly Cost Reports (Email)

Chargeback Integration

Contributing

Changelog

Version 1.1.0 (2025-11-01)

Version 1.0.0 (2025-10-31)

Additional Resources

License

Support

14 KiB

Raw Permalink Blame History