Initial commit

2025-11-29 18:16:51 +08:00
commit 4e8a12140c
88 changed files with 17078 additions and 0 deletions
--- a/skills/otel-monitoring-setup/README.md
+++ b/skills/otel-monitoring-setup/README.md
@@ -0,0 +1,558 @@
+# Claude Code OpenTelemetry Setup Skill
+
+Automated workflow for setting up OpenTelemetry telemetry collection for Claude Code usage monitoring, cost tracking, and productivity analytics.
+
+**Version:** 1.0.0
+**Author:** Prometheus Team
+
+---
+
+## Features
+
+- **Mode 1: Local PoC Setup** - Full Docker stack with Grafana dashboards
+- **Mode 2: Enterprise Setup** - Connect to centralized infrastructure
+- Automated configuration file generation
+- Dashboard import with UID detection
+- Verification and testing procedures
+- Comprehensive troubleshooting guides
+
+---
+
+## Quick Start
+
+### Prerequisites
+
+**For Mode 1 (Local PoC):**
+- Docker Desktop installed and running
+- Claude Code installed
+- Write access to `~/.claude/settings.json`
+
+**For Mode 2 (Enterprise):**
+- OTEL Collector endpoint URL
+- Authentication credentials
+- Write access to `~/.claude/settings.json`
+
+### Installation
+
+This skill is designed to be invoked by Claude Code. No manual installation required.
+
+### Usage
+
+**Mode 1 - Local PoC Setup:**
+```
+"Set up Claude Code telemetry locally"
+"I want to try OpenTelemetry with Claude Code"
+"Create a local telemetry stack for me"
+```
+
+**Mode 2 - Enterprise Setup:**
+```
+"Connect Claude Code to our company OTEL endpoint at otel.company.com:4317"
+"Set up telemetry for team rollout"
+"Configure enterprise telemetry"
+```
+
+---
+
+## What Gets Collected?
+
+### Metrics
+- **Session counts and active time** - How much you use Claude Code
+- **Token usage** - Input, output, cached tokens by model
+- **API costs** - Spend tracking by model and time
+- **Lines of code** - Code modifications (added, changed, deleted)
+- **Commits and PRs** - Git activity tracking
+
+### Events/Logs
+- User prompts (if enabled)
+- Tool executions
+- API requests
+- Session lifecycle
+
+**Privacy:** Metrics are anonymized. Source code content is never collected.
+
+---
+
+## Directory Structure
+
+```
+claude-code-otel-setup/
+├── SKILL.md                  # Main skill definition
+├── README.md                 # This file
+├── modes/
+│   ├── mode1-poc-setup.md    # Detailed local setup workflow
+│   └── mode2-enterprise.md   # Detailed enterprise setup workflow
+├── templates/
+│   ├── docker-compose.yml    # Docker Compose configuration
+│   ├── otel-collector-config.yml  # OTEL Collector configuration
+│   ├── prometheus.yml        # Prometheus scrape configuration
+│   ├── grafana-datasources.yml    # Grafana datasource provisioning
+│   ├── settings.json.local   # Local telemetry settings template
+│   ├── settings.json.enterprise  # Enterprise settings template
+│   ├── start-telemetry.sh    # Start script
+│   └── stop-telemetry.sh     # Stop script
+├── dashboards/
+│   ├── README.md             # Dashboard import guide
+│   ├── claude-code-overview.json  # Comprehensive dashboard
+│   └── claude-code-simple.json    # Simplified dashboard
+└── data/
+    ├── metrics-reference.md  # Complete metrics documentation
+    ├── prometheus-queries.md # Useful PromQL queries
+    └── troubleshooting.md    # Common issues and solutions
+```
+
+---
+
+## Mode 1: Local PoC Setup
+
+**What it does:**
+- Creates `~/.claude/telemetry/` directory
+- Generates Docker Compose configuration
+- Starts 4 containers: OTEL Collector, Prometheus, Loki, Grafana
+- Updates Claude Code settings.json
+- Imports Grafana dashboards
+- Verifies data flow
+
+**Time:** 5-7 minutes
+
+**Output:**
+- Grafana: http://localhost:3000 (admin/admin)
+- Prometheus: http://localhost:9090
+- Working dashboards with real data
+
+**Detailed workflow:** See `modes/mode1-poc-setup.md`
+
+---
+
+## Mode 2: Enterprise Setup
+
+**What it does:**
+- Collects enterprise OTEL endpoint details
+- Updates Claude Code settings.json with endpoint and auth
+- Adds team/environment resource attributes
+- Tests connectivity (optional)
+- Provides team rollout documentation
+
+**Time:** 2-3 minutes
+
+**Output:**
+- Claude Code configured to send to central endpoint
+- Connectivity verified
+- Team rollout guide generated
+
+**Detailed workflow:** See `modes/mode2-enterprise.md`
+
+---
+
+## Example Dashboards
+
+### Overview Dashboard
+
+Includes:
+- Total Lines of Code (all-time)
+- Total Cost (24h)
+- Total Tokens (24h)
+- Active Time (24h)
+- Cost Over Time (timeseries)
+- Token Usage by Type (stacked)
+- Lines of Code Modified (bar chart)
+- Commits Created (24h)
+
+### Custom Queries
+
+See `data/prometheus-queries.md` for 50+ ready-to-use PromQL queries:
+- Cost analysis
+- Token usage
+- Productivity metrics
+- Team aggregation
+- Model comparison
+- Alerting rules
+
+---
+
+## Common Use Cases
+
+### Individual Developer
+
+**Goal:** Track personal Claude Code usage and costs
+
+**Setup:**
+```
+Mode 1 (Local PoC)
+```
+
+**Access:**
+- Personal Grafana dashboard at localhost:3000
+- All data stays local
+
+---
+
+### Team Pilot (5-10 Users)
+
+**Goal:** Aggregate metrics across pilot users
+
+**Setup:**
+```
+Mode 2 (Enterprise)
+```
+
+**Architecture:**
+- Centralized OTEL Collector
+- Team-level Prometheus/Grafana
+- Aggregated dashboards
+
+---
+
+### Enterprise Rollout (100+ Users)
+
+**Goal:** Organization-wide cost tracking and productivity analytics
+
+**Setup:**
+```
+Mode 2 (Enterprise) + Managed Infrastructure
+```
+
+**Features:**
+- Department/team/project attribution
+- Chargeback reporting
+- Executive dashboards
+- Trend analysis
+
+---
+
+## Troubleshooting
+
+### Quick Checks
+
+**Containers not starting:**
+```bash
+docker compose logs
+```
+
+**No metrics in Prometheus:**
+1. Restart Claude Code (telemetry loads at startup)
+2. Wait 60 seconds (export interval)
+3. Check OTEL Collector logs: `docker compose logs otel-collector`
+
+**Dashboard shows "No data":**
+1. Verify metric names use double prefix: `claude_code_claude_code_*`
+2. Check time range (top-right corner)
+3. Verify datasource UID matches
+
+**Full troubleshooting guide:** See `data/troubleshooting.md`
+
+---
+
+## Known Issues
+
+### Issue 1: 🚨 CRITICAL - Missing OTEL Exporters
+
+**Description:** Claude Code not sending telemetry even with `CLAUDE_CODE_ENABLE_TELEMETRY=1`
+
+**Cause:** Missing required `OTEL_METRICS_EXPORTER` and `OTEL_LOGS_EXPORTER` settings
+
+**Solution:** The skill templates include these by default. **Always verify** they're present in settings.json. See Configuration Reference for details.
+
+---
+
+### Issue 2: OTEL Collector Deprecated 'address' Field
+
+**Description:** Collector crashes with "'address' has invalid keys" error
+
+**Cause:** The `address` field in `service.telemetry.metrics` is deprecated in collector v0.123.0+
+
+**Solution:** Skill templates have this removed. If using custom config, remove the deprecated field.
+
+---
+
+### Issue 3: Metric Double Prefix
+
+**Description:** Metrics are named `claude_code_claude_code_*` instead of `claude_code_*`
+
+**Cause:** OTEL Collector Prometheus exporter adds namespace prefix
+
+**Solution:** This is expected. Dashboards use correct naming.
+
+---
+
+### Issue 4: Dashboard Datasource UID Mismatch
+
+**Description:** Dashboard shows "datasource prometheus not found"
+
+**Cause:** Dashboard has hardcoded UID that doesn't match your Grafana
+
+**Solution:** Skill automatically detects and fixes UID during import
+
+---
+
+### Issue 5: OTEL Collector Deprecated Exporter
+
+**Description:** Container fails with "logging exporter has been deprecated"
+
+**Cause:** Old OTEL configuration
+
+**Solution:** Skill uses `debug` exporter (not deprecated `logging`)
+
+---
+
+## Configuration Reference
+
+### Settings.json (Local)
+
+**🚨 CRITICAL REQUIREMENTS:**
+
+The following settings are **REQUIRED** (not optional) for telemetry to work:
+- `CLAUDE_CODE_ENABLE_TELEMETRY: "1"` - Enables telemetry system
+- `OTEL_METRICS_EXPORTER: "otlp"` - **REQUIRED** to send metrics (most common missing setting!)
+- `OTEL_LOGS_EXPORTER: "otlp"` - **REQUIRED** to send events/logs
+
+Without `OTEL_METRICS_EXPORTER` and `OTEL_LOGS_EXPORTER`, telemetry will not send even if `CLAUDE_CODE_ENABLE_TELEMETRY=1` is set.
+
+```json
+{
+  "env": {
+    "CLAUDE_CODE_ENABLE_TELEMETRY": "1",
+    "OTEL_METRICS_EXPORTER": "otlp",           // REQUIRED!
+    "OTEL_LOGS_EXPORTER": "otlp",              // REQUIRED!
+    "OTEL_EXPORTER_OTLP_PROTOCOL": "grpc",
+    "OTEL_EXPORTER_OTLP_ENDPOINT": "http://localhost:4317",
+    "OTEL_METRIC_EXPORT_INTERVAL": "60000",
+    "OTEL_LOGS_EXPORT_INTERVAL": "5000",
+    "OTEL_LOG_USER_PROMPTS": "1",
+    "OTEL_METRICS_INCLUDE_SESSION_ID": "true",
+    "OTEL_METRICS_INCLUDE_VERSION": "true",
+    "OTEL_METRICS_INCLUDE_ACCOUNT_UUID": "true",
+    "OTEL_RESOURCE_ATTRIBUTES": "environment=local,deployment=poc"
+  }
+}
+```
+
+### Settings.json (Enterprise)
+
+**Same CRITICAL requirements apply:**
+- `OTEL_METRICS_EXPORTER: "otlp"` - **REQUIRED!**
+- `OTEL_LOGS_EXPORTER: "otlp"` - **REQUIRED!**
+
+```json
+{
+  "env": {
+    "CLAUDE_CODE_ENABLE_TELEMETRY": "1",
+    "OTEL_METRICS_EXPORTER": "otlp",           // REQUIRED!
+    "OTEL_LOGS_EXPORTER": "otlp",              // REQUIRED!
+    "OTEL_EXPORTER_OTLP_PROTOCOL": "grpc",
+    "OTEL_EXPORTER_OTLP_ENDPOINT": "https://otel.company.com:4317",
+    "OTEL_EXPORTER_OTLP_HEADERS": "Authorization=Bearer YOUR_API_KEY",
+    "OTEL_METRIC_EXPORT_INTERVAL": "60000",
+    "OTEL_LOGS_EXPORT_INTERVAL": "5000",
+    "OTEL_LOG_USER_PROMPTS": "1",
+    "OTEL_METRICS_INCLUDE_SESSION_ID": "true",
+    "OTEL_METRICS_INCLUDE_VERSION": "true",
+    "OTEL_METRICS_INCLUDE_ACCOUNT_UUID": "true",
+    "OTEL_RESOURCE_ATTRIBUTES": "team=platform,environment=production"
+  }
+}
+```
+
+---
+
+## Management
+
+### Start Telemetry Stack (Mode 1)
+
+```bash
+~/.claude/telemetry/start-telemetry.sh
+```
+
+### Stop Telemetry Stack (Mode 1)
+
+```bash
+~/.claude/telemetry/stop-telemetry.sh
+```
+
+### Check Status
+
+```bash
+docker compose ps
+```
+
+### View Logs
+
+```bash
+docker compose logs -f
+```
+
+### Restart Services
+
+```bash
+docker compose restart
+```
+
+---
+
+## Data Retention
+
+**Default:** 15 days in Prometheus
+
+**Adjust retention:**
+Edit `docker-compose.yml` or `prometheus.yml`:
+```yaml
+command:
+  - '--storage.tsdb.retention.time=90d'
+  - '--storage.tsdb.retention.size=50GB'
+```
+
+**Disk usage:** ~1-2 MB per day per active user
+
+---
+
+## Security Considerations
+
+### Local Setup (Mode 1)
+
+- Grafana accessible only on localhost
+- Default credentials: admin/admin (change after first login)
+- No external network exposure
+- Data stored in Docker volumes
+
+### Enterprise Setup (Mode 2)
+
+- Use HTTPS endpoints
+- Store API keys securely (environment variables, secrets manager)
+- Enable mTLS for production
+- Tag metrics with team/project for proper attribution
+
+---
+
+## Performance Tuning
+
+### Reduce OTEL Collector Memory
+
+Edit `otel-collector-config.yml`:
+```yaml
+processors:
+  memory_limiter:
+    limit_mib: 256  # Reduce from default
+```
+
+### Reduce Prometheus Retention
+
+Edit `docker-compose.yml`:
+```yaml
+command:
+  - '--storage.tsdb.retention.time=7d'  # Reduce from 15d
+```
+
+### Optimize Dashboard Queries
+
+- Use recording rules for expensive queries
+- Reduce dashboard time ranges
+- Increase refresh intervals
+
+See `data/prometheus-queries.md` for recording rule examples
+
+---
+
+## Integration Examples
+
+### Cost Alerts (PagerDuty/Slack)
+
+```yaml
+# alertmanager.yml
+groups:
+  - name: claude_code_cost
+    rules:
+      - alert: HighDailyCost
+        expr: sum(increase(claude_code_claude_code_cost_usage_USD_total[24h])) > 100
+        annotations:
+          summary: "Claude Code daily cost exceeded $100"
+```
+
+### Weekly Cost Reports (Email)
+
+Use Grafana Reporting:
+1. Create dashboard with cost panels
+2. Set up email delivery
+3. Schedule weekly reports
+
+### Chargeback Integration
+
+Export metrics to data warehouse:
+```yaml
+# Use Prometheus remote write
+remote_write:
+  - url: "https://datawarehouse.company.com/prometheus"
+```
+
+---
+
+## Contributing
+
+This skill is maintained by the Prometheus Team.
+
+**Feedback:** Open an issue or contact the team
+
+**Improvements:** Submit pull requests with enhancements
+
+---
+
+## Changelog
+
+### Version 1.1.0 (2025-11-01)
+
+**Critical Updates from Production Testing:**
+- 🚨 **CRITICAL FIX**: Documented missing OTEL_METRICS_EXPORTER/OTEL_LOGS_EXPORTER as #1 cause of "telemetry not working"
+- ✅ Added deprecated `address` field fix for OTEL Collector v0.123.0+
+- ✅ Enhanced troubleshooting with prominent exporter configuration section
+- ✅ Updated all documentation with CRITICAL warnings for required settings
+- ✅ Added comprehensive Known Issues section covering production scenarios
+- ✅ Verified templates have correct exporter configuration
+
+**What Changed:**
+- Troubleshooting guide now prioritizes missing exporters as root cause
+- Known Issues expanded from 3 to 6 issues with production learnings
+- Configuration Reference includes prominent CRITICAL requirements callout
+- SKILL.md Important Reminders section updated with exporter warnings
+
+### Version 1.0.0 (2025-10-31)
+
+**Initial Release:**
+- Mode 1: Local PoC setup with full Docker stack
+- Mode 2: Enterprise setup with centralized endpoint
+- Comprehensive documentation and troubleshooting
+- Dashboard templates with correct metric naming
+- Automated UID detection and replacement
+
+**Known Issues Fixed:**
+- ✅ OTEL Collector deprecated logging exporter
+- ✅ Dashboard datasource UID mismatch
+- ✅ Metric double prefix handling
+- ✅ Loki exporter configuration
+
+---
+
+## Additional Resources
+
+- **Claude Code Monitoring Docs:** https://docs.claude.com/claude-code/monitoring
+- **OpenTelemetry Docs:** https://opentelemetry.io/docs/
+- **Prometheus Docs:** https://prometheus.io/docs/
+- **Grafana Docs:** https://grafana.com/docs/
+
+---
+
+## License
+
+Internal use within Elsevier organization.
+
+---
+
+## Support
+
+**Issues?** Check `data/troubleshooting.md` first
+
+**Questions?** Contact Prometheus Team or #claude-code-telemetry channel
+
+**Emergency?** Rollback with: `cp ~/.claude/settings.json.backup ~/.claude/settings.json`
+
+---
+
+**Ready to monitor your Claude Code usage!** 🚀