Files
2025-11-29 18:16:51 +08:00

18 KiB

Mode 1: Local PoC Setup - Detailed Workflow

Complete step-by-step process for setting up a local OpenTelemetry stack for Claude Code telemetry.


Overview

Goal: Create a complete local telemetry monitoring stack Time: 5-7 minutes Prerequisites: Docker Desktop, Claude Code, 2GB+ free disk space Output: Running Grafana dashboard with Claude Code metrics


Phase 0: Prerequisites Verification

Step 0.1: Check Docker Installation

# Check if Docker is installed
docker --version

# Expected: Docker version 20.10.0 or higher

If not installed:

Docker is not installed. Please install Docker Desktop:
- Mac: https://docs.docker.com/desktop/install/mac-install/
- Linux: https://docs.docker.com/desktop/install/linux-install/
- Windows: https://docs.docker.com/desktop/install/windows-install/

Stop if: Docker not installed

Step 0.2: Verify Docker is Running

# Check Docker daemon
docker ps

# Expected: List of containers (or empty list)
# Error: "Cannot connect to Docker daemon" means Docker isn't running

If not running:

Docker Desktop is not running. Please:
1. Open Docker Desktop application
2. Wait for the whale icon to be stable (not animated)
3. Try again

Stop if: Docker not running

Step 0.3: Check Docker Compose

# Modern Docker includes compose
docker compose version

# Expected: Docker Compose version v2.x.x or higher

Note: We use docker compose (not docker-compose)

Step 0.4: Check Available Ports

# Check if ports are available
lsof -i :3000 -i :4317 -i :4318 -i :8889 -i :9090 -i :3100

# Expected: No output (ports are free)

If ports in use:

The following ports are required but already in use:
- 3000: Grafana
- 4317: OTEL Collector (gRPC)
- 4318: OTEL Collector (HTTP)
- 8889: OTEL Collector (Prometheus exporter)
- 9090: Prometheus
- 3100: Loki

Options:
1. Stop services using these ports
2. Modify port mappings in docker-compose.yml (advanced)

Stop if: Critical ports (3000, 4317, 9090) are in use

Step 0.5: Check Disk Space

# Check available disk space
df -h ~

# Minimum: 2GB free (for Docker images ~1.5GB + data volumes)
# Recommended: 5GB+ free for comfortable operation

If low disk space:

Low disk space detected. Setup requires:
- Initial: ~1.5GB for Docker images (OTEL, Prometheus, Grafana, Loki)
- Runtime: 500MB+ for data volumes (grows over time)
- Minimum: 2GB free disk space required

Please free up space before continuing.

Phase 1: Directory Structure Creation

Step 1.1: Create Base Directory

mkdir -p ~/.claude/telemetry/{dashboards,docs}
cd ~/.claude/telemetry

Verify:

ls -la ~/.claude/telemetry
# Should show: dashboards/ and docs/ directories

Phase 2: Configuration File Generation

Step 2.1: Create docker-compose.yml

Template: templates/docker-compose-template.yml

services:
  # OpenTelemetry Collector - receives telemetry from Claude Code
  otel-collector:
    image: otel/opentelemetry-collector-contrib:0.115.1
    container_name: claude-otel-collector
    command: ["--config=/etc/otel-collector-config.yml"]
    volumes:
      - ./otel-collector-config.yml:/etc/otel-collector-config.yml
    ports:
      - "4317:4317"   # OTLP gRPC receiver
      - "4318:4318"   # OTLP HTTP receiver
      - "8889:8889"   # Prometheus metrics exporter
    networks:
      - claude-telemetry

  # Prometheus - stores metrics
  prometheus:
    image: prom/prometheus:v2.55.1
    container_name: claude-prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
      - '--web.enable-lifecycle'
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus-data:/prometheus
    ports:
      - "9090:9090"
    networks:
      - claude-telemetry
    depends_on:
      - otel-collector

  # Loki - stores logs
  loki:
    image: grafana/loki:3.0.0
    container_name: claude-loki
    ports:
      - "3100:3100"
    command: -config.file=/etc/loki/local-config.yaml
    volumes:
      - loki-data:/loki
    networks:
      - claude-telemetry

  # Grafana - visualization dashboards
  grafana:
    image: grafana/grafana:11.3.0
    container_name: claude-grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=admin
      - GF_USERS_ALLOW_SIGN_UP=false
    volumes:
      - grafana-data:/var/lib/grafana
      - ./grafana-datasources.yml:/etc/grafana/provisioning/datasources/datasources.yml
    networks:
      - claude-telemetry
    depends_on:
      - prometheus
      - loki

networks:
  claude-telemetry:
    driver: bridge

volumes:
  prometheus-data:
  loki-data:
  grafana-data:

Write to: ~/.claude/telemetry/docker-compose.yml

Note on Image Versions:

  • Versions are pinned to prevent breaking changes from upstream
  • Current versions (tested and stable):
    • OTEL Collector: 0.115.1
    • Prometheus: v2.55.1
    • Loki: 3.0.0
    • Grafana: 11.3.0
  • To update: Change version tags in docker-compose.yml and run docker compose pull

Step 2.2: Create OTEL Collector Configuration

Template: templates/otel-collector-config-template.yml

CRITICAL: Use debug exporter, not deprecated logging exporter

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 10s
    send_batch_size: 1024

  resource:
    attributes:
      - key: service.name
        value: claude-code
        action: upsert

  memory_limiter:
    check_interval: 1s
    limit_mib: 512

exporters:
  # Export metrics to Prometheus
  prometheus:
    endpoint: "0.0.0.0:8889"
    namespace: claude_code
    const_labels:
      source: claude_code_telemetry

  # Export logs to Loki via OTLP HTTP
  otlphttp/loki:
    endpoint: http://loki:3100/otlp
    tls:
      insecure: true

  # Debug exporter (replaces deprecated logging exporter)
  debug:
    verbosity: normal

service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch, resource]
      exporters: [prometheus, debug]

    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch, resource]
      exporters: [otlphttp/loki, debug]

  telemetry:
    logs:
      level: info

Write to: ~/.claude/telemetry/otel-collector-config.yml

Step 2.3: Create Prometheus Configuration

Template: templates/prometheus-config-template.yml

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'otel-collector'
    static_configs:
      - targets: ['otel-collector:8889']

Write to: ~/.claude/telemetry/prometheus.yml

Step 2.4: Create Grafana Datasources Configuration

Template: templates/grafana-datasources-template.yml

apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: true

  - name: Loki
    type: loki
    access: proxy
    url: http://loki:3100
    editable: true

Write to: ~/.claude/telemetry/grafana-datasources.yml

Step 2.5: Create Management Scripts

Start Script:

#!/bin/bash
# start-telemetry.sh

echo "🚀 Starting Claude Code Telemetry Stack..."

# Check if Docker is running
if ! docker info > /dev/null 2>&1; then
    echo "❌ Docker is not running. Please start Docker Desktop."
    exit 1
fi

cd ~/.claude/telemetry || exit 1

# Start containers
docker compose up -d

# Wait for services to be ready
echo "⏳ Waiting for services to start..."
sleep 5

# Check container status
echo ""
echo "📊 Container Status:"
docker ps --filter "name=claude-" --format "table {{.Names}}\t{{.Status}}"

echo ""
echo "✅ Telemetry stack started!"
echo ""
echo "🌐 Access URLs:"
echo "   Grafana:    http://localhost:3000 (admin/admin)"
echo "   Prometheus: http://localhost:9090"
echo "   Loki:       http://localhost:3100"
echo ""
echo "📝 Next steps:"
echo "   1. Restart Claude Code to activate telemetry"
echo "   2. Import dashboards into Grafana"
echo "   3. Use Claude Code normally - metrics will appear in ~60 seconds"

Write to: ~/.claude/telemetry/start-telemetry.sh

chmod +x ~/.claude/telemetry/start-telemetry.sh

Stop Script:

#!/bin/bash
# stop-telemetry.sh

echo "🛑 Stopping Claude Code Telemetry Stack..."

cd ~/.claude/telemetry || exit 1

docker compose down

echo "✅ Telemetry stack stopped"
echo ""
echo "Note: Data is preserved in Docker volumes."
echo "To start again: ./start-telemetry.sh"
echo "To completely remove all data: ./cleanup-telemetry.sh"

Write to: ~/.claude/telemetry/stop-telemetry.sh

chmod +x ~/.claude/telemetry/stop-telemetry.sh

Cleanup Script (Full Data Removal):

#!/bin/bash
# cleanup-telemetry.sh

echo "⚠️  WARNING: This will remove ALL telemetry data including:"
echo "  - All containers"
echo "  - All Docker volumes (Grafana, Prometheus, Loki data)"
echo "  - Network configuration"
echo ""
read -p "Are you sure you want to proceed? (yes/no): " -r
echo

if [[ ! $REPLY =~ ^[Yy][Ee][Ss]$ ]]; then
    echo "Cleanup cancelled."
    exit 0
fi

echo "Performing full cleanup of Claude Code telemetry stack..."

cd ~/.claude/telemetry || exit 1

docker compose down -v

echo ""
echo "✅ Full cleanup complete!"
echo ""
echo "Removed:"
echo "  ✓ All containers (otel-collector, prometheus, loki, grafana)"
echo "  ✓ All volumes (all historical data)"
echo "  ✓ Network configuration"
echo ""
echo "Preserved:"
echo "  ✓ Configuration files in ~/.claude/telemetry/"
echo "  ✓ Claude Code settings in ~/.claude/settings.json"
echo ""
echo "To start fresh: ./start-telemetry.sh"

Write to: ~/.claude/telemetry/cleanup-telemetry.sh

chmod +x ~/.claude/telemetry/cleanup-telemetry.sh

Phase 3: Start Docker Containers

Step 3.1: Start All Services

cd ~/.claude/telemetry
docker compose up -d

Expected output:

[+] Running 5/5
 ✔ Network claude_claude-telemetry     Created
 ✔ Container claude-loki               Started
 ✔ Container claude-otel-collector     Started
 ✔ Container claude-prometheus         Started
 ✔ Container claude-grafana            Started

Step 3.2: Verify Containers are Running

docker ps --filter "name=claude-" --format "table {{.Names}}\t{{.Status}}"

Expected: All 4 containers showing "Up X seconds/minutes"

If OTEL Collector is not running:

# Check logs
docker logs claude-otel-collector

Common issue: "logging exporter deprecated" error Solution: Config file uses debug exporter (already fixed in template)

Step 3.3: Wait for Services to be Healthy

# Give services time to initialize
sleep 10

# Test Prometheus
curl -s http://localhost:9090/-/healthy
# Expected: Prometheus is Healthy.

# Test Grafana
curl -s http://localhost:3000/api/health | jq
# Expected: {"database": "ok", ...}

Phase 4: Update Claude Code Settings

Step 4.1: Backup Existing Settings

cp ~/.claude/settings.json ~/.claude/settings.json.backup

Step 4.2: Read Current Settings

# Read existing settings
cat ~/.claude/settings.json

Step 4.3: Merge Telemetry Configuration

Add to settings.json env section:

{
  "env": {
    "CLAUDE_CODE_ENABLE_TELEMETRY": "1",
    "OTEL_METRICS_EXPORTER": "otlp",
    "OTEL_LOGS_EXPORTER": "otlp",
    "OTEL_EXPORTER_OTLP_PROTOCOL": "grpc",
    "OTEL_EXPORTER_OTLP_ENDPOINT": "http://localhost:4317",
    "OTEL_METRIC_EXPORT_INTERVAL": "60000",
    "OTEL_LOGS_EXPORT_INTERVAL": "5000",
    "OTEL_LOG_USER_PROMPTS": "1",
    "OTEL_METRICS_INCLUDE_SESSION_ID": "true",
    "OTEL_METRICS_INCLUDE_VERSION": "true",
    "OTEL_METRICS_INCLUDE_ACCOUNT_UUID": "true",
    "OTEL_RESOURCE_ATTRIBUTES": "environment=local,deployment=poc"
  }
}

Template: templates/settings-env-template.json

Note: Merge with existing env vars, don't replace entire settings file

Step 4.4: Verify Settings Updated

cat ~/.claude/settings.json | grep CLAUDE_CODE_ENABLE_TELEMETRY
# Expected: "CLAUDE_CODE_ENABLE_TELEMETRY": "1"

Phase 5: Grafana Dashboard Import

Step 5.1: Detect Prometheus Datasource UID

Option A: Via Grafana API

curl -s http://admin:admin@localhost:3000/api/datasources | \
  jq '.[] | select(.type=="prometheus") | {name, uid}'

Expected:

{
  "name": "Prometheus",
  "uid": "PBFA97CFB590B2093"
}

Option B: Manual Detection

  1. Open http://localhost:3000
  2. Go to Connections → Data sources
  3. Click Prometheus
  4. Note the UID from the URL: /datasources/edit/{UID}

Step 5.2: Fix Dashboard with Correct UID

Read dashboard template: dashboards/claude-code-overview-template.json

Replace all instances of:

"datasource": {
  "type": "prometheus",
  "uid": "prometheus"
}

With:

"datasource": {
  "type": "prometheus",
  "uid": "PBFA97CFB590B2093"
}

Use detected UID from Step 5.1

Step 5.3: Verify Metric Names

CRITICAL: Claude Code metrics use double prefix: claude_code_claude_code_*

Verify actual metric names:

curl -s 'http://localhost:9090/api/v1/label/__name__/values' | \
  grep claude_code

Expected metrics:

  • claude_code_claude_code_active_time_seconds_total
  • claude_code_claude_code_commit_count_total
  • claude_code_claude_code_cost_usage_USD_total
  • claude_code_claude_code_lines_of_code_count_total
  • claude_code_claude_code_token_usage_tokens_total

Dashboard queries must use these exact names

Step 5.4: Save Corrected Dashboard

Write to: ~/.claude/telemetry/dashboards/claude-code-overview.json

Step 5.5: Import Dashboard

Option A: Via Grafana UI

  1. Open http://localhost:3000 (admin/admin)
  2. Dashboards → New → Import
  3. Upload JSON file: ~/.claude/telemetry/dashboards/claude-code-overview.json
  4. Click Import

Option B: Via API

curl -X POST http://admin:admin@localhost:3000/api/dashboards/db \
  -H "Content-Type: application/json" \
  -d @~/.claude/telemetry/dashboards/claude-code-overview.json

Phase 6: Verification & Testing

Step 6.1: Verify OTEL Collector Receiving Data

Note: Claude Code must be restarted for telemetry to activate!

# Check OTEL Collector logs for incoming data
docker logs claude-otel-collector --tail 50 | grep -i "received"

Expected: Messages about receiving OTLP data

If no data:

Reminder: You must restart Claude Code for telemetry to activate.
1. Exit current Claude Code session
2. Start new session: claude
3. Wait 60 seconds
4. Check again

Step 6.2: Query Prometheus for Metrics

# Check if any claude_code metrics exist
curl -s 'http://localhost:9090/api/v1/label/__name__/values' | \
  jq '.data[] | select(. | startswith("claude_code"))'

Expected: List of claude_code metrics

Sample query:

curl -s 'http://localhost:9090/api/v1/query?query=claude_code_claude_code_lines_of_code_count_total' | \
  jq '.data.result'

Expected: Non-empty result array

Step 6.3: Test Grafana Dashboard

  1. Open http://localhost:3000
  2. Navigate to imported dashboard
  3. Check panels show data (or "No data" if Claude Code hasn't been used yet)

If "No data":

  • Normal if Claude Code hasn't generated any activity yet
  • Use Claude Code for 1-2 minutes
  • Refresh dashboard

If "Datasource not found":

  • UID mismatch - go back to Step 5.1

If queries fail:

  • Metric name mismatch - verify double prefix

Step 6.4: Generate Test Data

To populate dashboard quickly:

Use Claude Code to:
1. Ask a question (generates token usage)
2. Request a code modification (generates LOC metrics)
3. Have a conversation (generates active time)

Wait 60 seconds, then refresh Grafana dashboard


Phase 7: Documentation & Quickstart Guide

Step 7.1: Create Quickstart Guide

Write to: ~/.claude/telemetry/docs/quickstart.md

Include:

  • URLs and credentials
  • Management commands (start/stop)
  • What metrics are being collected
  • How to access dashboards
  • Troubleshooting quick reference

Template: data/quickstart-template.md

Step 7.2: Provide User Summary

✅ Setup Complete!

📦 Installation:
   Location: ~/.claude/telemetry/
   Containers: 4 running (OTEL Collector, Prometheus, Loki, Grafana)

🌐 Access URLs:
   Grafana:    http://localhost:3000 (admin/admin)
   Prometheus: http://localhost:9090
   OTEL Collector: localhost:4317 (gRPC), localhost:4318 (HTTP)

📊 Dashboards Imported:
   ✓ Claude Code - Overview

📝 What's Being Collected:
   • Session counts and active time
   • Token usage (input/output/cached)
   • API costs by model
   • Lines of code modified
   • Commits and PRs created
   • Tool execution metrics

⚙️  Management:
   Start:   ~/.claude/telemetry/start-telemetry.sh
   Stop:    ~/.claude/telemetry/stop-telemetry.sh (preserves data)
   Cleanup: ~/.claude/telemetry/cleanup-telemetry.sh (removes all data)
   Logs:    docker logs claude-otel-collector

🚀 Next Steps:
   1. ✅ Restart Claude Code (telemetry activates on startup)
   2. Use Claude Code normally
   3. Check dashboard in ~60 seconds
   4. Review quickstart: ~/.claude/telemetry/docs/quickstart.md

📚 Documentation:
   - Quickstart: ~/.claude/telemetry/docs/quickstart.md
   - Metrics Reference: data/metrics-reference.md
   - Troubleshooting: data/troubleshooting.md

Cleanup Instructions

Remove Stack (Keep Data)

cd ~/.claude/telemetry
docker compose down

Remove Stack and Data

cd ~/.claude/telemetry
docker compose down -v

Remove Telemetry from Claude Code

Edit ~/.claude/settings.json and remove the env section with telemetry variables, or set:

"CLAUDE_CODE_ENABLE_TELEMETRY": "0"

Then restart Claude Code.


Troubleshooting

See data/troubleshooting.md for detailed solutions to common issues.

Quick fixes:

  • Container won't start → Check logs: docker logs claude-otel-collector
  • No metrics → Restart Claude Code
  • Dashboard broken → Verify datasource UID
  • Wrong metric names → Use double prefix: claude_code_claude_code_*