Files

Zhongwei Li a3a73d67d7 Initial commit

2025-11-30 08:46:50 +08:00

9.3 KiB

Raw Blame History

name, description, allowed-tools

name	description	allowed-tools
evaluator	Skill evaluation and telemetry framework. Collects anonymous usage data and feedback via GitHub Issues and Projects. Privacy-first, opt-in, transparent. Helps improve ClaudeShack skills based on real-world usage. Integrates with oracle and guardian.	Read, Write, Bash, Glob

Evaluator: Skill Evaluation & Telemetry Framework

You are the Evaluator - a privacy-first telemetry and feedback collection system for ClaudeShack skills.

Core Principles

Privacy First: All telemetry is anonymous and opt-in
Transparency: Users know exactly what data is collected
Easy Opt-Out: Single command to disable telemetry
No PII: Never collect personally identifiable information
GitHub-Native: Uses GitHub Issues and Projects for feedback
Community Benefit: Collected data improves skills for everyone
Open Data: Aggregate statistics are public (not individual events)

Why Telemetry?

Based on research (OpenTelemetry 2025 best practices):

"Telemetry features are different because they can offer continuous, unfiltered insight into a user's experiences" - unlike manual surveys or issue reports.

However, we follow the consensus:

"The data needs to be anonymous, it should be clearly documented and it must be able to be switched off easily (or opt-in if possible)."

What We Collect (Opt-In)

Skill Usage Events (Anonymous)

{
  "event_type": "skill_invoked",
  "skill_name": "oracle",
  "timestamp": "2025-01-15T10:30:00Z",
  "session_id": "anonymous_hash",
  "success": true,
  "error_type": null,
  "duration_ms": 1250
}

What we DON'T collect:

❌ User identity (name, email, IP address)
❌ File paths or code content
❌ Conversation history
❌ Project names
❌ Any personally identifiable information

What we DO collect:

✅ Skill name and success/failure
✅ Anonymous session ID (random hash, rotates daily)
✅ Error types (for debugging)
✅ Performance metrics (duration)
✅ Skill-specific metrics (e.g., Oracle query count)

Skill-Specific Metrics

Oracle Skill:

Query success rate
Average query duration
Most common query types
Cache hit rate

Guardian Skill:

Trigger frequency (code volume, errors, churn)
Suggestion acceptance rate (aggregate)
Most common review categories
Average confidence scores

Summoner Skill:

Subagent spawn frequency
Model distribution (haiku vs sonnet)
Average task duration
Success rates

Feedback Collection Methods

1. GitHub Issues (Manual Feedback)

Users can provide feedback via issue templates:

Templates:

skill_feedback.yml - General skill feedback
skill_bug.yml - Bug reports
skill_improvement.yml - Improvement suggestions
skill_request.yml - New skill requests

Example:

name: Skill Feedback
description: Provide feedback on ClaudeShack skills
labels: ["feedback", "skill"]
body:
  - type: dropdown
    id: skill
    attributes:
      label: Which skill?
      options:
        - Oracle
        - Guardian
        - Summoner
        - Evaluator
        - Other
  - type: dropdown
    id: rating
    attributes:
      label: How useful is this skill?
      options:
        - Very useful
        - Somewhat useful
        - Not useful
  - type: textarea
    id: what-works
    attributes:
      label: What works well?
  - type: textarea
    id: what-doesnt
    attributes:
      label: What could be improved?

2. GitHub Projects (Feedback Dashboard)

We use GitHub Projects to track and prioritize feedback:

Project Columns:

📥 New Feedback (Triage)
🔍 Investigating
📋 Planned
🚧 In Progress
✅ Completed
🚫 Won't Fix

Metrics Tracked:

Issue velocity (feedback → resolution time)
Top requested improvements
Most reported bugs
Skill satisfaction ratings

3. Anonymous Telemetry (Opt-In)

How It Works:

User opts in: /evaluator enable
Events are collected locally in .evaluator/events.jsonl
Periodically (daily), events are aggregated into summary stats
Summary stats are optionally sent to GitHub Discussions as anonymous metrics
Individual events are never sent (only aggregates)

Example Aggregate Report (posted to GitHub Discussions):

## Weekly Skill Usage Report (Anonymous)

**Oracle Skill:**
- Total queries: 1,250 (across all users)
- Success rate: 94.2%
- Average duration: 850ms
- Most common queries: "pattern search" (45%), "gotcha lookup" (30%)

**Guardian Skill:**
- Reviews triggered: 320
- Suggestion acceptance: 72%
- Most common categories: security (40%), performance (25%), style (20%)

**Summoner Skill:**
- Subagents spawned: 580
- Haiku: 85%, Sonnet: 15%
- Success rate: 88%

**Top User Feedback Themes:**
1. "Oracle needs better search filters" (12 mentions)
2. "Guardian triggers too frequently" (8 mentions)
3. "Love the minimal context passing!" (15 mentions)

How to Use Evaluator

Enable Telemetry (Opt-In)

# Enable anonymous telemetry
/evaluator enable

# Confirm telemetry is enabled
/evaluator status

# View what will be collected
/evaluator show-sample

Disable Telemetry

# Disable telemetry
/evaluator disable

# Delete all local telemetry data
/evaluator purge

View Local Telemetry

# View local event summary (never leaves your machine)
/evaluator summary

# View local events (for transparency)
/evaluator show-events

# Export events to JSON
/evaluator export --output telemetry.json

Submit Manual Feedback

# Open feedback form in browser
/evaluator feedback

# Submit quick rating
/evaluator rate oracle 5 "Love the pattern search!"

# Report a bug
/evaluator bug guardian "Triggers too often on test files"

Privacy Guarantees

What We Guarantee:

Opt-In Only: Telemetry is disabled by default
No PII: We never collect personal information
Local First: Events stored locally, you control when/if they're sent
Aggregate Only: Only summary statistics are sent (not individual events)
Easy Deletion: One command to delete all local data
Transparent: Source code is open, you can audit what's collected
No Tracking: No cookies, no fingerprinting, no cross-site tracking

Data Lifecycle:

1. Event occurs → 2. Stored locally → 3. Aggregated weekly →
4. [Optional] Send aggregate → 5. Auto-delete events >30 days old

You control steps 4 and 5.

Configuration

.evaluator/config.json:

{
  "enabled": false,
  "anonymous_id": "randomly-generated-daily-rotating-hash",
  "send_aggregates": false,
  "retention_days": 30,
  "aggregation_interval_days": 7,
  "collect": {
    "skill_usage": true,
    "performance_metrics": true,
    "error_types": true,
    "success_rates": true
  },
  "exclude_skills": [],
  "github": {
    "repo": "Overlord-Z/ClaudeShack",
    "discussions_category": "Telemetry",
    "issue_labels": ["feedback", "telemetry"]
  }
}

For Skill Developers

Instrumenting Your Skill

Add telemetry hooks to your skill:

from evaluator import track_event, track_metric

# Track skill invocation
with track_event('my_skill_invoked'):
    result = my_skill.execute()

# Track custom metric
track_metric('my_skill_success_rate', success_rate)

# Track error (error type only, not message)
track_error('my_skill_error', error_type='ValueError')

Viewing Skill Analytics

# View analytics for your skill
/evaluator analytics my_skill

# Compare with other skills
/evaluator compare oracle guardian summoner

Benefits to Users

Better Skills: Identify which features are most useful
Faster Bug Fixes: Know which bugs affect the most users
Prioritized Features: Build what users actually want
Performance Improvements: Optimize based on real usage patterns
Community Growth: Demonstrate value to attract contributors

What You Get Back:

Public aggregate metrics (see how you compare)
Priority bug fixes for highly-used features
Better documentation based on common questions
Skills optimized for real-world usage patterns

Implementation Status

Current:

✅ Privacy-first design
✅ GitHub Issues templates designed
✅ Configuration schema
✅ Opt-in/opt-out framework

In Progress:

🚧 Event collection scripts
🚧 Aggregation engine
🚧 GitHub Projects integration
🚧 Analytics dashboard

Planned:

📋 Skill instrumentation helpers
📋 Automated weekly reports
📋 Community analytics page

Transparency Report

We commit to publishing quarterly transparency reports:

Metrics Reported:

Total opt-in users (approximate)
Total events collected
Top skills by usage
Top feedback themes
Privacy incidents (if any)