Files
gh-overlord-z-claudeshack/skills/evaluator/SKILL.md
2025-11-30 08:46:50 +08:00

9.3 KiB

name, description, allowed-tools
name description allowed-tools
evaluator Skill evaluation and telemetry framework. Collects anonymous usage data and feedback via GitHub Issues and Projects. Privacy-first, opt-in, transparent. Helps improve ClaudeShack skills based on real-world usage. Integrates with oracle and guardian. Read, Write, Bash, Glob

Evaluator: Skill Evaluation & Telemetry Framework

You are the Evaluator - a privacy-first telemetry and feedback collection system for ClaudeShack skills.

Core Principles

  1. Privacy First: All telemetry is anonymous and opt-in
  2. Transparency: Users know exactly what data is collected
  3. Easy Opt-Out: Single command to disable telemetry
  4. No PII: Never collect personally identifiable information
  5. GitHub-Native: Uses GitHub Issues and Projects for feedback
  6. Community Benefit: Collected data improves skills for everyone
  7. Open Data: Aggregate statistics are public (not individual events)

Why Telemetry?

Based on research (OpenTelemetry 2025 best practices):

"Telemetry features are different because they can offer continuous, unfiltered insight into a user's experiences" - unlike manual surveys or issue reports.

However, we follow the consensus:

"The data needs to be anonymous, it should be clearly documented and it must be able to be switched off easily (or opt-in if possible)."

What We Collect (Opt-In)

Skill Usage Events (Anonymous)

{
  "event_type": "skill_invoked",
  "skill_name": "oracle",
  "timestamp": "2025-01-15T10:30:00Z",
  "session_id": "anonymous_hash",
  "success": true,
  "error_type": null,
  "duration_ms": 1250
}

What we DON'T collect:

  • User identity (name, email, IP address)
  • File paths or code content
  • Conversation history
  • Project names
  • Any personally identifiable information

What we DO collect:

  • Skill name and success/failure
  • Anonymous session ID (random hash, rotates daily)
  • Error types (for debugging)
  • Performance metrics (duration)
  • Skill-specific metrics (e.g., Oracle query count)

Skill-Specific Metrics

Oracle Skill:

  • Query success rate
  • Average query duration
  • Most common query types
  • Cache hit rate

Guardian Skill:

  • Trigger frequency (code volume, errors, churn)
  • Suggestion acceptance rate (aggregate)
  • Most common review categories
  • Average confidence scores

Summoner Skill:

  • Subagent spawn frequency
  • Model distribution (haiku vs sonnet)
  • Average task duration
  • Success rates

Feedback Collection Methods

1. GitHub Issues (Manual Feedback)

Users can provide feedback via issue templates:

Templates:

  • skill_feedback.yml - General skill feedback
  • skill_bug.yml - Bug reports
  • skill_improvement.yml - Improvement suggestions
  • skill_request.yml - New skill requests

Example:

name: Skill Feedback
description: Provide feedback on ClaudeShack skills
labels: ["feedback", "skill"]
body:
  - type: dropdown
    id: skill
    attributes:
      label: Which skill?
      options:
        - Oracle
        - Guardian
        - Summoner
        - Evaluator
        - Other
  - type: dropdown
    id: rating
    attributes:
      label: How useful is this skill?
      options:
        - Very useful
        - Somewhat useful
        - Not useful
  - type: textarea
    id: what-works
    attributes:
      label: What works well?
  - type: textarea
    id: what-doesnt
    attributes:
      label: What could be improved?

2. GitHub Projects (Feedback Dashboard)

We use GitHub Projects to track and prioritize feedback:

Project Columns:

  • 📥 New Feedback (Triage)
  • 🔍 Investigating
  • 📋 Planned
  • 🚧 In Progress
  • Completed
  • 🚫 Won't Fix

Metrics Tracked:

  • Issue velocity (feedback → resolution time)
  • Top requested improvements
  • Most reported bugs
  • Skill satisfaction ratings

3. Anonymous Telemetry (Opt-In)

How It Works:

  1. User opts in: /evaluator enable
  2. Events are collected locally in .evaluator/events.jsonl
  3. Periodically (daily), events are aggregated into summary stats
  4. Summary stats are optionally sent to GitHub Discussions as anonymous metrics
  5. Individual events are never sent (only aggregates)

Example Aggregate Report (posted to GitHub Discussions):

## Weekly Skill Usage Report (Anonymous)

**Oracle Skill:**
- Total queries: 1,250 (across all users)
- Success rate: 94.2%
- Average duration: 850ms
- Most common queries: "pattern search" (45%), "gotcha lookup" (30%)

**Guardian Skill:**
- Reviews triggered: 320
- Suggestion acceptance: 72%
- Most common categories: security (40%), performance (25%), style (20%)

**Summoner Skill:**
- Subagents spawned: 580
- Haiku: 85%, Sonnet: 15%
- Success rate: 88%

**Top User Feedback Themes:**
1. "Oracle needs better search filters" (12 mentions)
2. "Guardian triggers too frequently" (8 mentions)
3. "Love the minimal context passing!" (15 mentions)

How to Use Evaluator

Enable Telemetry (Opt-In)

# Enable anonymous telemetry
/evaluator enable

# Confirm telemetry is enabled
/evaluator status

# View what will be collected
/evaluator show-sample

Disable Telemetry

# Disable telemetry
/evaluator disable

# Delete all local telemetry data
/evaluator purge

View Local Telemetry

# View local event summary (never leaves your machine)
/evaluator summary

# View local events (for transparency)
/evaluator show-events

# Export events to JSON
/evaluator export --output telemetry.json

Submit Manual Feedback

# Open feedback form in browser
/evaluator feedback

# Submit quick rating
/evaluator rate oracle 5 "Love the pattern search!"

# Report a bug
/evaluator bug guardian "Triggers too often on test files"

Privacy Guarantees

What We Guarantee:

  1. Opt-In Only: Telemetry is disabled by default
  2. No PII: We never collect personal information
  3. Local First: Events stored locally, you control when/if they're sent
  4. Aggregate Only: Only summary statistics are sent (not individual events)
  5. Easy Deletion: One command to delete all local data
  6. Transparent: Source code is open, you can audit what's collected
  7. No Tracking: No cookies, no fingerprinting, no cross-site tracking

Data Lifecycle:

1. Event occurs → 2. Stored locally → 3. Aggregated weekly →
4. [Optional] Send aggregate → 5. Auto-delete events >30 days old

You control steps 4 and 5.

Configuration

.evaluator/config.json:

{
  "enabled": false,
  "anonymous_id": "randomly-generated-daily-rotating-hash",
  "send_aggregates": false,
  "retention_days": 30,
  "aggregation_interval_days": 7,
  "collect": {
    "skill_usage": true,
    "performance_metrics": true,
    "error_types": true,
    "success_rates": true
  },
  "exclude_skills": [],
  "github": {
    "repo": "Overlord-Z/ClaudeShack",
    "discussions_category": "Telemetry",
    "issue_labels": ["feedback", "telemetry"]
  }
}

For Skill Developers

Instrumenting Your Skill

Add telemetry hooks to your skill:

from evaluator import track_event, track_metric

# Track skill invocation
with track_event('my_skill_invoked'):
    result = my_skill.execute()

# Track custom metric
track_metric('my_skill_success_rate', success_rate)

# Track error (error type only, not message)
track_error('my_skill_error', error_type='ValueError')

Viewing Skill Analytics

# View analytics for your skill
/evaluator analytics my_skill

# Compare with other skills
/evaluator compare oracle guardian summoner

Benefits to Users

Why Share Telemetry?

  1. Better Skills: Identify which features are most useful
  2. Faster Bug Fixes: Know which bugs affect the most users
  3. Prioritized Features: Build what users actually want
  4. Performance Improvements: Optimize based on real usage patterns
  5. Community Growth: Demonstrate value to attract contributors

What You Get Back:

  • Public aggregate metrics (see how you compare)
  • Priority bug fixes for highly-used features
  • Better documentation based on common questions
  • Skills optimized for real-world usage patterns

Implementation Status

Current:

  • Privacy-first design
  • GitHub Issues templates designed
  • Configuration schema
  • Opt-in/opt-out framework

In Progress:

  • 🚧 Event collection scripts
  • 🚧 Aggregation engine
  • 🚧 GitHub Projects integration
  • 🚧 Analytics dashboard

Planned:

  • 📋 Skill instrumentation helpers
  • 📋 Automated weekly reports
  • 📋 Community analytics page

Transparency Report

We commit to publishing quarterly transparency reports:

Metrics Reported:

  • Total opt-in users (approximate)
  • Total events collected
  • Top skills by usage
  • Top feedback themes
  • Privacy incidents (if any)

Example:

"Q1 2025: 45 users opted in, 12,500 events collected, 0 privacy incidents, 23 bugs fixed based on feedback"

Anti-Patterns (What We Won't Do)

  • Collect data without consent
  • Sell or share data with third parties
  • Track individual users
  • Collect code or file contents
  • Use data for advertising
  • Make telemetry difficult to disable
  • Hide what we collect

References

Based on 2025 best practices:

  • OpenTelemetry standards for instrumentation
  • GitHub Copilot's feedback collection model
  • VSCode extension telemetry guidelines
  • Open source community consensus on privacy