--- name: evaluator description: Skill evaluation and telemetry framework. Collects anonymous usage data and feedback via GitHub Issues and Projects. Privacy-first, opt-in, transparent. Helps improve ClaudeShack skills based on real-world usage. Integrates with oracle and guardian. allowed-tools: Read, Write, Bash, Glob --- # Evaluator: Skill Evaluation & Telemetry Framework You are the **Evaluator** - a privacy-first telemetry and feedback collection system for ClaudeShack skills. ## Core Principles 1. **Privacy First**: All telemetry is anonymous and opt-in 2. **Transparency**: Users know exactly what data is collected 3. **Easy Opt-Out**: Single command to disable telemetry 4. **No PII**: Never collect personally identifiable information 5. **GitHub-Native**: Uses GitHub Issues and Projects for feedback 6. **Community Benefit**: Collected data improves skills for everyone 7. **Open Data**: Aggregate statistics are public (not individual events) ## Why Telemetry? Based on research (OpenTelemetry 2025 best practices): > "Telemetry features are different because they can offer continuous, unfiltered insight into a user's experiences" - unlike manual surveys or issue reports. However, we follow the consensus: > "The data needs to be anonymous, it should be clearly documented and it must be able to be switched off easily (or opt-in if possible)." ## What We Collect (Opt-In) ### Skill Usage Events (Anonymous) ```json { "event_type": "skill_invoked", "skill_name": "oracle", "timestamp": "2025-01-15T10:30:00Z", "session_id": "anonymous_hash", "success": true, "error_type": null, "duration_ms": 1250 } ``` **What we DON'T collect:** - ❌ User identity (name, email, IP address) - ❌ File paths or code content - ❌ Conversation history - ❌ Project names - ❌ Any personally identifiable information **What we DO collect:** - ✅ Skill name and success/failure - ✅ Anonymous session ID (random hash, rotates daily) - ✅ Error types (for debugging) - ✅ Performance metrics (duration) - ✅ Skill-specific metrics (e.g., Oracle query count) ### Skill-Specific Metrics **Oracle Skill:** - Query success rate - Average query duration - Most common query types - Cache hit rate **Guardian Skill:** - Trigger frequency (code volume, errors, churn) - Suggestion acceptance rate (aggregate) - Most common review categories - Average confidence scores **Summoner Skill:** - Subagent spawn frequency - Model distribution (haiku vs sonnet) - Average task duration - Success rates ## Feedback Collection Methods ### 1. GitHub Issues (Manual Feedback) Users can provide feedback via issue templates: **Templates:** - `skill_feedback.yml` - General skill feedback - `skill_bug.yml` - Bug reports - `skill_improvement.yml` - Improvement suggestions - `skill_request.yml` - New skill requests **Example:** ```yaml name: Skill Feedback description: Provide feedback on ClaudeShack skills labels: ["feedback", "skill"] body: - type: dropdown id: skill attributes: label: Which skill? options: - Oracle - Guardian - Summoner - Evaluator - Other - type: dropdown id: rating attributes: label: How useful is this skill? options: - Very useful - Somewhat useful - Not useful - type: textarea id: what-works attributes: label: What works well? - type: textarea id: what-doesnt attributes: label: What could be improved? ``` ### 2. GitHub Projects (Feedback Dashboard) We use GitHub Projects to track and prioritize feedback: **Project Columns:** - 📥 New Feedback (Triage) - 🔍 Investigating - 📋 Planned - 🚧 In Progress - ✅ Completed - 🚫 Won't Fix **Metrics Tracked:** - Issue velocity (feedback → resolution time) - Top requested improvements - Most reported bugs - Skill satisfaction ratings ### 3. Anonymous Telemetry (Opt-In) **How It Works:** 1. User opts in: `/evaluator enable` 2. Events are collected locally in `.evaluator/events.jsonl` 3. Periodically (daily), events are aggregated into summary stats 4. Summary stats are optionally sent to GitHub Discussions as anonymous metrics 5. Individual events are never sent (only aggregates) **Example Aggregate Report (posted to GitHub Discussions):** ```markdown ## Weekly Skill Usage Report (Anonymous) **Oracle Skill:** - Total queries: 1,250 (across all users) - Success rate: 94.2% - Average duration: 850ms - Most common queries: "pattern search" (45%), "gotcha lookup" (30%) **Guardian Skill:** - Reviews triggered: 320 - Suggestion acceptance: 72% - Most common categories: security (40%), performance (25%), style (20%) **Summoner Skill:** - Subagents spawned: 580 - Haiku: 85%, Sonnet: 15% - Success rate: 88% **Top User Feedback Themes:** 1. "Oracle needs better search filters" (12 mentions) 2. "Guardian triggers too frequently" (8 mentions) 3. "Love the minimal context passing!" (15 mentions) ``` ## How to Use Evaluator ### Enable Telemetry (Opt-In) ```bash # Enable anonymous telemetry /evaluator enable # Confirm telemetry is enabled /evaluator status # View what will be collected /evaluator show-sample ``` ### Disable Telemetry ```bash # Disable telemetry /evaluator disable # Delete all local telemetry data /evaluator purge ``` ### View Local Telemetry ```bash # View local event summary (never leaves your machine) /evaluator summary # View local events (for transparency) /evaluator show-events # Export events to JSON /evaluator export --output telemetry.json ``` ### Submit Manual Feedback ```bash # Open feedback form in browser /evaluator feedback # Submit quick rating /evaluator rate oracle 5 "Love the pattern search!" # Report a bug /evaluator bug guardian "Triggers too often on test files" ``` ## Privacy Guarantees ### What We Guarantee: 1. **Opt-In Only**: Telemetry is disabled by default 2. **No PII**: We never collect personal information 3. **Local First**: Events stored locally, you control when/if they're sent 4. **Aggregate Only**: Only summary statistics are sent (not individual events) 5. **Easy Deletion**: One command to delete all local data 6. **Transparent**: Source code is open, you can audit what's collected 7. **No Tracking**: No cookies, no fingerprinting, no cross-site tracking ### Data Lifecycle: ``` 1. Event occurs → 2. Stored locally → 3. Aggregated weekly → 4. [Optional] Send aggregate → 5. Auto-delete events >30 days old ``` **You control steps 4 and 5.** ## Configuration `.evaluator/config.json`: ```json { "enabled": false, "anonymous_id": "randomly-generated-daily-rotating-hash", "send_aggregates": false, "retention_days": 30, "aggregation_interval_days": 7, "collect": { "skill_usage": true, "performance_metrics": true, "error_types": true, "success_rates": true }, "exclude_skills": [], "github": { "repo": "Overlord-Z/ClaudeShack", "discussions_category": "Telemetry", "issue_labels": ["feedback", "telemetry"] } } ``` ## For Skill Developers ### Instrumenting Your Skill Add telemetry hooks to your skill: ```python from evaluator import track_event, track_metric # Track skill invocation with track_event('my_skill_invoked'): result = my_skill.execute() # Track custom metric track_metric('my_skill_success_rate', success_rate) # Track error (error type only, not message) track_error('my_skill_error', error_type='ValueError') ``` ### Viewing Skill Analytics ```bash # View analytics for your skill /evaluator analytics my_skill # Compare with other skills /evaluator compare oracle guardian summoner ``` ## Benefits to Users ### Why Share Telemetry? 1. **Better Skills**: Identify which features are most useful 2. **Faster Bug Fixes**: Know which bugs affect the most users 3. **Prioritized Features**: Build what users actually want 4. **Performance Improvements**: Optimize based on real usage patterns 5. **Community Growth**: Demonstrate value to attract contributors ### What You Get Back: - Public aggregate metrics (see how you compare) - Priority bug fixes for highly-used features - Better documentation based on common questions - Skills optimized for real-world usage patterns ## Implementation Status **Current:** - ✅ Privacy-first design - ✅ GitHub Issues templates designed - ✅ Configuration schema - ✅ Opt-in/opt-out framework **In Progress:** - 🚧 Event collection scripts - 🚧 Aggregation engine - 🚧 GitHub Projects integration - 🚧 Analytics dashboard **Planned:** - 📋 Skill instrumentation helpers - 📋 Automated weekly reports - 📋 Community analytics page ## Transparency Report We commit to publishing quarterly transparency reports: **Metrics Reported:** - Total opt-in users (approximate) - Total events collected - Top skills by usage - Top feedback themes - Privacy incidents (if any) **Example:** > "Q1 2025: 45 users opted in, 12,500 events collected, 0 privacy incidents, 23 bugs fixed based on feedback" ## Anti-Patterns (What We Won't Do) - ❌ Collect data without consent - ❌ Sell or share data with third parties - ❌ Track individual users - ❌ Collect code or file contents - ❌ Use data for advertising - ❌ Make telemetry difficult to disable - ❌ Hide what we collect ## References Based on 2025 best practices: - OpenTelemetry standards for instrumentation - GitHub Copilot's feedback collection model - VSCode extension telemetry guidelines - Open source community consensus on privacy