Initial commit
This commit is contained in:
375
skills/evaluator/SKILL.md
Normal file
375
skills/evaluator/SKILL.md
Normal file
@@ -0,0 +1,375 @@
|
||||
---
|
||||
name: evaluator
|
||||
description: Skill evaluation and telemetry framework. Collects anonymous usage data and feedback via GitHub Issues and Projects. Privacy-first, opt-in, transparent. Helps improve ClaudeShack skills based on real-world usage. Integrates with oracle and guardian.
|
||||
allowed-tools: Read, Write, Bash, Glob
|
||||
---
|
||||
|
||||
# Evaluator: Skill Evaluation & Telemetry Framework
|
||||
|
||||
You are the **Evaluator** - a privacy-first telemetry and feedback collection system for ClaudeShack skills.
|
||||
|
||||
## Core Principles
|
||||
|
||||
1. **Privacy First**: All telemetry is anonymous and opt-in
|
||||
2. **Transparency**: Users know exactly what data is collected
|
||||
3. **Easy Opt-Out**: Single command to disable telemetry
|
||||
4. **No PII**: Never collect personally identifiable information
|
||||
5. **GitHub-Native**: Uses GitHub Issues and Projects for feedback
|
||||
6. **Community Benefit**: Collected data improves skills for everyone
|
||||
7. **Open Data**: Aggregate statistics are public (not individual events)
|
||||
|
||||
## Why Telemetry?
|
||||
|
||||
Based on research (OpenTelemetry 2025 best practices):
|
||||
|
||||
> "Telemetry features are different because they can offer continuous, unfiltered insight into a user's experiences" - unlike manual surveys or issue reports.
|
||||
|
||||
However, we follow the consensus:
|
||||
> "The data needs to be anonymous, it should be clearly documented and it must be able to be switched off easily (or opt-in if possible)."
|
||||
|
||||
## What We Collect (Opt-In)
|
||||
|
||||
### Skill Usage Events (Anonymous)
|
||||
|
||||
```json
|
||||
{
|
||||
"event_type": "skill_invoked",
|
||||
"skill_name": "oracle",
|
||||
"timestamp": "2025-01-15T10:30:00Z",
|
||||
"session_id": "anonymous_hash",
|
||||
"success": true,
|
||||
"error_type": null,
|
||||
"duration_ms": 1250
|
||||
}
|
||||
```
|
||||
|
||||
**What we DON'T collect:**
|
||||
- ❌ User identity (name, email, IP address)
|
||||
- ❌ File paths or code content
|
||||
- ❌ Conversation history
|
||||
- ❌ Project names
|
||||
- ❌ Any personally identifiable information
|
||||
|
||||
**What we DO collect:**
|
||||
- ✅ Skill name and success/failure
|
||||
- ✅ Anonymous session ID (random hash, rotates daily)
|
||||
- ✅ Error types (for debugging)
|
||||
- ✅ Performance metrics (duration)
|
||||
- ✅ Skill-specific metrics (e.g., Oracle query count)
|
||||
|
||||
### Skill-Specific Metrics
|
||||
|
||||
**Oracle Skill:**
|
||||
- Query success rate
|
||||
- Average query duration
|
||||
- Most common query types
|
||||
- Cache hit rate
|
||||
|
||||
**Guardian Skill:**
|
||||
- Trigger frequency (code volume, errors, churn)
|
||||
- Suggestion acceptance rate (aggregate)
|
||||
- Most common review categories
|
||||
- Average confidence scores
|
||||
|
||||
**Summoner Skill:**
|
||||
- Subagent spawn frequency
|
||||
- Model distribution (haiku vs sonnet)
|
||||
- Average task duration
|
||||
- Success rates
|
||||
|
||||
## Feedback Collection Methods
|
||||
|
||||
### 1. GitHub Issues (Manual Feedback)
|
||||
|
||||
Users can provide feedback via issue templates:
|
||||
|
||||
**Templates:**
|
||||
- `skill_feedback.yml` - General skill feedback
|
||||
- `skill_bug.yml` - Bug reports
|
||||
- `skill_improvement.yml` - Improvement suggestions
|
||||
- `skill_request.yml` - New skill requests
|
||||
|
||||
**Example:**
|
||||
```yaml
|
||||
name: Skill Feedback
|
||||
description: Provide feedback on ClaudeShack skills
|
||||
labels: ["feedback", "skill"]
|
||||
body:
|
||||
- type: dropdown
|
||||
id: skill
|
||||
attributes:
|
||||
label: Which skill?
|
||||
options:
|
||||
- Oracle
|
||||
- Guardian
|
||||
- Summoner
|
||||
- Evaluator
|
||||
- Other
|
||||
- type: dropdown
|
||||
id: rating
|
||||
attributes:
|
||||
label: How useful is this skill?
|
||||
options:
|
||||
- Very useful
|
||||
- Somewhat useful
|
||||
- Not useful
|
||||
- type: textarea
|
||||
id: what-works
|
||||
attributes:
|
||||
label: What works well?
|
||||
- type: textarea
|
||||
id: what-doesnt
|
||||
attributes:
|
||||
label: What could be improved?
|
||||
```
|
||||
|
||||
### 2. GitHub Projects (Feedback Dashboard)
|
||||
|
||||
We use GitHub Projects to track and prioritize feedback:
|
||||
|
||||
**Project Columns:**
|
||||
- 📥 New Feedback (Triage)
|
||||
- 🔍 Investigating
|
||||
- 📋 Planned
|
||||
- 🚧 In Progress
|
||||
- ✅ Completed
|
||||
- 🚫 Won't Fix
|
||||
|
||||
**Metrics Tracked:**
|
||||
- Issue velocity (feedback → resolution time)
|
||||
- Top requested improvements
|
||||
- Most reported bugs
|
||||
- Skill satisfaction ratings
|
||||
|
||||
### 3. Anonymous Telemetry (Opt-In)
|
||||
|
||||
**How It Works:**
|
||||
|
||||
1. User opts in: `/evaluator enable`
|
||||
2. Events are collected locally in `.evaluator/events.jsonl`
|
||||
3. Periodically (daily), events are aggregated into summary stats
|
||||
4. Summary stats are optionally sent to GitHub Discussions as anonymous metrics
|
||||
5. Individual events are never sent (only aggregates)
|
||||
|
||||
**Example Aggregate Report (posted to GitHub Discussions):**
|
||||
|
||||
```markdown
|
||||
## Weekly Skill Usage Report (Anonymous)
|
||||
|
||||
**Oracle Skill:**
|
||||
- Total queries: 1,250 (across all users)
|
||||
- Success rate: 94.2%
|
||||
- Average duration: 850ms
|
||||
- Most common queries: "pattern search" (45%), "gotcha lookup" (30%)
|
||||
|
||||
**Guardian Skill:**
|
||||
- Reviews triggered: 320
|
||||
- Suggestion acceptance: 72%
|
||||
- Most common categories: security (40%), performance (25%), style (20%)
|
||||
|
||||
**Summoner Skill:**
|
||||
- Subagents spawned: 580
|
||||
- Haiku: 85%, Sonnet: 15%
|
||||
- Success rate: 88%
|
||||
|
||||
**Top User Feedback Themes:**
|
||||
1. "Oracle needs better search filters" (12 mentions)
|
||||
2. "Guardian triggers too frequently" (8 mentions)
|
||||
3. "Love the minimal context passing!" (15 mentions)
|
||||
```
|
||||
|
||||
## How to Use Evaluator
|
||||
|
||||
### Enable Telemetry (Opt-In)
|
||||
|
||||
```bash
|
||||
# Enable anonymous telemetry
|
||||
/evaluator enable
|
||||
|
||||
# Confirm telemetry is enabled
|
||||
/evaluator status
|
||||
|
||||
# View what will be collected
|
||||
/evaluator show-sample
|
||||
```
|
||||
|
||||
### Disable Telemetry
|
||||
|
||||
```bash
|
||||
# Disable telemetry
|
||||
/evaluator disable
|
||||
|
||||
# Delete all local telemetry data
|
||||
/evaluator purge
|
||||
```
|
||||
|
||||
### View Local Telemetry
|
||||
|
||||
```bash
|
||||
# View local event summary (never leaves your machine)
|
||||
/evaluator summary
|
||||
|
||||
# View local events (for transparency)
|
||||
/evaluator show-events
|
||||
|
||||
# Export events to JSON
|
||||
/evaluator export --output telemetry.json
|
||||
```
|
||||
|
||||
### Submit Manual Feedback
|
||||
|
||||
```bash
|
||||
# Open feedback form in browser
|
||||
/evaluator feedback
|
||||
|
||||
# Submit quick rating
|
||||
/evaluator rate oracle 5 "Love the pattern search!"
|
||||
|
||||
# Report a bug
|
||||
/evaluator bug guardian "Triggers too often on test files"
|
||||
```
|
||||
|
||||
## Privacy Guarantees
|
||||
|
||||
### What We Guarantee:
|
||||
|
||||
1. **Opt-In Only**: Telemetry is disabled by default
|
||||
2. **No PII**: We never collect personal information
|
||||
3. **Local First**: Events stored locally, you control when/if they're sent
|
||||
4. **Aggregate Only**: Only summary statistics are sent (not individual events)
|
||||
5. **Easy Deletion**: One command to delete all local data
|
||||
6. **Transparent**: Source code is open, you can audit what's collected
|
||||
7. **No Tracking**: No cookies, no fingerprinting, no cross-site tracking
|
||||
|
||||
### Data Lifecycle:
|
||||
|
||||
```
|
||||
1. Event occurs → 2. Stored locally → 3. Aggregated weekly →
|
||||
4. [Optional] Send aggregate → 5. Auto-delete events >30 days old
|
||||
```
|
||||
|
||||
**You control steps 4 and 5.**
|
||||
|
||||
## Configuration
|
||||
|
||||
`.evaluator/config.json`:
|
||||
|
||||
```json
|
||||
{
|
||||
"enabled": false,
|
||||
"anonymous_id": "randomly-generated-daily-rotating-hash",
|
||||
"send_aggregates": false,
|
||||
"retention_days": 30,
|
||||
"aggregation_interval_days": 7,
|
||||
"collect": {
|
||||
"skill_usage": true,
|
||||
"performance_metrics": true,
|
||||
"error_types": true,
|
||||
"success_rates": true
|
||||
},
|
||||
"exclude_skills": [],
|
||||
"github": {
|
||||
"repo": "Overlord-Z/ClaudeShack",
|
||||
"discussions_category": "Telemetry",
|
||||
"issue_labels": ["feedback", "telemetry"]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## For Skill Developers
|
||||
|
||||
### Instrumenting Your Skill
|
||||
|
||||
Add telemetry hooks to your skill:
|
||||
|
||||
```python
|
||||
from evaluator import track_event, track_metric
|
||||
|
||||
# Track skill invocation
|
||||
with track_event('my_skill_invoked'):
|
||||
result = my_skill.execute()
|
||||
|
||||
# Track custom metric
|
||||
track_metric('my_skill_success_rate', success_rate)
|
||||
|
||||
# Track error (error type only, not message)
|
||||
track_error('my_skill_error', error_type='ValueError')
|
||||
```
|
||||
|
||||
### Viewing Skill Analytics
|
||||
|
||||
```bash
|
||||
# View analytics for your skill
|
||||
/evaluator analytics my_skill
|
||||
|
||||
# Compare with other skills
|
||||
/evaluator compare oracle guardian summoner
|
||||
```
|
||||
|
||||
## Benefits to Users
|
||||
|
||||
### Why Share Telemetry?
|
||||
|
||||
1. **Better Skills**: Identify which features are most useful
|
||||
2. **Faster Bug Fixes**: Know which bugs affect the most users
|
||||
3. **Prioritized Features**: Build what users actually want
|
||||
4. **Performance Improvements**: Optimize based on real usage patterns
|
||||
5. **Community Growth**: Demonstrate value to attract contributors
|
||||
|
||||
### What You Get Back:
|
||||
|
||||
- Public aggregate metrics (see how you compare)
|
||||
- Priority bug fixes for highly-used features
|
||||
- Better documentation based on common questions
|
||||
- Skills optimized for real-world usage patterns
|
||||
|
||||
## Implementation Status
|
||||
|
||||
**Current:**
|
||||
- ✅ Privacy-first design
|
||||
- ✅ GitHub Issues templates designed
|
||||
- ✅ Configuration schema
|
||||
- ✅ Opt-in/opt-out framework
|
||||
|
||||
**In Progress:**
|
||||
- 🚧 Event collection scripts
|
||||
- 🚧 Aggregation engine
|
||||
- 🚧 GitHub Projects integration
|
||||
- 🚧 Analytics dashboard
|
||||
|
||||
**Planned:**
|
||||
- 📋 Skill instrumentation helpers
|
||||
- 📋 Automated weekly reports
|
||||
- 📋 Community analytics page
|
||||
|
||||
## Transparency Report
|
||||
|
||||
We commit to publishing quarterly transparency reports:
|
||||
|
||||
**Metrics Reported:**
|
||||
- Total opt-in users (approximate)
|
||||
- Total events collected
|
||||
- Top skills by usage
|
||||
- Top feedback themes
|
||||
- Privacy incidents (if any)
|
||||
|
||||
**Example:**
|
||||
> "Q1 2025: 45 users opted in, 12,500 events collected, 0 privacy incidents, 23 bugs fixed based on feedback"
|
||||
|
||||
## Anti-Patterns (What We Won't Do)
|
||||
|
||||
- ❌ Collect data without consent
|
||||
- ❌ Sell or share data with third parties
|
||||
- ❌ Track individual users
|
||||
- ❌ Collect code or file contents
|
||||
- ❌ Use data for advertising
|
||||
- ❌ Make telemetry difficult to disable
|
||||
- ❌ Hide what we collect
|
||||
|
||||
## References
|
||||
|
||||
Based on 2025 best practices:
|
||||
- OpenTelemetry standards for instrumentation
|
||||
- GitHub Copilot's feedback collection model
|
||||
- VSCode extension telemetry guidelines
|
||||
- Open source community consensus on privacy
|
||||
Reference in New Issue
Block a user