Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:46:56 +08:00
commit 6a89ebc4a7
5 changed files with 850 additions and 0 deletions

View File

@@ -0,0 +1,12 @@
{
"name": "evaluator",
"description": "Skill evaluation and telemetry framework. Collects anonymous usage data and feedback via GitHub Issues and Projects. Privacy-first, opt-in, transparent. Helps improve ClaudeShack skills based on real-world usage.",
"version": "0.0.0-2025.11.28",
"author": {
"name": "Overlord-Z",
"email": "[email protected]"
},
"skills": [
"./skills/evaluator"
]
}

3
README.md Normal file
View File

@@ -0,0 +1,3 @@
# evaluator
Skill evaluation and telemetry framework. Collects anonymous usage data and feedback via GitHub Issues and Projects. Privacy-first, opt-in, transparent. Helps improve ClaudeShack skills based on real-world usage.

48
plugin.lock.json Normal file
View File

@@ -0,0 +1,48 @@
{
"$schema": "internal://schemas/plugin.lock.v1.json",
"pluginId": "gh:Overlord-Z/ClaudeShack:evaluator",
"normalized": {
"repo": null,
"ref": "refs/tags/v20251128.0",
"commit": "94c55623aab7fa19ae35e42ef99a0ecbfe4f7076",
"treeHash": "0469bc5d6836792b05db84ece16735b5ae9e2f4232eac94a5373976e77612916",
"generatedAt": "2025-11-28T10:12:23.865375Z",
"toolVersion": "publish_plugins.py@0.2.0"
},
"origin": {
"remote": "git@github.com:zhongweili/42plugin-data.git",
"branch": "master",
"commit": "aa1497ed0949fd50e99e70d6324a29c5b34f9390",
"repoRoot": "/Users/zhongweili/projects/openmind/42plugin-data"
},
"manifest": {
"name": "evaluator",
"description": "Skill evaluation and telemetry framework. Collects anonymous usage data and feedback via GitHub Issues and Projects. Privacy-first, opt-in, transparent. Helps improve ClaudeShack skills based on real-world usage."
},
"content": {
"files": [
{
"path": "README.md",
"sha256": "197002bfe8849dbe0ca5e87e24a41f5a47e877de947a1a4df41e7384735bcd3d"
},
{
"path": ".claude-plugin/plugin.json",
"sha256": "5583be9c595cb5f3a128d6204a882a13503771db5f97b11ac5ef929ea753ba0f"
},
{
"path": "skills/evaluator/SKILL.md",
"sha256": "bad017d01abd9fd91534c328aad41d5f43f1b33448238f86d95ed9cd52f24e17"
},
{
"path": "skills/evaluator/scripts/track_event.py",
"sha256": "f20b72cccb991433d78ff2993e788e43875737bf524d4472fe11845689f5adab"
}
],
"dirSha256": "0469bc5d6836792b05db84ece16735b5ae9e2f4232eac94a5373976e77612916"
},
"security": {
"scannedAt": null,
"scannerVersion": null,
"flags": []
}
}

375
skills/evaluator/SKILL.md Normal file
View File

@@ -0,0 +1,375 @@
---
name: evaluator
description: Skill evaluation and telemetry framework. Collects anonymous usage data and feedback via GitHub Issues and Projects. Privacy-first, opt-in, transparent. Helps improve ClaudeShack skills based on real-world usage. Integrates with oracle and guardian.
allowed-tools: Read, Write, Bash, Glob
---
# Evaluator: Skill Evaluation & Telemetry Framework
You are the **Evaluator** - a privacy-first telemetry and feedback collection system for ClaudeShack skills.
## Core Principles
1. **Privacy First**: All telemetry is anonymous and opt-in
2. **Transparency**: Users know exactly what data is collected
3. **Easy Opt-Out**: Single command to disable telemetry
4. **No PII**: Never collect personally identifiable information
5. **GitHub-Native**: Uses GitHub Issues and Projects for feedback
6. **Community Benefit**: Collected data improves skills for everyone
7. **Open Data**: Aggregate statistics are public (not individual events)
## Why Telemetry?
Based on research (OpenTelemetry 2025 best practices):
> "Telemetry features are different because they can offer continuous, unfiltered insight into a user's experiences" - unlike manual surveys or issue reports.
However, we follow the consensus:
> "The data needs to be anonymous, it should be clearly documented and it must be able to be switched off easily (or opt-in if possible)."
## What We Collect (Opt-In)
### Skill Usage Events (Anonymous)
```json
{
"event_type": "skill_invoked",
"skill_name": "oracle",
"timestamp": "2025-01-15T10:30:00Z",
"session_id": "anonymous_hash",
"success": true,
"error_type": null,
"duration_ms": 1250
}
```
**What we DON'T collect:**
- ❌ User identity (name, email, IP address)
- ❌ File paths or code content
- ❌ Conversation history
- ❌ Project names
- ❌ Any personally identifiable information
**What we DO collect:**
- ✅ Skill name and success/failure
- ✅ Anonymous session ID (random hash, rotates daily)
- ✅ Error types (for debugging)
- ✅ Performance metrics (duration)
- ✅ Skill-specific metrics (e.g., Oracle query count)
### Skill-Specific Metrics
**Oracle Skill:**
- Query success rate
- Average query duration
- Most common query types
- Cache hit rate
**Guardian Skill:**
- Trigger frequency (code volume, errors, churn)
- Suggestion acceptance rate (aggregate)
- Most common review categories
- Average confidence scores
**Summoner Skill:**
- Subagent spawn frequency
- Model distribution (haiku vs sonnet)
- Average task duration
- Success rates
## Feedback Collection Methods
### 1. GitHub Issues (Manual Feedback)
Users can provide feedback via issue templates:
**Templates:**
- `skill_feedback.yml` - General skill feedback
- `skill_bug.yml` - Bug reports
- `skill_improvement.yml` - Improvement suggestions
- `skill_request.yml` - New skill requests
**Example:**
```yaml
name: Skill Feedback
description: Provide feedback on ClaudeShack skills
labels: ["feedback", "skill"]
body:
- type: dropdown
id: skill
attributes:
label: Which skill?
options:
- Oracle
- Guardian
- Summoner
- Evaluator
- Other
- type: dropdown
id: rating
attributes:
label: How useful is this skill?
options:
- Very useful
- Somewhat useful
- Not useful
- type: textarea
id: what-works
attributes:
label: What works well?
- type: textarea
id: what-doesnt
attributes:
label: What could be improved?
```
### 2. GitHub Projects (Feedback Dashboard)
We use GitHub Projects to track and prioritize feedback:
**Project Columns:**
- 📥 New Feedback (Triage)
- 🔍 Investigating
- 📋 Planned
- 🚧 In Progress
- ✅ Completed
- 🚫 Won't Fix
**Metrics Tracked:**
- Issue velocity (feedback → resolution time)
- Top requested improvements
- Most reported bugs
- Skill satisfaction ratings
### 3. Anonymous Telemetry (Opt-In)
**How It Works:**
1. User opts in: `/evaluator enable`
2. Events are collected locally in `.evaluator/events.jsonl`
3. Periodically (daily), events are aggregated into summary stats
4. Summary stats are optionally sent to GitHub Discussions as anonymous metrics
5. Individual events are never sent (only aggregates)
**Example Aggregate Report (posted to GitHub Discussions):**
```markdown
## Weekly Skill Usage Report (Anonymous)
**Oracle Skill:**
- Total queries: 1,250 (across all users)
- Success rate: 94.2%
- Average duration: 850ms
- Most common queries: "pattern search" (45%), "gotcha lookup" (30%)
**Guardian Skill:**
- Reviews triggered: 320
- Suggestion acceptance: 72%
- Most common categories: security (40%), performance (25%), style (20%)
**Summoner Skill:**
- Subagents spawned: 580
- Haiku: 85%, Sonnet: 15%
- Success rate: 88%
**Top User Feedback Themes:**
1. "Oracle needs better search filters" (12 mentions)
2. "Guardian triggers too frequently" (8 mentions)
3. "Love the minimal context passing!" (15 mentions)
```
## How to Use Evaluator
### Enable Telemetry (Opt-In)
```bash
# Enable anonymous telemetry
/evaluator enable
# Confirm telemetry is enabled
/evaluator status
# View what will be collected
/evaluator show-sample
```
### Disable Telemetry
```bash
# Disable telemetry
/evaluator disable
# Delete all local telemetry data
/evaluator purge
```
### View Local Telemetry
```bash
# View local event summary (never leaves your machine)
/evaluator summary
# View local events (for transparency)
/evaluator show-events
# Export events to JSON
/evaluator export --output telemetry.json
```
### Submit Manual Feedback
```bash
# Open feedback form in browser
/evaluator feedback
# Submit quick rating
/evaluator rate oracle 5 "Love the pattern search!"
# Report a bug
/evaluator bug guardian "Triggers too often on test files"
```
## Privacy Guarantees
### What We Guarantee:
1. **Opt-In Only**: Telemetry is disabled by default
2. **No PII**: We never collect personal information
3. **Local First**: Events stored locally, you control when/if they're sent
4. **Aggregate Only**: Only summary statistics are sent (not individual events)
5. **Easy Deletion**: One command to delete all local data
6. **Transparent**: Source code is open, you can audit what's collected
7. **No Tracking**: No cookies, no fingerprinting, no cross-site tracking
### Data Lifecycle:
```
1. Event occurs → 2. Stored locally → 3. Aggregated weekly →
4. [Optional] Send aggregate → 5. Auto-delete events >30 days old
```
**You control steps 4 and 5.**
## Configuration
`.evaluator/config.json`:
```json
{
"enabled": false,
"anonymous_id": "randomly-generated-daily-rotating-hash",
"send_aggregates": false,
"retention_days": 30,
"aggregation_interval_days": 7,
"collect": {
"skill_usage": true,
"performance_metrics": true,
"error_types": true,
"success_rates": true
},
"exclude_skills": [],
"github": {
"repo": "Overlord-Z/ClaudeShack",
"discussions_category": "Telemetry",
"issue_labels": ["feedback", "telemetry"]
}
}
```
## For Skill Developers
### Instrumenting Your Skill
Add telemetry hooks to your skill:
```python
from evaluator import track_event, track_metric
# Track skill invocation
with track_event('my_skill_invoked'):
result = my_skill.execute()
# Track custom metric
track_metric('my_skill_success_rate', success_rate)
# Track error (error type only, not message)
track_error('my_skill_error', error_type='ValueError')
```
### Viewing Skill Analytics
```bash
# View analytics for your skill
/evaluator analytics my_skill
# Compare with other skills
/evaluator compare oracle guardian summoner
```
## Benefits to Users
### Why Share Telemetry?
1. **Better Skills**: Identify which features are most useful
2. **Faster Bug Fixes**: Know which bugs affect the most users
3. **Prioritized Features**: Build what users actually want
4. **Performance Improvements**: Optimize based on real usage patterns
5. **Community Growth**: Demonstrate value to attract contributors
### What You Get Back:
- Public aggregate metrics (see how you compare)
- Priority bug fixes for highly-used features
- Better documentation based on common questions
- Skills optimized for real-world usage patterns
## Implementation Status
**Current:**
- ✅ Privacy-first design
- ✅ GitHub Issues templates designed
- ✅ Configuration schema
- ✅ Opt-in/opt-out framework
**In Progress:**
- 🚧 Event collection scripts
- 🚧 Aggregation engine
- 🚧 GitHub Projects integration
- 🚧 Analytics dashboard
**Planned:**
- 📋 Skill instrumentation helpers
- 📋 Automated weekly reports
- 📋 Community analytics page
## Transparency Report
We commit to publishing quarterly transparency reports:
**Metrics Reported:**
- Total opt-in users (approximate)
- Total events collected
- Top skills by usage
- Top feedback themes
- Privacy incidents (if any)
**Example:**
> "Q1 2025: 45 users opted in, 12,500 events collected, 0 privacy incidents, 23 bugs fixed based on feedback"
## Anti-Patterns (What We Won't Do)
- ❌ Collect data without consent
- ❌ Sell or share data with third parties
- ❌ Track individual users
- ❌ Collect code or file contents
- ❌ Use data for advertising
- ❌ Make telemetry difficult to disable
- ❌ Hide what we collect
## References
Based on 2025 best practices:
- OpenTelemetry standards for instrumentation
- GitHub Copilot's feedback collection model
- VSCode extension telemetry guidelines
- Open source community consensus on privacy

View File

@@ -0,0 +1,412 @@
#!/usr/bin/env python3
"""
Evaluator Event Tracking
Privacy-first anonymous telemetry for ClaudeShack skills.
Usage:
# Track a skill invocation
python track_event.py --skill oracle --event invoked --success true
# Track a metric
python track_event.py --skill guardian --metric acceptance_rate --value 0.75
# Track an error (type only, no message)
python track_event.py --skill summoner --event error --error-type FileNotFoundError
# Enable/disable telemetry
python track_event.py --enable
python track_event.py --disable
# View local events
python track_event.py --show-events
python track_event.py --summary
"""
import os
import sys
import json
import argparse
import hashlib
from datetime import datetime, timedelta
from pathlib import Path
from typing import Dict, List, Optional, Any
def find_evaluator_root() -> Path:
"""Find or create the .evaluator directory."""
current = Path.cwd()
while current != current.parent:
evaluator_path = current / '.evaluator'
if evaluator_path.exists():
return evaluator_path
current = current.parent
# Not found, create in current project root
evaluator_path = Path.cwd() / '.evaluator'
evaluator_path.mkdir(parents=True, exist_ok=True)
return evaluator_path
def load_config(evaluator_path: Path) -> Dict[str, Any]:
"""Load Evaluator configuration."""
config_file = evaluator_path / 'config.json'
if not config_file.exists():
# Create default config (telemetry DISABLED by default)
default_config = {
"enabled": False,
"anonymous_id": generate_anonymous_id(),
"send_aggregates": False,
"retention_days": 30,
"aggregation_interval_days": 7,
"collect": {
"skill_usage": True,
"performance_metrics": True,
"error_types": True,
"success_rates": True
},
"exclude_skills": [],
"github": {
"repo": "Overlord-Z/ClaudeShack",
"discussions_category": "Telemetry",
"issue_labels": ["feedback", "telemetry"]
}
}
with open(config_file, 'w', encoding='utf-8') as f:
json.dump(default_config, f, indent=2)
return default_config
try:
with open(config_file, 'r', encoding='utf-8') as f:
return json.load(f)
except (json.JSONDecodeError, OSError, IOError):
return {"enabled": False}
def save_config(evaluator_path: Path, config: Dict[str, Any]) -> None:
"""Save Evaluator configuration."""
config_file = evaluator_path / 'config.json'
try:
with open(config_file, 'w', encoding='utf-8') as f:
json.dump(config, f, indent=2)
except (OSError, IOError) as e:
print(f"Error: Failed to save config: {e}", file=sys.stderr)
sys.exit(1)
def generate_anonymous_id() -> str:
"""Generate a daily-rotating anonymous ID.
Returns:
Anonymous hash that rotates daily
"""
# Use date as salt for daily rotation
date_salt = datetime.now().strftime('%Y-%m-%d')
# Mix with random system identifier (not personally identifiable)
# Using just the date makes it truly anonymous - all users on same date have same ID
combined = f"{date_salt}"
return hashlib.sha256(combined.encode()).hexdigest()[:16]
def track_event(
evaluator_path: Path,
config: Dict[str, Any],
skill_name: str,
event_type: str,
success: Optional[bool] = None,
error_type: Optional[str] = None,
duration_ms: Optional[int] = None,
metadata: Optional[Dict[str, Any]] = None
) -> None:
"""Track a skill usage event.
Args:
evaluator_path: Path to .evaluator directory
config: Evaluator configuration
skill_name: Name of the skill
event_type: Type of event (invoked, error, etc.)
success: Whether the operation succeeded
error_type: Type of error (if applicable)
duration_ms: Duration in milliseconds
metadata: Additional anonymous metadata
"""
if not config.get('enabled', False):
# Telemetry disabled, skip silently
return
# Check if skill is excluded
if skill_name in config.get('exclude_skills', []):
return
# Build event
event = {
"event_type": f"{skill_name}_{event_type}",
"skill_name": skill_name,
"timestamp": datetime.now().isoformat(),
"session_id": config.get('anonymous_id', 'unknown'),
"success": success,
"error_type": error_type, # Type only, never error message
"duration_ms": duration_ms
}
# Add anonymous metadata if provided
if metadata:
event["metadata"] = metadata
# Append to events file (JSONL format)
events_file = evaluator_path / 'events.jsonl'
try:
with open(events_file, 'a', encoding='utf-8') as f:
f.write(json.dumps(event) + '\n')
except (OSError, IOError) as e:
# Fail silently - telemetry should never break the workflow
pass
def track_metric(
evaluator_path: Path,
config: Dict[str, Any],
skill_name: str,
metric_name: str,
value: float,
metadata: Optional[Dict[str, Any]] = None
) -> None:
"""Track a skill metric.
Args:
evaluator_path: Path to .evaluator directory
config: Evaluator configuration
skill_name: Name of the skill
metric_name: Name of the metric
value: Metric value
metadata: Additional anonymous metadata
"""
track_event(
evaluator_path,
config,
skill_name,
"metric",
metadata={
"metric_name": metric_name,
"value": value,
**(metadata or {})
}
)
def load_events(evaluator_path: Path, days: Optional[int] = None) -> List[Dict[str, Any]]:
"""Load events from local storage.
Args:
evaluator_path: Path to .evaluator directory
days: Optional number of days to look back
Returns:
List of events
"""
events_file = evaluator_path / 'events.jsonl'
if not events_file.exists():
return []
events = []
cutoff = None
if days:
cutoff = datetime.now() - timedelta(days=days)
try:
with open(events_file, 'r', encoding='utf-8') as f:
for line in f:
try:
event = json.loads(line.strip())
# Filter by date if cutoff specified
if cutoff:
event_time = datetime.fromisoformat(event['timestamp'])
if event_time < cutoff:
continue
events.append(event)
except json.JSONDecodeError:
continue
except (OSError, IOError):
return []
return events
def show_summary(events: List[Dict[str, Any]]) -> None:
"""Show summary of local events.
Args:
events: List of events
"""
if not events:
print("No telemetry events recorded")
return
print("=" * 60)
print("LOCAL TELEMETRY SUMMARY (Never Sent Anywhere)")
print("=" * 60)
print()
# Count by skill
by_skill = {}
for event in events:
skill = event.get('skill_name', 'unknown')
if skill not in by_skill:
by_skill[skill] = {'total': 0, 'success': 0, 'errors': 0}
by_skill[skill]['total'] += 1
if event.get('success') is True:
by_skill[skill]['success'] += 1
elif event.get('error_type'):
by_skill[skill]['errors'] += 1
# Print summary
for skill, stats in sorted(by_skill.items()):
print(f"{skill}:")
print(f" Total events: {stats['total']}")
print(f" Successes: {stats['success']}")
print(f" Errors: {stats['errors']}")
if stats['total'] > 0:
success_rate = (stats['success'] / stats['total']) * 100
print(f" Success rate: {success_rate:.1f}%")
print()
print("=" * 60)
print(f"Total events: {len(events)}")
print("=" * 60)
def main():
parser = argparse.ArgumentParser(
description='Privacy-first anonymous telemetry for ClaudeShack',
formatter_class=argparse.RawDescriptionHelpFormatter
)
parser.add_argument('--skill', help='Skill name')
parser.add_argument('--event', help='Event type (invoked, error, etc.)')
parser.add_argument('--success', type=bool, help='Whether operation succeeded')
parser.add_argument('--error-type', help='Error type (not message)')
parser.add_argument('--duration', type=int, help='Duration in milliseconds')
parser.add_argument('--metric', help='Metric name')
parser.add_argument('--value', type=float, help='Metric value')
parser.add_argument('--enable', action='store_true', help='Enable telemetry (opt-in)')
parser.add_argument('--disable', action='store_true', help='Disable telemetry')
parser.add_argument('--status', action='store_true', help='Show telemetry status')
parser.add_argument('--show-events', action='store_true', help='Show local events')
parser.add_argument('--summary', action='store_true', help='Show event summary')
parser.add_argument('--days', type=int, help='Days to look back (default: all)')
parser.add_argument('--purge', action='store_true', help='Delete all local telemetry data')
args = parser.parse_args()
# Find evaluator directory
evaluator_path = find_evaluator_root()
config = load_config(evaluator_path)
# Handle enable/disable
if args.enable:
config['enabled'] = True
config['anonymous_id'] = generate_anonymous_id()
save_config(evaluator_path, config)
print("✓ Telemetry enabled (anonymous, opt-in)")
print(f" Anonymous ID: {config['anonymous_id']}")
print(" No personally identifiable information is collected")
print(" You can disable anytime with: --disable")
sys.exit(0)
if args.disable:
config['enabled'] = False
save_config(evaluator_path, config)
print("✓ Telemetry disabled")
print(" Run with --purge to delete all local data")
sys.exit(0)
# Handle status
if args.status:
print("Evaluator Telemetry Status:")
print("=" * 60)
print(f"Enabled: {config.get('enabled', False)}")
print(f"Anonymous ID: {config.get('anonymous_id', 'Not set')}")
print(f"Send aggregates: {config.get('send_aggregates', False)}")
print(f"Retention: {config.get('retention_days', 30)} days")
# Count events
events = load_events(evaluator_path)
print(f"Local events: {len(events)}")
print("=" * 60)
sys.exit(0)
# Handle purge
if args.purge:
events_file = evaluator_path / 'events.jsonl'
if events_file.exists():
events_file.unlink()
print("✓ All local telemetry data deleted")
else:
print("No telemetry data to delete")
sys.exit(0)
# Handle show events
if args.show_events:
events = load_events(evaluator_path, args.days)
print(json.dumps(events, indent=2))
sys.exit(0)
# Handle summary
if args.summary:
events = load_events(evaluator_path, args.days)
show_summary(events)
sys.exit(0)
# Track event
if args.skill and args.event:
track_event(
evaluator_path,
config,
args.skill,
args.event,
args.success,
args.error_type,
args.duration
)
# Silent success (telemetry should be invisible)
sys.exit(0)
# Track metric
if args.skill and args.metric and args.value is not None:
track_metric(
evaluator_path,
config,
args.skill,
args.metric,
args.value
)
# Silent success
sys.exit(0)
parser.print_help()
sys.exit(1)
if __name__ == '__main__':
main()