12 KiB
description
| description |
|---|
| Run automated benchmark tests on Claude Code agents and track performance over time |
Usage
# Create a new benchmark suite
/benchmark-agent --create <agent-name>
# Run benchmarks
/benchmark-agent <agent-name>
/benchmark-agent --all
/benchmark-agent --all --marketing
/benchmark-agent --all --tech
# Advanced options
/benchmark-agent <agent-name> --rotate
/benchmark-agent --report-only
/benchmark-agent <agent-name> --verbose
/benchmark-agent <agent-name> --marketing-summary
Commands
Create New Benchmark
/benchmark-agent --create my-content-agent
What happens:
- Launches
test-suite-creatoragent - Asks you 5 questions about your agent
- Generates complete benchmark suite:
- 5 diverse test cases
- Ground truth expectations (JSON)
- Scoring rubric (METRICS.md)
- Documentation
Time: < 1 hour from start to first benchmark
Run Single Agent
/benchmark-agent seo-specialist
What happens:
- Loads test suite for
seo-specialist - Executes all test cases
- Scores results via
benchmark-judge - Updates performance history
- Generates detailed report
Output:
# Benchmark Results: seo-specialist
Overall Score: 90/100 ✅ PASS
Trend: ⬆️ Improving (+2 from baseline)
Individual Tests:
- Test #01: 82/100 ✓
- Test #02: 96/100 ✓
- Test #03: 92/100 ✓
Recommendation: DEPLOY v2
Time: 2-5 minutes (depends on agent complexity)
Run All Agents
/benchmark-agent --all
What happens:
- Loads all agents from registry
- Runs benchmark on each
- Generates summary report
Filters:
/benchmark-agent --all --marketing # Marketing agents only
/benchmark-agent --all --tech # Tech repo agents only
Output:
# Benchmark Results: All Agents
Summary:
| Agent | Score | Status | Trend |
|------------------------|--------|--------|-------|
| seo-specialist | 90/100 | ✅ Pass | ⬆️ +2 |
| content-publishing | 97/100 | ✅ Pass | ➡️ 0 |
| weekly-planning | 85/100 | ✅ Pass | ⬆️ +3 |
Overall health: 6/7 passing (85.7%)
Time: 10-30 minutes (depends on number of agents)
Report Only
/benchmark-agent --report-only
/benchmark-agent seo-specialist --report-only
What happens:
- Skips test execution
- Reads latest run from history
- Generates report from stored data
Use cases:
- Quick status check
- Share results with team
- Review historical performance
Time: < 5 seconds
Test Rotation
/benchmark-agent seo-specialist --rotate
What happens:
- Runs normal benchmark
- Analyzes test performance
- Suggests new tests (if agent scoring 95+)
- Suggests retiring tests (if scoring 100 three times)
- You approve/reject suggestions
Example output:
## Test Rotation Suggestion
Current performance:
- Test #01: 95/100
- Test #02: 96/100
- Test #03: 97/100
Recommendation: Add Test #04 (long-form listicle)
Rationale:
- Agent mastering current tests
- Need to test SEO on 2000+ word content
- Listicle format has unique challenges
Accept? (yes/no)
Verbose Mode
/benchmark-agent seo-specialist --verbose
What happens: Shows detailed execution steps:
- Test file loading
- Agent invocation
- Judge scoring process
- Performance calculation
Use for:
- Debugging
- Understanding workflow
- Investigating unexpected results
Marketing Summary
/benchmark-agent seo-specialist --marketing-summary
What happens: Generates marketing-ready content about agent performance.
Output:
# seo-specialist Performance Update
Latest score: 90/100 ✅
Improvement: +2.3% over 8 days
What Improved:
✨ More accurate scoring on mediocre content
✨ Zero false positives on excellent content
✨ Consistent spam detection
Real-World Impact:
Automated SEO auditing for blog posts with improved accuracy.
*Benchmarked using Agent Benchmark Kit*
Use for:
- Blog posts
- Social media
- Performance updates
- Customer communication
Configuration
Registry File
Location: ~/.agent-benchmarks/registry.yml
Structure:
agents:
seo-specialist:
name: "seo-specialist"
location: "marketing"
test_suite: "~/.agent-benchmarks/seo-specialist/"
baseline_score: 88
target_score: 90
status: "production"
content-publishing-specialist:
name: "content-publishing-specialist"
location: "marketing"
test_suite: "~/.agent-benchmarks/content-publishing-specialist/"
baseline_score: 97.5
target_score: 95
status: "production"
Add new agent:
/benchmark-agent --create my-agent
# Automatically adds to registry
Performance History
Location: ~/.agent-benchmarks/performance-history.json
Structure:
{
"seo-specialist": {
"baseline": { "version": "v1", "score": 88 },
"current": { "version": "v2", "score": 90 },
"runs": [
{
"id": "run-001",
"timestamp": "2025-11-01T10:00:00Z",
"score": 88,
"tests": {...}
},
{
"id": "run-002",
"timestamp": "2025-11-09T14:30:00Z",
"score": 90,
"tests": {...}
}
]
}
}
Managed automatically - no manual editing needed
Examples
Example 1: Create and run first benchmark
# 1. Create benchmark suite
/benchmark-agent --create seo-specialist
# Answer questions:
# > What does your agent do?
# "Audits blog posts for SEO optimization"
# > What validations does it perform?
# "Keyword usage, meta descriptions, content length, structure"
# > What are edge cases?
# "Keyword stuffing, perfect content, very short content"
# > What's perfect output?
# "700+ words, good keyword density, proper structure"
# > What's failing output?
# "Thin content, no meta, keyword stuffing"
# 2. Review generated suite
ls ~/.agent-benchmarks/seo-specialist/
# 3. Run benchmark
/benchmark-agent seo-specialist
# 4. View results
# (Automatically displayed)
Example 2: Weekly benchmark run
# Run all production agents
/benchmark-agent --all
# Review summary
# Identify any regressions
# Investigate agents below threshold
Example 3: After prompt changes
# Made changes to agent prompt
# Want to validate improvement
# Run benchmark
/benchmark-agent seo-specialist
# Compare to baseline
# Look for:
# - Overall score increase
# - Specific test improvements
# - No new regressions
Example 4: Generate marketing content
# Agent improved, want to share
/benchmark-agent seo-specialist --marketing-summary
# Copy output to blog post
# Share on social media
# Include in documentation
Workflow Behind the Scenes
When you run /benchmark-agent seo-specialist, this happens:
- Slash command receives input
- Invokes
benchmark-orchestratoragent - Orchestrator:
- Loads agent config
- For each test:
- Reads test file
- Invokes agent under test
- Captures output
- Invokes
benchmark-judge - Records score
- Calculates overall score
- Updates performance history
- Generates report
- Returns report to you
You see: Final report Behind the scenes: Full orchestration workflow
Directory Structure
~/.agent-benchmarks/
├── registry.yml # Agent registry
├── performance-history.json # All agent history
├── seo-specialist/ # Agent benchmark suite
│ ├── test-cases/
│ │ ├── TEST-METADATA.md
│ │ ├── 01-mediocre-content.md
│ │ ├── 02-excellent-content.md
│ │ └── ...
│ ├── ground-truth/
│ │ ├── 01-expected.json
│ │ ├── 02-expected.json
│ │ └── ...
│ ├── results/
│ │ ├── run-001-results.md
│ │ ├── run-002-results.md
│ │ └── summary.md
│ ├── METRICS.md
│ ├── README.md
│ └── QUICK-START.md
└── content-publishing-specialist/
└── [similar structure]
Error Messages
Agent not found
❌ Error: Agent 'xyz' not found in registry
Available agents:
- seo-specialist
- content-publishing-specialist
- weekly-planning-specialist
Did you mean: seo-specialist?
To create new benchmark:
/benchmark-agent --create xyz
No test suite
❌ Error: No test suite found for 'my-agent'
The agent is registered but has no test cases.
Create benchmark suite:
/benchmark-agent --create my-agent
Below threshold
⚠️ Warning: Agent scored below threshold
Score: 75/100
Threshold: 80/100
Status: ❌ FAIL
Recommendation: Do NOT deploy
- Review failing tests
- Investigate regressions
- Iterate on agent prompt
- Re-run benchmark
Tips
Tip 1: Run before deploying
# Made prompt changes?
# Run benchmark before deploying
/benchmark-agent my-agent
# Only deploy if:
# - Score ≥ 80/100
# - No regressions on critical tests
# - Improvement over baseline (ideally)
Tip 2: Weekly health checks
# Set up weekly routine
# Every Monday morning:
/benchmark-agent --all
# Review summary
# Investigate any regressions
# Celebrate improvements
Tip 3: Use reports in PRs
# Making agent changes in PR?
# Include benchmark results
/benchmark-agent my-agent --report-only
# Copy markdown to PR description
# Show before/after scores
# Justify changes with data
Tip 4: Track improvement journeys
# Document your agent's evolution
Week 1: 88/100 (baseline)
Week 2: 90/100 (+2 - added calibration)
Week 3: 92/100 (+2 - improved recommendations)
Week 4: 94/100 (+2 - edge case handling)
# Great content for:
# - Blog posts
# - Case studies
# - Team updates
Next Steps
After creating your first benchmark:
- ✅ Run it - Get baseline score
- ✅ Review results - Understand strengths/weaknesses
- ✅ Iterate - Improve agent prompt based on data
- ✅ Re-run - Validate improvements
- ✅ Deploy - Ship better agent to production
After establishing multiple benchmarks:
- ✅ Schedule weekly runs -
/benchmark-agent --all - ✅ Track trends - Performance history over time
- ✅ Rotate tests - Keep agents challenged
- ✅ Share results - Marketing content, team updates
Learn More
- Getting Started Guide - Installation and first benchmark
- Test Creation Guide - How to design effective tests
- Scoring Rubrics - How to create fair scoring
- Advanced Usage - Test rotation, tips, best practices
Troubleshooting
Problem: Command not found
Solution: Run install script: ./scripts/install.sh
Problem: Agent execution timeout Solution: Increase timeout in config or simplify test case
Problem: Judge scoring seems incorrect Solution: Review ground truth expectations, update rubric
Problem: Can't find test files Solution: Check directory structure, ensure files are in correct location
Support
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Docs: Full Documentation
Built with ❤️ by BrandCast
Automated agent QA for production use.