592 lines
12 KiB
Markdown
592 lines
12 KiB
Markdown
---
|
|
description: Run automated benchmark tests on Claude Code agents and track performance over time
|
|
---
|
|
|
|
## Usage
|
|
|
|
```bash
|
|
# Create a new benchmark suite
|
|
/benchmark-agent --create <agent-name>
|
|
|
|
# Run benchmarks
|
|
/benchmark-agent <agent-name>
|
|
/benchmark-agent --all
|
|
/benchmark-agent --all --marketing
|
|
/benchmark-agent --all --tech
|
|
|
|
# Advanced options
|
|
/benchmark-agent <agent-name> --rotate
|
|
/benchmark-agent --report-only
|
|
/benchmark-agent <agent-name> --verbose
|
|
/benchmark-agent <agent-name> --marketing-summary
|
|
```
|
|
|
|
---
|
|
|
|
## Commands
|
|
|
|
### Create New Benchmark
|
|
|
|
```bash
|
|
/benchmark-agent --create my-content-agent
|
|
```
|
|
|
|
**What happens:**
|
|
1. Launches `test-suite-creator` agent
|
|
2. Asks you 5 questions about your agent
|
|
3. Generates complete benchmark suite:
|
|
- 5 diverse test cases
|
|
- Ground truth expectations (JSON)
|
|
- Scoring rubric (METRICS.md)
|
|
- Documentation
|
|
|
|
**Time:** < 1 hour from start to first benchmark
|
|
|
|
---
|
|
|
|
### Run Single Agent
|
|
|
|
```bash
|
|
/benchmark-agent seo-specialist
|
|
```
|
|
|
|
**What happens:**
|
|
1. Loads test suite for `seo-specialist`
|
|
2. Executes all test cases
|
|
3. Scores results via `benchmark-judge`
|
|
4. Updates performance history
|
|
5. Generates detailed report
|
|
|
|
**Output:**
|
|
```markdown
|
|
# Benchmark Results: seo-specialist
|
|
|
|
Overall Score: 90/100 ✅ PASS
|
|
Trend: ⬆️ Improving (+2 from baseline)
|
|
|
|
Individual Tests:
|
|
- Test #01: 82/100 ✓
|
|
- Test #02: 96/100 ✓
|
|
- Test #03: 92/100 ✓
|
|
|
|
Recommendation: DEPLOY v2
|
|
```
|
|
|
|
**Time:** 2-5 minutes (depends on agent complexity)
|
|
|
|
---
|
|
|
|
### Run All Agents
|
|
|
|
```bash
|
|
/benchmark-agent --all
|
|
```
|
|
|
|
**What happens:**
|
|
1. Loads all agents from registry
|
|
2. Runs benchmark on each
|
|
3. Generates summary report
|
|
|
|
**Filters:**
|
|
```bash
|
|
/benchmark-agent --all --marketing # Marketing agents only
|
|
/benchmark-agent --all --tech # Tech repo agents only
|
|
```
|
|
|
|
**Output:**
|
|
```markdown
|
|
# Benchmark Results: All Agents
|
|
|
|
Summary:
|
|
| Agent | Score | Status | Trend |
|
|
|------------------------|--------|--------|-------|
|
|
| seo-specialist | 90/100 | ✅ Pass | ⬆️ +2 |
|
|
| content-publishing | 97/100 | ✅ Pass | ➡️ 0 |
|
|
| weekly-planning | 85/100 | ✅ Pass | ⬆️ +3 |
|
|
|
|
Overall health: 6/7 passing (85.7%)
|
|
```
|
|
|
|
**Time:** 10-30 minutes (depends on number of agents)
|
|
|
|
---
|
|
|
|
### Report Only
|
|
|
|
```bash
|
|
/benchmark-agent --report-only
|
|
/benchmark-agent seo-specialist --report-only
|
|
```
|
|
|
|
**What happens:**
|
|
1. Skips test execution
|
|
2. Reads latest run from history
|
|
3. Generates report from stored data
|
|
|
|
**Use cases:**
|
|
- Quick status check
|
|
- Share results with team
|
|
- Review historical performance
|
|
|
|
**Time:** < 5 seconds
|
|
|
|
---
|
|
|
|
### Test Rotation
|
|
|
|
```bash
|
|
/benchmark-agent seo-specialist --rotate
|
|
```
|
|
|
|
**What happens:**
|
|
1. Runs normal benchmark
|
|
2. Analyzes test performance
|
|
3. Suggests new tests (if agent scoring 95+)
|
|
4. Suggests retiring tests (if scoring 100 three times)
|
|
5. You approve/reject suggestions
|
|
|
|
**Example output:**
|
|
```markdown
|
|
## Test Rotation Suggestion
|
|
|
|
Current performance:
|
|
- Test #01: 95/100
|
|
- Test #02: 96/100
|
|
- Test #03: 97/100
|
|
|
|
Recommendation: Add Test #04 (long-form listicle)
|
|
|
|
Rationale:
|
|
- Agent mastering current tests
|
|
- Need to test SEO on 2000+ word content
|
|
- Listicle format has unique challenges
|
|
|
|
Accept? (yes/no)
|
|
```
|
|
|
|
---
|
|
|
|
### Verbose Mode
|
|
|
|
```bash
|
|
/benchmark-agent seo-specialist --verbose
|
|
```
|
|
|
|
**What happens:**
|
|
Shows detailed execution steps:
|
|
- Test file loading
|
|
- Agent invocation
|
|
- Judge scoring process
|
|
- Performance calculation
|
|
|
|
**Use for:**
|
|
- Debugging
|
|
- Understanding workflow
|
|
- Investigating unexpected results
|
|
|
|
---
|
|
|
|
### Marketing Summary
|
|
|
|
```bash
|
|
/benchmark-agent seo-specialist --marketing-summary
|
|
```
|
|
|
|
**What happens:**
|
|
Generates marketing-ready content about agent performance.
|
|
|
|
**Output:**
|
|
```markdown
|
|
# seo-specialist Performance Update
|
|
|
|
Latest score: 90/100 ✅
|
|
Improvement: +2.3% over 8 days
|
|
|
|
What Improved:
|
|
✨ More accurate scoring on mediocre content
|
|
✨ Zero false positives on excellent content
|
|
✨ Consistent spam detection
|
|
|
|
Real-World Impact:
|
|
Automated SEO auditing for blog posts with improved accuracy.
|
|
|
|
*Benchmarked using Agent Benchmark Kit*
|
|
```
|
|
|
|
**Use for:**
|
|
- Blog posts
|
|
- Social media
|
|
- Performance updates
|
|
- Customer communication
|
|
|
|
---
|
|
|
|
## Configuration
|
|
|
|
### Registry File
|
|
|
|
**Location:** `~/.agent-benchmarks/registry.yml`
|
|
|
|
**Structure:**
|
|
```yaml
|
|
agents:
|
|
seo-specialist:
|
|
name: "seo-specialist"
|
|
location: "marketing"
|
|
test_suite: "~/.agent-benchmarks/seo-specialist/"
|
|
baseline_score: 88
|
|
target_score: 90
|
|
status: "production"
|
|
|
|
content-publishing-specialist:
|
|
name: "content-publishing-specialist"
|
|
location: "marketing"
|
|
test_suite: "~/.agent-benchmarks/content-publishing-specialist/"
|
|
baseline_score: 97.5
|
|
target_score: 95
|
|
status: "production"
|
|
```
|
|
|
|
**Add new agent:**
|
|
```bash
|
|
/benchmark-agent --create my-agent
|
|
# Automatically adds to registry
|
|
```
|
|
|
|
---
|
|
|
|
### Performance History
|
|
|
|
**Location:** `~/.agent-benchmarks/performance-history.json`
|
|
|
|
**Structure:**
|
|
```json
|
|
{
|
|
"seo-specialist": {
|
|
"baseline": { "version": "v1", "score": 88 },
|
|
"current": { "version": "v2", "score": 90 },
|
|
"runs": [
|
|
{
|
|
"id": "run-001",
|
|
"timestamp": "2025-11-01T10:00:00Z",
|
|
"score": 88,
|
|
"tests": {...}
|
|
},
|
|
{
|
|
"id": "run-002",
|
|
"timestamp": "2025-11-09T14:30:00Z",
|
|
"score": 90,
|
|
"tests": {...}
|
|
}
|
|
]
|
|
}
|
|
}
|
|
```
|
|
|
|
**Managed automatically** - no manual editing needed
|
|
|
|
---
|
|
|
|
## Examples
|
|
|
|
### Example 1: Create and run first benchmark
|
|
|
|
```bash
|
|
# 1. Create benchmark suite
|
|
/benchmark-agent --create seo-specialist
|
|
|
|
# Answer questions:
|
|
# > What does your agent do?
|
|
# "Audits blog posts for SEO optimization"
|
|
# > What validations does it perform?
|
|
# "Keyword usage, meta descriptions, content length, structure"
|
|
# > What are edge cases?
|
|
# "Keyword stuffing, perfect content, very short content"
|
|
# > What's perfect output?
|
|
# "700+ words, good keyword density, proper structure"
|
|
# > What's failing output?
|
|
# "Thin content, no meta, keyword stuffing"
|
|
|
|
# 2. Review generated suite
|
|
ls ~/.agent-benchmarks/seo-specialist/
|
|
|
|
# 3. Run benchmark
|
|
/benchmark-agent seo-specialist
|
|
|
|
# 4. View results
|
|
# (Automatically displayed)
|
|
```
|
|
|
|
---
|
|
|
|
### Example 2: Weekly benchmark run
|
|
|
|
```bash
|
|
# Run all production agents
|
|
/benchmark-agent --all
|
|
|
|
# Review summary
|
|
# Identify any regressions
|
|
# Investigate agents below threshold
|
|
```
|
|
|
|
---
|
|
|
|
### Example 3: After prompt changes
|
|
|
|
```bash
|
|
# Made changes to agent prompt
|
|
# Want to validate improvement
|
|
|
|
# Run benchmark
|
|
/benchmark-agent seo-specialist
|
|
|
|
# Compare to baseline
|
|
# Look for:
|
|
# - Overall score increase
|
|
# - Specific test improvements
|
|
# - No new regressions
|
|
```
|
|
|
|
---
|
|
|
|
### Example 4: Generate marketing content
|
|
|
|
```bash
|
|
# Agent improved, want to share
|
|
|
|
/benchmark-agent seo-specialist --marketing-summary
|
|
|
|
# Copy output to blog post
|
|
# Share on social media
|
|
# Include in documentation
|
|
```
|
|
|
|
---
|
|
|
|
## Workflow Behind the Scenes
|
|
|
|
When you run `/benchmark-agent seo-specialist`, this happens:
|
|
|
|
1. **Slash command** receives input
|
|
2. **Invokes** `benchmark-orchestrator` agent
|
|
3. **Orchestrator:**
|
|
- Loads agent config
|
|
- For each test:
|
|
- Reads test file
|
|
- Invokes agent under test
|
|
- Captures output
|
|
- Invokes `benchmark-judge`
|
|
- Records score
|
|
- Calculates overall score
|
|
- Updates performance history
|
|
- Generates report
|
|
4. **Returns** report to you
|
|
|
|
**You see:** Final report
|
|
**Behind the scenes:** Full orchestration workflow
|
|
|
|
---
|
|
|
|
## Directory Structure
|
|
|
|
```
|
|
~/.agent-benchmarks/
|
|
├── registry.yml # Agent registry
|
|
├── performance-history.json # All agent history
|
|
├── seo-specialist/ # Agent benchmark suite
|
|
│ ├── test-cases/
|
|
│ │ ├── TEST-METADATA.md
|
|
│ │ ├── 01-mediocre-content.md
|
|
│ │ ├── 02-excellent-content.md
|
|
│ │ └── ...
|
|
│ ├── ground-truth/
|
|
│ │ ├── 01-expected.json
|
|
│ │ ├── 02-expected.json
|
|
│ │ └── ...
|
|
│ ├── results/
|
|
│ │ ├── run-001-results.md
|
|
│ │ ├── run-002-results.md
|
|
│ │ └── summary.md
|
|
│ ├── METRICS.md
|
|
│ ├── README.md
|
|
│ └── QUICK-START.md
|
|
└── content-publishing-specialist/
|
|
└── [similar structure]
|
|
```
|
|
|
|
---
|
|
|
|
## Error Messages
|
|
|
|
### Agent not found
|
|
|
|
```markdown
|
|
❌ Error: Agent 'xyz' not found in registry
|
|
|
|
Available agents:
|
|
- seo-specialist
|
|
- content-publishing-specialist
|
|
- weekly-planning-specialist
|
|
|
|
Did you mean: seo-specialist?
|
|
|
|
To create new benchmark:
|
|
/benchmark-agent --create xyz
|
|
```
|
|
|
|
---
|
|
|
|
### No test suite
|
|
|
|
```markdown
|
|
❌ Error: No test suite found for 'my-agent'
|
|
|
|
The agent is registered but has no test cases.
|
|
|
|
Create benchmark suite:
|
|
/benchmark-agent --create my-agent
|
|
```
|
|
|
|
---
|
|
|
|
### Below threshold
|
|
|
|
```markdown
|
|
⚠️ Warning: Agent scored below threshold
|
|
|
|
Score: 75/100
|
|
Threshold: 80/100
|
|
Status: ❌ FAIL
|
|
|
|
Recommendation: Do NOT deploy
|
|
- Review failing tests
|
|
- Investigate regressions
|
|
- Iterate on agent prompt
|
|
- Re-run benchmark
|
|
```
|
|
|
|
---
|
|
|
|
## Tips
|
|
|
|
### Tip 1: Run before deploying
|
|
|
|
```bash
|
|
# Made prompt changes?
|
|
# Run benchmark before deploying
|
|
|
|
/benchmark-agent my-agent
|
|
|
|
# Only deploy if:
|
|
# - Score ≥ 80/100
|
|
# - No regressions on critical tests
|
|
# - Improvement over baseline (ideally)
|
|
```
|
|
|
|
---
|
|
|
|
### Tip 2: Weekly health checks
|
|
|
|
```bash
|
|
# Set up weekly routine
|
|
# Every Monday morning:
|
|
|
|
/benchmark-agent --all
|
|
|
|
# Review summary
|
|
# Investigate any regressions
|
|
# Celebrate improvements
|
|
```
|
|
|
|
---
|
|
|
|
### Tip 3: Use reports in PRs
|
|
|
|
```bash
|
|
# Making agent changes in PR?
|
|
# Include benchmark results
|
|
|
|
/benchmark-agent my-agent --report-only
|
|
|
|
# Copy markdown to PR description
|
|
# Show before/after scores
|
|
# Justify changes with data
|
|
```
|
|
|
|
---
|
|
|
|
### Tip 4: Track improvement journeys
|
|
|
|
```bash
|
|
# Document your agent's evolution
|
|
|
|
Week 1: 88/100 (baseline)
|
|
Week 2: 90/100 (+2 - added calibration)
|
|
Week 3: 92/100 (+2 - improved recommendations)
|
|
Week 4: 94/100 (+2 - edge case handling)
|
|
|
|
# Great content for:
|
|
# - Blog posts
|
|
# - Case studies
|
|
# - Team updates
|
|
```
|
|
|
|
---
|
|
|
|
## Next Steps
|
|
|
|
### After creating your first benchmark:
|
|
|
|
1. ✅ **Run it** - Get baseline score
|
|
2. ✅ **Review results** - Understand strengths/weaknesses
|
|
3. ✅ **Iterate** - Improve agent prompt based on data
|
|
4. ✅ **Re-run** - Validate improvements
|
|
5. ✅ **Deploy** - Ship better agent to production
|
|
|
|
### After establishing multiple benchmarks:
|
|
|
|
1. ✅ **Schedule weekly runs** - `/benchmark-agent --all`
|
|
2. ✅ **Track trends** - Performance history over time
|
|
3. ✅ **Rotate tests** - Keep agents challenged
|
|
4. ✅ **Share results** - Marketing content, team updates
|
|
|
|
---
|
|
|
|
## Learn More
|
|
|
|
- **[Getting Started Guide](../docs/getting-started.md)** - Installation and first benchmark
|
|
- **[Test Creation Guide](../docs/test-creation-guide.md)** - How to design effective tests
|
|
- **[Scoring Rubrics](../docs/scoring-rubrics.md)** - How to create fair scoring
|
|
- **[Advanced Usage](../docs/advanced-usage.md)** - Test rotation, tips, best practices
|
|
|
|
---
|
|
|
|
## Troubleshooting
|
|
|
|
**Problem:** Command not found
|
|
**Solution:** Run install script: `./scripts/install.sh`
|
|
|
|
**Problem:** Agent execution timeout
|
|
**Solution:** Increase timeout in config or simplify test case
|
|
|
|
**Problem:** Judge scoring seems incorrect
|
|
**Solution:** Review ground truth expectations, update rubric
|
|
|
|
**Problem:** Can't find test files
|
|
**Solution:** Check directory structure, ensure files are in correct location
|
|
|
|
---
|
|
|
|
## Support
|
|
|
|
- **Issues:** [GitHub Issues](https://github.com/BrandCast-Signage/agent-benchmark-kit/issues)
|
|
- **Discussions:** [GitHub Discussions](https://github.com/BrandCast-Signage/agent-benchmark-kit/discussions)
|
|
- **Docs:** [Full Documentation](../docs/)
|
|
|
|
---
|
|
|
|
**Built with ❤️ by [BrandCast](https://brandcast.app)**
|
|
|
|
Automated agent QA for production use.
|