gh-brandcast-signage-agent-benchmark-kit/commands/benchmark-agent.md at 6199d55230d983df9bd0a76b9ef20239421000e9

zhongwei/gh-brandcast-signage-agent-benchmark-kit

Files

Zhongwei Li 6199d55230 Initial commit

2025-11-29 18:02:28 +08:00

12 KiB

Raw Blame History

description

description
Run automated benchmark tests on Claude Code agents and track performance over time

Usage

# Create a new benchmark suite
/benchmark-agent --create <agent-name>

# Run benchmarks
/benchmark-agent <agent-name>
/benchmark-agent --all
/benchmark-agent --all --marketing
/benchmark-agent --all --tech

# Advanced options
/benchmark-agent <agent-name> --rotate
/benchmark-agent --report-only
/benchmark-agent <agent-name> --verbose
/benchmark-agent <agent-name> --marketing-summary

Commands

Create New Benchmark

/benchmark-agent --create my-content-agent

What happens:

Launches test-suite-creator agent
Asks you 5 questions about your agent
Generates complete benchmark suite:
- 5 diverse test cases
- Ground truth expectations (JSON)
- Scoring rubric (METRICS.md)
- Documentation

Time: < 1 hour from start to first benchmark

Run Single Agent

/benchmark-agent seo-specialist

What happens:

Loads test suite for seo-specialist
Executes all test cases
Scores results via benchmark-judge
Updates performance history
Generates detailed report

Output:

# Benchmark Results: seo-specialist

Overall Score: 90/100 ✅ PASS
Trend: ⬆️ Improving (+2 from baseline)

Individual Tests:
- Test #01: 82/100 ✓
- Test #02: 96/100 ✓
- Test #03: 92/100 ✓

Recommendation: DEPLOY v2

Time: 2-5 minutes (depends on agent complexity)

Run All Agents

/benchmark-agent --all

What happens:

Loads all agents from registry
Runs benchmark on each
Generates summary report

Filters:

/benchmark-agent --all --marketing  # Marketing agents only
/benchmark-agent --all --tech       # Tech repo agents only

Output:

# Benchmark Results: All Agents

Summary:
| Agent                  | Score  | Status | Trend |
|------------------------|--------|--------|-------|
| seo-specialist         | 90/100 | ✅ Pass | ⬆️ +2  |
| content-publishing     | 97/100 | ✅ Pass | ➡️  0  |
| weekly-planning        | 85/100 | ✅ Pass | ⬆️ +3  |

Overall health: 6/7 passing (85.7%)

Time: 10-30 minutes (depends on number of agents)

Report Only

/benchmark-agent --report-only
/benchmark-agent seo-specialist --report-only

What happens:

Skips test execution
Reads latest run from history
Generates report from stored data

Use cases:

Quick status check
Share results with team
Review historical performance

Time: < 5 seconds

Test Rotation

/benchmark-agent seo-specialist --rotate

What happens:

Runs normal benchmark
Analyzes test performance
Suggests new tests (if agent scoring 95+)
Suggests retiring tests (if scoring 100 three times)
You approve/reject suggestions

Example output:

## Test Rotation Suggestion

Current performance:
- Test #01: 95/100
- Test #02: 96/100
- Test #03: 97/100

Recommendation: Add Test #04 (long-form listicle)

Rationale:
- Agent mastering current tests
- Need to test SEO on 2000+ word content
- Listicle format has unique challenges

Accept? (yes/no)

Verbose Mode

/benchmark-agent seo-specialist --verbose

What happens: Shows detailed execution steps:

Test file loading
Agent invocation
Judge scoring process
Performance calculation

Use for:

Debugging
Understanding workflow
Investigating unexpected results

Marketing Summary

/benchmark-agent seo-specialist --marketing-summary

What happens: Generates marketing-ready content about agent performance.

Output:

# seo-specialist Performance Update

Latest score: 90/100 ✅
Improvement: +2.3% over 8 days

What Improved:
✨ More accurate scoring on mediocre content
✨ Zero false positives on excellent content
✨ Consistent spam detection

Real-World Impact:
Automated SEO auditing for blog posts with improved accuracy.

*Benchmarked using Agent Benchmark Kit*

Use for:

Blog posts
Social media
Performance updates
Customer communication

Configuration

Registry File

Location: ~/.agent-benchmarks/registry.yml

Structure:

agents:
  seo-specialist:
    name: "seo-specialist"
    location: "marketing"
    test_suite: "~/.agent-benchmarks/seo-specialist/"
    baseline_score: 88
    target_score: 90
    status: "production"

  content-publishing-specialist:
    name: "content-publishing-specialist"
    location: "marketing"
    test_suite: "~/.agent-benchmarks/content-publishing-specialist/"
    baseline_score: 97.5
    target_score: 95
    status: "production"

Add new agent:

/benchmark-agent --create my-agent
# Automatically adds to registry

Performance History

Location: ~/.agent-benchmarks/performance-history.json

Structure:

{
  "seo-specialist": {
    "baseline": { "version": "v1", "score": 88 },
    "current": { "version": "v2", "score": 90 },
    "runs": [
      {
        "id": "run-001",
        "timestamp": "2025-11-01T10:00:00Z",
        "score": 88,
        "tests": {...}
      },
      {
        "id": "run-002",
        "timestamp": "2025-11-09T14:30:00Z",
        "score": 90,
        "tests": {...}
      }
    ]
  }
}

Managed automatically - no manual editing needed

Examples

Example 1: Create and run first benchmark

# 1. Create benchmark suite
/benchmark-agent --create seo-specialist

# Answer questions:
# > What does your agent do?
#   "Audits blog posts for SEO optimization"
# > What validations does it perform?
#   "Keyword usage, meta descriptions, content length, structure"
# > What are edge cases?
#   "Keyword stuffing, perfect content, very short content"
# > What's perfect output?
#   "700+ words, good keyword density, proper structure"
# > What's failing output?
#   "Thin content, no meta, keyword stuffing"

# 2. Review generated suite
ls ~/.agent-benchmarks/seo-specialist/

# 3. Run benchmark
/benchmark-agent seo-specialist

# 4. View results
# (Automatically displayed)

Example 2: Weekly benchmark run

# Run all production agents
/benchmark-agent --all

# Review summary
# Identify any regressions
# Investigate agents below threshold

Example 3: After prompt changes

# Made changes to agent prompt
# Want to validate improvement

# Run benchmark
/benchmark-agent seo-specialist

# Compare to baseline
# Look for:
# - Overall score increase
# - Specific test improvements
# - No new regressions

Example 4: Generate marketing content

# Agent improved, want to share

/benchmark-agent seo-specialist --marketing-summary

# Copy output to blog post
# Share on social media
# Include in documentation

Workflow Behind the Scenes

When you run /benchmark-agent seo-specialist, this happens:

Slash command receives input
Invokes benchmark-orchestrator agent
Orchestrator:
- Loads agent config
- For each test:
  - Reads test file
  - Invokes agent under test
  - Captures output
  - Invokes benchmark-judge
  - Records score
- Calculates overall score
- Updates performance history
- Generates report
Returns report to you

You see: Final report Behind the scenes: Full orchestration workflow

Directory Structure

~/.agent-benchmarks/
├── registry.yml                          # Agent registry
├── performance-history.json              # All agent history
├── seo-specialist/                       # Agent benchmark suite
│   ├── test-cases/
│   │   ├── TEST-METADATA.md
│   │   ├── 01-mediocre-content.md
│   │   ├── 02-excellent-content.md
│   │   └── ...
│   ├── ground-truth/
│   │   ├── 01-expected.json
│   │   ├── 02-expected.json
│   │   └── ...
│   ├── results/
│   │   ├── run-001-results.md
│   │   ├── run-002-results.md
│   │   └── summary.md
│   ├── METRICS.md
│   ├── README.md
│   └── QUICK-START.md
└── content-publishing-specialist/
    └── [similar structure]

Error Messages

Agent not found

❌ Error: Agent 'xyz' not found in registry

Available agents:
- seo-specialist
- content-publishing-specialist
- weekly-planning-specialist

Did you mean: seo-specialist?

To create new benchmark:
/benchmark-agent --create xyz

No test suite

❌ Error: No test suite found for 'my-agent'

The agent is registered but has no test cases.

Create benchmark suite:
/benchmark-agent --create my-agent

Below threshold

⚠️ Warning: Agent scored below threshold

Score: 75/100
Threshold: 80/100
Status: ❌ FAIL

Recommendation: Do NOT deploy
- Review failing tests
- Investigate regressions
- Iterate on agent prompt
- Re-run benchmark

Tips

Tip 1: Run before deploying

# Made prompt changes?
# Run benchmark before deploying

/benchmark-agent my-agent

# Only deploy if:
# - Score ≥ 80/100
# - No regressions on critical tests
# - Improvement over baseline (ideally)

Tip 2: Weekly health checks

# Set up weekly routine
# Every Monday morning:

/benchmark-agent --all

# Review summary
# Investigate any regressions
# Celebrate improvements

Tip 3: Use reports in PRs

# Making agent changes in PR?
# Include benchmark results

/benchmark-agent my-agent --report-only

# Copy markdown to PR description
# Show before/after scores
# Justify changes with data

Tip 4: Track improvement journeys

# Document your agent's evolution

Week 1: 88/100 (baseline)
Week 2: 90/100 (+2 - added calibration)
Week 3: 92/100 (+2 - improved recommendations)
Week 4: 94/100 (+2 - edge case handling)

# Great content for:
# - Blog posts
# - Case studies
# - Team updates

Next Steps

After creating your first benchmark:

✅ Run it - Get baseline score
✅ Review results - Understand strengths/weaknesses
✅ Iterate - Improve agent prompt based on data
✅ Re-run - Validate improvements
✅ Deploy - Ship better agent to production

After establishing multiple benchmarks:

✅ Schedule weekly runs - /benchmark-agent --all
✅ Track trends - Performance history over time
✅ Rotate tests - Keep agents challenged
✅ Share results - Marketing content, team updates

Learn More

Getting Started Guide - Installation and first benchmark
Test Creation Guide - How to design effective tests
Scoring Rubrics - How to create fair scoring
Advanced Usage - Test rotation, tips, best practices

Troubleshooting

Problem: Command not found Solution: Run install script: ./scripts/install.sh

Problem: Agent execution timeout Solution: Increase timeout in config or simplify test case

Problem: Judge scoring seems incorrect Solution: Review ground truth expectations, update rubric

Problem: Can't find test files Solution: Check directory structure, ensure files are in correct location

Support

Issues: GitHub Issues
Discussions: GitHub Discussions
Docs: Full Documentation

Built with ❤️ by BrandCast

Automated agent QA for production use.

12 KiB Raw Blame History

Usage

Commands

Create New Benchmark

Run Single Agent

Run All Agents

Report Only

Test Rotation

Verbose Mode

Marketing Summary

Configuration

Registry File

Performance History

Examples

Example 1: Create and run first benchmark

Example 2: Weekly benchmark run

Example 3: After prompt changes

Example 4: Generate marketing content

Workflow Behind the Scenes

Directory Structure

Error Messages

Agent not found

No test suite

Below threshold

Tips

Tip 1: Run before deploying

Tip 2: Weekly health checks

Tip 3: Use reports in PRs

Tip 4: Track improvement journeys

Next Steps

After creating your first benchmark:

After establishing multiple benchmarks:

Learn More

Troubleshooting

Support

12 KiB

Raw Blame History