Files
gh-brandcast-signage-agent-…/commands/benchmark-agent.md
2025-11-29 18:02:28 +08:00

12 KiB

description
description
Run automated benchmark tests on Claude Code agents and track performance over time

Usage

# Create a new benchmark suite
/benchmark-agent --create <agent-name>

# Run benchmarks
/benchmark-agent <agent-name>
/benchmark-agent --all
/benchmark-agent --all --marketing
/benchmark-agent --all --tech

# Advanced options
/benchmark-agent <agent-name> --rotate
/benchmark-agent --report-only
/benchmark-agent <agent-name> --verbose
/benchmark-agent <agent-name> --marketing-summary

Commands

Create New Benchmark

/benchmark-agent --create my-content-agent

What happens:

  1. Launches test-suite-creator agent
  2. Asks you 5 questions about your agent
  3. Generates complete benchmark suite:
    • 5 diverse test cases
    • Ground truth expectations (JSON)
    • Scoring rubric (METRICS.md)
    • Documentation

Time: < 1 hour from start to first benchmark


Run Single Agent

/benchmark-agent seo-specialist

What happens:

  1. Loads test suite for seo-specialist
  2. Executes all test cases
  3. Scores results via benchmark-judge
  4. Updates performance history
  5. Generates detailed report

Output:

# Benchmark Results: seo-specialist

Overall Score: 90/100 ✅ PASS
Trend: ⬆️ Improving (+2 from baseline)

Individual Tests:
- Test #01: 82/100 ✓
- Test #02: 96/100 ✓
- Test #03: 92/100 ✓

Recommendation: DEPLOY v2

Time: 2-5 minutes (depends on agent complexity)


Run All Agents

/benchmark-agent --all

What happens:

  1. Loads all agents from registry
  2. Runs benchmark on each
  3. Generates summary report

Filters:

/benchmark-agent --all --marketing  # Marketing agents only
/benchmark-agent --all --tech       # Tech repo agents only

Output:

# Benchmark Results: All Agents

Summary:
| Agent                  | Score  | Status | Trend |
|------------------------|--------|--------|-------|
| seo-specialist         | 90/100 | ✅ Pass | ⬆️ +2  |
| content-publishing     | 97/100 | ✅ Pass | ➡️  0  |
| weekly-planning        | 85/100 | ✅ Pass | ⬆️ +3  |

Overall health: 6/7 passing (85.7%)

Time: 10-30 minutes (depends on number of agents)


Report Only

/benchmark-agent --report-only
/benchmark-agent seo-specialist --report-only

What happens:

  1. Skips test execution
  2. Reads latest run from history
  3. Generates report from stored data

Use cases:

  • Quick status check
  • Share results with team
  • Review historical performance

Time: < 5 seconds


Test Rotation

/benchmark-agent seo-specialist --rotate

What happens:

  1. Runs normal benchmark
  2. Analyzes test performance
  3. Suggests new tests (if agent scoring 95+)
  4. Suggests retiring tests (if scoring 100 three times)
  5. You approve/reject suggestions

Example output:

## Test Rotation Suggestion

Current performance:
- Test #01: 95/100
- Test #02: 96/100
- Test #03: 97/100

Recommendation: Add Test #04 (long-form listicle)

Rationale:
- Agent mastering current tests
- Need to test SEO on 2000+ word content
- Listicle format has unique challenges

Accept? (yes/no)

Verbose Mode

/benchmark-agent seo-specialist --verbose

What happens: Shows detailed execution steps:

  • Test file loading
  • Agent invocation
  • Judge scoring process
  • Performance calculation

Use for:

  • Debugging
  • Understanding workflow
  • Investigating unexpected results

Marketing Summary

/benchmark-agent seo-specialist --marketing-summary

What happens: Generates marketing-ready content about agent performance.

Output:

# seo-specialist Performance Update

Latest score: 90/100 ✅
Improvement: +2.3% over 8 days

What Improved:
✨ More accurate scoring on mediocre content
✨ Zero false positives on excellent content
✨ Consistent spam detection

Real-World Impact:
Automated SEO auditing for blog posts with improved accuracy.

*Benchmarked using Agent Benchmark Kit*

Use for:

  • Blog posts
  • Social media
  • Performance updates
  • Customer communication

Configuration

Registry File

Location: ~/.agent-benchmarks/registry.yml

Structure:

agents:
  seo-specialist:
    name: "seo-specialist"
    location: "marketing"
    test_suite: "~/.agent-benchmarks/seo-specialist/"
    baseline_score: 88
    target_score: 90
    status: "production"

  content-publishing-specialist:
    name: "content-publishing-specialist"
    location: "marketing"
    test_suite: "~/.agent-benchmarks/content-publishing-specialist/"
    baseline_score: 97.5
    target_score: 95
    status: "production"

Add new agent:

/benchmark-agent --create my-agent
# Automatically adds to registry

Performance History

Location: ~/.agent-benchmarks/performance-history.json

Structure:

{
  "seo-specialist": {
    "baseline": { "version": "v1", "score": 88 },
    "current": { "version": "v2", "score": 90 },
    "runs": [
      {
        "id": "run-001",
        "timestamp": "2025-11-01T10:00:00Z",
        "score": 88,
        "tests": {...}
      },
      {
        "id": "run-002",
        "timestamp": "2025-11-09T14:30:00Z",
        "score": 90,
        "tests": {...}
      }
    ]
  }
}

Managed automatically - no manual editing needed


Examples

Example 1: Create and run first benchmark

# 1. Create benchmark suite
/benchmark-agent --create seo-specialist

# Answer questions:
# > What does your agent do?
#   "Audits blog posts for SEO optimization"
# > What validations does it perform?
#   "Keyword usage, meta descriptions, content length, structure"
# > What are edge cases?
#   "Keyword stuffing, perfect content, very short content"
# > What's perfect output?
#   "700+ words, good keyword density, proper structure"
# > What's failing output?
#   "Thin content, no meta, keyword stuffing"

# 2. Review generated suite
ls ~/.agent-benchmarks/seo-specialist/

# 3. Run benchmark
/benchmark-agent seo-specialist

# 4. View results
# (Automatically displayed)

Example 2: Weekly benchmark run

# Run all production agents
/benchmark-agent --all

# Review summary
# Identify any regressions
# Investigate agents below threshold

Example 3: After prompt changes

# Made changes to agent prompt
# Want to validate improvement

# Run benchmark
/benchmark-agent seo-specialist

# Compare to baseline
# Look for:
# - Overall score increase
# - Specific test improvements
# - No new regressions

Example 4: Generate marketing content

# Agent improved, want to share

/benchmark-agent seo-specialist --marketing-summary

# Copy output to blog post
# Share on social media
# Include in documentation

Workflow Behind the Scenes

When you run /benchmark-agent seo-specialist, this happens:

  1. Slash command receives input
  2. Invokes benchmark-orchestrator agent
  3. Orchestrator:
    • Loads agent config
    • For each test:
      • Reads test file
      • Invokes agent under test
      • Captures output
      • Invokes benchmark-judge
      • Records score
    • Calculates overall score
    • Updates performance history
    • Generates report
  4. Returns report to you

You see: Final report Behind the scenes: Full orchestration workflow


Directory Structure

~/.agent-benchmarks/
├── registry.yml                          # Agent registry
├── performance-history.json              # All agent history
├── seo-specialist/                       # Agent benchmark suite
│   ├── test-cases/
│   │   ├── TEST-METADATA.md
│   │   ├── 01-mediocre-content.md
│   │   ├── 02-excellent-content.md
│   │   └── ...
│   ├── ground-truth/
│   │   ├── 01-expected.json
│   │   ├── 02-expected.json
│   │   └── ...
│   ├── results/
│   │   ├── run-001-results.md
│   │   ├── run-002-results.md
│   │   └── summary.md
│   ├── METRICS.md
│   ├── README.md
│   └── QUICK-START.md
└── content-publishing-specialist/
    └── [similar structure]

Error Messages

Agent not found

❌ Error: Agent 'xyz' not found in registry

Available agents:
- seo-specialist
- content-publishing-specialist
- weekly-planning-specialist

Did you mean: seo-specialist?

To create new benchmark:
/benchmark-agent --create xyz

No test suite

❌ Error: No test suite found for 'my-agent'

The agent is registered but has no test cases.

Create benchmark suite:
/benchmark-agent --create my-agent

Below threshold

⚠️ Warning: Agent scored below threshold

Score: 75/100
Threshold: 80/100
Status: ❌ FAIL

Recommendation: Do NOT deploy
- Review failing tests
- Investigate regressions
- Iterate on agent prompt
- Re-run benchmark

Tips

Tip 1: Run before deploying

# Made prompt changes?
# Run benchmark before deploying

/benchmark-agent my-agent

# Only deploy if:
# - Score ≥ 80/100
# - No regressions on critical tests
# - Improvement over baseline (ideally)

Tip 2: Weekly health checks

# Set up weekly routine
# Every Monday morning:

/benchmark-agent --all

# Review summary
# Investigate any regressions
# Celebrate improvements

Tip 3: Use reports in PRs

# Making agent changes in PR?
# Include benchmark results

/benchmark-agent my-agent --report-only

# Copy markdown to PR description
# Show before/after scores
# Justify changes with data

Tip 4: Track improvement journeys

# Document your agent's evolution

Week 1: 88/100 (baseline)
Week 2: 90/100 (+2 - added calibration)
Week 3: 92/100 (+2 - improved recommendations)
Week 4: 94/100 (+2 - edge case handling)

# Great content for:
# - Blog posts
# - Case studies
# - Team updates

Next Steps

After creating your first benchmark:

  1. Run it - Get baseline score
  2. Review results - Understand strengths/weaknesses
  3. Iterate - Improve agent prompt based on data
  4. Re-run - Validate improvements
  5. Deploy - Ship better agent to production

After establishing multiple benchmarks:

  1. Schedule weekly runs - /benchmark-agent --all
  2. Track trends - Performance history over time
  3. Rotate tests - Keep agents challenged
  4. Share results - Marketing content, team updates

Learn More


Troubleshooting

Problem: Command not found Solution: Run install script: ./scripts/install.sh

Problem: Agent execution timeout Solution: Increase timeout in config or simplify test case

Problem: Judge scoring seems incorrect Solution: Review ground truth expectations, update rubric

Problem: Can't find test files Solution: Check directory structure, ensure files are in correct location


Support


Built with ❤️ by BrandCast

Automated agent QA for production use.