Files
gh-glittercowboy-taches-cc-…/skills/create-subagents/references/evaluation-and-testing.md
2025-11-29 18:28:37 +08:00

11 KiB

Evaluation and Testing for Subagents

<evaluation_framework>

<task_completion> Primary metric: Proportion of tasks completed correctly and satisfactorily.

Measure:

  • Did the subagent complete the requested task?
  • Did it produce the expected output?
  • Would a human consider the task "done"?

Testing approach: Create test cases with known expected outcomes, invoke subagent, compare results. </task_completion>

<tool_correctness> Secondary metric: Whether subagent calls correct tools for given task.

Measure:

  • Are tool selections appropriate for the task?
  • Does it use tools efficiently (not calling unnecessary tools)?
  • Does it use tools in correct sequence?

Testing approach: Review tool call patterns in execution logs. </tool_correctness>

<output_quality> Quality metric: Assess quality of subagent-generated outputs.

Measure:

  • Accuracy of analysis
  • Completeness of coverage
  • Clarity of communication
  • Adherence to specified format

Testing approach: Human review or LLM-as-judge evaluation. </output_quality>

**Resilience metric**: How well subagent handles failures and edge cases.

Measure:

  • Graceful handling of missing files
  • Recovery from tool failures
  • Appropriate responses to unexpected inputs
  • Boundary condition handling

Testing approach: Inject failures (missing files, malformed data) and verify responses.

**Performance metrics**: Response time and resource usage.

Measure:

  • Token usage (cost)
  • Latency (response time)
  • Number of tool calls

Testing approach: Monitor metrics across multiple invocations, track trends. </evaluation_framework>

<g_eval>

G-Eval: Use LLMs with chain-of-thought to evaluate outputs against ANY custom criteria defined in natural language.

**Custom criterion**: "Security review completeness"
Evaluate the security review output on a 1-5 scale:

1. Missing critical vulnerability types
2. Covers basic vulnerabilities but misses some common patterns
3. Covers standard OWASP Top 10 vulnerabilities
4. Comprehensive coverage including framework-specific issues
5. Exceptional coverage including business logic vulnerabilities

Think step-by-step about which vulnerabilities were checked and which were missed.

Implementation: Pass subagent output and criteria to Claude, get structured evaluation.

When to use: Complex quality metrics that can't be measured programmatically (thoroughness, insight quality, appropriateness of recommendations). </g_eval>

<validation_strategies>

<offline_testing> Offline validation: Test before deployment with synthetic scenarios.

Process:

  1. Create representative test cases covering:
    • Happy path scenarios
    • Edge cases (boundary conditions, unusual inputs)
    • Error conditions (missing data, tool failures)
    • Adversarial inputs (malformed, malicious)
  2. Invoke subagent with each test case
  3. Compare outputs to expected results
  4. Document failures and iterate on prompt

Example test suite for code-reviewer subagent:

Test 1 (Happy path): Recent commit with SQL injection vulnerability
Expected: Identifies SQL injection, provides fix, rates as Critical

Test 2 (Edge case): No recent code changes
Expected: Confirms review completed, no issues found

Test 3 (Error condition): Git repository not initialized
Expected: Gracefully handles missing git, provides helpful message

Test 4 (Adversarial): Obfuscated code with hidden vulnerability
Expected: Identifies pattern despite obfuscation

</offline_testing>

**Simulation testing**: Run subagent in realistic but controlled environments.

Use cases:

  • Testing against historical issues (can it find bugs that were previously fixed?)
  • Benchmark datasets (SWE-bench for code agents)
  • Controlled codebases with known vulnerabilities

Benefit: Higher confidence than synthetic tests, safer than production testing.

<online_monitoring> Production monitoring: Track metrics during real usage.

Key metrics:

  • Success rate (completed vs failed tasks)
  • User satisfaction (explicit feedback)
  • Retry rate (how often users reinvoke after failure)
  • Token usage trends (increasing = potential prompt issues)
  • Error rates by error type

Implementation: Log all invocations with context, outcomes, and metrics. Review regularly for patterns. </online_monitoring> </validation_strategies>

<evaluation_driven_development>

Philosophy: Integrate evaluation throughout subagent lifecycle, not just at validation stage.

1. **Initial creation**: Define success criteria before writing prompt 2. **Development**: Test after each prompt iteration 3. **Pre-deployment**: Comprehensive offline testing 4. **Deployment**: Online monitoring with metrics collection 5. **Iteration**: Regular review of failures, update prompt based on learnings 6. **Continuous**: Ongoing evaluation → feedback → refinement cycles

Anti-pattern: Writing subagent, deploying, never measuring effectiveness or iterating.

Best practice: Treat subagent prompts as living documents that evolve based on real-world performance data. </evaluation_driven_development>

<testing_checklist>

<before_deployment> Before deploying a subagent, complete this validation:

Basic functionality:

  • Invoke with representative task, verify completion
  • Check output format matches specification
  • Verify workflow steps are followed in sequence
  • Confirm constraints are respected

Edge cases:

  • Test with missing/incomplete data
  • Test with unusual but valid inputs
  • Test with boundary conditions (empty files, large files, etc.)

Error handling:

  • Test with unavailable tools (if tool access restricted)
  • Test with malformed inputs
  • Verify graceful degradation when ideal path fails

Quality checks:

  • Human review of outputs for accuracy
  • Verify no hallucinations or fabricated information
  • Check output is actionable and useful

Security:

  • Verify tool access follows least privilege
  • Check for potential unsafe operations
  • Ensure sensitive data handling is appropriate

Documentation:

  • Description field clearly indicates when to use
  • Role and focus areas are specific
  • Workflow is complete and logical </before_deployment> </testing_checklist>

<synthetic_data>

<when_to_use> Synthetic data generation useful for:

  • Cold starts: No real usage data yet
  • Edge cases: Rare scenarios hard to capture from real data
  • Adversarial testing: Security, robustness testing
  • Scenario coverage: Systematic coverage of input space </when_to_use>

<generation_approaches> Persona-based generation: Create test cases from different user personas.

Persona: Junior developer
Task: "Fix the bug where the login page crashes"
Expected behavior: Subagent provides detailed debugging steps

Persona: Senior engineer
Task: "Investigate authentication flow security"
Expected behavior: Subagent performs deep security analysis

Scenario simulation: Generate variations of common scenarios.

Scenario: SQL injection vulnerability review
Variations:
- Direct SQL concatenation
- ORM with raw queries
- Prepared statements (should pass)
- Stored procedures with dynamic SQL

</generation_approaches>

<critical_limitation> Never rely exclusively on synthetic data.

Maintain a validation set of real usage examples. Synthetic data can miss:

  • Real-world complexity
  • Actual user intent patterns
  • Production environment constraints
  • Emergent usage patterns

Best practice: 70% synthetic (for coverage), 30% real (for reality check). </critical_limitation> </synthetic_data>

<llm_as_judge>

<basic_pattern> Use LLM to evaluate subagent outputs when human review is impractical at scale.

Example evaluation prompt:

You are evaluating a security code review performed by an AI subagent.

Review output:
{subagent_output}

Code that was reviewed:
{code}

Evaluate on these criteria:
1. Accuracy: Are identified vulnerabilities real? (Yes/Partial/No)
2. Completeness: Were obvious vulnerabilities missed? (None missed/Some missed/Many missed)
3. Actionability: Are fixes specific and implementable? (Very/Somewhat/Not really)

Provide:
- Overall grade (A/B/C/D/F)
- Specific issues with the review
- What a human reviewer would have done differently

</basic_pattern>

<comparison_pattern> Ground truth comparison: When correct answer is known.

Expected vulnerabilities in test code:
1. SQL injection on line 42
2. XSS vulnerability on line 67
3. Missing authentication check on line 103

Subagent identified:
{subagent_findings}

Calculate:
- Precision: % of identified issues that are real
- Recall: % of real issues that were identified
- F1 score: Harmonic mean of precision and recall

</comparison_pattern> </llm_as_judge>

<test_driven_development>

Anthropic guidance: "Test-driven development becomes even more powerful with agentic coding."

**Before writing subagent prompt**: 1. Define expected input/output pairs 2. Create test cases that subagent must pass 3. Write initial prompt 4. Run tests, observe failures 5. Refine prompt based on failures 6. Repeat until all tests pass

Example for test-writer subagent:

Test 1:
Input: Function that adds two numbers
Expected output: Test file with:
  - Happy path (2 + 2 = 4)
  - Edge cases (0 + 0, negative numbers)
  - Type errors (string + number)

Test 2:
Input: Async function that fetches user data
Expected output: Test file with:
  - Successful fetch
  - Network error handling
  - Invalid user ID handling
  - Mocked HTTP calls (no real API calls)

Invoke subagent → check if outputs match expectations → iterate on prompt.

Benefit: Clear acceptance criteria before development, objective measure of prompt quality. </test_driven_development>

<anti_patterns>

<anti_pattern name="no_testing"> Deploying subagents without any validation

Risk: Subagent fails on real tasks, wastes user time, damages trust.

Fix: Minimum viable testing = invoke with 3 representative tasks before deploying. </anti_pattern>

<anti_pattern name="only_happy_path"> Testing only ideal scenarios

Risk: Subagent fails on edge cases, error conditions, or unusual (but valid) inputs.

Fix: Test matrix covering happy path, edge cases, and error conditions. </anti_pattern>

<anti_pattern name="no_metrics"> No measurement of effectiveness

Risk: Can't tell if prompt changes improve or degrade performance.

Fix: Define at least one quantitative metric (task completion rate, output quality score). </anti_pattern>

<anti_pattern name="test_once_deploy_forever"> Testing once at creation, never revisiting

Risk: Subagent degrades over time as usage patterns shift, codebases change, or models update.

Fix: Periodic re-evaluation with current usage patterns and edge cases. </anti_pattern> </anti_patterns>