Files
gh-lerianstudio-ring-default/skills/testing-agents-with-subagents/SKILL.md
2025-11-30 08:37:11 +08:00

20 KiB
Raw Blame History

name, description, trigger, skip_when, related
name description trigger skip_when related
testing-agents-with-subagents Agent testing methodology - run agents with test inputs, observe outputs, iterate until outputs are accurate and well-structured. - Before deploying a new agent - After editing an existing agent - Agent produces structured outputs that must be accurate - Agent is simple passthrough → minimal testing needed - Agent already tested for this use case
complementary
test-driven-development

Testing Agents With Subagents

Overview

Testing agents is TDD applied to AI worker definitions.

You run agents with known test inputs (RED - observe incorrect outputs), fix the agent definition (GREEN - outputs now correct), then handle edge cases (REFACTOR - robust under all conditions).

Core principle: If you didn't run an agent with test inputs and verify its outputs, you don't know if the agent works correctly.

REQUIRED BACKGROUND: You MUST understand ring-default:test-driven-development before using this skill. That skill defines the fundamental RED-GREEN-REFACTOR cycle. This skill provides agent-specific test formats (test inputs, output verification, accuracy metrics).

Key difference from testing-skills-with-subagents:

  • Skills = instructions that guide behavior; test if agent follows rules under pressure
  • Agents = separate Claude instances via Task tool; test if they produce correct outputs

The Iron Law

NO AGENT DEPLOYMENT WITHOUT RED-GREEN-REFACTOR TESTING FIRST

About to deploy an agent without completing the test cycle? You have ONLY one option:

STOP. TEST FIRST. THEN DEPLOY.

You CANNOT:

  • "Deploy and monitor for issues"
  • "Test with first real usage"
  • "Quick smoke test is enough"
  • "Tested manually in Claude UI"
  • "One test case passed"
  • "Agent prompt looks correct"
  • "Based on working template"
  • "Deploy now, test in parallel"
  • "Production is down, no time to test"

ZERO exceptions. Simple agent, expert confidence, time pressure, production outage - NONE override testing.

Why this is absolute: Untested agents fail in production. Every time. The question is not IF but WHEN and HOW BADLY. A 20-minute test suite prevents hours of debugging and lost trust.

When to Use

Test agents that:

  • Analyze code/designs and produce findings (reviewers)
  • Generate structured outputs (planners, analyzers)
  • Make decisions or categorizations (severity, priority)
  • Have defined output schemas that must be followed
  • Are used in parallel workflows where consistency matters

Test exemptions require explicit human partner approval:

  • Simple pass-through agents (just reformatting) - only if human partner confirms
  • Agents without structured outputs - only if human partner confirms
  • You CANNOT self-determine test exemption
  • When in doubt → TEST

TDD Mapping for Agent Testing

TDD Phase Agent Testing What You Do
RED Run with test inputs Dispatch agent, observe incorrect/incomplete outputs
Verify RED Document failures Capture exact output issues verbatim
GREEN Fix agent definition Update prompt/schema to address failures
Verify GREEN Re-run tests Agent now produces correct outputs
REFACTOR Test edge cases Ambiguous inputs, empty inputs, complex scenarios
Stay GREEN Re-verify all Previous tests still pass after changes

Same cycle as code TDD, different test format.

RED Phase: Baseline Testing (Observe Failures)

Goal: Run agent with known test inputs - observe what's wrong, document exact failures.

This is identical to TDD's "write failing test first" - you MUST see what the agent actually produces before fixing the definition.

Process:

  • Create test inputs (known issues, edge cases, clean inputs)
  • Run agent - dispatch via Task tool with test inputs
  • Compare outputs - expected vs actual
  • Document failures - missing findings, wrong severity, bad format
  • Identify patterns - which input types cause failures?

Test Input Categories

Category Purpose Example
Known Issues Verify agent finds real problems Code with SQL injection, hardcoded secrets
Clean Inputs Verify no false positives Well-written code with no issues
Edge Cases Verify robustness Empty files, huge files, unusual patterns
Ambiguous Cases Verify judgment Code that could go either way
Severity Calibration Verify severity accuracy Mix of critical, high, medium, low issues

Minimum Test Suite Requirements

Before deploying ANY agent, you MUST have:

Agent Type Minimum Test Cases Required Coverage
Reviewer agents 6 tests 2 known issues, 2 clean, 1 edge case, 1 ambiguous
Analyzer agents 5 tests 2 typical, 1 empty, 1 large, 1 malformed
Decision agents 4 tests 2 clear cases, 2 boundary cases
Planning agents 5 tests 2 standard, 1 complex, 1 minimal, 1 edge case

Fewer tests = incomplete testing = DO NOT DEPLOY.

One test case proves nothing. Three tests are suspicious. Six tests are minimum for confidence.

Example Test Suite for Code Reviewer

## Test Case 1: Known SQL Injection
**Input:** Function with string concatenation in SQL query
**Expected:** CRITICAL finding, references OWASP A03:2021
**Actual:** [Run agent, record output]

## Test Case 2: Clean Authentication
**Input:** Well-implemented JWT validation with proper error handling
**Expected:** No findings or LOW-severity suggestions only
**Actual:** [Run agent, record output]

## Test Case 3: Ambiguous Error Handling
**Input:** Error caught but only logged, not re-thrown
**Expected:** MEDIUM finding about silent failures
**Actual:** [Run agent, record output]

## Test Case 4: Empty File
**Input:** Empty source file
**Expected:** Graceful handling, no crash, maybe LOW finding
**Actual:** [Run agent, record output]

Running the Test

Use Task tool to dispatch agent:

Task(
  subagent_type="ring-default:code-reviewer",
  prompt="""
  Review this code for security issues:

  ```python
  def get_user(user_id):
      query = "SELECT * FROM users WHERE id = " + user_id
      return db.execute(query)

Provide findings with severity levels. """ )


**Document exact output.** Don't summarize - capture verbatim.

## GREEN Phase: Fix Agent Definition (Make Tests Pass)

Write/update agent definition addressing specific failures documented in RED phase.

**Common fixes:**

| Failure Type | Fix Approach |
|--------------|--------------|
| Missing findings | Add explicit instructions to check for X |
| Wrong severity | Add severity calibration examples |
| Bad output format | Add output schema with examples |
| False positives | Add "don't flag X when Y" instructions |
| Incomplete analysis | Add "always check A, B, C" checklist |

### Example Fix: Severity Calibration

**RED Phase Failure:**

Agent marked hardcoded password as MEDIUM instead of CRITICAL


**GREEN Phase Fix (add to agent definition):**
```markdown
## Severity Calibration

**CRITICAL:** Immediate exploitation possible
- Hardcoded secrets (passwords, API keys, tokens)
- SQL injection with user input
- Authentication bypass

**HIGH:** Exploitation requires additional steps
- Missing input validation
- Improper error handling exposing internals

**MEDIUM:** Security weakness, not directly exploitable
- Missing rate limiting
- Verbose error messages

**LOW:** Best practice violations
- Missing security headers
- Outdated dependencies (no known CVEs)

Re-run Tests

After fixing, re-run ALL test cases:

## Test Results After Fix

| Test Case | Expected | Actual | Pass/Fail |
|-----------|----------|--------|-----------|
| SQL Injection | CRITICAL | CRITICAL |  PASS |
| Clean Auth | No findings | No findings |  PASS |
| Ambiguous Error | MEDIUM | MEDIUM |  PASS |
| Empty File | Graceful | Graceful |  PASS |

If any test fails: continue fixing, re-test.

VERIFY GREEN: Output Verification

Goal: Confirm agent produces correct, well-structured outputs consistently.

Output Schema Compliance

If agent has defined output schema, verify compliance:

## Expected Schema
- Summary (1-2 sentences)
- Findings (array of {severity, location, description, recommendation})
- Overall assessment (PASS/FAIL with conditions)

## Actual Output Analysis
- Summary:  Present, correct format
- Findings:  Array, all fields present
- Overall assessment: L Missing conditions for FAIL

Accuracy Metrics

Track agent accuracy across test suite:

Metric Target Actual
True Positives (found real issues) 100% [X]%
False Positives (flagged non-issues) <10% [X]%
False Negatives (missed real issues) <5% [X]%
Severity Accuracy >90% [X]%
Schema Compliance 100% [X]%

Consistency Testing

Run same test input 3 times. Outputs should be consistent:

## Consistency Test: SQL Injection Input

Run 1: CRITICAL, SQL injection, line 3
Run 2: CRITICAL, SQL injection, line 3
Run 3: CRITICAL, SQL injection, line 3

Consistency:  100% (all runs identical findings)

Inconsistency indicates agent definition is ambiguous.

REFACTOR Phase: Edge Cases and Robustness

Agent passes basic tests? Now test edge cases.

Edge Case Categories

Category Test Cases
Empty/Null Empty file, null input, whitespace only
Large 10K line file, deeply nested code
Unusual Minified code, generated code, config files
Multi-language Mixed JS/TS, embedded SQL, templates
Ambiguous Code that could be good or bad depending on context

Stress Testing

## Stress Test: Large File

**Input:** 5000-line file with 20 known issues scattered throughout
**Expected:** All 20 issues found, reasonable response time
**Actual:** [Run agent, record output]

## Stress Test: Complex Nesting

**Input:** 15-level deep callback hell
**Expected:** Findings about complexity, maintainability
**Actual:** [Run agent, record output]

Ambiguity Testing

## Ambiguity Test: Context-Dependent Security

**Input:**
```python
# Internal admin tool, not exposed to users
password = "admin123"  # Default for local dev

Expected: Agent should note context but still flag Actual: [Run agent, record output]

Analysis: Does agent handle nuance appropriately?


### Plugging Holes

For each edge case failure, add explicit handling:

**Before:**
```markdown
Review code for security issues.

After:

Review code for security issues.

**Edge case handling:**
- Empty files: Return "No code to review" with PASS
- Large files (>5K lines): Focus on high-risk patterns first
- Minified code: Note limitations, review what's readable
- Context comments: Consider but don't use to dismiss issues

Testing Parallel Agent Workflows

When agents run in parallel (like 3 reviewers), test the combined workflow:

Parallel Consistency

## Parallel Test: Same Input to All Reviewers

Input: Authentication module with mixed issues

| Reviewer | Findings | Overlap |
|----------|----------|---------|
| code-reviewer | 5 findings | - |
| business-logic-reviewer | 3 findings | 1 shared |
| security-reviewer | 4 findings | 2 shared |

Analysis:
- Total unique findings: 9
- Appropriate overlap (security issues found by both security and code reviewer)
- No contradictions

Aggregation Testing

## Aggregation Test: Severity Consistency

Same issue found by multiple reviewers:

| Reviewer | Finding | Severity |
|----------|---------|----------|
| code-reviewer | Missing null check | MEDIUM |
| business-logic-reviewer | Missing null check | HIGH |

Problem: Inconsistent severity for same issue
Fix: Align severity calibration across all reviewers

Agent Testing Checklist

Before deploying agent, verify you followed RED-GREEN-REFACTOR:

RED Phase:

  • Created test inputs (known issues, clean code, edge cases)
  • Ran agent with test inputs
  • Documented failures verbatim (missing findings, wrong severity, bad format)

GREEN Phase:

  • Updated agent definition addressing specific failures
  • Re-ran test inputs
  • All basic tests now pass

REFACTOR Phase:

  • Tested edge cases (empty, large, unusual, ambiguous)
  • Tested stress scenarios (many issues, complex code)
  • Added explicit edge case handling to definition
  • Verified consistency (multiple runs produce same results)
  • Verified schema compliance
  • Tested parallel workflow integration (if applicable)
  • Re-ran ALL tests after each change

Metrics (for reviewer agents):

  • True positive rate: >95%
  • False positive rate: <10%
  • False negative rate: <5%
  • Severity accuracy: >90%
  • Schema compliance: 100%
  • Consistency: >95%

Prohibited Testing Shortcuts

You CANNOT substitute proper testing with:

Shortcut Why It Fails
Reading agent definition carefully Reading ≠ executing. Must run agent with inputs.
Manual testing in Claude UI Ad-hoc ≠ reproducible. No baseline documented.
"Looks good to me" review Visual inspection misses runtime failures.
Basing on proven template Templates need validation for YOUR use case.
Expert prompt engineering knowledge Expertise doesn't prevent bugs. Tests do.
Testing after first production use Production is not QA. Test before deployment.
Monitoring production for issues Reactive ≠ proactive. Catch issues before users do.
Deploy now, test in parallel Parallel testing still means untested code in production.

ALL require running agent with documented test inputs and comparing outputs.

Testing Agent Modifications

EVERY agent edit requires re-running the FULL test suite:

Change Type Required Action
Prompt wording changes Full re-test
Severity calibration updates Full re-test
Output schema modifications Full re-test
Adding edge case handling Full re-test
"Small" one-line changes Full re-test
Typo fixes in prompt Full re-test

"Small change" is not an exception. One-line prompt changes can completely alter LLM behavior. Re-test always.

Common Mistakes

L Testing with only "happy path" inputs Agent works with obvious issues but misses subtle ones.  Fix: Include ambiguous cases and edge cases in test suite.

L Not documenting exact outputs "Agent was wrong" doesn't tell you what to fix.  Fix: Capture agent output verbatim, compare to expected.

L Fixing without re-running all tests Fix one issue, break another.  Fix: Re-run entire test suite after each change.

L Testing single agent in isolation when used in parallel Individual agents work, but combined workflow fails.  Fix: Test parallel dispatch and output aggregation.

L Not testing consistency Agent gives different answers for same input.  Fix: Run same input 3+ times, verify consistent output.

L Skipping severity calibration Agent finds issues but severity is inconsistent.  Fix: Add explicit severity examples to agent definition.

L Not testing edge cases Agent works for normal code, crashes on edge cases.  Fix: Test empty, large, unusual, and ambiguous inputs.

Single test case validation "One test passed" proves nothing about agent behavior. Fix: Minimum 4-6 test cases per agent type.

Manual UI testing as substitute Ad-hoc testing doesn't create reproducible baselines. Fix: Document all test inputs and expected outputs.

Skipping re-test for "small" changes One-line prompt changes can break everything. Fix: Re-run full suite after ANY modification.

Rationalization Table

Excuse Reality
"Agent prompt is obviously correct" Obvious prompts fail in practice. Test proves correctness.
"Tested manually in Claude UI" Ad-hoc ≠ reproducible. No baseline documented.
"One test case passed" Sample size = 1 proves nothing. Need 4-6 cases minimum.
"Will test after first production use" Production is not QA. Test before deployment. Always.
"Reading prompt is sufficient review" Reading ≠ executing. Must run agent with inputs.
"Changes are small, re-test unnecessary" Small changes cause big failures. Re-run full suite.
"Based agent on proven template" Templates need validation for your use case. Test anyway.
"Expert in prompt engineering" Expertise doesn't prevent bugs. Tests do.
"Production is down, no time to test" Deploying untested fix may make outage worse. Test first.
"Deploy now, test in parallel" Untested code in production = unknown behavior. Unacceptable.
"Quick smoke test is enough" Smoke test misses edge cases. Full suite required.
"Simple pass-through agent" You cannot self-determine exemptions. Get human approval.

Red Flags - STOP and Test Now

If you catch yourself thinking ANY of these, STOP. You're about to violate the Iron Law:

  • Agent edited but tests not re-run
  • "Looks good" without execution
  • Single test case only
  • No documented baseline
  • No edge case testing
  • Manual verification only
  • "Will test in production"
  • "Based on template, should work"
  • "Just a small prompt change"
  • "No time to test properly"
  • "One quick test is enough"
  • "Agent is simple, obviously works"
  • "Expert intuition says it's fine"
  • "Production is down, skip testing"
  • "Deploy now, test in parallel"

All of these mean: STOP. Run full RED-GREEN-REFACTOR cycle NOW.

Quick Reference (TDD Cycle for Agents)

TDD Phase Agent Testing Success Criteria
RED Run with test inputs Document exact output failures
Verify RED Capture verbatim Have specific issues to fix
GREEN Fix agent definition All basic tests pass
Verify GREEN Re-run all tests No regressions
REFACTOR Test edge cases Robust under all conditions
Stay GREEN Full test suite All tests pass, metrics met

Example: Testing a New Reviewer Agent

Step 1: Create Test Suite

# security-reviewer Test Suite

## Test 1: SQL Injection (Known Critical)
Input: `query = "SELECT * FROM users WHERE id = " + user_id`
Expected: CRITICAL, SQL injection, OWASP A03:2021

## Test 2: Parameterized Query (Clean)
Input: `query = "SELECT * FROM users WHERE id = ?"; db.execute(query, [user_id])`
Expected: No security findings

## Test 3: Hardcoded Secret (Known Critical)
Input: `API_KEY = "sk-1234567890abcdef"`
Expected: CRITICAL, hardcoded secret

## Test 4: Environment Variable (Clean)
Input: `API_KEY = os.environ.get("API_KEY")`
Expected: No security findings (or LOW suggestion for validation)

## Test 5: Empty File
Input: (empty)
Expected: Graceful handling

## Test 6: Ambiguous - Internal Tool
Input: `password = "dev123"  # Local development only`
Expected: Flag but acknowledge context

Step 2: Run RED Phase

Test 1: L Found issue but marked HIGH not CRITICAL
Test 2:  No findings
Test 3: L Missed entirely
Test 4:  No findings
Test 5: L Error: "No code provided"
Test 6: L Dismissed due to comment

Step 3: GREEN Phase - Fix Definition

Add to agent:

  1. Severity calibration with SQL injection = CRITICAL
  2. Explicit check for hardcoded secrets pattern
  3. Empty file handling instruction
  4. "Context comments don't dismiss security issues"

Step 4: Re-run Tests

Test 1:  CRITICAL
Test 2:  No findings
Test 3:  CRITICAL
Test 4:  No findings
Test 5:  "No code to review"
Test 6:  Flagged with context acknowledgment

Step 5: REFACTOR - Edge Cases

Add tests for: minified code, 10K line file, mixed languages, nested vulnerabilities.

Run, fix, repeat until all pass.

The Bottom Line

Agent testing IS TDD. Same principles, same cycle, same benefits.

If you wouldn't deploy code without tests, don't deploy agents without testing them.

RED-GREEN-REFACTOR for agents works exactly like RED-GREEN-REFACTOR for code:

  1. RED: See what's wrong (run with test inputs)
  2. GREEN: Fix it (update agent definition)
  3. REFACTOR: Make it robust (edge cases, consistency)

Evidence before deployment. Always.