--- name: testing-agents-with-subagents description: | Agent testing methodology - run agents with test inputs, observe outputs, iterate until outputs are accurate and well-structured. trigger: | - Before deploying a new agent - After editing an existing agent - Agent produces structured outputs that must be accurate skip_when: | - Agent is simple passthrough → minimal testing needed - Agent already tested for this use case related: complementary: [test-driven-development] --- # Testing Agents With Subagents ## Overview **Testing agents is TDD applied to AI worker definitions.** You run agents with known test inputs (RED - observe incorrect outputs), fix the agent definition (GREEN - outputs now correct), then handle edge cases (REFACTOR - robust under all conditions). **Core principle:** If you didn't run an agent with test inputs and verify its outputs, you don't know if the agent works correctly. **REQUIRED BACKGROUND:** You MUST understand `ring-default:test-driven-development` before using this skill. That skill defines the fundamental RED-GREEN-REFACTOR cycle. This skill provides agent-specific test formats (test inputs, output verification, accuracy metrics). **Key difference from testing-skills-with-subagents:** - **Skills** = instructions that guide behavior; test if agent follows rules under pressure - **Agents** = separate Claude instances via Task tool; test if they produce correct outputs ## The Iron Law ``` NO AGENT DEPLOYMENT WITHOUT RED-GREEN-REFACTOR TESTING FIRST ``` About to deploy an agent without completing the test cycle? You have ONLY one option: ### STOP. TEST FIRST. THEN DEPLOY. **You CANNOT:** - ❌ "Deploy and monitor for issues" - ❌ "Test with first real usage" - ❌ "Quick smoke test is enough" - ❌ "Tested manually in Claude UI" - ❌ "One test case passed" - ❌ "Agent prompt looks correct" - ❌ "Based on working template" - ❌ "Deploy now, test in parallel" - ❌ "Production is down, no time to test" **ZERO exceptions. Simple agent, expert confidence, time pressure, production outage - NONE override testing.** **Why this is absolute:** Untested agents fail in production. Every time. The question is not IF but WHEN and HOW BADLY. A 20-minute test suite prevents hours of debugging and lost trust. ## When to Use Test agents that: - Analyze code/designs and produce findings (reviewers) - Generate structured outputs (planners, analyzers) - Make decisions or categorizations (severity, priority) - Have defined output schemas that must be followed - Are used in parallel workflows where consistency matters **Test exemptions require explicit human partner approval:** - Simple pass-through agents (just reformatting) - **only if human partner confirms** - Agents without structured outputs - **only if human partner confirms** - **You CANNOT self-determine test exemption** - **When in doubt → TEST** ## TDD Mapping for Agent Testing | TDD Phase | Agent Testing | What You Do | |-----------|---------------|-------------| | **RED** | Run with test inputs | Dispatch agent, observe incorrect/incomplete outputs | | **Verify RED** | Document failures | Capture exact output issues verbatim | | **GREEN** | Fix agent definition | Update prompt/schema to address failures | | **Verify GREEN** | Re-run tests | Agent now produces correct outputs | | **REFACTOR** | Test edge cases | Ambiguous inputs, empty inputs, complex scenarios | | **Stay GREEN** | Re-verify all | Previous tests still pass after changes | Same cycle as code TDD, different test format. ## RED Phase: Baseline Testing (Observe Failures) **Goal:** Run agent with known test inputs - observe what's wrong, document exact failures. This is identical to TDD's "write failing test first" - you MUST see what the agent actually produces before fixing the definition. **Process:** - [ ] **Create test inputs** (known issues, edge cases, clean inputs) - [ ] **Run agent** - dispatch via Task tool with test inputs - [ ] **Compare outputs** - expected vs actual - [ ] **Document failures** - missing findings, wrong severity, bad format - [ ] **Identify patterns** - which input types cause failures? ### Test Input Categories | Category | Purpose | Example | |----------|---------|---------| | **Known Issues** | Verify agent finds real problems | Code with SQL injection, hardcoded secrets | | **Clean Inputs** | Verify no false positives | Well-written code with no issues | | **Edge Cases** | Verify robustness | Empty files, huge files, unusual patterns | | **Ambiguous Cases** | Verify judgment | Code that could go either way | | **Severity Calibration** | Verify severity accuracy | Mix of critical, high, medium, low issues | ### Minimum Test Suite Requirements Before deploying ANY agent, you MUST have: | Agent Type | Minimum Test Cases | Required Coverage | |------------|-------------------|-------------------| | **Reviewer agents** | 6 tests | 2 known issues, 2 clean, 1 edge case, 1 ambiguous | | **Analyzer agents** | 5 tests | 2 typical, 1 empty, 1 large, 1 malformed | | **Decision agents** | 4 tests | 2 clear cases, 2 boundary cases | | **Planning agents** | 5 tests | 2 standard, 1 complex, 1 minimal, 1 edge case | **Fewer tests = incomplete testing = DO NOT DEPLOY.** One test case proves nothing. Three tests are suspicious. Six tests are minimum for confidence. ### Example Test Suite for Code Reviewer ```markdown ## Test Case 1: Known SQL Injection **Input:** Function with string concatenation in SQL query **Expected:** CRITICAL finding, references OWASP A03:2021 **Actual:** [Run agent, record output] ## Test Case 2: Clean Authentication **Input:** Well-implemented JWT validation with proper error handling **Expected:** No findings or LOW-severity suggestions only **Actual:** [Run agent, record output] ## Test Case 3: Ambiguous Error Handling **Input:** Error caught but only logged, not re-thrown **Expected:** MEDIUM finding about silent failures **Actual:** [Run agent, record output] ## Test Case 4: Empty File **Input:** Empty source file **Expected:** Graceful handling, no crash, maybe LOW finding **Actual:** [Run agent, record output] ``` ### Running the Test ```markdown Use Task tool to dispatch agent: Task( subagent_type="ring-default:code-reviewer", prompt=""" Review this code for security issues: ```python def get_user(user_id): query = "SELECT * FROM users WHERE id = " + user_id return db.execute(query) ``` Provide findings with severity levels. """ ) ``` **Document exact output.** Don't summarize - capture verbatim. ## GREEN Phase: Fix Agent Definition (Make Tests Pass) Write/update agent definition addressing specific failures documented in RED phase. **Common fixes:** | Failure Type | Fix Approach | |--------------|--------------| | Missing findings | Add explicit instructions to check for X | | Wrong severity | Add severity calibration examples | | Bad output format | Add output schema with examples | | False positives | Add "don't flag X when Y" instructions | | Incomplete analysis | Add "always check A, B, C" checklist | ### Example Fix: Severity Calibration **RED Phase Failure:** ``` Agent marked hardcoded password as MEDIUM instead of CRITICAL ``` **GREEN Phase Fix (add to agent definition):** ```markdown ## Severity Calibration **CRITICAL:** Immediate exploitation possible - Hardcoded secrets (passwords, API keys, tokens) - SQL injection with user input - Authentication bypass **HIGH:** Exploitation requires additional steps - Missing input validation - Improper error handling exposing internals **MEDIUM:** Security weakness, not directly exploitable - Missing rate limiting - Verbose error messages **LOW:** Best practice violations - Missing security headers - Outdated dependencies (no known CVEs) ``` ### Re-run Tests After fixing, re-run ALL test cases: ```markdown ## Test Results After Fix | Test Case | Expected | Actual | Pass/Fail | |-----------|----------|--------|-----------| | SQL Injection | CRITICAL | CRITICAL |  PASS | | Clean Auth | No findings | No findings |  PASS | | Ambiguous Error | MEDIUM | MEDIUM |  PASS | | Empty File | Graceful | Graceful |  PASS | ``` If any test fails: continue fixing, re-test. ## VERIFY GREEN: Output Verification **Goal:** Confirm agent produces correct, well-structured outputs consistently. ### Output Schema Compliance If agent has defined output schema, verify compliance: ```markdown ## Expected Schema - Summary (1-2 sentences) - Findings (array of {severity, location, description, recommendation}) - Overall assessment (PASS/FAIL with conditions) ## Actual Output Analysis - Summary:  Present, correct format - Findings:  Array, all fields present - Overall assessment: L Missing conditions for FAIL ``` ### Accuracy Metrics Track agent accuracy across test suite: | Metric | Target | Actual | |--------|--------|--------| | True Positives (found real issues) | 100% | [X]% | | False Positives (flagged non-issues) | <10% | [X]% | | False Negatives (missed real issues) | <5% | [X]% | | Severity Accuracy | >90% | [X]% | | Schema Compliance | 100% | [X]% | ### Consistency Testing Run same test input 3 times. Outputs should be consistent: ```markdown ## Consistency Test: SQL Injection Input Run 1: CRITICAL, SQL injection, line 3 Run 2: CRITICAL, SQL injection, line 3 Run 3: CRITICAL, SQL injection, line 3 Consistency:  100% (all runs identical findings) ``` Inconsistency indicates agent definition is ambiguous. ## REFACTOR Phase: Edge Cases and Robustness Agent passes basic tests? Now test edge cases. ### Edge Case Categories | Category | Test Cases | |----------|------------| | **Empty/Null** | Empty file, null input, whitespace only | | **Large** | 10K line file, deeply nested code | | **Unusual** | Minified code, generated code, config files | | **Multi-language** | Mixed JS/TS, embedded SQL, templates | | **Ambiguous** | Code that could be good or bad depending on context | ### Stress Testing ```markdown ## Stress Test: Large File **Input:** 5000-line file with 20 known issues scattered throughout **Expected:** All 20 issues found, reasonable response time **Actual:** [Run agent, record output] ## Stress Test: Complex Nesting **Input:** 15-level deep callback hell **Expected:** Findings about complexity, maintainability **Actual:** [Run agent, record output] ``` ### Ambiguity Testing ```markdown ## Ambiguity Test: Context-Dependent Security **Input:** ```python # Internal admin tool, not exposed to users password = "admin123" # Default for local dev ``` **Expected:** Agent should note context but still flag **Actual:** [Run agent, record output] **Analysis:** Does agent handle nuance appropriately? ``` ### Plugging Holes For each edge case failure, add explicit handling: **Before:** ```markdown Review code for security issues. ``` **After:** ```markdown Review code for security issues. **Edge case handling:** - Empty files: Return "No code to review" with PASS - Large files (>5K lines): Focus on high-risk patterns first - Minified code: Note limitations, review what's readable - Context comments: Consider but don't use to dismiss issues ``` ## Testing Parallel Agent Workflows When agents run in parallel (like 3 reviewers), test the combined workflow: ### Parallel Consistency ```markdown ## Parallel Test: Same Input to All Reviewers Input: Authentication module with mixed issues | Reviewer | Findings | Overlap | |----------|----------|---------| | code-reviewer | 5 findings | - | | business-logic-reviewer | 3 findings | 1 shared | | security-reviewer | 4 findings | 2 shared | Analysis: - Total unique findings: 9 - Appropriate overlap (security issues found by both security and code reviewer) - No contradictions ``` ### Aggregation Testing ```markdown ## Aggregation Test: Severity Consistency Same issue found by multiple reviewers: | Reviewer | Finding | Severity | |----------|---------|----------| | code-reviewer | Missing null check | MEDIUM | | business-logic-reviewer | Missing null check | HIGH | Problem: Inconsistent severity for same issue Fix: Align severity calibration across all reviewers ``` ## Agent Testing Checklist Before deploying agent, verify you followed RED-GREEN-REFACTOR: **RED Phase:** - [ ] Created test inputs (known issues, clean code, edge cases) - [ ] Ran agent with test inputs - [ ] Documented failures verbatim (missing findings, wrong severity, bad format) **GREEN Phase:** - [ ] Updated agent definition addressing specific failures - [ ] Re-ran test inputs - [ ] All basic tests now pass **REFACTOR Phase:** - [ ] Tested edge cases (empty, large, unusual, ambiguous) - [ ] Tested stress scenarios (many issues, complex code) - [ ] Added explicit edge case handling to definition - [ ] Verified consistency (multiple runs produce same results) - [ ] Verified schema compliance - [ ] Tested parallel workflow integration (if applicable) - [ ] Re-ran ALL tests after each change **Metrics (for reviewer agents):** - [ ] True positive rate: >95% - [ ] False positive rate: <10% - [ ] False negative rate: <5% - [ ] Severity accuracy: >90% - [ ] Schema compliance: 100% - [ ] Consistency: >95% ## Prohibited Testing Shortcuts **You CANNOT substitute proper testing with:** | Shortcut | Why It Fails | |----------|--------------| | Reading agent definition carefully | Reading ≠ executing. Must run agent with inputs. | | Manual testing in Claude UI | Ad-hoc ≠ reproducible. No baseline documented. | | "Looks good to me" review | Visual inspection misses runtime failures. | | Basing on proven template | Templates need validation for YOUR use case. | | Expert prompt engineering knowledge | Expertise doesn't prevent bugs. Tests do. | | Testing after first production use | Production is not QA. Test before deployment. | | Monitoring production for issues | Reactive ≠ proactive. Catch issues before users do. | | Deploy now, test in parallel | Parallel testing still means untested code in production. | **ALL require running agent with documented test inputs and comparing outputs.** ## Testing Agent Modifications **EVERY agent edit requires re-running the FULL test suite:** | Change Type | Required Action | |-------------|-----------------| | Prompt wording changes | Full re-test | | Severity calibration updates | Full re-test | | Output schema modifications | Full re-test | | Adding edge case handling | Full re-test | | "Small" one-line changes | Full re-test | | Typo fixes in prompt | Full re-test | **"Small change" is not an exception.** One-line prompt changes can completely alter LLM behavior. Re-test always. ## Common Mistakes **L Testing with only "happy path" inputs** Agent works with obvious issues but misses subtle ones.  Fix: Include ambiguous cases and edge cases in test suite. **L Not documenting exact outputs** "Agent was wrong" doesn't tell you what to fix.  Fix: Capture agent output verbatim, compare to expected. **L Fixing without re-running all tests** Fix one issue, break another.  Fix: Re-run entire test suite after each change. **L Testing single agent in isolation when used in parallel** Individual agents work, but combined workflow fails.  Fix: Test parallel dispatch and output aggregation. **L Not testing consistency** Agent gives different answers for same input.  Fix: Run same input 3+ times, verify consistent output. **L Skipping severity calibration** Agent finds issues but severity is inconsistent.  Fix: Add explicit severity examples to agent definition. **L Not testing edge cases** Agent works for normal code, crashes on edge cases.  Fix: Test empty, large, unusual, and ambiguous inputs. **Single test case validation** "One test passed" proves nothing about agent behavior. Fix: Minimum 4-6 test cases per agent type. **Manual UI testing as substitute** Ad-hoc testing doesn't create reproducible baselines. Fix: Document all test inputs and expected outputs. **Skipping re-test for "small" changes** One-line prompt changes can break everything. Fix: Re-run full suite after ANY modification. ## Rationalization Table | Excuse | Reality | |--------|---------| | "Agent prompt is obviously correct" | Obvious prompts fail in practice. Test proves correctness. | | "Tested manually in Claude UI" | Ad-hoc ≠ reproducible. No baseline documented. | | "One test case passed" | Sample size = 1 proves nothing. Need 4-6 cases minimum. | | "Will test after first production use" | Production is not QA. Test before deployment. Always. | | "Reading prompt is sufficient review" | Reading ≠ executing. Must run agent with inputs. | | "Changes are small, re-test unnecessary" | Small changes cause big failures. Re-run full suite. | | "Based agent on proven template" | Templates need validation for your use case. Test anyway. | | "Expert in prompt engineering" | Expertise doesn't prevent bugs. Tests do. | | "Production is down, no time to test" | Deploying untested fix may make outage worse. Test first. | | "Deploy now, test in parallel" | Untested code in production = unknown behavior. Unacceptable. | | "Quick smoke test is enough" | Smoke test misses edge cases. Full suite required. | | "Simple pass-through agent" | You cannot self-determine exemptions. Get human approval. | ## Red Flags - STOP and Test Now If you catch yourself thinking ANY of these, STOP. You're about to violate the Iron Law: - Agent edited but tests not re-run - "Looks good" without execution - Single test case only - No documented baseline - No edge case testing - Manual verification only - "Will test in production" - "Based on template, should work" - "Just a small prompt change" - "No time to test properly" - "One quick test is enough" - "Agent is simple, obviously works" - "Expert intuition says it's fine" - "Production is down, skip testing" - "Deploy now, test in parallel" **All of these mean: STOP. Run full RED-GREEN-REFACTOR cycle NOW.** ## Quick Reference (TDD Cycle for Agents) | TDD Phase | Agent Testing | Success Criteria | |-----------|---------------|------------------| | **RED** | Run with test inputs | Document exact output failures | | **Verify RED** | Capture verbatim | Have specific issues to fix | | **GREEN** | Fix agent definition | All basic tests pass | | **Verify GREEN** | Re-run all tests | No regressions | | **REFACTOR** | Test edge cases | Robust under all conditions | | **Stay GREEN** | Full test suite | All tests pass, metrics met | ## Example: Testing a New Reviewer Agent ### Step 1: Create Test Suite ```markdown # security-reviewer Test Suite ## Test 1: SQL Injection (Known Critical) Input: `query = "SELECT * FROM users WHERE id = " + user_id` Expected: CRITICAL, SQL injection, OWASP A03:2021 ## Test 2: Parameterized Query (Clean) Input: `query = "SELECT * FROM users WHERE id = ?"; db.execute(query, [user_id])` Expected: No security findings ## Test 3: Hardcoded Secret (Known Critical) Input: `API_KEY = "sk-1234567890abcdef"` Expected: CRITICAL, hardcoded secret ## Test 4: Environment Variable (Clean) Input: `API_KEY = os.environ.get("API_KEY")` Expected: No security findings (or LOW suggestion for validation) ## Test 5: Empty File Input: (empty) Expected: Graceful handling ## Test 6: Ambiguous - Internal Tool Input: `password = "dev123" # Local development only` Expected: Flag but acknowledge context ``` ### Step 2: Run RED Phase ``` Test 1: L Found issue but marked HIGH not CRITICAL Test 2:  No findings Test 3: L Missed entirely Test 4:  No findings Test 5: L Error: "No code provided" Test 6: L Dismissed due to comment ``` ### Step 3: GREEN Phase - Fix Definition Add to agent: 1. Severity calibration with SQL injection = CRITICAL 2. Explicit check for hardcoded secrets pattern 3. Empty file handling instruction 4. "Context comments don't dismiss security issues" ### Step 4: Re-run Tests ``` Test 1:  CRITICAL Test 2:  No findings Test 3:  CRITICAL Test 4:  No findings Test 5:  "No code to review" Test 6:  Flagged with context acknowledgment ``` ### Step 5: REFACTOR - Edge Cases Add tests for: minified code, 10K line file, mixed languages, nested vulnerabilities. Run, fix, repeat until all pass. ## The Bottom Line **Agent testing IS TDD. Same principles, same cycle, same benefits.** If you wouldn't deploy code without tests, don't deploy agents without testing them. RED-GREEN-REFACTOR for agents works exactly like RED-GREEN-REFACTOR for code: 1. **RED:** See what's wrong (run with test inputs) 2. **GREEN:** Fix it (update agent definition) 3. **REFACTOR:** Make it robust (edge cases, consistency) **Evidence before deployment. Always.**