Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:37:11 +08:00
commit 20b36ca9b1
56 changed files with 14530 additions and 0 deletions

View File

@@ -0,0 +1,623 @@
---
name: testing-agents-with-subagents
description: |
Agent testing methodology - run agents with test inputs, observe outputs,
iterate until outputs are accurate and well-structured.
trigger: |
- Before deploying a new agent
- After editing an existing agent
- Agent produces structured outputs that must be accurate
skip_when: |
- Agent is simple passthrough → minimal testing needed
- Agent already tested for this use case
related:
complementary: [test-driven-development]
---
# Testing Agents With Subagents
## Overview
**Testing agents is TDD applied to AI worker definitions.**
You run agents with known test inputs (RED - observe incorrect outputs), fix the agent definition (GREEN - outputs now correct), then handle edge cases (REFACTOR - robust under all conditions).
**Core principle:** If you didn't run an agent with test inputs and verify its outputs, you don't know if the agent works correctly.
**REQUIRED BACKGROUND:** You MUST understand `ring-default:test-driven-development` before using this skill. That skill defines the fundamental RED-GREEN-REFACTOR cycle. This skill provides agent-specific test formats (test inputs, output verification, accuracy metrics).
**Key difference from testing-skills-with-subagents:**
- **Skills** = instructions that guide behavior; test if agent follows rules under pressure
- **Agents** = separate Claude instances via Task tool; test if they produce correct outputs
## The Iron Law
```
NO AGENT DEPLOYMENT WITHOUT RED-GREEN-REFACTOR TESTING FIRST
```
About to deploy an agent without completing the test cycle? You have ONLY one option:
### STOP. TEST FIRST. THEN DEPLOY.
**You CANNOT:**
- ❌ "Deploy and monitor for issues"
- ❌ "Test with first real usage"
- ❌ "Quick smoke test is enough"
- ❌ "Tested manually in Claude UI"
- ❌ "One test case passed"
- ❌ "Agent prompt looks correct"
- ❌ "Based on working template"
- ❌ "Deploy now, test in parallel"
- ❌ "Production is down, no time to test"
**ZERO exceptions. Simple agent, expert confidence, time pressure, production outage - NONE override testing.**
**Why this is absolute:** Untested agents fail in production. Every time. The question is not IF but WHEN and HOW BADLY. A 20-minute test suite prevents hours of debugging and lost trust.
## When to Use
Test agents that:
- Analyze code/designs and produce findings (reviewers)
- Generate structured outputs (planners, analyzers)
- Make decisions or categorizations (severity, priority)
- Have defined output schemas that must be followed
- Are used in parallel workflows where consistency matters
**Test exemptions require explicit human partner approval:**
- Simple pass-through agents (just reformatting) - **only if human partner confirms**
- Agents without structured outputs - **only if human partner confirms**
- **You CANNOT self-determine test exemption**
- **When in doubt → TEST**
## TDD Mapping for Agent Testing
| TDD Phase | Agent Testing | What You Do |
|-----------|---------------|-------------|
| **RED** | Run with test inputs | Dispatch agent, observe incorrect/incomplete outputs |
| **Verify RED** | Document failures | Capture exact output issues verbatim |
| **GREEN** | Fix agent definition | Update prompt/schema to address failures |
| **Verify GREEN** | Re-run tests | Agent now produces correct outputs |
| **REFACTOR** | Test edge cases | Ambiguous inputs, empty inputs, complex scenarios |
| **Stay GREEN** | Re-verify all | Previous tests still pass after changes |
Same cycle as code TDD, different test format.
## RED Phase: Baseline Testing (Observe Failures)
**Goal:** Run agent with known test inputs - observe what's wrong, document exact failures.
This is identical to TDD's "write failing test first" - you MUST see what the agent actually produces before fixing the definition.
**Process:**
- [ ] **Create test inputs** (known issues, edge cases, clean inputs)
- [ ] **Run agent** - dispatch via Task tool with test inputs
- [ ] **Compare outputs** - expected vs actual
- [ ] **Document failures** - missing findings, wrong severity, bad format
- [ ] **Identify patterns** - which input types cause failures?
### Test Input Categories
| Category | Purpose | Example |
|----------|---------|---------|
| **Known Issues** | Verify agent finds real problems | Code with SQL injection, hardcoded secrets |
| **Clean Inputs** | Verify no false positives | Well-written code with no issues |
| **Edge Cases** | Verify robustness | Empty files, huge files, unusual patterns |
| **Ambiguous Cases** | Verify judgment | Code that could go either way |
| **Severity Calibration** | Verify severity accuracy | Mix of critical, high, medium, low issues |
### Minimum Test Suite Requirements
Before deploying ANY agent, you MUST have:
| Agent Type | Minimum Test Cases | Required Coverage |
|------------|-------------------|-------------------|
| **Reviewer agents** | 6 tests | 2 known issues, 2 clean, 1 edge case, 1 ambiguous |
| **Analyzer agents** | 5 tests | 2 typical, 1 empty, 1 large, 1 malformed |
| **Decision agents** | 4 tests | 2 clear cases, 2 boundary cases |
| **Planning agents** | 5 tests | 2 standard, 1 complex, 1 minimal, 1 edge case |
**Fewer tests = incomplete testing = DO NOT DEPLOY.**
One test case proves nothing. Three tests are suspicious. Six tests are minimum for confidence.
### Example Test Suite for Code Reviewer
```markdown
## Test Case 1: Known SQL Injection
**Input:** Function with string concatenation in SQL query
**Expected:** CRITICAL finding, references OWASP A03:2021
**Actual:** [Run agent, record output]
## Test Case 2: Clean Authentication
**Input:** Well-implemented JWT validation with proper error handling
**Expected:** No findings or LOW-severity suggestions only
**Actual:** [Run agent, record output]
## Test Case 3: Ambiguous Error Handling
**Input:** Error caught but only logged, not re-thrown
**Expected:** MEDIUM finding about silent failures
**Actual:** [Run agent, record output]
## Test Case 4: Empty File
**Input:** Empty source file
**Expected:** Graceful handling, no crash, maybe LOW finding
**Actual:** [Run agent, record output]
```
### Running the Test
```markdown
Use Task tool to dispatch agent:
Task(
subagent_type="ring-default:code-reviewer",
prompt="""
Review this code for security issues:
```python
def get_user(user_id):
query = "SELECT * FROM users WHERE id = " + user_id
return db.execute(query)
```
Provide findings with severity levels.
"""
)
```
**Document exact output.** Don't summarize - capture verbatim.
## GREEN Phase: Fix Agent Definition (Make Tests Pass)
Write/update agent definition addressing specific failures documented in RED phase.
**Common fixes:**
| Failure Type | Fix Approach |
|--------------|--------------|
| Missing findings | Add explicit instructions to check for X |
| Wrong severity | Add severity calibration examples |
| Bad output format | Add output schema with examples |
| False positives | Add "don't flag X when Y" instructions |
| Incomplete analysis | Add "always check A, B, C" checklist |
### Example Fix: Severity Calibration
**RED Phase Failure:**
```
Agent marked hardcoded password as MEDIUM instead of CRITICAL
```
**GREEN Phase Fix (add to agent definition):**
```markdown
## Severity Calibration
**CRITICAL:** Immediate exploitation possible
- Hardcoded secrets (passwords, API keys, tokens)
- SQL injection with user input
- Authentication bypass
**HIGH:** Exploitation requires additional steps
- Missing input validation
- Improper error handling exposing internals
**MEDIUM:** Security weakness, not directly exploitable
- Missing rate limiting
- Verbose error messages
**LOW:** Best practice violations
- Missing security headers
- Outdated dependencies (no known CVEs)
```
### Re-run Tests
After fixing, re-run ALL test cases:
```markdown
## Test Results After Fix
| Test Case | Expected | Actual | Pass/Fail |
|-----------|----------|--------|-----------|
| SQL Injection | CRITICAL | CRITICAL |  PASS |
| Clean Auth | No findings | No findings |  PASS |
| Ambiguous Error | MEDIUM | MEDIUM |  PASS |
| Empty File | Graceful | Graceful |  PASS |
```
If any test fails: continue fixing, re-test.
## VERIFY GREEN: Output Verification
**Goal:** Confirm agent produces correct, well-structured outputs consistently.
### Output Schema Compliance
If agent has defined output schema, verify compliance:
```markdown
## Expected Schema
- Summary (1-2 sentences)
- Findings (array of {severity, location, description, recommendation})
- Overall assessment (PASS/FAIL with conditions)
## Actual Output Analysis
- Summary:  Present, correct format
- Findings:  Array, all fields present
- Overall assessment: L Missing conditions for FAIL
```
### Accuracy Metrics
Track agent accuracy across test suite:
| Metric | Target | Actual |
|--------|--------|--------|
| True Positives (found real issues) | 100% | [X]% |
| False Positives (flagged non-issues) | <10% | [X]% |
| False Negatives (missed real issues) | <5% | [X]% |
| Severity Accuracy | >90% | [X]% |
| Schema Compliance | 100% | [X]% |
### Consistency Testing
Run same test input 3 times. Outputs should be consistent:
```markdown
## Consistency Test: SQL Injection Input
Run 1: CRITICAL, SQL injection, line 3
Run 2: CRITICAL, SQL injection, line 3
Run 3: CRITICAL, SQL injection, line 3
Consistency:  100% (all runs identical findings)
```
Inconsistency indicates agent definition is ambiguous.
## REFACTOR Phase: Edge Cases and Robustness
Agent passes basic tests? Now test edge cases.
### Edge Case Categories
| Category | Test Cases |
|----------|------------|
| **Empty/Null** | Empty file, null input, whitespace only |
| **Large** | 10K line file, deeply nested code |
| **Unusual** | Minified code, generated code, config files |
| **Multi-language** | Mixed JS/TS, embedded SQL, templates |
| **Ambiguous** | Code that could be good or bad depending on context |
### Stress Testing
```markdown
## Stress Test: Large File
**Input:** 5000-line file with 20 known issues scattered throughout
**Expected:** All 20 issues found, reasonable response time
**Actual:** [Run agent, record output]
## Stress Test: Complex Nesting
**Input:** 15-level deep callback hell
**Expected:** Findings about complexity, maintainability
**Actual:** [Run agent, record output]
```
### Ambiguity Testing
```markdown
## Ambiguity Test: Context-Dependent Security
**Input:**
```python
# Internal admin tool, not exposed to users
password = "admin123" # Default for local dev
```
**Expected:** Agent should note context but still flag
**Actual:** [Run agent, record output]
**Analysis:** Does agent handle nuance appropriately?
```
### Plugging Holes
For each edge case failure, add explicit handling:
**Before:**
```markdown
Review code for security issues.
```
**After:**
```markdown
Review code for security issues.
**Edge case handling:**
- Empty files: Return "No code to review" with PASS
- Large files (>5K lines): Focus on high-risk patterns first
- Minified code: Note limitations, review what's readable
- Context comments: Consider but don't use to dismiss issues
```
## Testing Parallel Agent Workflows
When agents run in parallel (like 3 reviewers), test the combined workflow:
### Parallel Consistency
```markdown
## Parallel Test: Same Input to All Reviewers
Input: Authentication module with mixed issues
| Reviewer | Findings | Overlap |
|----------|----------|---------|
| code-reviewer | 5 findings | - |
| business-logic-reviewer | 3 findings | 1 shared |
| security-reviewer | 4 findings | 2 shared |
Analysis:
- Total unique findings: 9
- Appropriate overlap (security issues found by both security and code reviewer)
- No contradictions
```
### Aggregation Testing
```markdown
## Aggregation Test: Severity Consistency
Same issue found by multiple reviewers:
| Reviewer | Finding | Severity |
|----------|---------|----------|
| code-reviewer | Missing null check | MEDIUM |
| business-logic-reviewer | Missing null check | HIGH |
Problem: Inconsistent severity for same issue
Fix: Align severity calibration across all reviewers
```
## Agent Testing Checklist
Before deploying agent, verify you followed RED-GREEN-REFACTOR:
**RED Phase:**
- [ ] Created test inputs (known issues, clean code, edge cases)
- [ ] Ran agent with test inputs
- [ ] Documented failures verbatim (missing findings, wrong severity, bad format)
**GREEN Phase:**
- [ ] Updated agent definition addressing specific failures
- [ ] Re-ran test inputs
- [ ] All basic tests now pass
**REFACTOR Phase:**
- [ ] Tested edge cases (empty, large, unusual, ambiguous)
- [ ] Tested stress scenarios (many issues, complex code)
- [ ] Added explicit edge case handling to definition
- [ ] Verified consistency (multiple runs produce same results)
- [ ] Verified schema compliance
- [ ] Tested parallel workflow integration (if applicable)
- [ ] Re-ran ALL tests after each change
**Metrics (for reviewer agents):**
- [ ] True positive rate: >95%
- [ ] False positive rate: <10%
- [ ] False negative rate: <5%
- [ ] Severity accuracy: >90%
- [ ] Schema compliance: 100%
- [ ] Consistency: >95%
## Prohibited Testing Shortcuts
**You CANNOT substitute proper testing with:**
| Shortcut | Why It Fails |
|----------|--------------|
| Reading agent definition carefully | Reading ≠ executing. Must run agent with inputs. |
| Manual testing in Claude UI | Ad-hoc ≠ reproducible. No baseline documented. |
| "Looks good to me" review | Visual inspection misses runtime failures. |
| Basing on proven template | Templates need validation for YOUR use case. |
| Expert prompt engineering knowledge | Expertise doesn't prevent bugs. Tests do. |
| Testing after first production use | Production is not QA. Test before deployment. |
| Monitoring production for issues | Reactive ≠ proactive. Catch issues before users do. |
| Deploy now, test in parallel | Parallel testing still means untested code in production. |
**ALL require running agent with documented test inputs and comparing outputs.**
## Testing Agent Modifications
**EVERY agent edit requires re-running the FULL test suite:**
| Change Type | Required Action |
|-------------|-----------------|
| Prompt wording changes | Full re-test |
| Severity calibration updates | Full re-test |
| Output schema modifications | Full re-test |
| Adding edge case handling | Full re-test |
| "Small" one-line changes | Full re-test |
| Typo fixes in prompt | Full re-test |
**"Small change" is not an exception.** One-line prompt changes can completely alter LLM behavior. Re-test always.
## Common Mistakes
**L Testing with only "happy path" inputs**
Agent works with obvious issues but misses subtle ones.
 Fix: Include ambiguous cases and edge cases in test suite.
**L Not documenting exact outputs**
"Agent was wrong" doesn't tell you what to fix.
 Fix: Capture agent output verbatim, compare to expected.
**L Fixing without re-running all tests**
Fix one issue, break another.
 Fix: Re-run entire test suite after each change.
**L Testing single agent in isolation when used in parallel**
Individual agents work, but combined workflow fails.
 Fix: Test parallel dispatch and output aggregation.
**L Not testing consistency**
Agent gives different answers for same input.
 Fix: Run same input 3+ times, verify consistent output.
**L Skipping severity calibration**
Agent finds issues but severity is inconsistent.
 Fix: Add explicit severity examples to agent definition.
**L Not testing edge cases**
Agent works for normal code, crashes on edge cases.
 Fix: Test empty, large, unusual, and ambiguous inputs.
**Single test case validation**
"One test passed" proves nothing about agent behavior.
Fix: Minimum 4-6 test cases per agent type.
**Manual UI testing as substitute**
Ad-hoc testing doesn't create reproducible baselines.
Fix: Document all test inputs and expected outputs.
**Skipping re-test for "small" changes**
One-line prompt changes can break everything.
Fix: Re-run full suite after ANY modification.
## Rationalization Table
| Excuse | Reality |
|--------|---------|
| "Agent prompt is obviously correct" | Obvious prompts fail in practice. Test proves correctness. |
| "Tested manually in Claude UI" | Ad-hoc ≠ reproducible. No baseline documented. |
| "One test case passed" | Sample size = 1 proves nothing. Need 4-6 cases minimum. |
| "Will test after first production use" | Production is not QA. Test before deployment. Always. |
| "Reading prompt is sufficient review" | Reading ≠ executing. Must run agent with inputs. |
| "Changes are small, re-test unnecessary" | Small changes cause big failures. Re-run full suite. |
| "Based agent on proven template" | Templates need validation for your use case. Test anyway. |
| "Expert in prompt engineering" | Expertise doesn't prevent bugs. Tests do. |
| "Production is down, no time to test" | Deploying untested fix may make outage worse. Test first. |
| "Deploy now, test in parallel" | Untested code in production = unknown behavior. Unacceptable. |
| "Quick smoke test is enough" | Smoke test misses edge cases. Full suite required. |
| "Simple pass-through agent" | You cannot self-determine exemptions. Get human approval. |
## Red Flags - STOP and Test Now
If you catch yourself thinking ANY of these, STOP. You're about to violate the Iron Law:
- Agent edited but tests not re-run
- "Looks good" without execution
- Single test case only
- No documented baseline
- No edge case testing
- Manual verification only
- "Will test in production"
- "Based on template, should work"
- "Just a small prompt change"
- "No time to test properly"
- "One quick test is enough"
- "Agent is simple, obviously works"
- "Expert intuition says it's fine"
- "Production is down, skip testing"
- "Deploy now, test in parallel"
**All of these mean: STOP. Run full RED-GREEN-REFACTOR cycle NOW.**
## Quick Reference (TDD Cycle for Agents)
| TDD Phase | Agent Testing | Success Criteria |
|-----------|---------------|------------------|
| **RED** | Run with test inputs | Document exact output failures |
| **Verify RED** | Capture verbatim | Have specific issues to fix |
| **GREEN** | Fix agent definition | All basic tests pass |
| **Verify GREEN** | Re-run all tests | No regressions |
| **REFACTOR** | Test edge cases | Robust under all conditions |
| **Stay GREEN** | Full test suite | All tests pass, metrics met |
## Example: Testing a New Reviewer Agent
### Step 1: Create Test Suite
```markdown
# security-reviewer Test Suite
## Test 1: SQL Injection (Known Critical)
Input: `query = "SELECT * FROM users WHERE id = " + user_id`
Expected: CRITICAL, SQL injection, OWASP A03:2021
## Test 2: Parameterized Query (Clean)
Input: `query = "SELECT * FROM users WHERE id = ?"; db.execute(query, [user_id])`
Expected: No security findings
## Test 3: Hardcoded Secret (Known Critical)
Input: `API_KEY = "sk-1234567890abcdef"`
Expected: CRITICAL, hardcoded secret
## Test 4: Environment Variable (Clean)
Input: `API_KEY = os.environ.get("API_KEY")`
Expected: No security findings (or LOW suggestion for validation)
## Test 5: Empty File
Input: (empty)
Expected: Graceful handling
## Test 6: Ambiguous - Internal Tool
Input: `password = "dev123" # Local development only`
Expected: Flag but acknowledge context
```
### Step 2: Run RED Phase
```
Test 1: L Found issue but marked HIGH not CRITICAL
Test 2:  No findings
Test 3: L Missed entirely
Test 4:  No findings
Test 5: L Error: "No code provided"
Test 6: L Dismissed due to comment
```
### Step 3: GREEN Phase - Fix Definition
Add to agent:
1. Severity calibration with SQL injection = CRITICAL
2. Explicit check for hardcoded secrets pattern
3. Empty file handling instruction
4. "Context comments don't dismiss security issues"
### Step 4: Re-run Tests
```
Test 1:  CRITICAL
Test 2:  No findings
Test 3:  CRITICAL
Test 4:  No findings
Test 5:  "No code to review"
Test 6:  Flagged with context acknowledgment
```
### Step 5: REFACTOR - Edge Cases
Add tests for: minified code, 10K line file, mixed languages, nested vulnerabilities.
Run, fix, repeat until all pass.
## The Bottom Line
**Agent testing IS TDD. Same principles, same cycle, same benefits.**
If you wouldn't deploy code without tests, don't deploy agents without testing them.
RED-GREEN-REFACTOR for agents works exactly like RED-GREEN-REFACTOR for code:
1. **RED:** See what's wrong (run with test inputs)
2. **GREEN:** Fix it (update agent definition)
3. **REFACTOR:** Make it robust (edge cases, consistency)
**Evidence before deployment. Always.**