624 lines
20 KiB
Markdown
624 lines
20 KiB
Markdown
---
|
||
name: testing-agents-with-subagents
|
||
description: |
|
||
Agent testing methodology - run agents with test inputs, observe outputs,
|
||
iterate until outputs are accurate and well-structured.
|
||
|
||
trigger: |
|
||
- Before deploying a new agent
|
||
- After editing an existing agent
|
||
- Agent produces structured outputs that must be accurate
|
||
|
||
skip_when: |
|
||
- Agent is simple passthrough → minimal testing needed
|
||
- Agent already tested for this use case
|
||
|
||
related:
|
||
complementary: [test-driven-development]
|
||
---
|
||
|
||
# Testing Agents With Subagents
|
||
|
||
## Overview
|
||
|
||
**Testing agents is TDD applied to AI worker definitions.**
|
||
|
||
You run agents with known test inputs (RED - observe incorrect outputs), fix the agent definition (GREEN - outputs now correct), then handle edge cases (REFACTOR - robust under all conditions).
|
||
|
||
**Core principle:** If you didn't run an agent with test inputs and verify its outputs, you don't know if the agent works correctly.
|
||
|
||
**REQUIRED BACKGROUND:** You MUST understand `ring-default:test-driven-development` before using this skill. That skill defines the fundamental RED-GREEN-REFACTOR cycle. This skill provides agent-specific test formats (test inputs, output verification, accuracy metrics).
|
||
|
||
**Key difference from testing-skills-with-subagents:**
|
||
- **Skills** = instructions that guide behavior; test if agent follows rules under pressure
|
||
- **Agents** = separate Claude instances via Task tool; test if they produce correct outputs
|
||
|
||
## The Iron Law
|
||
|
||
```
|
||
NO AGENT DEPLOYMENT WITHOUT RED-GREEN-REFACTOR TESTING FIRST
|
||
```
|
||
|
||
About to deploy an agent without completing the test cycle? You have ONLY one option:
|
||
|
||
### STOP. TEST FIRST. THEN DEPLOY.
|
||
|
||
**You CANNOT:**
|
||
- ❌ "Deploy and monitor for issues"
|
||
- ❌ "Test with first real usage"
|
||
- ❌ "Quick smoke test is enough"
|
||
- ❌ "Tested manually in Claude UI"
|
||
- ❌ "One test case passed"
|
||
- ❌ "Agent prompt looks correct"
|
||
- ❌ "Based on working template"
|
||
- ❌ "Deploy now, test in parallel"
|
||
- ❌ "Production is down, no time to test"
|
||
|
||
**ZERO exceptions. Simple agent, expert confidence, time pressure, production outage - NONE override testing.**
|
||
|
||
**Why this is absolute:** Untested agents fail in production. Every time. The question is not IF but WHEN and HOW BADLY. A 20-minute test suite prevents hours of debugging and lost trust.
|
||
|
||
## When to Use
|
||
|
||
Test agents that:
|
||
- Analyze code/designs and produce findings (reviewers)
|
||
- Generate structured outputs (planners, analyzers)
|
||
- Make decisions or categorizations (severity, priority)
|
||
- Have defined output schemas that must be followed
|
||
- Are used in parallel workflows where consistency matters
|
||
|
||
**Test exemptions require explicit human partner approval:**
|
||
- Simple pass-through agents (just reformatting) - **only if human partner confirms**
|
||
- Agents without structured outputs - **only if human partner confirms**
|
||
- **You CANNOT self-determine test exemption**
|
||
- **When in doubt → TEST**
|
||
|
||
## TDD Mapping for Agent Testing
|
||
|
||
| TDD Phase | Agent Testing | What You Do |
|
||
|-----------|---------------|-------------|
|
||
| **RED** | Run with test inputs | Dispatch agent, observe incorrect/incomplete outputs |
|
||
| **Verify RED** | Document failures | Capture exact output issues verbatim |
|
||
| **GREEN** | Fix agent definition | Update prompt/schema to address failures |
|
||
| **Verify GREEN** | Re-run tests | Agent now produces correct outputs |
|
||
| **REFACTOR** | Test edge cases | Ambiguous inputs, empty inputs, complex scenarios |
|
||
| **Stay GREEN** | Re-verify all | Previous tests still pass after changes |
|
||
|
||
Same cycle as code TDD, different test format.
|
||
|
||
## RED Phase: Baseline Testing (Observe Failures)
|
||
|
||
**Goal:** Run agent with known test inputs - observe what's wrong, document exact failures.
|
||
|
||
This is identical to TDD's "write failing test first" - you MUST see what the agent actually produces before fixing the definition.
|
||
|
||
**Process:**
|
||
|
||
- [ ] **Create test inputs** (known issues, edge cases, clean inputs)
|
||
- [ ] **Run agent** - dispatch via Task tool with test inputs
|
||
- [ ] **Compare outputs** - expected vs actual
|
||
- [ ] **Document failures** - missing findings, wrong severity, bad format
|
||
- [ ] **Identify patterns** - which input types cause failures?
|
||
|
||
### Test Input Categories
|
||
|
||
| Category | Purpose | Example |
|
||
|----------|---------|---------|
|
||
| **Known Issues** | Verify agent finds real problems | Code with SQL injection, hardcoded secrets |
|
||
| **Clean Inputs** | Verify no false positives | Well-written code with no issues |
|
||
| **Edge Cases** | Verify robustness | Empty files, huge files, unusual patterns |
|
||
| **Ambiguous Cases** | Verify judgment | Code that could go either way |
|
||
| **Severity Calibration** | Verify severity accuracy | Mix of critical, high, medium, low issues |
|
||
|
||
### Minimum Test Suite Requirements
|
||
|
||
Before deploying ANY agent, you MUST have:
|
||
|
||
| Agent Type | Minimum Test Cases | Required Coverage |
|
||
|------------|-------------------|-------------------|
|
||
| **Reviewer agents** | 6 tests | 2 known issues, 2 clean, 1 edge case, 1 ambiguous |
|
||
| **Analyzer agents** | 5 tests | 2 typical, 1 empty, 1 large, 1 malformed |
|
||
| **Decision agents** | 4 tests | 2 clear cases, 2 boundary cases |
|
||
| **Planning agents** | 5 tests | 2 standard, 1 complex, 1 minimal, 1 edge case |
|
||
|
||
**Fewer tests = incomplete testing = DO NOT DEPLOY.**
|
||
|
||
One test case proves nothing. Three tests are suspicious. Six tests are minimum for confidence.
|
||
|
||
### Example Test Suite for Code Reviewer
|
||
|
||
```markdown
|
||
## Test Case 1: Known SQL Injection
|
||
**Input:** Function with string concatenation in SQL query
|
||
**Expected:** CRITICAL finding, references OWASP A03:2021
|
||
**Actual:** [Run agent, record output]
|
||
|
||
## Test Case 2: Clean Authentication
|
||
**Input:** Well-implemented JWT validation with proper error handling
|
||
**Expected:** No findings or LOW-severity suggestions only
|
||
**Actual:** [Run agent, record output]
|
||
|
||
## Test Case 3: Ambiguous Error Handling
|
||
**Input:** Error caught but only logged, not re-thrown
|
||
**Expected:** MEDIUM finding about silent failures
|
||
**Actual:** [Run agent, record output]
|
||
|
||
## Test Case 4: Empty File
|
||
**Input:** Empty source file
|
||
**Expected:** Graceful handling, no crash, maybe LOW finding
|
||
**Actual:** [Run agent, record output]
|
||
```
|
||
|
||
### Running the Test
|
||
|
||
```markdown
|
||
Use Task tool to dispatch agent:
|
||
|
||
Task(
|
||
subagent_type="ring-default:code-reviewer",
|
||
prompt="""
|
||
Review this code for security issues:
|
||
|
||
```python
|
||
def get_user(user_id):
|
||
query = "SELECT * FROM users WHERE id = " + user_id
|
||
return db.execute(query)
|
||
```
|
||
|
||
Provide findings with severity levels.
|
||
"""
|
||
)
|
||
```
|
||
|
||
**Document exact output.** Don't summarize - capture verbatim.
|
||
|
||
## GREEN Phase: Fix Agent Definition (Make Tests Pass)
|
||
|
||
Write/update agent definition addressing specific failures documented in RED phase.
|
||
|
||
**Common fixes:**
|
||
|
||
| Failure Type | Fix Approach |
|
||
|--------------|--------------|
|
||
| Missing findings | Add explicit instructions to check for X |
|
||
| Wrong severity | Add severity calibration examples |
|
||
| Bad output format | Add output schema with examples |
|
||
| False positives | Add "don't flag X when Y" instructions |
|
||
| Incomplete analysis | Add "always check A, B, C" checklist |
|
||
|
||
### Example Fix: Severity Calibration
|
||
|
||
**RED Phase Failure:**
|
||
```
|
||
Agent marked hardcoded password as MEDIUM instead of CRITICAL
|
||
```
|
||
|
||
**GREEN Phase Fix (add to agent definition):**
|
||
```markdown
|
||
## Severity Calibration
|
||
|
||
**CRITICAL:** Immediate exploitation possible
|
||
- Hardcoded secrets (passwords, API keys, tokens)
|
||
- SQL injection with user input
|
||
- Authentication bypass
|
||
|
||
**HIGH:** Exploitation requires additional steps
|
||
- Missing input validation
|
||
- Improper error handling exposing internals
|
||
|
||
**MEDIUM:** Security weakness, not directly exploitable
|
||
- Missing rate limiting
|
||
- Verbose error messages
|
||
|
||
**LOW:** Best practice violations
|
||
- Missing security headers
|
||
- Outdated dependencies (no known CVEs)
|
||
```
|
||
|
||
### Re-run Tests
|
||
|
||
After fixing, re-run ALL test cases:
|
||
|
||
```markdown
|
||
## Test Results After Fix
|
||
|
||
| Test Case | Expected | Actual | Pass/Fail |
|
||
|-----------|----------|--------|-----------|
|
||
| SQL Injection | CRITICAL | CRITICAL | PASS |
|
||
| Clean Auth | No findings | No findings | PASS |
|
||
| Ambiguous Error | MEDIUM | MEDIUM | PASS |
|
||
| Empty File | Graceful | Graceful | PASS |
|
||
```
|
||
|
||
If any test fails: continue fixing, re-test.
|
||
|
||
## VERIFY GREEN: Output Verification
|
||
|
||
**Goal:** Confirm agent produces correct, well-structured outputs consistently.
|
||
|
||
### Output Schema Compliance
|
||
|
||
If agent has defined output schema, verify compliance:
|
||
|
||
```markdown
|
||
## Expected Schema
|
||
- Summary (1-2 sentences)
|
||
- Findings (array of {severity, location, description, recommendation})
|
||
- Overall assessment (PASS/FAIL with conditions)
|
||
|
||
## Actual Output Analysis
|
||
- Summary: Present, correct format
|
||
- Findings: Array, all fields present
|
||
- Overall assessment: L Missing conditions for FAIL
|
||
```
|
||
|
||
### Accuracy Metrics
|
||
|
||
Track agent accuracy across test suite:
|
||
|
||
| Metric | Target | Actual |
|
||
|--------|--------|--------|
|
||
| True Positives (found real issues) | 100% | [X]% |
|
||
| False Positives (flagged non-issues) | <10% | [X]% |
|
||
| False Negatives (missed real issues) | <5% | [X]% |
|
||
| Severity Accuracy | >90% | [X]% |
|
||
| Schema Compliance | 100% | [X]% |
|
||
|
||
### Consistency Testing
|
||
|
||
Run same test input 3 times. Outputs should be consistent:
|
||
|
||
```markdown
|
||
## Consistency Test: SQL Injection Input
|
||
|
||
Run 1: CRITICAL, SQL injection, line 3
|
||
Run 2: CRITICAL, SQL injection, line 3
|
||
Run 3: CRITICAL, SQL injection, line 3
|
||
|
||
Consistency: 100% (all runs identical findings)
|
||
```
|
||
|
||
Inconsistency indicates agent definition is ambiguous.
|
||
|
||
## REFACTOR Phase: Edge Cases and Robustness
|
||
|
||
Agent passes basic tests? Now test edge cases.
|
||
|
||
### Edge Case Categories
|
||
|
||
| Category | Test Cases |
|
||
|----------|------------|
|
||
| **Empty/Null** | Empty file, null input, whitespace only |
|
||
| **Large** | 10K line file, deeply nested code |
|
||
| **Unusual** | Minified code, generated code, config files |
|
||
| **Multi-language** | Mixed JS/TS, embedded SQL, templates |
|
||
| **Ambiguous** | Code that could be good or bad depending on context |
|
||
|
||
### Stress Testing
|
||
|
||
```markdown
|
||
## Stress Test: Large File
|
||
|
||
**Input:** 5000-line file with 20 known issues scattered throughout
|
||
**Expected:** All 20 issues found, reasonable response time
|
||
**Actual:** [Run agent, record output]
|
||
|
||
## Stress Test: Complex Nesting
|
||
|
||
**Input:** 15-level deep callback hell
|
||
**Expected:** Findings about complexity, maintainability
|
||
**Actual:** [Run agent, record output]
|
||
```
|
||
|
||
### Ambiguity Testing
|
||
|
||
```markdown
|
||
## Ambiguity Test: Context-Dependent Security
|
||
|
||
**Input:**
|
||
```python
|
||
# Internal admin tool, not exposed to users
|
||
password = "admin123" # Default for local dev
|
||
```
|
||
|
||
**Expected:** Agent should note context but still flag
|
||
**Actual:** [Run agent, record output]
|
||
|
||
**Analysis:** Does agent handle nuance appropriately?
|
||
```
|
||
|
||
### Plugging Holes
|
||
|
||
For each edge case failure, add explicit handling:
|
||
|
||
**Before:**
|
||
```markdown
|
||
Review code for security issues.
|
||
```
|
||
|
||
**After:**
|
||
```markdown
|
||
Review code for security issues.
|
||
|
||
**Edge case handling:**
|
||
- Empty files: Return "No code to review" with PASS
|
||
- Large files (>5K lines): Focus on high-risk patterns first
|
||
- Minified code: Note limitations, review what's readable
|
||
- Context comments: Consider but don't use to dismiss issues
|
||
```
|
||
|
||
## Testing Parallel Agent Workflows
|
||
|
||
When agents run in parallel (like 3 reviewers), test the combined workflow:
|
||
|
||
### Parallel Consistency
|
||
|
||
```markdown
|
||
## Parallel Test: Same Input to All Reviewers
|
||
|
||
Input: Authentication module with mixed issues
|
||
|
||
| Reviewer | Findings | Overlap |
|
||
|----------|----------|---------|
|
||
| code-reviewer | 5 findings | - |
|
||
| business-logic-reviewer | 3 findings | 1 shared |
|
||
| security-reviewer | 4 findings | 2 shared |
|
||
|
||
Analysis:
|
||
- Total unique findings: 9
|
||
- Appropriate overlap (security issues found by both security and code reviewer)
|
||
- No contradictions
|
||
```
|
||
|
||
### Aggregation Testing
|
||
|
||
```markdown
|
||
## Aggregation Test: Severity Consistency
|
||
|
||
Same issue found by multiple reviewers:
|
||
|
||
| Reviewer | Finding | Severity |
|
||
|----------|---------|----------|
|
||
| code-reviewer | Missing null check | MEDIUM |
|
||
| business-logic-reviewer | Missing null check | HIGH |
|
||
|
||
Problem: Inconsistent severity for same issue
|
||
Fix: Align severity calibration across all reviewers
|
||
```
|
||
|
||
## Agent Testing Checklist
|
||
|
||
Before deploying agent, verify you followed RED-GREEN-REFACTOR:
|
||
|
||
**RED Phase:**
|
||
- [ ] Created test inputs (known issues, clean code, edge cases)
|
||
- [ ] Ran agent with test inputs
|
||
- [ ] Documented failures verbatim (missing findings, wrong severity, bad format)
|
||
|
||
**GREEN Phase:**
|
||
- [ ] Updated agent definition addressing specific failures
|
||
- [ ] Re-ran test inputs
|
||
- [ ] All basic tests now pass
|
||
|
||
**REFACTOR Phase:**
|
||
- [ ] Tested edge cases (empty, large, unusual, ambiguous)
|
||
- [ ] Tested stress scenarios (many issues, complex code)
|
||
- [ ] Added explicit edge case handling to definition
|
||
- [ ] Verified consistency (multiple runs produce same results)
|
||
- [ ] Verified schema compliance
|
||
- [ ] Tested parallel workflow integration (if applicable)
|
||
- [ ] Re-ran ALL tests after each change
|
||
|
||
**Metrics (for reviewer agents):**
|
||
- [ ] True positive rate: >95%
|
||
- [ ] False positive rate: <10%
|
||
- [ ] False negative rate: <5%
|
||
- [ ] Severity accuracy: >90%
|
||
- [ ] Schema compliance: 100%
|
||
- [ ] Consistency: >95%
|
||
|
||
## Prohibited Testing Shortcuts
|
||
|
||
**You CANNOT substitute proper testing with:**
|
||
|
||
| Shortcut | Why It Fails |
|
||
|----------|--------------|
|
||
| Reading agent definition carefully | Reading ≠ executing. Must run agent with inputs. |
|
||
| Manual testing in Claude UI | Ad-hoc ≠ reproducible. No baseline documented. |
|
||
| "Looks good to me" review | Visual inspection misses runtime failures. |
|
||
| Basing on proven template | Templates need validation for YOUR use case. |
|
||
| Expert prompt engineering knowledge | Expertise doesn't prevent bugs. Tests do. |
|
||
| Testing after first production use | Production is not QA. Test before deployment. |
|
||
| Monitoring production for issues | Reactive ≠ proactive. Catch issues before users do. |
|
||
| Deploy now, test in parallel | Parallel testing still means untested code in production. |
|
||
|
||
**ALL require running agent with documented test inputs and comparing outputs.**
|
||
|
||
## Testing Agent Modifications
|
||
|
||
**EVERY agent edit requires re-running the FULL test suite:**
|
||
|
||
| Change Type | Required Action |
|
||
|-------------|-----------------|
|
||
| Prompt wording changes | Full re-test |
|
||
| Severity calibration updates | Full re-test |
|
||
| Output schema modifications | Full re-test |
|
||
| Adding edge case handling | Full re-test |
|
||
| "Small" one-line changes | Full re-test |
|
||
| Typo fixes in prompt | Full re-test |
|
||
|
||
**"Small change" is not an exception.** One-line prompt changes can completely alter LLM behavior. Re-test always.
|
||
|
||
## Common Mistakes
|
||
|
||
**L Testing with only "happy path" inputs**
|
||
Agent works with obvious issues but misses subtle ones.
|
||
Fix: Include ambiguous cases and edge cases in test suite.
|
||
|
||
**L Not documenting exact outputs**
|
||
"Agent was wrong" doesn't tell you what to fix.
|
||
Fix: Capture agent output verbatim, compare to expected.
|
||
|
||
**L Fixing without re-running all tests**
|
||
Fix one issue, break another.
|
||
Fix: Re-run entire test suite after each change.
|
||
|
||
**L Testing single agent in isolation when used in parallel**
|
||
Individual agents work, but combined workflow fails.
|
||
Fix: Test parallel dispatch and output aggregation.
|
||
|
||
**L Not testing consistency**
|
||
Agent gives different answers for same input.
|
||
Fix: Run same input 3+ times, verify consistent output.
|
||
|
||
**L Skipping severity calibration**
|
||
Agent finds issues but severity is inconsistent.
|
||
Fix: Add explicit severity examples to agent definition.
|
||
|
||
**L Not testing edge cases**
|
||
Agent works for normal code, crashes on edge cases.
|
||
Fix: Test empty, large, unusual, and ambiguous inputs.
|
||
|
||
**Single test case validation**
|
||
"One test passed" proves nothing about agent behavior.
|
||
Fix: Minimum 4-6 test cases per agent type.
|
||
|
||
**Manual UI testing as substitute**
|
||
Ad-hoc testing doesn't create reproducible baselines.
|
||
Fix: Document all test inputs and expected outputs.
|
||
|
||
**Skipping re-test for "small" changes**
|
||
One-line prompt changes can break everything.
|
||
Fix: Re-run full suite after ANY modification.
|
||
|
||
## Rationalization Table
|
||
|
||
| Excuse | Reality |
|
||
|--------|---------|
|
||
| "Agent prompt is obviously correct" | Obvious prompts fail in practice. Test proves correctness. |
|
||
| "Tested manually in Claude UI" | Ad-hoc ≠ reproducible. No baseline documented. |
|
||
| "One test case passed" | Sample size = 1 proves nothing. Need 4-6 cases minimum. |
|
||
| "Will test after first production use" | Production is not QA. Test before deployment. Always. |
|
||
| "Reading prompt is sufficient review" | Reading ≠ executing. Must run agent with inputs. |
|
||
| "Changes are small, re-test unnecessary" | Small changes cause big failures. Re-run full suite. |
|
||
| "Based agent on proven template" | Templates need validation for your use case. Test anyway. |
|
||
| "Expert in prompt engineering" | Expertise doesn't prevent bugs. Tests do. |
|
||
| "Production is down, no time to test" | Deploying untested fix may make outage worse. Test first. |
|
||
| "Deploy now, test in parallel" | Untested code in production = unknown behavior. Unacceptable. |
|
||
| "Quick smoke test is enough" | Smoke test misses edge cases. Full suite required. |
|
||
| "Simple pass-through agent" | You cannot self-determine exemptions. Get human approval. |
|
||
|
||
## Red Flags - STOP and Test Now
|
||
|
||
If you catch yourself thinking ANY of these, STOP. You're about to violate the Iron Law:
|
||
|
||
- Agent edited but tests not re-run
|
||
- "Looks good" without execution
|
||
- Single test case only
|
||
- No documented baseline
|
||
- No edge case testing
|
||
- Manual verification only
|
||
- "Will test in production"
|
||
- "Based on template, should work"
|
||
- "Just a small prompt change"
|
||
- "No time to test properly"
|
||
- "One quick test is enough"
|
||
- "Agent is simple, obviously works"
|
||
- "Expert intuition says it's fine"
|
||
- "Production is down, skip testing"
|
||
- "Deploy now, test in parallel"
|
||
|
||
**All of these mean: STOP. Run full RED-GREEN-REFACTOR cycle NOW.**
|
||
|
||
## Quick Reference (TDD Cycle for Agents)
|
||
|
||
| TDD Phase | Agent Testing | Success Criteria |
|
||
|-----------|---------------|------------------|
|
||
| **RED** | Run with test inputs | Document exact output failures |
|
||
| **Verify RED** | Capture verbatim | Have specific issues to fix |
|
||
| **GREEN** | Fix agent definition | All basic tests pass |
|
||
| **Verify GREEN** | Re-run all tests | No regressions |
|
||
| **REFACTOR** | Test edge cases | Robust under all conditions |
|
||
| **Stay GREEN** | Full test suite | All tests pass, metrics met |
|
||
|
||
## Example: Testing a New Reviewer Agent
|
||
|
||
### Step 1: Create Test Suite
|
||
|
||
```markdown
|
||
# security-reviewer Test Suite
|
||
|
||
## Test 1: SQL Injection (Known Critical)
|
||
Input: `query = "SELECT * FROM users WHERE id = " + user_id`
|
||
Expected: CRITICAL, SQL injection, OWASP A03:2021
|
||
|
||
## Test 2: Parameterized Query (Clean)
|
||
Input: `query = "SELECT * FROM users WHERE id = ?"; db.execute(query, [user_id])`
|
||
Expected: No security findings
|
||
|
||
## Test 3: Hardcoded Secret (Known Critical)
|
||
Input: `API_KEY = "sk-1234567890abcdef"`
|
||
Expected: CRITICAL, hardcoded secret
|
||
|
||
## Test 4: Environment Variable (Clean)
|
||
Input: `API_KEY = os.environ.get("API_KEY")`
|
||
Expected: No security findings (or LOW suggestion for validation)
|
||
|
||
## Test 5: Empty File
|
||
Input: (empty)
|
||
Expected: Graceful handling
|
||
|
||
## Test 6: Ambiguous - Internal Tool
|
||
Input: `password = "dev123" # Local development only`
|
||
Expected: Flag but acknowledge context
|
||
```
|
||
|
||
### Step 2: Run RED Phase
|
||
|
||
```
|
||
Test 1: L Found issue but marked HIGH not CRITICAL
|
||
Test 2: No findings
|
||
Test 3: L Missed entirely
|
||
Test 4: No findings
|
||
Test 5: L Error: "No code provided"
|
||
Test 6: L Dismissed due to comment
|
||
```
|
||
|
||
### Step 3: GREEN Phase - Fix Definition
|
||
|
||
Add to agent:
|
||
1. Severity calibration with SQL injection = CRITICAL
|
||
2. Explicit check for hardcoded secrets pattern
|
||
3. Empty file handling instruction
|
||
4. "Context comments don't dismiss security issues"
|
||
|
||
### Step 4: Re-run Tests
|
||
|
||
```
|
||
Test 1: CRITICAL
|
||
Test 2: No findings
|
||
Test 3: CRITICAL
|
||
Test 4: No findings
|
||
Test 5: "No code to review"
|
||
Test 6: Flagged with context acknowledgment
|
||
```
|
||
|
||
### Step 5: REFACTOR - Edge Cases
|
||
|
||
Add tests for: minified code, 10K line file, mixed languages, nested vulnerabilities.
|
||
|
||
Run, fix, repeat until all pass.
|
||
|
||
## The Bottom Line
|
||
|
||
**Agent testing IS TDD. Same principles, same cycle, same benefits.**
|
||
|
||
If you wouldn't deploy code without tests, don't deploy agents without testing them.
|
||
|
||
RED-GREEN-REFACTOR for agents works exactly like RED-GREEN-REFACTOR for code:
|
||
1. **RED:** See what's wrong (run with test inputs)
|
||
2. **GREEN:** Fix it (update agent definition)
|
||
3. **REFACTOR:** Make it robust (edge cases, consistency)
|
||
|
||
**Evidence before deployment. Always.**
|