Initial commit

2025-11-30 08:37:11 +08:00
commit 20b36ca9b1
56 changed files with 14530 additions and 0 deletions
--- a/skills/testing-agents-with-subagents/SKILL.md
+++ b/skills/testing-agents-with-subagents/SKILL.md
@@ -0,0 +1,623 @@
+---
+name: testing-agents-with-subagents
+description: |
+  Agent testing methodology - run agents with test inputs, observe outputs,
+  iterate until outputs are accurate and well-structured.
+
+trigger: |
+  - Before deploying a new agent
+  - After editing an existing agent
+  - Agent produces structured outputs that must be accurate
+
+skip_when: |
+  - Agent is simple passthrough → minimal testing needed
+  - Agent already tested for this use case
+
+related:
+  complementary: [test-driven-development]
+---
+
+# Testing Agents With Subagents
+
+## Overview
+
+**Testing agents is TDD applied to AI worker definitions.**
+
+You run agents with known test inputs (RED - observe incorrect outputs), fix the agent definition (GREEN - outputs now correct), then handle edge cases (REFACTOR - robust under all conditions).
+
+**Core principle:** If you didn't run an agent with test inputs and verify its outputs, you don't know if the agent works correctly.
+
+**REQUIRED BACKGROUND:** You MUST understand `ring-default:test-driven-development` before using this skill. That skill defines the fundamental RED-GREEN-REFACTOR cycle. This skill provides agent-specific test formats (test inputs, output verification, accuracy metrics).
+
+**Key difference from testing-skills-with-subagents:**
+- **Skills** = instructions that guide behavior; test if agent follows rules under pressure
+- **Agents** = separate Claude instances via Task tool; test if they produce correct outputs
+
+## The Iron Law
+
+```
+NO AGENT DEPLOYMENT WITHOUT RED-GREEN-REFACTOR TESTING FIRST
+```
+
+About to deploy an agent without completing the test cycle? You have ONLY one option:
+
+### STOP. TEST FIRST. THEN DEPLOY.
+
+**You CANNOT:**
+- ❌ "Deploy and monitor for issues"
+- ❌ "Test with first real usage"
+- ❌ "Quick smoke test is enough"
+- ❌ "Tested manually in Claude UI"
+- ❌ "One test case passed"
+- ❌ "Agent prompt looks correct"
+- ❌ "Based on working template"
+- ❌ "Deploy now, test in parallel"
+- ❌ "Production is down, no time to test"
+
+**ZERO exceptions. Simple agent, expert confidence, time pressure, production outage - NONE override testing.**
+
+**Why this is absolute:** Untested agents fail in production. Every time. The question is not IF but WHEN and HOW BADLY. A 20-minute test suite prevents hours of debugging and lost trust.
+
+## When to Use
+
+Test agents that:
+- Analyze code/designs and produce findings (reviewers)
+- Generate structured outputs (planners, analyzers)
+- Make decisions or categorizations (severity, priority)
+- Have defined output schemas that must be followed
+- Are used in parallel workflows where consistency matters
+
+**Test exemptions require explicit human partner approval:**
+- Simple pass-through agents (just reformatting) - **only if human partner confirms**
+- Agents without structured outputs - **only if human partner confirms**
+- **You CANNOT self-determine test exemption**
+- **When in doubt → TEST**
+
+## TDD Mapping for Agent Testing
+
+| TDD Phase | Agent Testing | What You Do |
+|-----------|---------------|-------------|
+| **RED** | Run with test inputs | Dispatch agent, observe incorrect/incomplete outputs |
+| **Verify RED** | Document failures | Capture exact output issues verbatim |
+| **GREEN** | Fix agent definition | Update prompt/schema to address failures |
+| **Verify GREEN** | Re-run tests | Agent now produces correct outputs |
+| **REFACTOR** | Test edge cases | Ambiguous inputs, empty inputs, complex scenarios |
+| **Stay GREEN** | Re-verify all | Previous tests still pass after changes |
+
+Same cycle as code TDD, different test format.
+
+## RED Phase: Baseline Testing (Observe Failures)
+
+**Goal:** Run agent with known test inputs - observe what's wrong, document exact failures.
+
+This is identical to TDD's "write failing test first" - you MUST see what the agent actually produces before fixing the definition.
+
+**Process:**
+
+- [ ] **Create test inputs** (known issues, edge cases, clean inputs)
+- [ ] **Run agent** - dispatch via Task tool with test inputs
+- [ ] **Compare outputs** - expected vs actual
+- [ ] **Document failures** - missing findings, wrong severity, bad format
+- [ ] **Identify patterns** - which input types cause failures?
+
+### Test Input Categories
+
+| Category | Purpose | Example |
+|----------|---------|---------|
+| **Known Issues** | Verify agent finds real problems | Code with SQL injection, hardcoded secrets |
+| **Clean Inputs** | Verify no false positives | Well-written code with no issues |
+| **Edge Cases** | Verify robustness | Empty files, huge files, unusual patterns |
+| **Ambiguous Cases** | Verify judgment | Code that could go either way |
+| **Severity Calibration** | Verify severity accuracy | Mix of critical, high, medium, low issues |
+
+### Minimum Test Suite Requirements
+
+Before deploying ANY agent, you MUST have:
+
+| Agent Type | Minimum Test Cases | Required Coverage |
+|------------|-------------------|-------------------|
+| **Reviewer agents** | 6 tests | 2 known issues, 2 clean, 1 edge case, 1 ambiguous |
+| **Analyzer agents** | 5 tests | 2 typical, 1 empty, 1 large, 1 malformed |
+| **Decision agents** | 4 tests | 2 clear cases, 2 boundary cases |
+| **Planning agents** | 5 tests | 2 standard, 1 complex, 1 minimal, 1 edge case |
+
+**Fewer tests = incomplete testing = DO NOT DEPLOY.**
+
+One test case proves nothing. Three tests are suspicious. Six tests are minimum for confidence.
+
+### Example Test Suite for Code Reviewer
+
+```markdown
+## Test Case 1: Known SQL Injection
+**Input:** Function with string concatenation in SQL query
+**Expected:** CRITICAL finding, references OWASP A03:2021
+**Actual:** [Run agent, record output]
+
+## Test Case 2: Clean Authentication
+**Input:** Well-implemented JWT validation with proper error handling
+**Expected:** No findings or LOW-severity suggestions only
+**Actual:** [Run agent, record output]
+
+## Test Case 3: Ambiguous Error Handling
+**Input:** Error caught but only logged, not re-thrown
+**Expected:** MEDIUM finding about silent failures
+**Actual:** [Run agent, record output]
+
+## Test Case 4: Empty File
+**Input:** Empty source file
+**Expected:** Graceful handling, no crash, maybe LOW finding
+**Actual:** [Run agent, record output]
+```
+
+### Running the Test
+
+```markdown
+Use Task tool to dispatch agent:
+
+Task(
+  subagent_type="ring-default:code-reviewer",
+  prompt="""
+  Review this code for security issues:
+
+  ```python
+  def get_user(user_id):
+      query = "SELECT * FROM users WHERE id = " + user_id
+      return db.execute(query)
+  ```
+
+  Provide findings with severity levels.
+  """
+)
+```
+
+**Document exact output.** Don't summarize - capture verbatim.
+
+## GREEN Phase: Fix Agent Definition (Make Tests Pass)
+
+Write/update agent definition addressing specific failures documented in RED phase.
+
+**Common fixes:**
+
+| Failure Type | Fix Approach |
+|--------------|--------------|
+| Missing findings | Add explicit instructions to check for X |
+| Wrong severity | Add severity calibration examples |
+| Bad output format | Add output schema with examples |
+| False positives | Add "don't flag X when Y" instructions |
+| Incomplete analysis | Add "always check A, B, C" checklist |
+
+### Example Fix: Severity Calibration
+
+**RED Phase Failure:**
+```
+Agent marked hardcoded password as MEDIUM instead of CRITICAL
+```
+
+**GREEN Phase Fix (add to agent definition):**
+```markdown
+## Severity Calibration
+
+**CRITICAL:** Immediate exploitation possible
+- Hardcoded secrets (passwords, API keys, tokens)
+- SQL injection with user input
+- Authentication bypass
+
+**HIGH:** Exploitation requires additional steps
+- Missing input validation
+- Improper error handling exposing internals
+
+**MEDIUM:** Security weakness, not directly exploitable
+- Missing rate limiting
+- Verbose error messages
+
+**LOW:** Best practice violations
+- Missing security headers
+- Outdated dependencies (no known CVEs)
+```
+
+### Re-run Tests
+
+After fixing, re-run ALL test cases:
+
+```markdown
+## Test Results After Fix
+
+| Test Case | Expected | Actual | Pass/Fail |
+|-----------|----------|--------|-----------|
+| SQL Injection | CRITICAL | CRITICAL |  PASS |
+| Clean Auth | No findings | No findings |  PASS |
+| Ambiguous Error | MEDIUM | MEDIUM |  PASS |
+| Empty File | Graceful | Graceful |  PASS |
+```
+
+If any test fails: continue fixing, re-test.
+
+## VERIFY GREEN: Output Verification
+
+**Goal:** Confirm agent produces correct, well-structured outputs consistently.
+
+### Output Schema Compliance
+
+If agent has defined output schema, verify compliance:
+
+```markdown
+## Expected Schema
+- Summary (1-2 sentences)
+- Findings (array of {severity, location, description, recommendation})
+- Overall assessment (PASS/FAIL with conditions)
+
+## Actual Output Analysis
+- Summary:  Present, correct format
+- Findings:  Array, all fields present
+- Overall assessment: L Missing conditions for FAIL
+```
+
+### Accuracy Metrics
+
+Track agent accuracy across test suite:
+
+| Metric | Target | Actual |
+|--------|--------|--------|
+| True Positives (found real issues) | 100% | [X]% |
+| False Positives (flagged non-issues) | <10% | [X]% |
+| False Negatives (missed real issues) | <5% | [X]% |
+| Severity Accuracy | >90% | [X]% |
+| Schema Compliance | 100% | [X]% |
+
+### Consistency Testing
+
+Run same test input 3 times. Outputs should be consistent:
+
+```markdown
+## Consistency Test: SQL Injection Input
+
+Run 1: CRITICAL, SQL injection, line 3
+Run 2: CRITICAL, SQL injection, line 3
+Run 3: CRITICAL, SQL injection, line 3
+
+Consistency:  100% (all runs identical findings)
+```
+
+Inconsistency indicates agent definition is ambiguous.
+
+## REFACTOR Phase: Edge Cases and Robustness
+
+Agent passes basic tests? Now test edge cases.
+
+### Edge Case Categories
+
+| Category | Test Cases |
+|----------|------------|
+| **Empty/Null** | Empty file, null input, whitespace only |
+| **Large** | 10K line file, deeply nested code |
+| **Unusual** | Minified code, generated code, config files |
+| **Multi-language** | Mixed JS/TS, embedded SQL, templates |
+| **Ambiguous** | Code that could be good or bad depending on context |
+
+### Stress Testing
+
+```markdown
+## Stress Test: Large File
+
+**Input:** 5000-line file with 20 known issues scattered throughout
+**Expected:** All 20 issues found, reasonable response time
+**Actual:** [Run agent, record output]
+
+## Stress Test: Complex Nesting
+
+**Input:** 15-level deep callback hell
+**Expected:** Findings about complexity, maintainability
+**Actual:** [Run agent, record output]
+```
+
+### Ambiguity Testing
+
+```markdown
+## Ambiguity Test: Context-Dependent Security
+
+**Input:**
+```python
+# Internal admin tool, not exposed to users
+password = "admin123"  # Default for local dev
+```
+
+**Expected:** Agent should note context but still flag
+**Actual:** [Run agent, record output]
+
+**Analysis:** Does agent handle nuance appropriately?
+```
+
+### Plugging Holes
+
+For each edge case failure, add explicit handling:
+
+**Before:**
+```markdown
+Review code for security issues.
+```
+
+**After:**
+```markdown
+Review code for security issues.
+
+**Edge case handling:**
+- Empty files: Return "No code to review" with PASS
+- Large files (>5K lines): Focus on high-risk patterns first
+- Minified code: Note limitations, review what's readable
+- Context comments: Consider but don't use to dismiss issues
+```
+
+## Testing Parallel Agent Workflows
+
+When agents run in parallel (like 3 reviewers), test the combined workflow:
+
+### Parallel Consistency
+
+```markdown
+## Parallel Test: Same Input to All Reviewers
+
+Input: Authentication module with mixed issues
+
+| Reviewer | Findings | Overlap |
+|----------|----------|---------|
+| code-reviewer | 5 findings | - |
+| business-logic-reviewer | 3 findings | 1 shared |
+| security-reviewer | 4 findings | 2 shared |
+
+Analysis:
+- Total unique findings: 9
+- Appropriate overlap (security issues found by both security and code reviewer)
+- No contradictions
+```
+
+### Aggregation Testing
+
+```markdown
+## Aggregation Test: Severity Consistency
+
+Same issue found by multiple reviewers:
+
+| Reviewer | Finding | Severity |
+|----------|---------|----------|
+| code-reviewer | Missing null check | MEDIUM |
+| business-logic-reviewer | Missing null check | HIGH |
+
+Problem: Inconsistent severity for same issue
+Fix: Align severity calibration across all reviewers
+```
+
+## Agent Testing Checklist
+
+Before deploying agent, verify you followed RED-GREEN-REFACTOR:
+
+**RED Phase:**
+- [ ] Created test inputs (known issues, clean code, edge cases)
+- [ ] Ran agent with test inputs
+- [ ] Documented failures verbatim (missing findings, wrong severity, bad format)
+
+**GREEN Phase:**
+- [ ] Updated agent definition addressing specific failures
+- [ ] Re-ran test inputs
+- [ ] All basic tests now pass
+
+**REFACTOR Phase:**
+- [ ] Tested edge cases (empty, large, unusual, ambiguous)
+- [ ] Tested stress scenarios (many issues, complex code)
+- [ ] Added explicit edge case handling to definition
+- [ ] Verified consistency (multiple runs produce same results)
+- [ ] Verified schema compliance
+- [ ] Tested parallel workflow integration (if applicable)
+- [ ] Re-ran ALL tests after each change
+
+**Metrics (for reviewer agents):**
+- [ ] True positive rate: >95%
+- [ ] False positive rate: <10%
+- [ ] False negative rate: <5%
+- [ ] Severity accuracy: >90%
+- [ ] Schema compliance: 100%
+- [ ] Consistency: >95%
+
+## Prohibited Testing Shortcuts
+
+**You CANNOT substitute proper testing with:**
+
+| Shortcut | Why It Fails |
+|----------|--------------|
+| Reading agent definition carefully | Reading ≠ executing. Must run agent with inputs. |
+| Manual testing in Claude UI | Ad-hoc ≠ reproducible. No baseline documented. |
+| "Looks good to me" review | Visual inspection misses runtime failures. |
+| Basing on proven template | Templates need validation for YOUR use case. |
+| Expert prompt engineering knowledge | Expertise doesn't prevent bugs. Tests do. |
+| Testing after first production use | Production is not QA. Test before deployment. |
+| Monitoring production for issues | Reactive ≠ proactive. Catch issues before users do. |
+| Deploy now, test in parallel | Parallel testing still means untested code in production. |
+
+**ALL require running agent with documented test inputs and comparing outputs.**
+
+## Testing Agent Modifications
+
+**EVERY agent edit requires re-running the FULL test suite:**
+
+| Change Type | Required Action |
+|-------------|-----------------|
+| Prompt wording changes | Full re-test |
+| Severity calibration updates | Full re-test |
+| Output schema modifications | Full re-test |
+| Adding edge case handling | Full re-test |
+| "Small" one-line changes | Full re-test |
+| Typo fixes in prompt | Full re-test |
+
+**"Small change" is not an exception.** One-line prompt changes can completely alter LLM behavior. Re-test always.
+
+## Common Mistakes
+
+**L Testing with only "happy path" inputs**
+Agent works with obvious issues but misses subtle ones.
+ Fix: Include ambiguous cases and edge cases in test suite.
+
+**L Not documenting exact outputs**
+"Agent was wrong" doesn't tell you what to fix.
+ Fix: Capture agent output verbatim, compare to expected.
+
+**L Fixing without re-running all tests**
+Fix one issue, break another.
+ Fix: Re-run entire test suite after each change.
+
+**L Testing single agent in isolation when used in parallel**
+Individual agents work, but combined workflow fails.
+ Fix: Test parallel dispatch and output aggregation.
+
+**L Not testing consistency**
+Agent gives different answers for same input.
+ Fix: Run same input 3+ times, verify consistent output.
+
+**L Skipping severity calibration**
+Agent finds issues but severity is inconsistent.
+ Fix: Add explicit severity examples to agent definition.
+
+**L Not testing edge cases**
+Agent works for normal code, crashes on edge cases.
+ Fix: Test empty, large, unusual, and ambiguous inputs.
+
+**Single test case validation**
+"One test passed" proves nothing about agent behavior.
+Fix: Minimum 4-6 test cases per agent type.
+
+**Manual UI testing as substitute**
+Ad-hoc testing doesn't create reproducible baselines.
+Fix: Document all test inputs and expected outputs.
+
+**Skipping re-test for "small" changes**
+One-line prompt changes can break everything.
+Fix: Re-run full suite after ANY modification.
+
+## Rationalization Table
+
+| Excuse | Reality |
+|--------|---------|
+| "Agent prompt is obviously correct" | Obvious prompts fail in practice. Test proves correctness. |
+| "Tested manually in Claude UI" | Ad-hoc ≠ reproducible. No baseline documented. |
+| "One test case passed" | Sample size = 1 proves nothing. Need 4-6 cases minimum. |
+| "Will test after first production use" | Production is not QA. Test before deployment. Always. |
+| "Reading prompt is sufficient review" | Reading ≠ executing. Must run agent with inputs. |
+| "Changes are small, re-test unnecessary" | Small changes cause big failures. Re-run full suite. |
+| "Based agent on proven template" | Templates need validation for your use case. Test anyway. |
+| "Expert in prompt engineering" | Expertise doesn't prevent bugs. Tests do. |
+| "Production is down, no time to test" | Deploying untested fix may make outage worse. Test first. |
+| "Deploy now, test in parallel" | Untested code in production = unknown behavior. Unacceptable. |
+| "Quick smoke test is enough" | Smoke test misses edge cases. Full suite required. |
+| "Simple pass-through agent" | You cannot self-determine exemptions. Get human approval. |
+
+## Red Flags - STOP and Test Now
+
+If you catch yourself thinking ANY of these, STOP. You're about to violate the Iron Law:
+
+- Agent edited but tests not re-run
+- "Looks good" without execution
+- Single test case only
+- No documented baseline
+- No edge case testing
+- Manual verification only
+- "Will test in production"
+- "Based on template, should work"
+- "Just a small prompt change"
+- "No time to test properly"
+- "One quick test is enough"
+- "Agent is simple, obviously works"
+- "Expert intuition says it's fine"
+- "Production is down, skip testing"
+- "Deploy now, test in parallel"
+
+**All of these mean: STOP. Run full RED-GREEN-REFACTOR cycle NOW.**
+
+## Quick Reference (TDD Cycle for Agents)
+
+| TDD Phase | Agent Testing | Success Criteria |
+|-----------|---------------|------------------|
+| **RED** | Run with test inputs | Document exact output failures |
+| **Verify RED** | Capture verbatim | Have specific issues to fix |
+| **GREEN** | Fix agent definition | All basic tests pass |
+| **Verify GREEN** | Re-run all tests | No regressions |
+| **REFACTOR** | Test edge cases | Robust under all conditions |
+| **Stay GREEN** | Full test suite | All tests pass, metrics met |
+
+## Example: Testing a New Reviewer Agent
+
+### Step 1: Create Test Suite
+
+```markdown
+# security-reviewer Test Suite
+
+## Test 1: SQL Injection (Known Critical)
+Input: `query = "SELECT * FROM users WHERE id = " + user_id`
+Expected: CRITICAL, SQL injection, OWASP A03:2021
+
+## Test 2: Parameterized Query (Clean)
+Input: `query = "SELECT * FROM users WHERE id = ?"; db.execute(query, [user_id])`
+Expected: No security findings
+
+## Test 3: Hardcoded Secret (Known Critical)
+Input: `API_KEY = "sk-1234567890abcdef"`
+Expected: CRITICAL, hardcoded secret
+
+## Test 4: Environment Variable (Clean)
+Input: `API_KEY = os.environ.get("API_KEY")`
+Expected: No security findings (or LOW suggestion for validation)
+
+## Test 5: Empty File
+Input: (empty)
+Expected: Graceful handling
+
+## Test 6: Ambiguous - Internal Tool
+Input: `password = "dev123"  # Local development only`
+Expected: Flag but acknowledge context
+```
+
+### Step 2: Run RED Phase
+
+```
+Test 1: L Found issue but marked HIGH not CRITICAL
+Test 2:  No findings
+Test 3: L Missed entirely
+Test 4:  No findings
+Test 5: L Error: "No code provided"
+Test 6: L Dismissed due to comment
+```
+
+### Step 3: GREEN Phase - Fix Definition
+
+Add to agent:
+1. Severity calibration with SQL injection = CRITICAL
+2. Explicit check for hardcoded secrets pattern
+3. Empty file handling instruction
+4. "Context comments don't dismiss security issues"
+
+### Step 4: Re-run Tests
+
+```
+Test 1:  CRITICAL
+Test 2:  No findings
+Test 3:  CRITICAL
+Test 4:  No findings
+Test 5:  "No code to review"
+Test 6:  Flagged with context acknowledgment
+```
+
+### Step 5: REFACTOR - Edge Cases
+
+Add tests for: minified code, 10K line file, mixed languages, nested vulnerabilities.
+
+Run, fix, repeat until all pass.
+
+## The Bottom Line
+
+**Agent testing IS TDD. Same principles, same cycle, same benefits.**
+
+If you wouldn't deploy code without tests, don't deploy agents without testing them.
+
+RED-GREEN-REFACTOR for agents works exactly like RED-GREEN-REFACTOR for code:
+1. **RED:** See what's wrong (run with test inputs)
+2. **GREEN:** Fix it (update agent definition)
+3. **REFACTOR:** Make it robust (edge cases, consistency)
+
+**Evidence before deployment. Always.**