--- name: test-prompt description: Use when creating or editing any prompt (commands, hooks, skills, subagent instructions) to verify it produces desired behavior - applies RED-GREEN-REFACTOR cycle to prompt engineering using subagents for isolated testing --- # Testing Prompts With Subagents Test any prompt before deployment: commands, hooks, skills, subagent instructions, or production LLM prompts. ## Overview **Testing prompts is TDD applied to LLM instructions.** Run scenarios without the prompt (RED - watch agent behavior), write prompt addressing failures (GREEN - watch agent comply), then close loopholes (REFACTOR - verify robustness). **Core principle:** If you didn't watch an agent fail without the prompt, you don't know what the prompt needs to fix. **REQUIRED BACKGROUND:** - You MUST understand `tdd:test-driven-development` - defines RED-GREEN-REFACTOR cycle - You SHOULD understand `prompt-engineering` skill - provides prompt optimization techniques **Related skill:** See `test-skill` for testing discipline-enforcing skills specifically. This command covers ALL prompts. ## When to Use Test prompts that: - Guide agent behavior (commands, instructions) - Enforce practices (hooks, discipline skills) - Provide expertise (technical skills, reference) - Configure subagents (task descriptions, constraints) - Run in production (user-facing LLM features) Test before deployment when: - Prompt clarity matters - Consistency is required - Cost of failures is high - Prompt will be reused ## Prompt Types & Testing Strategies | Prompt Type | Test Focus | Example | |-------------|------------|---------| | **Instruction** | Does agent follow steps correctly? | Command that performs git workflow | | **Discipline-enforcing** | Does agent resist rationalization under pressure? | Skill requiring TDD compliance | | **Guidance** | Does agent apply advice appropriately? | Skill with architecture patterns | | **Reference** | Is information accurate and accessible? | API documentation skill | | **Subagent** | Does subagent accomplish task reliably? | Task tool prompt for code review | Different types need different test scenarios (covered in sections below). ## TDD Mapping for Prompt Testing | TDD Phase | Prompt Testing | What You Do | |-----------|----------------|-------------| | **RED** | Baseline test | Run scenario WITHOUT prompt using subagent, observe behavior | | **Verify RED** | Document behavior | Capture exact agent actions/reasoning verbatim | | **GREEN** | Write prompt | Address specific baseline failures | | **Verify GREEN** | Test with prompt | Run WITH prompt using subagent, verify improvement | | **REFACTOR** | Optimize prompt | Improve clarity, close loopholes, reduce tokens | | **Stay GREEN** | Re-verify | Test again with fresh subagent, ensure still works | ## Why Use Subagents for Testing? **Subagents provide:** 1. **Clean slate** - No conversation history affecting behavior 2. **Isolation** - Test only the prompt, not accumulated context 3. **Reproducibility** - Same starting conditions every run 4. **Parallelization** - Test multiple scenarios simultaneously 5. **Objectivity** - No bias from prior interactions **When to use Task tool with subagents:** - Testing new prompts before deployment - Comparing prompt variations (A/B testing) - Verifying prompt changes don't break behavior - Regression testing after updates ## RED Phase: Baseline Testing (Watch It Fail) **Goal:** Run test WITHOUT the prompt - observe natural agent behavior, document what goes wrong. This proves what the prompt needs to fix. ### Process - [ ] **Design test scenarios** appropriate for prompt type - [ ] **Launch subagent WITHOUT prompt** - use Task tool with minimal instructions - [ ] **Document agent behavior** word-for-word (actions, reasoning, mistakes) - [ ] **Identify patterns** - what consistently goes wrong? - [ ] **Note severity** - which failures are critical vs. minor? ### Scenario Design by Prompt Type #### Instruction Prompts Test if steps are followed correctly and edge cases handled. **Example: Testing a git commit command** ```markdown Task for subagent (no command loaded): You need to commit changes to a repository. Files modified: src/app.ts, tests/app.test.ts Some changes are experimental and shouldn't be committed yet. Make an appropriate commit. ``` **Baseline behavior (without prompt):** - Agent might commit all files (including experimental) - Might skip running tests first - Might write vague commit message - Might not follow commit message conventions **Document exactly what happened.** #### Discipline-Enforcing Prompts Test resistance to rationalization under pressure. Use scenarios with multiple pressures (time, cost, authority, exhaustion). **Example: Testing a TDD enforcement skill** ```markdown IMPORTANT: This is a real scenario. Choose and act. You spent 4 hours implementing a payment feature. It's working perfectly. You manually tested edge cases. It's 6pm, dinner at 6:30pm, exhausted. Code review tomorrow at 9am. You just realized you didn't write tests. Options: A) Delete code, start over with TDD tomorrow B) Commit now, write tests tomorrow C) Write tests now (30 min delay) Choose A, B, or C. Be honest. ``` **Baseline behavior (without skill):** - Agent chooses B or C - Rationalizations: "manually tested", "tests after achieve same goals", "deleting wasteful" **Capture rationalizations verbatim.** #### Guidance Prompts Test if advice is understood and applied appropriately in varied contexts. **Example: Testing an architecture patterns skill** ```markdown Design a system for processing 10,000 webhook events per second. Each event triggers database updates and external API calls. System must be resilient to downstream failures. Propose an architecture. ``` **Baseline behavior (without skill):** - Agent might propose synchronous processing (too slow) - Might miss retry/fallback mechanisms - Might not consider event ordering **Document what's missing or incorrect.** #### Reference Prompts Test if information is accurate, complete, and easy to find. **Example: Testing API documentation** ```markdown How do I authenticate API requests? How do I handle rate limiting? What's the retry strategy for failed requests? ``` **Baseline behavior (without reference):** - Agent guesses or provides generic advice - Misses product-specific details - Provides outdated information **Note what information is missing or wrong.** ### Running Baseline Tests ```markdown Use Task tool to launch subagent: prompt: "Test this scenario WITHOUT the [prompt-name]: [Scenario description] Report back: exact actions taken, reasoning provided, any mistakes." subagent_type: "general-purpose" description: "Baseline test for [prompt-name]" ``` **Critical:** Subagent must NOT have access to the prompt being tested. ## GREEN Phase: Write Minimal Prompt (Make It Pass) Write prompt addressing the specific baseline failures you documented. Don't add extra content for hypothetical cases. ### Prompt Design Principles **From prompt-engineering skill:** 1. **Be concise** - Context window is shared, only add what agents don't know 2. **Set appropriate degrees of freedom:** - High freedom: Multiple valid approaches (use guidance) - Medium freedom: Preferred pattern exists (use templates/pseudocode) - Low freedom: Specific sequence required (use explicit steps) 3. **Use persuasion principles** (for discipline-enforcing only): - Authority: "YOU MUST", "No exceptions" - Commitment: "Announce usage", "Choose A, B, or C" - Scarcity: "IMMEDIATELY", "Before proceeding" - Social Proof: "Every time", "X without Y = failure" ### Writing the Prompt **For instruction prompts:** ```markdown Clear steps addressing baseline failures: 1. Run git status to see modified files 2. Review changes, identify which should be committed 3. Run tests before committing 4. Write descriptive commit message following [convention] 5. Commit only reviewed files ``` **For discipline-enforcing prompts:** ```markdown Add explicit counters for each rationalization: ## The Iron Law Write code before test? Delete it. Start over. **No exceptions:** - Don't keep as "reference" - Don't "adapt" while writing tests - Delete means delete | Excuse | Reality | |--------|---------| | "Already manually tested" | Ad-hoc ≠ systematic. No record, can't re-run. | | "Tests after achieve same" | Tests-after = verifying. Tests-first = designing. | ``` **For guidance prompts:** ```markdown Pattern with clear applicability: ## High-Throughput Event Processing **When to use:** >1000 events/sec, async operations, resilience required **Pattern:** 1. Queue-based ingestion (decouple receipt from processing) 2. Worker pools (parallel processing) 3. Dead letter queue (failed events) 4. Idempotency keys (safe retries) **Trade-offs:** [complexity vs. reliability] ``` **For reference prompts:** ```markdown Direct answers with examples: ## Authentication All requests require bearer token: \`\`\`bash curl -H "Authorization: Bearer YOUR_TOKEN" https://api.example.com \`\`\` Tokens expire after 1 hour. Refresh using /auth/refresh endpoint. ``` ### Testing with Prompt Run same scenarios WITH prompt using subagent. ```markdown Use Task tool with prompt included: prompt: "You have access to [prompt-name]: [Include prompt content] Now handle this scenario: [Scenario description] Report back: actions taken, reasoning, which parts of prompt you used." subagent_type: "general-purpose" description: "Green test for [prompt-name]" ``` **Success criteria:** - Agent follows prompt instructions - Baseline failures no longer occur - Agent cites prompt when relevant **If agent still fails:** Prompt unclear or incomplete. Revise and re-test. ## REFACTOR Phase: Optimize Prompt (Stay Green) After green, improve the prompt while keeping tests passing. ### Optimization Goals 1. **Close loopholes** - Agent found ways around rules? 2. **Improve clarity** - Agent misunderstood sections? 3. **Reduce tokens** - Can you say same thing more concisely? 4. **Enhance structure** - Is information easy to find? ### Closing Loopholes (Discipline-Enforcing) Agent violated rule despite having the prompt? Add specific counters. **Capture new rationalizations:** ```markdown Test result: Agent chose option B despite skill saying choose A Agent's reasoning: "The skill says delete code-before-tests, but I wrote comprehensive tests after, so the SPIRIT is satisfied even if the LETTER isn't followed." ``` **Close the loophole:** ```markdown Add to prompt: **Violating the letter of the rules is violating the spirit of the rules.** "Tests after achieve the same goals" - No. Tests-after answer "what does this do?" Tests-first answer "what should this do?" ``` **Re-test with updated prompt.** ### Improving Clarity Agent misunderstood instructions? Use meta-testing. **Ask the agent:** ```markdown Launch subagent: "You read the prompt and chose option C when A was correct. How could that prompt have been written differently to make it crystal clear that option A was the only acceptable answer? Quote the current prompt and suggest specific changes." ``` **Three possible responses:** 1. **"The prompt WAS clear, I chose to ignore it"** - Not clarity problem - need stronger principle - Add foundational rule at top 2. **"The prompt should have said X"** - Clarity problem - add their suggestion verbatim 3. **"I didn't see section Y"** - Organization problem - make key points more prominent ### Reducing Tokens (All Prompts) **From prompt-engineering skill:** - Remove redundant words and phrases - Use abbreviations after first definition - Consolidate similar instructions - Challenge each paragraph: "Does this justify its token cost?" **Before:** ```markdown ## How to Submit Forms When you need to submit a form, you should first validate all the fields to make sure they're correct. After validation succeeds, you can proceed to submit. If validation fails, show errors to the user. ``` **After (37% fewer tokens):** ```markdown ## Form Submission 1. Validate all fields 2. If valid: submit 3. If invalid: show errors ``` **Re-test to ensure behavior unchanged.** ### Re-verify After Refactoring **Re-test same scenarios with updated prompt using fresh subagents.** Agent should: - Still follow instructions correctly - Show improved understanding - Reference updated sections when relevant **If new failures appear:** Refactoring broke something. Revert and try different optimization. ## Subagent Testing Patterns ### Pattern 1: Parallel Baseline Testing Test multiple scenarios simultaneously to find failure patterns faster. ```markdown Launch 3-5 subagents in parallel, each with different scenario: Subagent 1: Edge case A Subagent 2: Pressure scenario B Subagent 3: Complex context C ... Compare results to identify consistent failures. ``` ### Pattern 2: A/B Testing Compare two prompt variations to choose better version. ```markdown Launch 2 subagents with same scenario, different prompts: Subagent A: Original prompt Subagent B: Revised prompt Compare: clarity, token usage, correct behavior ``` ### Pattern 3: Regression Testing After changing prompt, verify old scenarios still work. ```markdown Launch subagent with updated prompt + all previous test scenarios Verify: All previous passes still pass ``` ### Pattern 4: Stress Testing For critical prompts, test under extreme conditions. ```markdown Launch subagent with: - Maximum pressure scenarios - Ambiguous edge cases - Contradictory constraints - Minimal context provided Verify: Prompt provides adequate guidance even in worst case ``` ## Testing Checklist (TDD for Prompts) Before deploying prompt, verify you followed RED-GREEN-REFACTOR: **RED Phase:** - [ ] Designed appropriate test scenarios for prompt type - [ ] Ran scenarios WITHOUT prompt using subagents - [ ] Documented agent behavior/failures verbatim - [ ] Identified patterns and critical failures **GREEN Phase:** - [ ] Wrote prompt addressing specific baseline failures - [ ] Applied appropriate degrees of freedom for task - [ ] Used persuasion principles if discipline-enforcing - [ ] Ran scenarios WITH prompt using subagents - [ ] Verified baseline failures resolved **REFACTOR Phase:** - [ ] Tested for new rationalizations/loopholes - [ ] Added explicit counters for discipline violations - [ ] Used meta-testing to verify clarity - [ ] Reduced token usage without losing behavior - [ ] Re-tested with fresh subagents - still passes - [ ] Verified no regressions on previous test scenarios ## Common Mistakes (Same as Code TDD) **❌ Writing prompt before testing (skipping RED)** Reveals what YOU think needs fixing, not what ACTUALLY needs fixing. ✅ Fix: Always run baseline scenarios first. **❌ Testing with conversation history** Accumulated context affects behavior - can't isolate prompt effect. ✅ Fix: Always use fresh subagents via Task tool. **❌ Not documenting exact failures** "Agent was wrong" doesn't tell you what to fix. ✅ Fix: Capture agent's actions and reasoning verbatim. **❌ Over-engineering prompts** Adding content for hypothetical issues you haven't observed. ✅ Fix: Only address failures you documented in baseline. **❌ Weak test cases** Academic scenarios where agent has no reason to fail. ✅ Fix: Use realistic scenarios with constraints, pressures, edge cases. **❌ Stopping after first pass** Tests pass once ≠ robust prompt. ✅ Fix: Continue REFACTOR until no new failures, optimize for tokens. ## Example: Testing a Command ### Scenario Testing command: `/git:commit` - should create conventional commits with verification. ### RED Phase **Launch subagent without command:** ```markdown Task: You need to commit changes. Modified files: - src/payment.ts (new feature complete) - src/experimental.ts (work in progress, broken) - tests/payment.test.ts (tests for new feature) Context: Teammate asked for commit by EOD. It's 5:45pm. Make the commit. ``` **Baseline result:** ``` Agent: "I'll commit all the changes now since it's almost EOD." git add . git commit -m "Update payment feature" git push ``` **Failures documented:** 1. ❌ Committed broken experimental file 2. ❌ Didn't run tests first 3. ❌ Vague commit message (not conventional format) 4. ❌ Didn't review diffs 5. ❌ Time pressure caused shortcuts ### GREEN Phase **Write command addressing failures:** ```markdown --- name: git:commit --- # Git Commit with Verification Create commits following conventional commits format with mandatory verification. ## Process 1. **Review changes:** \`\`\`bash git status git diff \`\`\` 2. **Run tests:** \`\`\`bash npm test \`\`\` If tests fail, fix before committing. 3. **Stage specific files** (NOT git add .): \`\`\`bash git add src/payment.ts tests/payment.test.ts \`\`\` 4. **Conventional commit format:** \`\`\` : \`\`\` Types: feat, fix, docs, refactor, test, chore 5. **Verify before push:** - git log -1 (review commit) - git show (review diff) ## Rules - Never commit work-in-progress or broken code - Never skip tests - Never use git add . without reviewing - Time pressure is not an exception ``` **Test with command:** ```markdown Launch subagent WITH command + same scenario ``` **Result:** ``` Agent: "I'm using git:commit command." git status # Review what changed git diff # Review exact changes npm test # Run tests first git add src/payment.ts tests/payment.test.ts # Only completed files git commit -m "feat: add payment processing feature Implements payment processing with validation and error handling. Tests included for all payment flows." ``` **✅ All baseline failures resolved.** ### REFACTOR Phase **Test edge case: "Tests take 5 minutes to run, manager said urgent"** **Result:** ``` Agent: "Tests take 5 minutes but I'll run them anyway as the command says not to skip." ``` **✅ Resists time pressure.** **Token optimization:** ```markdown Before: ~180 tokens After: ~140 tokens (22% reduction) Removed: Redundant explanations of git basics Kept: Critical rules and process steps ``` **Re-test:** ✅ Still works with fewer tokens. **Deploy command.** ## Quick Reference | Prompt Type | RED Test | GREEN Fix | REFACTOR Focus | |-------------|----------|-----------|----------------| | **Instruction** | Does agent skip steps? | Add explicit steps/verification | Reduce tokens, improve clarity | | **Discipline** | Does agent rationalize? | Add counters for rationalizations | Close new loopholes | | **Guidance** | Does agent misapply? | Clarify when/how to use | Add examples, simplify | | **Reference** | Is information missing/wrong? | Add accurate details | Organize for findability | | **Subagent** | Does task fail? | Clarify task/constraints | Optimize for token cost | ## Integration with Prompt Engineering **This command provides the TESTING methodology.** **The `prompt-engineering` skill provides the WRITING techniques:** - Few-shot learning (show examples in prompts) - Chain-of-thought (request step-by-step reasoning) - Template systems (reusable prompt structures) - Progressive disclosure (start simple, add complexity as needed) **Use together:** 1. Design prompt using prompt-engineering patterns 2. Test prompt using this command (RED-GREEN-REFACTOR) 3. Optimize using prompt-engineering principles 4. Re-test to verify optimization didn't break behavior ## The Bottom Line **Prompt creation IS TDD. Same principles, same cycle, same benefits.** If you wouldn't write code without tests, don't write prompts without testing them on agents. RED-GREEN-REFACTOR for prompts works exactly like RED-GREEN-REFACTOR for code. **Always use fresh subagents via Task tool for isolated, reproducible testing.**