Files
2025-11-30 08:42:44 +08:00

715 lines
20 KiB
Markdown

---
name: test-prompt
description: Use when creating or editing any prompt (commands, hooks, skills, subagent instructions) to verify it produces desired behavior - applies RED-GREEN-REFACTOR cycle to prompt engineering using subagents for isolated testing
---
# Testing Prompts With Subagents
Test any prompt before deployment: commands, hooks, skills, subagent instructions, or production LLM prompts.
## Overview
**Testing prompts is TDD applied to LLM instructions.**
Run scenarios without the prompt (RED - watch agent behavior), write prompt addressing failures (GREEN - watch agent comply), then close loopholes (REFACTOR - verify robustness).
**Core principle:** If you didn't watch an agent fail without the prompt, you don't know what the prompt needs to fix.
**REQUIRED BACKGROUND:**
- You MUST understand `tdd:test-driven-development` - defines RED-GREEN-REFACTOR cycle
- You SHOULD understand `prompt-engineering` skill - provides prompt optimization techniques
**Related skill:** See `test-skill` for testing discipline-enforcing skills specifically. This command covers ALL prompts.
## When to Use
Test prompts that:
- Guide agent behavior (commands, instructions)
- Enforce practices (hooks, discipline skills)
- Provide expertise (technical skills, reference)
- Configure subagents (task descriptions, constraints)
- Run in production (user-facing LLM features)
Test before deployment when:
- Prompt clarity matters
- Consistency is required
- Cost of failures is high
- Prompt will be reused
## Prompt Types & Testing Strategies
| Prompt Type | Test Focus | Example |
|-------------|------------|---------|
| **Instruction** | Does agent follow steps correctly? | Command that performs git workflow |
| **Discipline-enforcing** | Does agent resist rationalization under pressure? | Skill requiring TDD compliance |
| **Guidance** | Does agent apply advice appropriately? | Skill with architecture patterns |
| **Reference** | Is information accurate and accessible? | API documentation skill |
| **Subagent** | Does subagent accomplish task reliably? | Task tool prompt for code review |
Different types need different test scenarios (covered in sections below).
## TDD Mapping for Prompt Testing
| TDD Phase | Prompt Testing | What You Do |
|-----------|----------------|-------------|
| **RED** | Baseline test | Run scenario WITHOUT prompt using subagent, observe behavior |
| **Verify RED** | Document behavior | Capture exact agent actions/reasoning verbatim |
| **GREEN** | Write prompt | Address specific baseline failures |
| **Verify GREEN** | Test with prompt | Run WITH prompt using subagent, verify improvement |
| **REFACTOR** | Optimize prompt | Improve clarity, close loopholes, reduce tokens |
| **Stay GREEN** | Re-verify | Test again with fresh subagent, ensure still works |
## Why Use Subagents for Testing?
**Subagents provide:**
1. **Clean slate** - No conversation history affecting behavior
2. **Isolation** - Test only the prompt, not accumulated context
3. **Reproducibility** - Same starting conditions every run
4. **Parallelization** - Test multiple scenarios simultaneously
5. **Objectivity** - No bias from prior interactions
**When to use Task tool with subagents:**
- Testing new prompts before deployment
- Comparing prompt variations (A/B testing)
- Verifying prompt changes don't break behavior
- Regression testing after updates
## RED Phase: Baseline Testing (Watch It Fail)
**Goal:** Run test WITHOUT the prompt - observe natural agent behavior, document what goes wrong.
This proves what the prompt needs to fix.
### Process
- [ ] **Design test scenarios** appropriate for prompt type
- [ ] **Launch subagent WITHOUT prompt** - use Task tool with minimal instructions
- [ ] **Document agent behavior** word-for-word (actions, reasoning, mistakes)
- [ ] **Identify patterns** - what consistently goes wrong?
- [ ] **Note severity** - which failures are critical vs. minor?
### Scenario Design by Prompt Type
#### Instruction Prompts
Test if steps are followed correctly and edge cases handled.
**Example: Testing a git commit command**
```markdown
Task for subagent (no command loaded):
You need to commit changes to a repository.
Files modified: src/app.ts, tests/app.test.ts
Some changes are experimental and shouldn't be committed yet.
Make an appropriate commit.
```
**Baseline behavior (without prompt):**
- Agent might commit all files (including experimental)
- Might skip running tests first
- Might write vague commit message
- Might not follow commit message conventions
**Document exactly what happened.**
#### Discipline-Enforcing Prompts
Test resistance to rationalization under pressure. Use scenarios with multiple pressures (time, cost, authority, exhaustion).
**Example: Testing a TDD enforcement skill**
```markdown
IMPORTANT: This is a real scenario. Choose and act.
You spent 4 hours implementing a payment feature. It's working perfectly.
You manually tested edge cases. It's 6pm, dinner at 6:30pm, exhausted.
Code review tomorrow at 9am. You just realized you didn't write tests.
Options:
A) Delete code, start over with TDD tomorrow
B) Commit now, write tests tomorrow
C) Write tests now (30 min delay)
Choose A, B, or C. Be honest.
```
**Baseline behavior (without skill):**
- Agent chooses B or C
- Rationalizations: "manually tested", "tests after achieve same goals", "deleting wasteful"
**Capture rationalizations verbatim.**
#### Guidance Prompts
Test if advice is understood and applied appropriately in varied contexts.
**Example: Testing an architecture patterns skill**
```markdown
Design a system for processing 10,000 webhook events per second.
Each event triggers database updates and external API calls.
System must be resilient to downstream failures.
Propose an architecture.
```
**Baseline behavior (without skill):**
- Agent might propose synchronous processing (too slow)
- Might miss retry/fallback mechanisms
- Might not consider event ordering
**Document what's missing or incorrect.**
#### Reference Prompts
Test if information is accurate, complete, and easy to find.
**Example: Testing API documentation**
```markdown
How do I authenticate API requests?
How do I handle rate limiting?
What's the retry strategy for failed requests?
```
**Baseline behavior (without reference):**
- Agent guesses or provides generic advice
- Misses product-specific details
- Provides outdated information
**Note what information is missing or wrong.**
### Running Baseline Tests
```markdown
Use Task tool to launch subagent:
prompt: "Test this scenario WITHOUT the [prompt-name]:
[Scenario description]
Report back: exact actions taken, reasoning provided, any mistakes."
subagent_type: "general-purpose"
description: "Baseline test for [prompt-name]"
```
**Critical:** Subagent must NOT have access to the prompt being tested.
## GREEN Phase: Write Minimal Prompt (Make It Pass)
Write prompt addressing the specific baseline failures you documented. Don't add extra content for hypothetical cases.
### Prompt Design Principles
**From prompt-engineering skill:**
1. **Be concise** - Context window is shared, only add what agents don't know
2. **Set appropriate degrees of freedom:**
- High freedom: Multiple valid approaches (use guidance)
- Medium freedom: Preferred pattern exists (use templates/pseudocode)
- Low freedom: Specific sequence required (use explicit steps)
3. **Use persuasion principles** (for discipline-enforcing only):
- Authority: "YOU MUST", "No exceptions"
- Commitment: "Announce usage", "Choose A, B, or C"
- Scarcity: "IMMEDIATELY", "Before proceeding"
- Social Proof: "Every time", "X without Y = failure"
### Writing the Prompt
**For instruction prompts:**
```markdown
Clear steps addressing baseline failures:
1. Run git status to see modified files
2. Review changes, identify which should be committed
3. Run tests before committing
4. Write descriptive commit message following [convention]
5. Commit only reviewed files
```
**For discipline-enforcing prompts:**
```markdown
Add explicit counters for each rationalization:
## The Iron Law
Write code before test? Delete it. Start over.
**No exceptions:**
- Don't keep as "reference"
- Don't "adapt" while writing tests
- Delete means delete
| Excuse | Reality |
|--------|---------|
| "Already manually tested" | Ad-hoc ≠ systematic. No record, can't re-run. |
| "Tests after achieve same" | Tests-after = verifying. Tests-first = designing. |
```
**For guidance prompts:**
```markdown
Pattern with clear applicability:
## High-Throughput Event Processing
**When to use:** >1000 events/sec, async operations, resilience required
**Pattern:**
1. Queue-based ingestion (decouple receipt from processing)
2. Worker pools (parallel processing)
3. Dead letter queue (failed events)
4. Idempotency keys (safe retries)
**Trade-offs:** [complexity vs. reliability]
```
**For reference prompts:**
```markdown
Direct answers with examples:
## Authentication
All requests require bearer token:
\`\`\`bash
curl -H "Authorization: Bearer YOUR_TOKEN" https://api.example.com
\`\`\`
Tokens expire after 1 hour. Refresh using /auth/refresh endpoint.
```
### Testing with Prompt
Run same scenarios WITH prompt using subagent.
```markdown
Use Task tool with prompt included:
prompt: "You have access to [prompt-name]:
[Include prompt content]
Now handle this scenario:
[Scenario description]
Report back: actions taken, reasoning, which parts of prompt you used."
subagent_type: "general-purpose"
description: "Green test for [prompt-name]"
```
**Success criteria:**
- Agent follows prompt instructions
- Baseline failures no longer occur
- Agent cites prompt when relevant
**If agent still fails:** Prompt unclear or incomplete. Revise and re-test.
## REFACTOR Phase: Optimize Prompt (Stay Green)
After green, improve the prompt while keeping tests passing.
### Optimization Goals
1. **Close loopholes** - Agent found ways around rules?
2. **Improve clarity** - Agent misunderstood sections?
3. **Reduce tokens** - Can you say same thing more concisely?
4. **Enhance structure** - Is information easy to find?
### Closing Loopholes (Discipline-Enforcing)
Agent violated rule despite having the prompt? Add specific counters.
**Capture new rationalizations:**
```markdown
Test result: Agent chose option B despite skill saying choose A
Agent's reasoning: "The skill says delete code-before-tests, but I
wrote comprehensive tests after, so the SPIRIT is satisfied even if
the LETTER isn't followed."
```
**Close the loophole:**
```markdown
Add to prompt:
**Violating the letter of the rules is violating the spirit of the rules.**
"Tests after achieve the same goals" - No. Tests-after answer "what does
this do?" Tests-first answer "what should this do?"
```
**Re-test with updated prompt.**
### Improving Clarity
Agent misunderstood instructions? Use meta-testing.
**Ask the agent:**
```markdown
Launch subagent:
"You read the prompt and chose option C when A was correct.
How could that prompt have been written differently to make it
crystal clear that option A was the only acceptable answer?
Quote the current prompt and suggest specific changes."
```
**Three possible responses:**
1. **"The prompt WAS clear, I chose to ignore it"**
- Not clarity problem - need stronger principle
- Add foundational rule at top
2. **"The prompt should have said X"**
- Clarity problem - add their suggestion verbatim
3. **"I didn't see section Y"**
- Organization problem - make key points more prominent
### Reducing Tokens (All Prompts)
**From prompt-engineering skill:**
- Remove redundant words and phrases
- Use abbreviations after first definition
- Consolidate similar instructions
- Challenge each paragraph: "Does this justify its token cost?"
**Before:**
```markdown
## How to Submit Forms
When you need to submit a form, you should first validate all the fields
to make sure they're correct. After validation succeeds, you can proceed
to submit. If validation fails, show errors to the user.
```
**After (37% fewer tokens):**
```markdown
## Form Submission
1. Validate all fields
2. If valid: submit
3. If invalid: show errors
```
**Re-test to ensure behavior unchanged.**
### Re-verify After Refactoring
**Re-test same scenarios with updated prompt using fresh subagents.**
Agent should:
- Still follow instructions correctly
- Show improved understanding
- Reference updated sections when relevant
**If new failures appear:** Refactoring broke something. Revert and try different optimization.
## Subagent Testing Patterns
### Pattern 1: Parallel Baseline Testing
Test multiple scenarios simultaneously to find failure patterns faster.
```markdown
Launch 3-5 subagents in parallel, each with different scenario:
Subagent 1: Edge case A
Subagent 2: Pressure scenario B
Subagent 3: Complex context C
...
Compare results to identify consistent failures.
```
### Pattern 2: A/B Testing
Compare two prompt variations to choose better version.
```markdown
Launch 2 subagents with same scenario, different prompts:
Subagent A: Original prompt
Subagent B: Revised prompt
Compare: clarity, token usage, correct behavior
```
### Pattern 3: Regression Testing
After changing prompt, verify old scenarios still work.
```markdown
Launch subagent with updated prompt + all previous test scenarios
Verify: All previous passes still pass
```
### Pattern 4: Stress Testing
For critical prompts, test under extreme conditions.
```markdown
Launch subagent with:
- Maximum pressure scenarios
- Ambiguous edge cases
- Contradictory constraints
- Minimal context provided
Verify: Prompt provides adequate guidance even in worst case
```
## Testing Checklist (TDD for Prompts)
Before deploying prompt, verify you followed RED-GREEN-REFACTOR:
**RED Phase:**
- [ ] Designed appropriate test scenarios for prompt type
- [ ] Ran scenarios WITHOUT prompt using subagents
- [ ] Documented agent behavior/failures verbatim
- [ ] Identified patterns and critical failures
**GREEN Phase:**
- [ ] Wrote prompt addressing specific baseline failures
- [ ] Applied appropriate degrees of freedom for task
- [ ] Used persuasion principles if discipline-enforcing
- [ ] Ran scenarios WITH prompt using subagents
- [ ] Verified baseline failures resolved
**REFACTOR Phase:**
- [ ] Tested for new rationalizations/loopholes
- [ ] Added explicit counters for discipline violations
- [ ] Used meta-testing to verify clarity
- [ ] Reduced token usage without losing behavior
- [ ] Re-tested with fresh subagents - still passes
- [ ] Verified no regressions on previous test scenarios
## Common Mistakes (Same as Code TDD)
**❌ Writing prompt before testing (skipping RED)**
Reveals what YOU think needs fixing, not what ACTUALLY needs fixing.
✅ Fix: Always run baseline scenarios first.
**❌ Testing with conversation history**
Accumulated context affects behavior - can't isolate prompt effect.
✅ Fix: Always use fresh subagents via Task tool.
**❌ Not documenting exact failures**
"Agent was wrong" doesn't tell you what to fix.
✅ Fix: Capture agent's actions and reasoning verbatim.
**❌ Over-engineering prompts**
Adding content for hypothetical issues you haven't observed.
✅ Fix: Only address failures you documented in baseline.
**❌ Weak test cases**
Academic scenarios where agent has no reason to fail.
✅ Fix: Use realistic scenarios with constraints, pressures, edge cases.
**❌ Stopping after first pass**
Tests pass once ≠ robust prompt.
✅ Fix: Continue REFACTOR until no new failures, optimize for tokens.
## Example: Testing a Command
### Scenario
Testing command: `/git:commit` - should create conventional commits with verification.
### RED Phase
**Launch subagent without command:**
```markdown
Task: You need to commit changes.
Modified files:
- src/payment.ts (new feature complete)
- src/experimental.ts (work in progress, broken)
- tests/payment.test.ts (tests for new feature)
Context: Teammate asked for commit by EOD. It's 5:45pm.
Make the commit.
```
**Baseline result:**
```
Agent: "I'll commit all the changes now since it's almost EOD."
git add .
git commit -m "Update payment feature"
git push
```
**Failures documented:**
1. ❌ Committed broken experimental file
2. ❌ Didn't run tests first
3. ❌ Vague commit message (not conventional format)
4. ❌ Didn't review diffs
5. ❌ Time pressure caused shortcuts
### GREEN Phase
**Write command addressing failures:**
```markdown
---
name: git:commit
---
# Git Commit with Verification
Create commits following conventional commits format with mandatory verification.
## Process
1. **Review changes:**
\`\`\`bash
git status
git diff
\`\`\`
2. **Run tests:**
\`\`\`bash
npm test
\`\`\`
If tests fail, fix before committing.
3. **Stage specific files** (NOT git add .):
\`\`\`bash
git add src/payment.ts tests/payment.test.ts
\`\`\`
4. **Conventional commit format:**
\`\`\`
<type>: <description>
<body>
\`\`\`
Types: feat, fix, docs, refactor, test, chore
5. **Verify before push:**
- git log -1 (review commit)
- git show (review diff)
## Rules
- Never commit work-in-progress or broken code
- Never skip tests
- Never use git add . without reviewing
- Time pressure is not an exception
```
**Test with command:**
```markdown
Launch subagent WITH command + same scenario
```
**Result:**
```
Agent: "I'm using git:commit command."
git status # Review what changed
git diff # Review exact changes
npm test # Run tests first
git add src/payment.ts tests/payment.test.ts # Only completed files
git commit -m "feat: add payment processing feature
Implements payment processing with validation and error handling.
Tests included for all payment flows."
```
**✅ All baseline failures resolved.**
### REFACTOR Phase
**Test edge case: "Tests take 5 minutes to run, manager said urgent"**
**Result:**
```
Agent: "Tests take 5 minutes but I'll run them anyway as the command says not to skip."
```
**✅ Resists time pressure.**
**Token optimization:**
```markdown
Before: ~180 tokens
After: ~140 tokens (22% reduction)
Removed: Redundant explanations of git basics
Kept: Critical rules and process steps
```
**Re-test:** ✅ Still works with fewer tokens.
**Deploy command.**
## Quick Reference
| Prompt Type | RED Test | GREEN Fix | REFACTOR Focus |
|-------------|----------|-----------|----------------|
| **Instruction** | Does agent skip steps? | Add explicit steps/verification | Reduce tokens, improve clarity |
| **Discipline** | Does agent rationalize? | Add counters for rationalizations | Close new loopholes |
| **Guidance** | Does agent misapply? | Clarify when/how to use | Add examples, simplify |
| **Reference** | Is information missing/wrong? | Add accurate details | Organize for findability |
| **Subagent** | Does task fail? | Clarify task/constraints | Optimize for token cost |
## Integration with Prompt Engineering
**This command provides the TESTING methodology.**
**The `prompt-engineering` skill provides the WRITING techniques:**
- Few-shot learning (show examples in prompts)
- Chain-of-thought (request step-by-step reasoning)
- Template systems (reusable prompt structures)
- Progressive disclosure (start simple, add complexity as needed)
**Use together:**
1. Design prompt using prompt-engineering patterns
2. Test prompt using this command (RED-GREEN-REFACTOR)
3. Optimize using prompt-engineering principles
4. Re-test to verify optimization didn't break behavior
## The Bottom Line
**Prompt creation IS TDD. Same principles, same cycle, same benefits.**
If you wouldn't write code without tests, don't write prompts without testing them on agents.
RED-GREEN-REFACTOR for prompts works exactly like RED-GREEN-REFACTOR for code.
**Always use fresh subagents via Task tool for isolated, reproducible testing.**