7.4 KiB
Test-Driven Documentation (TDD for Skills)
Extracted from obra/superpowers-skills with attribution.
Core Concept
Writing skills IS Test-Driven Development applied to process documentation.
Test skills with subagents BEFORE writing, iterate until bulletproof.
TDD Mapping
| TDD Concept | Skill Creation |
|---|---|
| Test case | Pressure scenario with subagent |
| Production code | Skill document (SKILL.md) |
| Test fails (RED) | Agent violates rule without skill |
| Test passes (GREEN) | Agent complies with skill present |
| Refactor | Close loopholes while maintaining compliance |
The Iron Law
NO SKILL WITHOUT A FAILING TEST FIRST
No exceptions:
- Not for "simple additions"
- Not for "just documentation updates"
- Delete untested changes and start over
RED-GREEN-REFACTOR Cycle
RED Phase: Write Failing Test
-
Create pressure scenarios with subagent
- Combine multiple pressures (time + sunk cost + authority)
- Make it HARD to follow the rule
-
Run WITHOUT skill - Document exact behavior:
- What choices did agent make?
- What rationalizations used? (verbatim quotes)
- Which pressures triggered violations?
-
Identify patterns in failures
GREEN Phase: Write Minimal Skill
-
Address specific baseline failures found in RED
- Don't add hypothetical content
- Write only what fixes observed violations
-
Run WITH skill - Agent should now comply
-
Verify compliance under same pressure
REFACTOR Phase: Close Loopholes
-
Find new rationalizations - Agent found workaround?
-
Add explicit counters to skill
-
Re-test until bulletproof
Pressure Types
Time Pressure
- "You have 5 minutes to ship this"
- "Deadline is in 1 hour"
Sunk Cost
- "You've already written 200 lines of code"
- "This has been in development for 3 days"
Authority
- "Senior engineer approved this approach"
- "This is company standard"
Exhaustion
- "You've been debugging for 4 hours"
- "This is the 10th iteration"
Combine 2-3 pressures for realistic scenarios.
Bulletproofing Against Rationalization
Agents will rationalize violations. Close every loophole explicitly:
Pattern 1: Add Explicit Forbid List
❌ Bad:
Write tests before code.
✅ Good:
Write tests before code. Delete code if written first. Start over.
**No exceptions:**
- Don't keep as "reference"
- Don't "adapt" while testing
- Don't look at it
- Delete means delete
Pattern 2: Build Rationalization Table
Capture every excuse from testing:
| Excuse | Reality |
|---|---|
| "Too simple to test" | Simple code breaks. Test takes 30 seconds. |
| "I'll test after" | Tests passing immediately prove nothing. |
| "This is different because..." | It's not. Write test first. |
Pattern 3: Create Red Flags List
## Red Flags - STOP and Start Over
Saying any of these means you're about to violate the rule:
- "Code before test"
- "I already tested it manually"
- "Tests after achieve same purpose"
- "It's about spirit not ritual"
**All mean: Delete code. Start over with TDD.**
Pattern 4: Address Spirit vs Letter
Add early in skill:
**Violating the letter of the rules IS violating the spirit of the rules.**
Cuts off entire class of rationalizations.
Testing Different Skill Types
Discipline-Enforcing Skills
Rules/requirements (TDD, verification-before-completion)
Test with:
- Academic questions: Do they understand?
- Pressure scenarios: Do they comply under stress?
- Multiple pressures combined
- Identify and counter rationalizations
Technique Skills
How-to guides (condition-based-waiting, root-cause-tracing)
Test with:
- Application scenarios: Can they apply correctly?
- Variation scenarios: Handle edge cases?
- Missing information tests: Instructions have gaps?
Pattern Skills
Mental models (reducing-complexity concepts)
Test with:
- Recognition scenarios: Recognize when pattern applies?
- Application scenarios: Use the mental model?
- Counter-examples: Know when NOT to apply?
Reference Skills
Documentation/APIs
Test with:
- Retrieval scenarios: Find the right information?
- Application scenarios: Use it correctly?
- Gap testing: Common use cases covered?
Common Rationalization Excuses
| Excuse | Reality |
|---|---|
| "Skill is obviously clear" | Clear to you ≠ clear to agents. Test it. |
| "It's just a reference" | References have gaps. Test retrieval. |
| "Testing is overkill" | Untested skills have issues. Always. |
| "I'll test if problems emerge" | Problems = agents can't use skill. Test BEFORE. |
| "Too tedious to test" | Less tedious than debugging bad skill in production. |
| "I'm confident it's good" | Overconfidence guarantees issues. Test anyway. |
Testing Workflow
# 1. RED - Create baseline
create_pressure_scenario()
run_without_skill() > baseline.txt
document_rationalizations()
# 2. GREEN - Write skill
write_minimal_skill_addressing_baseline()
run_with_skill() > with_skill.txt
verify_compliance()
# 3. REFACTOR - Close loopholes
identify_new_rationalizations()
add_explicit_counters_to_skill()
rerun_tests()
repeat_until_bulletproof()
Real Example: TDD Skill for Skills
When creating the writing-skills skill itself:
RED:
- Ran scenario: "Create a simple condition-waiting skill"
- Without TDD skill: Agent wrote skill without testing
- Rationalization: "It's simple enough, testing would be overkill"
GREEN:
- Wrote skill with Iron Law and rationalization table
- Re-ran scenario with skill present
- Agent now creates test scenarios first
REFACTOR:
- New rationalization: "I'll test after since it's just documentation"
- Added explicit counter: "Tests after = what does this do? Tests first = what SHOULD this do?"
- Re-tested: Agent complies
Deployment Checklist
For EACH skill (no batching):
RED Phase:
- Created pressure scenarios (3+ combined pressures)
- Ran WITHOUT skill - documented verbatim behavior
- Identified rationalization patterns
GREEN Phase:
- Wrote minimal skill addressing baseline failures
- Ran WITH skill - verified compliance
REFACTOR Phase:
- Identified NEW rationalizations
- Added explicit counters
- Built rationalization table
- Created red flags list
- Re-tested until bulletproof
Quality:
- Follows Anthropic best practices
- Has concrete examples
- Quick reference table included
Deploy:
- Committed to git
- Tested in production conversation
Key Insights
- Test BEFORE writing - Only way to know skill teaches right thing
- Pressure scenarios required - Agents need stress to reveal rationalizations
- Explicit > Implicit - Close every loophole with explicit forbid
- Iterate - First version never bulletproof, refactor until it is
- Same rigor as code - Skills are infrastructure, test like code
Source
Based on obra/superpowers-skills
- skills/meta/writing-skills/SKILL.md
- TDD methodology applied to documentation
Quick TDD Checklist
Before deploying any skill:
- Created and documented failing baseline test?
- Skill addresses specific observed violations?
- Tested with skill present - passes?
- Identified and countered rationalizations?
- Explicit forbid list included?
- Red flags section present?
- Re-tested until bulletproof?