Files
gh-tenequm-claude-plugins-s…/skills/skill/references/obra-tdd-methodology.md
2025-11-30 09:01:20 +08:00

279 lines
7.4 KiB
Markdown

# Test-Driven Documentation (TDD for Skills)
Extracted from obra/superpowers-skills with attribution.
## Core Concept
**Writing skills IS Test-Driven Development applied to process documentation.**
Test skills with subagents BEFORE writing, iterate until bulletproof.
## TDD Mapping
| TDD Concept | Skill Creation |
|-------------|----------------|
| Test case | Pressure scenario with subagent |
| Production code | Skill document (SKILL.md) |
| Test fails (RED) | Agent violates rule without skill |
| Test passes (GREEN) | Agent complies with skill present |
| Refactor | Close loopholes while maintaining compliance |
## The Iron Law
```
NO SKILL WITHOUT A FAILING TEST FIRST
```
No exceptions:
- Not for "simple additions"
- Not for "just documentation updates"
- Delete untested changes and start over
## RED-GREEN-REFACTOR Cycle
### RED Phase: Write Failing Test
1. **Create pressure scenarios** with subagent
- Combine multiple pressures (time + sunk cost + authority)
- Make it HARD to follow the rule
2. **Run WITHOUT skill** - Document exact behavior:
- What choices did agent make?
- What rationalizations used? (verbatim quotes)
- Which pressures triggered violations?
3. **Identify patterns** in failures
### GREEN Phase: Write Minimal Skill
1. **Address specific baseline failures** found in RED
- Don't add hypothetical content
- Write only what fixes observed violations
2. **Run WITH skill** - Agent should now comply
3. **Verify compliance** under same pressure
### REFACTOR Phase: Close Loopholes
1. **Find new rationalizations** - Agent found workaround?
2. **Add explicit counters** to skill
3. **Re-test** until bulletproof
## Pressure Types
### Time Pressure
- "You have 5 minutes to ship this"
- "Deadline is in 1 hour"
### Sunk Cost
- "You've already written 200 lines of code"
- "This has been in development for 3 days"
### Authority
- "Senior engineer approved this approach"
- "This is company standard"
### Exhaustion
- "You've been debugging for 4 hours"
- "This is the 10th iteration"
**Combine 2-3 pressures** for realistic scenarios.
## Bulletproofing Against Rationalization
Agents will rationalize violations. Close every loophole explicitly:
### Pattern 1: Add Explicit Forbid List
❌ Bad:
```markdown
Write tests before code.
```
✅ Good:
```markdown
Write tests before code. Delete code if written first. Start over.
**No exceptions:**
- Don't keep as "reference"
- Don't "adapt" while testing
- Don't look at it
- Delete means delete
```
### Pattern 2: Build Rationalization Table
Capture every excuse from testing:
| Excuse | Reality |
|--------|---------|
| "Too simple to test" | Simple code breaks. Test takes 30 seconds. |
| "I'll test after" | Tests passing immediately prove nothing. |
| "This is different because..." | It's not. Write test first. |
### Pattern 3: Create Red Flags List
```markdown
## Red Flags - STOP and Start Over
Saying any of these means you're about to violate the rule:
- "Code before test"
- "I already tested it manually"
- "Tests after achieve same purpose"
- "It's about spirit not ritual"
**All mean: Delete code. Start over with TDD.**
```
### Pattern 4: Address Spirit vs Letter
Add early in skill:
```markdown
**Violating the letter of the rules IS violating the spirit of the rules.**
```
Cuts off entire class of rationalizations.
## Testing Different Skill Types
### Discipline-Enforcing Skills
Rules/requirements (TDD, verification-before-completion)
**Test with:**
- Academic questions: Do they understand?
- Pressure scenarios: Do they comply under stress?
- Multiple pressures combined
- Identify and counter rationalizations
### Technique Skills
How-to guides (condition-based-waiting, root-cause-tracing)
**Test with:**
- Application scenarios: Can they apply correctly?
- Variation scenarios: Handle edge cases?
- Missing information tests: Instructions have gaps?
### Pattern Skills
Mental models (reducing-complexity concepts)
**Test with:**
- Recognition scenarios: Recognize when pattern applies?
- Application scenarios: Use the mental model?
- Counter-examples: Know when NOT to apply?
### Reference Skills
Documentation/APIs
**Test with:**
- Retrieval scenarios: Find the right information?
- Application scenarios: Use it correctly?
- Gap testing: Common use cases covered?
## Common Rationalization Excuses
| Excuse | Reality |
|--------|---------|
| "Skill is obviously clear" | Clear to you ≠ clear to agents. Test it. |
| "It's just a reference" | References have gaps. Test retrieval. |
| "Testing is overkill" | Untested skills have issues. Always. |
| "I'll test if problems emerge" | Problems = agents can't use skill. Test BEFORE. |
| "Too tedious to test" | Less tedious than debugging bad skill in production. |
| "I'm confident it's good" | Overconfidence guarantees issues. Test anyway. |
## Testing Workflow
```bash
# 1. RED - Create baseline
create_pressure_scenario()
run_without_skill() > baseline.txt
document_rationalizations()
# 2. GREEN - Write skill
write_minimal_skill_addressing_baseline()
run_with_skill() > with_skill.txt
verify_compliance()
# 3. REFACTOR - Close loopholes
identify_new_rationalizations()
add_explicit_counters_to_skill()
rerun_tests()
repeat_until_bulletproof()
```
## Real Example: TDD Skill for Skills
When creating the writing-skills skill itself:
**RED:**
- Ran scenario: "Create a simple condition-waiting skill"
- Without TDD skill: Agent wrote skill without testing
- Rationalization: "It's simple enough, testing would be overkill"
**GREEN:**
- Wrote skill with Iron Law and rationalization table
- Re-ran scenario with skill present
- Agent now creates test scenarios first
**REFACTOR:**
- New rationalization: "I'll test after since it's just documentation"
- Added explicit counter: "Tests after = what does this do? Tests first = what SHOULD this do?"
- Re-tested: Agent complies
## Deployment Checklist
For EACH skill (no batching):
**RED Phase:**
- [ ] Created pressure scenarios (3+ combined pressures)
- [ ] Ran WITHOUT skill - documented verbatim behavior
- [ ] Identified rationalization patterns
**GREEN Phase:**
- [ ] Wrote minimal skill addressing baseline failures
- [ ] Ran WITH skill - verified compliance
**REFACTOR Phase:**
- [ ] Identified NEW rationalizations
- [ ] Added explicit counters
- [ ] Built rationalization table
- [ ] Created red flags list
- [ ] Re-tested until bulletproof
**Quality:**
- [ ] Follows Anthropic best practices
- [ ] Has concrete examples
- [ ] Quick reference table included
**Deploy:**
- [ ] Committed to git
- [ ] Tested in production conversation
## Key Insights
1. **Test BEFORE writing** - Only way to know skill teaches right thing
2. **Pressure scenarios required** - Agents need stress to reveal rationalizations
3. **Explicit > Implicit** - Close every loophole with explicit forbid
4. **Iterate** - First version never bulletproof, refactor until it is
5. **Same rigor as code** - Skills are infrastructure, test like code
## Source
Based on [obra/superpowers-skills](https://github.com/obra/superpowers-skills)
- skills/meta/writing-skills/SKILL.md
- TDD methodology applied to documentation
## Quick TDD Checklist
Before deploying any skill:
- [ ] Created and documented failing baseline test?
- [ ] Skill addresses specific observed violations?
- [ ] Tested with skill present - passes?
- [ ] Identified and countered rationalizations?
- [ ] Explicit forbid list included?
- [ ] Red flags section present?
- [ ] Re-tested until bulletproof?