Initial commit
This commit is contained in:
278
skills/skill/references/obra-tdd-methodology.md
Normal file
278
skills/skill/references/obra-tdd-methodology.md
Normal file
@@ -0,0 +1,278 @@
|
||||
# Test-Driven Documentation (TDD for Skills)
|
||||
|
||||
Extracted from obra/superpowers-skills with attribution.
|
||||
|
||||
## Core Concept
|
||||
|
||||
**Writing skills IS Test-Driven Development applied to process documentation.**
|
||||
|
||||
Test skills with subagents BEFORE writing, iterate until bulletproof.
|
||||
|
||||
## TDD Mapping
|
||||
|
||||
| TDD Concept | Skill Creation |
|
||||
|-------------|----------------|
|
||||
| Test case | Pressure scenario with subagent |
|
||||
| Production code | Skill document (SKILL.md) |
|
||||
| Test fails (RED) | Agent violates rule without skill |
|
||||
| Test passes (GREEN) | Agent complies with skill present |
|
||||
| Refactor | Close loopholes while maintaining compliance |
|
||||
|
||||
## The Iron Law
|
||||
|
||||
```
|
||||
NO SKILL WITHOUT A FAILING TEST FIRST
|
||||
```
|
||||
|
||||
No exceptions:
|
||||
- Not for "simple additions"
|
||||
- Not for "just documentation updates"
|
||||
- Delete untested changes and start over
|
||||
|
||||
## RED-GREEN-REFACTOR Cycle
|
||||
|
||||
### RED Phase: Write Failing Test
|
||||
|
||||
1. **Create pressure scenarios** with subagent
|
||||
- Combine multiple pressures (time + sunk cost + authority)
|
||||
- Make it HARD to follow the rule
|
||||
|
||||
2. **Run WITHOUT skill** - Document exact behavior:
|
||||
- What choices did agent make?
|
||||
- What rationalizations used? (verbatim quotes)
|
||||
- Which pressures triggered violations?
|
||||
|
||||
3. **Identify patterns** in failures
|
||||
|
||||
### GREEN Phase: Write Minimal Skill
|
||||
|
||||
1. **Address specific baseline failures** found in RED
|
||||
- Don't add hypothetical content
|
||||
- Write only what fixes observed violations
|
||||
|
||||
2. **Run WITH skill** - Agent should now comply
|
||||
|
||||
3. **Verify compliance** under same pressure
|
||||
|
||||
### REFACTOR Phase: Close Loopholes
|
||||
|
||||
1. **Find new rationalizations** - Agent found workaround?
|
||||
|
||||
2. **Add explicit counters** to skill
|
||||
|
||||
3. **Re-test** until bulletproof
|
||||
|
||||
## Pressure Types
|
||||
|
||||
### Time Pressure
|
||||
- "You have 5 minutes to ship this"
|
||||
- "Deadline is in 1 hour"
|
||||
|
||||
### Sunk Cost
|
||||
- "You've already written 200 lines of code"
|
||||
- "This has been in development for 3 days"
|
||||
|
||||
### Authority
|
||||
- "Senior engineer approved this approach"
|
||||
- "This is company standard"
|
||||
|
||||
### Exhaustion
|
||||
- "You've been debugging for 4 hours"
|
||||
- "This is the 10th iteration"
|
||||
|
||||
**Combine 2-3 pressures** for realistic scenarios.
|
||||
|
||||
## Bulletproofing Against Rationalization
|
||||
|
||||
Agents will rationalize violations. Close every loophole explicitly:
|
||||
|
||||
### Pattern 1: Add Explicit Forbid List
|
||||
|
||||
❌ Bad:
|
||||
```markdown
|
||||
Write tests before code.
|
||||
```
|
||||
|
||||
✅ Good:
|
||||
```markdown
|
||||
Write tests before code. Delete code if written first. Start over.
|
||||
|
||||
**No exceptions:**
|
||||
- Don't keep as "reference"
|
||||
- Don't "adapt" while testing
|
||||
- Don't look at it
|
||||
- Delete means delete
|
||||
```
|
||||
|
||||
### Pattern 2: Build Rationalization Table
|
||||
|
||||
Capture every excuse from testing:
|
||||
|
||||
| Excuse | Reality |
|
||||
|--------|---------|
|
||||
| "Too simple to test" | Simple code breaks. Test takes 30 seconds. |
|
||||
| "I'll test after" | Tests passing immediately prove nothing. |
|
||||
| "This is different because..." | It's not. Write test first. |
|
||||
|
||||
### Pattern 3: Create Red Flags List
|
||||
|
||||
```markdown
|
||||
## Red Flags - STOP and Start Over
|
||||
|
||||
Saying any of these means you're about to violate the rule:
|
||||
- "Code before test"
|
||||
- "I already tested it manually"
|
||||
- "Tests after achieve same purpose"
|
||||
- "It's about spirit not ritual"
|
||||
|
||||
**All mean: Delete code. Start over with TDD.**
|
||||
```
|
||||
|
||||
### Pattern 4: Address Spirit vs Letter
|
||||
|
||||
Add early in skill:
|
||||
|
||||
```markdown
|
||||
**Violating the letter of the rules IS violating the spirit of the rules.**
|
||||
```
|
||||
|
||||
Cuts off entire class of rationalizations.
|
||||
|
||||
## Testing Different Skill Types
|
||||
|
||||
### Discipline-Enforcing Skills
|
||||
Rules/requirements (TDD, verification-before-completion)
|
||||
|
||||
**Test with:**
|
||||
- Academic questions: Do they understand?
|
||||
- Pressure scenarios: Do they comply under stress?
|
||||
- Multiple pressures combined
|
||||
- Identify and counter rationalizations
|
||||
|
||||
### Technique Skills
|
||||
How-to guides (condition-based-waiting, root-cause-tracing)
|
||||
|
||||
**Test with:**
|
||||
- Application scenarios: Can they apply correctly?
|
||||
- Variation scenarios: Handle edge cases?
|
||||
- Missing information tests: Instructions have gaps?
|
||||
|
||||
### Pattern Skills
|
||||
Mental models (reducing-complexity concepts)
|
||||
|
||||
**Test with:**
|
||||
- Recognition scenarios: Recognize when pattern applies?
|
||||
- Application scenarios: Use the mental model?
|
||||
- Counter-examples: Know when NOT to apply?
|
||||
|
||||
### Reference Skills
|
||||
Documentation/APIs
|
||||
|
||||
**Test with:**
|
||||
- Retrieval scenarios: Find the right information?
|
||||
- Application scenarios: Use it correctly?
|
||||
- Gap testing: Common use cases covered?
|
||||
|
||||
## Common Rationalization Excuses
|
||||
|
||||
| Excuse | Reality |
|
||||
|--------|---------|
|
||||
| "Skill is obviously clear" | Clear to you ≠ clear to agents. Test it. |
|
||||
| "It's just a reference" | References have gaps. Test retrieval. |
|
||||
| "Testing is overkill" | Untested skills have issues. Always. |
|
||||
| "I'll test if problems emerge" | Problems = agents can't use skill. Test BEFORE. |
|
||||
| "Too tedious to test" | Less tedious than debugging bad skill in production. |
|
||||
| "I'm confident it's good" | Overconfidence guarantees issues. Test anyway. |
|
||||
|
||||
## Testing Workflow
|
||||
|
||||
```bash
|
||||
# 1. RED - Create baseline
|
||||
create_pressure_scenario()
|
||||
run_without_skill() > baseline.txt
|
||||
document_rationalizations()
|
||||
|
||||
# 2. GREEN - Write skill
|
||||
write_minimal_skill_addressing_baseline()
|
||||
run_with_skill() > with_skill.txt
|
||||
verify_compliance()
|
||||
|
||||
# 3. REFACTOR - Close loopholes
|
||||
identify_new_rationalizations()
|
||||
add_explicit_counters_to_skill()
|
||||
rerun_tests()
|
||||
repeat_until_bulletproof()
|
||||
```
|
||||
|
||||
## Real Example: TDD Skill for Skills
|
||||
|
||||
When creating the writing-skills skill itself:
|
||||
|
||||
**RED:**
|
||||
- Ran scenario: "Create a simple condition-waiting skill"
|
||||
- Without TDD skill: Agent wrote skill without testing
|
||||
- Rationalization: "It's simple enough, testing would be overkill"
|
||||
|
||||
**GREEN:**
|
||||
- Wrote skill with Iron Law and rationalization table
|
||||
- Re-ran scenario with skill present
|
||||
- Agent now creates test scenarios first
|
||||
|
||||
**REFACTOR:**
|
||||
- New rationalization: "I'll test after since it's just documentation"
|
||||
- Added explicit counter: "Tests after = what does this do? Tests first = what SHOULD this do?"
|
||||
- Re-tested: Agent complies
|
||||
|
||||
## Deployment Checklist
|
||||
|
||||
For EACH skill (no batching):
|
||||
|
||||
**RED Phase:**
|
||||
- [ ] Created pressure scenarios (3+ combined pressures)
|
||||
- [ ] Ran WITHOUT skill - documented verbatim behavior
|
||||
- [ ] Identified rationalization patterns
|
||||
|
||||
**GREEN Phase:**
|
||||
- [ ] Wrote minimal skill addressing baseline failures
|
||||
- [ ] Ran WITH skill - verified compliance
|
||||
|
||||
**REFACTOR Phase:**
|
||||
- [ ] Identified NEW rationalizations
|
||||
- [ ] Added explicit counters
|
||||
- [ ] Built rationalization table
|
||||
- [ ] Created red flags list
|
||||
- [ ] Re-tested until bulletproof
|
||||
|
||||
**Quality:**
|
||||
- [ ] Follows Anthropic best practices
|
||||
- [ ] Has concrete examples
|
||||
- [ ] Quick reference table included
|
||||
|
||||
**Deploy:**
|
||||
- [ ] Committed to git
|
||||
- [ ] Tested in production conversation
|
||||
|
||||
## Key Insights
|
||||
|
||||
1. **Test BEFORE writing** - Only way to know skill teaches right thing
|
||||
2. **Pressure scenarios required** - Agents need stress to reveal rationalizations
|
||||
3. **Explicit > Implicit** - Close every loophole with explicit forbid
|
||||
4. **Iterate** - First version never bulletproof, refactor until it is
|
||||
5. **Same rigor as code** - Skills are infrastructure, test like code
|
||||
|
||||
## Source
|
||||
|
||||
Based on [obra/superpowers-skills](https://github.com/obra/superpowers-skills)
|
||||
- skills/meta/writing-skills/SKILL.md
|
||||
- TDD methodology applied to documentation
|
||||
|
||||
## Quick TDD Checklist
|
||||
|
||||
Before deploying any skill:
|
||||
- [ ] Created and documented failing baseline test?
|
||||
- [ ] Skill addresses specific observed violations?
|
||||
- [ ] Tested with skill present - passes?
|
||||
- [ ] Identified and countered rationalizations?
|
||||
- [ ] Explicit forbid list included?
|
||||
- [ ] Red flags section present?
|
||||
- [ ] Re-tested until bulletproof?
|
||||
Reference in New Issue
Block a user