gh-tenequm-claude-plugins-s…/skills/skill/references/obra-tdd-methodology.md

# Test-Driven Documentation (TDD for Skills)

Extracted from obra/superpowers-skills with attribution.

## Core Concept

**Writing skills IS Test-Driven Development applied to process documentation.**

Test skills with subagents BEFORE writing, iterate until bulletproof.

## TDD Mapping

| TDD Concept | Skill Creation |
|-------------|----------------|
| Test case | Pressure scenario with subagent |
| Production code | Skill document (SKILL.md) |
| Test fails (RED) | Agent violates rule without skill |
| Test passes (GREEN) | Agent complies with skill present |
| Refactor | Close loopholes while maintaining compliance |

## The Iron Law

```
NO SKILL WITHOUT A FAILING TEST FIRST
```

No exceptions:
- Not for "simple additions"
- Not for "just documentation updates"
- Delete untested changes and start over

## RED-GREEN-REFACTOR Cycle

### RED Phase: Write Failing Test

1. **Create pressure scenarios** with subagent
   - Combine multiple pressures (time + sunk cost + authority)
   - Make it HARD to follow the rule

2. **Run WITHOUT skill** - Document exact behavior:
   - What choices did agent make?
   - What rationalizations used? (verbatim quotes)
   - Which pressures triggered violations?

3. **Identify patterns** in failures

### GREEN Phase: Write Minimal Skill

1. **Address specific baseline failures** found in RED
   - Don't add hypothetical content
   - Write only what fixes observed violations

2. **Run WITH skill** - Agent should now comply

3. **Verify compliance** under same pressure

### REFACTOR Phase: Close Loopholes

1. **Find new rationalizations** - Agent found workaround?

2. **Add explicit counters** to skill

3. **Re-test** until bulletproof

## Pressure Types

### Time Pressure
- "You have 5 minutes to ship this"
- "Deadline is in 1 hour"

### Sunk Cost
- "You've already written 200 lines of code"
- "This has been in development for 3 days"

### Authority
- "Senior engineer approved this approach"
- "This is company standard"

### Exhaustion
- "You've been debugging for 4 hours"
- "This is the 10th iteration"

**Combine 2-3 pressures** for realistic scenarios.

## Bulletproofing Against Rationalization

Agents will rationalize violations. Close every loophole explicitly:

### Pattern 1: Add Explicit Forbid List

❌ Bad:
```markdown
Write tests before code.
```

✅ Good:
```markdown
Write tests before code. Delete code if written first. Start over.

**No exceptions:**
- Don't keep as "reference"
- Don't "adapt" while testing
- Don't look at it
- Delete means delete
```

### Pattern 2: Build Rationalization Table

Capture every excuse from testing:

| Excuse | Reality |
|--------|---------|
| "Too simple to test" | Simple code breaks. Test takes 30 seconds. |
| "I'll test after" | Tests passing immediately prove nothing. |
| "This is different because..." | It's not. Write test first. |

### Pattern 3: Create Red Flags List

```markdown
## Red Flags - STOP and Start Over

Saying any of these means you're about to violate the rule:
- "Code before test"
- "I already tested it manually"
- "Tests after achieve same purpose"
- "It's about spirit not ritual"

**All mean: Delete code. Start over with TDD.**
```

### Pattern 4: Address Spirit vs Letter

Add early in skill:

```markdown
**Violating the letter of the rules IS violating the spirit of the rules.**
```

Cuts off entire class of rationalizations.

## Testing Different Skill Types

### Discipline-Enforcing Skills
Rules/requirements (TDD, verification-before-completion)

**Test with:**
- Academic questions: Do they understand?
- Pressure scenarios: Do they comply under stress?
- Multiple pressures combined
- Identify and counter rationalizations

### Technique Skills
How-to guides (condition-based-waiting, root-cause-tracing)

**Test with:**
- Application scenarios: Can they apply correctly?
- Variation scenarios: Handle edge cases?
- Missing information tests: Instructions have gaps?

### Pattern Skills
Mental models (reducing-complexity concepts)

**Test with:**
- Recognition scenarios: Recognize when pattern applies?
- Application scenarios: Use the mental model?
- Counter-examples: Know when NOT to apply?

### Reference Skills
Documentation/APIs

**Test with:**
- Retrieval scenarios: Find the right information?
- Application scenarios: Use it correctly?
- Gap testing: Common use cases covered?

## Common Rationalization Excuses

| Excuse | Reality |
|--------|---------|
| "Skill is obviously clear" | Clear to you ≠ clear to agents. Test it. |
| "It's just a reference" | References have gaps. Test retrieval. |
| "Testing is overkill" | Untested skills have issues. Always. |
| "I'll test if problems emerge" | Problems = agents can't use skill. Test BEFORE. |
| "Too tedious to test" | Less tedious than debugging bad skill in production. |
| "I'm confident it's good" | Overconfidence guarantees issues. Test anyway. |

## Testing Workflow

```bash
# 1. RED - Create baseline
create_pressure_scenario()
run_without_skill() > baseline.txt
document_rationalizations()

# 2. GREEN - Write skill
write_minimal_skill_addressing_baseline()
run_with_skill() > with_skill.txt
verify_compliance()

# 3. REFACTOR - Close loopholes
identify_new_rationalizations()
add_explicit_counters_to_skill()
rerun_tests()
repeat_until_bulletproof()
```

## Real Example: TDD Skill for Skills

When creating the writing-skills skill itself:

**RED:**
- Ran scenario: "Create a simple condition-waiting skill"
- Without TDD skill: Agent wrote skill without testing
- Rationalization: "It's simple enough, testing would be overkill"

**GREEN:**
- Wrote skill with Iron Law and rationalization table
- Re-ran scenario with skill present
- Agent now creates test scenarios first

**REFACTOR:**
- New rationalization: "I'll test after since it's just documentation"
- Added explicit counter: "Tests after = what does this do? Tests first = what SHOULD this do?"
- Re-tested: Agent complies

## Deployment Checklist

For EACH skill (no batching):

**RED Phase:**
- [ ] Created pressure scenarios (3+ combined pressures)
- [ ] Ran WITHOUT skill - documented verbatim behavior
- [ ] Identified rationalization patterns

**GREEN Phase:**
- [ ] Wrote minimal skill addressing baseline failures
- [ ] Ran WITH skill - verified compliance

**REFACTOR Phase:**
- [ ] Identified NEW rationalizations
- [ ] Added explicit counters
- [ ] Built rationalization table
- [ ] Created red flags list
- [ ] Re-tested until bulletproof

**Quality:**
- [ ] Follows Anthropic best practices
- [ ] Has concrete examples
- [ ] Quick reference table included

**Deploy:**
- [ ] Committed to git
- [ ] Tested in production conversation

## Key Insights

1. **Test BEFORE writing** - Only way to know skill teaches right thing
2. **Pressure scenarios required** - Agents need stress to reveal rationalizations
3. **Explicit > Implicit** - Close every loophole with explicit forbid
4. **Iterate** - First version never bulletproof, refactor until it is
5. **Same rigor as code** - Skills are infrastructure, test like code

## Source

Based on [obra/superpowers-skills](https://github.com/obra/superpowers-skills)
- skills/meta/writing-skills/SKILL.md
- TDD methodology applied to documentation

## Quick TDD Checklist

Before deploying any skill:
- [ ] Created and documented failing baseline test?
- [ ] Skill addresses specific observed violations?
- [ ] Tested with skill present - passes?
- [ ] Identified and countered rationalizations?
- [ ] Explicit forbid list included?
- [ ] Red flags section present?
- [ ] Re-tested until bulletproof?