Initial commit

2025-11-30 09:01:20 +08:00
commit 650cfdbc05
15 changed files with 4252 additions and 0 deletions
--- a/skills/skill/references/obra-tdd-methodology.md
+++ b/skills/skill/references/obra-tdd-methodology.md
@@ -0,0 +1,278 @@
+# Test-Driven Documentation (TDD for Skills)
+
+Extracted from obra/superpowers-skills with attribution.
+
+## Core Concept
+
+**Writing skills IS Test-Driven Development applied to process documentation.**
+
+Test skills with subagents BEFORE writing, iterate until bulletproof.
+
+## TDD Mapping
+
+| TDD Concept | Skill Creation |
+|-------------|----------------|
+| Test case | Pressure scenario with subagent |
+| Production code | Skill document (SKILL.md) |
+| Test fails (RED) | Agent violates rule without skill |
+| Test passes (GREEN) | Agent complies with skill present |
+| Refactor | Close loopholes while maintaining compliance |
+
+## The Iron Law
+
+```
+NO SKILL WITHOUT A FAILING TEST FIRST
+```
+
+No exceptions:
+- Not for "simple additions"
+- Not for "just documentation updates"
+- Delete untested changes and start over
+
+## RED-GREEN-REFACTOR Cycle
+
+### RED Phase: Write Failing Test
+
+1. **Create pressure scenarios** with subagent
+   - Combine multiple pressures (time + sunk cost + authority)
+   - Make it HARD to follow the rule
+
+2. **Run WITHOUT skill** - Document exact behavior:
+   - What choices did agent make?
+   - What rationalizations used? (verbatim quotes)
+   - Which pressures triggered violations?
+
+3. **Identify patterns** in failures
+
+### GREEN Phase: Write Minimal Skill
+
+1. **Address specific baseline failures** found in RED
+   - Don't add hypothetical content
+   - Write only what fixes observed violations
+
+2. **Run WITH skill** - Agent should now comply
+
+3. **Verify compliance** under same pressure
+
+### REFACTOR Phase: Close Loopholes
+
+1. **Find new rationalizations** - Agent found workaround?
+
+2. **Add explicit counters** to skill
+
+3. **Re-test** until bulletproof
+
+## Pressure Types
+
+### Time Pressure
+- "You have 5 minutes to ship this"
+- "Deadline is in 1 hour"
+
+### Sunk Cost
+- "You've already written 200 lines of code"
+- "This has been in development for 3 days"
+
+### Authority
+- "Senior engineer approved this approach"
+- "This is company standard"
+
+### Exhaustion
+- "You've been debugging for 4 hours"
+- "This is the 10th iteration"
+
+**Combine 2-3 pressures** for realistic scenarios.
+
+## Bulletproofing Against Rationalization
+
+Agents will rationalize violations. Close every loophole explicitly:
+
+### Pattern 1: Add Explicit Forbid List
+
+❌ Bad:
+```markdown
+Write tests before code.
+```
+
+✅ Good:
+```markdown
+Write tests before code. Delete code if written first. Start over.
+
+**No exceptions:**
+- Don't keep as "reference"
+- Don't "adapt" while testing
+- Don't look at it
+- Delete means delete
+```
+
+### Pattern 2: Build Rationalization Table
+
+Capture every excuse from testing:
+
+| Excuse | Reality |
+|--------|---------|
+| "Too simple to test" | Simple code breaks. Test takes 30 seconds. |
+| "I'll test after" | Tests passing immediately prove nothing. |
+| "This is different because..." | It's not. Write test first. |
+
+### Pattern 3: Create Red Flags List
+
+```markdown
+## Red Flags - STOP and Start Over
+
+Saying any of these means you're about to violate the rule:
+- "Code before test"
+- "I already tested it manually"
+- "Tests after achieve same purpose"
+- "It's about spirit not ritual"
+
+**All mean: Delete code. Start over with TDD.**
+```
+
+### Pattern 4: Address Spirit vs Letter
+
+Add early in skill:
+
+```markdown
+**Violating the letter of the rules IS violating the spirit of the rules.**
+```
+
+Cuts off entire class of rationalizations.
+
+## Testing Different Skill Types
+
+### Discipline-Enforcing Skills
+Rules/requirements (TDD, verification-before-completion)
+
+**Test with:**
+- Academic questions: Do they understand?
+- Pressure scenarios: Do they comply under stress?
+- Multiple pressures combined
+- Identify and counter rationalizations
+
+### Technique Skills
+How-to guides (condition-based-waiting, root-cause-tracing)
+
+**Test with:**
+- Application scenarios: Can they apply correctly?
+- Variation scenarios: Handle edge cases?
+- Missing information tests: Instructions have gaps?
+
+### Pattern Skills
+Mental models (reducing-complexity concepts)
+
+**Test with:**
+- Recognition scenarios: Recognize when pattern applies?
+- Application scenarios: Use the mental model?
+- Counter-examples: Know when NOT to apply?
+
+### Reference Skills
+Documentation/APIs
+
+**Test with:**
+- Retrieval scenarios: Find the right information?
+- Application scenarios: Use it correctly?
+- Gap testing: Common use cases covered?
+
+## Common Rationalization Excuses
+
+| Excuse | Reality |
+|--------|---------|
+| "Skill is obviously clear" | Clear to you ≠ clear to agents. Test it. |
+| "It's just a reference" | References have gaps. Test retrieval. |
+| "Testing is overkill" | Untested skills have issues. Always. |
+| "I'll test if problems emerge" | Problems = agents can't use skill. Test BEFORE. |
+| "Too tedious to test" | Less tedious than debugging bad skill in production. |
+| "I'm confident it's good" | Overconfidence guarantees issues. Test anyway. |
+
+## Testing Workflow
+
+```bash
+# 1. RED - Create baseline
+create_pressure_scenario()
+run_without_skill() > baseline.txt
+document_rationalizations()
+
+# 2. GREEN - Write skill
+write_minimal_skill_addressing_baseline()
+run_with_skill() > with_skill.txt
+verify_compliance()
+
+# 3. REFACTOR - Close loopholes
+identify_new_rationalizations()
+add_explicit_counters_to_skill()
+rerun_tests()
+repeat_until_bulletproof()
+```
+
+## Real Example: TDD Skill for Skills
+
+When creating the writing-skills skill itself:
+
+**RED:**
+- Ran scenario: "Create a simple condition-waiting skill"
+- Without TDD skill: Agent wrote skill without testing
+- Rationalization: "It's simple enough, testing would be overkill"
+
+**GREEN:**
+- Wrote skill with Iron Law and rationalization table
+- Re-ran scenario with skill present
+- Agent now creates test scenarios first
+
+**REFACTOR:**
+- New rationalization: "I'll test after since it's just documentation"
+- Added explicit counter: "Tests after = what does this do? Tests first = what SHOULD this do?"
+- Re-tested: Agent complies
+
+## Deployment Checklist
+
+For EACH skill (no batching):
+
+**RED Phase:**
+- [ ] Created pressure scenarios (3+ combined pressures)
+- [ ] Ran WITHOUT skill - documented verbatim behavior
+- [ ] Identified rationalization patterns
+
+**GREEN Phase:**
+- [ ] Wrote minimal skill addressing baseline failures
+- [ ] Ran WITH skill - verified compliance
+
+**REFACTOR Phase:**
+- [ ] Identified NEW rationalizations
+- [ ] Added explicit counters
+- [ ] Built rationalization table
+- [ ] Created red flags list
+- [ ] Re-tested until bulletproof
+
+**Quality:**
+- [ ] Follows Anthropic best practices
+- [ ] Has concrete examples
+- [ ] Quick reference table included
+
+**Deploy:**
+- [ ] Committed to git
+- [ ] Tested in production conversation
+
+## Key Insights
+
+1. **Test BEFORE writing** - Only way to know skill teaches right thing
+2. **Pressure scenarios required** - Agents need stress to reveal rationalizations
+3. **Explicit > Implicit** - Close every loophole with explicit forbid
+4. **Iterate** - First version never bulletproof, refactor until it is
+5. **Same rigor as code** - Skills are infrastructure, test like code
+
+## Source
+
+Based on [obra/superpowers-skills](https://github.com/obra/superpowers-skills)
+- skills/meta/writing-skills/SKILL.md
+- TDD methodology applied to documentation
+
+## Quick TDD Checklist
+
+Before deploying any skill:
+- [ ] Created and documented failing baseline test?
+- [ ] Skill addresses specific observed violations?
+- [ ] Tested with skill present - passes?
+- [ ] Identified and countered rationalizations?
+- [ ] Explicit forbid list included?
+- [ ] Red flags section present?
+- [ ] Re-tested until bulletproof?