Files
gh-tenequm-claude-plugins-s…/skills/skill/references/obra-tdd-methodology.md
2025-11-30 09:01:20 +08:00

7.4 KiB

Test-Driven Documentation (TDD for Skills)

Extracted from obra/superpowers-skills with attribution.

Core Concept

Writing skills IS Test-Driven Development applied to process documentation.

Test skills with subagents BEFORE writing, iterate until bulletproof.

TDD Mapping

TDD Concept Skill Creation
Test case Pressure scenario with subagent
Production code Skill document (SKILL.md)
Test fails (RED) Agent violates rule without skill
Test passes (GREEN) Agent complies with skill present
Refactor Close loopholes while maintaining compliance

The Iron Law

NO SKILL WITHOUT A FAILING TEST FIRST

No exceptions:

  • Not for "simple additions"
  • Not for "just documentation updates"
  • Delete untested changes and start over

RED-GREEN-REFACTOR Cycle

RED Phase: Write Failing Test

  1. Create pressure scenarios with subagent

    • Combine multiple pressures (time + sunk cost + authority)
    • Make it HARD to follow the rule
  2. Run WITHOUT skill - Document exact behavior:

    • What choices did agent make?
    • What rationalizations used? (verbatim quotes)
    • Which pressures triggered violations?
  3. Identify patterns in failures

GREEN Phase: Write Minimal Skill

  1. Address specific baseline failures found in RED

    • Don't add hypothetical content
    • Write only what fixes observed violations
  2. Run WITH skill - Agent should now comply

  3. Verify compliance under same pressure

REFACTOR Phase: Close Loopholes

  1. Find new rationalizations - Agent found workaround?

  2. Add explicit counters to skill

  3. Re-test until bulletproof

Pressure Types

Time Pressure

  • "You have 5 minutes to ship this"
  • "Deadline is in 1 hour"

Sunk Cost

  • "You've already written 200 lines of code"
  • "This has been in development for 3 days"

Authority

  • "Senior engineer approved this approach"
  • "This is company standard"

Exhaustion

  • "You've been debugging for 4 hours"
  • "This is the 10th iteration"

Combine 2-3 pressures for realistic scenarios.

Bulletproofing Against Rationalization

Agents will rationalize violations. Close every loophole explicitly:

Pattern 1: Add Explicit Forbid List

Bad:

Write tests before code.

Good:

Write tests before code. Delete code if written first. Start over.

**No exceptions:**
- Don't keep as "reference"
- Don't "adapt" while testing
- Don't look at it
- Delete means delete

Pattern 2: Build Rationalization Table

Capture every excuse from testing:

Excuse Reality
"Too simple to test" Simple code breaks. Test takes 30 seconds.
"I'll test after" Tests passing immediately prove nothing.
"This is different because..." It's not. Write test first.

Pattern 3: Create Red Flags List

## Red Flags - STOP and Start Over

Saying any of these means you're about to violate the rule:
- "Code before test"
- "I already tested it manually"
- "Tests after achieve same purpose"
- "It's about spirit not ritual"

**All mean: Delete code. Start over with TDD.**

Pattern 4: Address Spirit vs Letter

Add early in skill:

**Violating the letter of the rules IS violating the spirit of the rules.**

Cuts off entire class of rationalizations.

Testing Different Skill Types

Discipline-Enforcing Skills

Rules/requirements (TDD, verification-before-completion)

Test with:

  • Academic questions: Do they understand?
  • Pressure scenarios: Do they comply under stress?
  • Multiple pressures combined
  • Identify and counter rationalizations

Technique Skills

How-to guides (condition-based-waiting, root-cause-tracing)

Test with:

  • Application scenarios: Can they apply correctly?
  • Variation scenarios: Handle edge cases?
  • Missing information tests: Instructions have gaps?

Pattern Skills

Mental models (reducing-complexity concepts)

Test with:

  • Recognition scenarios: Recognize when pattern applies?
  • Application scenarios: Use the mental model?
  • Counter-examples: Know when NOT to apply?

Reference Skills

Documentation/APIs

Test with:

  • Retrieval scenarios: Find the right information?
  • Application scenarios: Use it correctly?
  • Gap testing: Common use cases covered?

Common Rationalization Excuses

Excuse Reality
"Skill is obviously clear" Clear to you ≠ clear to agents. Test it.
"It's just a reference" References have gaps. Test retrieval.
"Testing is overkill" Untested skills have issues. Always.
"I'll test if problems emerge" Problems = agents can't use skill. Test BEFORE.
"Too tedious to test" Less tedious than debugging bad skill in production.
"I'm confident it's good" Overconfidence guarantees issues. Test anyway.

Testing Workflow

# 1. RED - Create baseline
create_pressure_scenario()
run_without_skill() > baseline.txt
document_rationalizations()

# 2. GREEN - Write skill
write_minimal_skill_addressing_baseline()
run_with_skill() > with_skill.txt
verify_compliance()

# 3. REFACTOR - Close loopholes
identify_new_rationalizations()
add_explicit_counters_to_skill()
rerun_tests()
repeat_until_bulletproof()

Real Example: TDD Skill for Skills

When creating the writing-skills skill itself:

RED:

  • Ran scenario: "Create a simple condition-waiting skill"
  • Without TDD skill: Agent wrote skill without testing
  • Rationalization: "It's simple enough, testing would be overkill"

GREEN:

  • Wrote skill with Iron Law and rationalization table
  • Re-ran scenario with skill present
  • Agent now creates test scenarios first

REFACTOR:

  • New rationalization: "I'll test after since it's just documentation"
  • Added explicit counter: "Tests after = what does this do? Tests first = what SHOULD this do?"
  • Re-tested: Agent complies

Deployment Checklist

For EACH skill (no batching):

RED Phase:

  • Created pressure scenarios (3+ combined pressures)
  • Ran WITHOUT skill - documented verbatim behavior
  • Identified rationalization patterns

GREEN Phase:

  • Wrote minimal skill addressing baseline failures
  • Ran WITH skill - verified compliance

REFACTOR Phase:

  • Identified NEW rationalizations
  • Added explicit counters
  • Built rationalization table
  • Created red flags list
  • Re-tested until bulletproof

Quality:

  • Follows Anthropic best practices
  • Has concrete examples
  • Quick reference table included

Deploy:

  • Committed to git
  • Tested in production conversation

Key Insights

  1. Test BEFORE writing - Only way to know skill teaches right thing
  2. Pressure scenarios required - Agents need stress to reveal rationalizations
  3. Explicit > Implicit - Close every loophole with explicit forbid
  4. Iterate - First version never bulletproof, refactor until it is
  5. Same rigor as code - Skills are infrastructure, test like code

Source

Based on obra/superpowers-skills

  • skills/meta/writing-skills/SKILL.md
  • TDD methodology applied to documentation

Quick TDD Checklist

Before deploying any skill:

  • Created and documented failing baseline test?
  • Skill addresses specific observed violations?
  • Tested with skill present - passes?
  • Identified and countered rationalizations?
  • Explicit forbid list included?
  • Red flags section present?
  • Re-tested until bulletproof?