zhongwei/gh-tenequm-claude-plugins-skill-factory

Fork 0

Files

Zhongwei Li 650cfdbc05 Initial commit

2025-11-30 09:01:20 +08:00

7.4 KiB

Raw Permalink Blame History

Test-Driven Documentation (TDD for Skills)

Extracted from obra/superpowers-skills with attribution.

Core Concept

Writing skills IS Test-Driven Development applied to process documentation.

Test skills with subagents BEFORE writing, iterate until bulletproof.

TDD Mapping

TDD Concept	Skill Creation
Test case	Pressure scenario with subagent
Production code	Skill document (SKILL.md)
Test fails (RED)	Agent violates rule without skill
Test passes (GREEN)	Agent complies with skill present
Refactor	Close loopholes while maintaining compliance

The Iron Law

NO SKILL WITHOUT A FAILING TEST FIRST

No exceptions:

Not for "simple additions"
Not for "just documentation updates"
Delete untested changes and start over

RED-GREEN-REFACTOR Cycle

RED Phase: Write Failing Test

Create pressure scenarios with subagent
- Combine multiple pressures (time + sunk cost + authority)
- Make it HARD to follow the rule
Run WITHOUT skill - Document exact behavior:
- What choices did agent make?
- What rationalizations used? (verbatim quotes)
- Which pressures triggered violations?
Identify patterns in failures

GREEN Phase: Write Minimal Skill

Address specific baseline failures found in RED
- Don't add hypothetical content
- Write only what fixes observed violations
Run WITH skill - Agent should now comply
Verify compliance under same pressure

REFACTOR Phase: Close Loopholes

Find new rationalizations - Agent found workaround?
Add explicit counters to skill
Re-test until bulletproof

Pressure Types

Time Pressure

"You have 5 minutes to ship this"
"Deadline is in 1 hour"

Sunk Cost

"You've already written 200 lines of code"
"This has been in development for 3 days"

Authority

"Senior engineer approved this approach"
"This is company standard"

Exhaustion

"You've been debugging for 4 hours"
"This is the 10th iteration"

Combine 2-3 pressures for realistic scenarios.

Bulletproofing Against Rationalization

Agents will rationalize violations. Close every loophole explicitly:

Pattern 1: Add Explicit Forbid List

❌ Bad:

Write tests before code.

✅ Good:

Write tests before code. Delete code if written first. Start over.

**No exceptions:**
- Don't keep as "reference"
- Don't "adapt" while testing
- Don't look at it
- Delete means delete

Pattern 2: Build Rationalization Table

Capture every excuse from testing:

Excuse	Reality
"Too simple to test"	Simple code breaks. Test takes 30 seconds.
"I'll test after"	Tests passing immediately prove nothing.
"This is different because..."	It's not. Write test first.

Pattern 3: Create Red Flags List

## Red Flags - STOP and Start Over

Saying any of these means you're about to violate the rule:
- "Code before test"
- "I already tested it manually"
- "Tests after achieve same purpose"
- "It's about spirit not ritual"

**All mean: Delete code. Start over with TDD.**

Pattern 4: Address Spirit vs Letter

Add early in skill:

**Violating the letter of the rules IS violating the spirit of the rules.**

Cuts off entire class of rationalizations.

Testing Different Skill Types

Discipline-Enforcing Skills

Rules/requirements (TDD, verification-before-completion)

Test with:

Academic questions: Do they understand?
Pressure scenarios: Do they comply under stress?
Multiple pressures combined
Identify and counter rationalizations

Technique Skills

How-to guides (condition-based-waiting, root-cause-tracing)

Test with:

Application scenarios: Can they apply correctly?
Variation scenarios: Handle edge cases?
Missing information tests: Instructions have gaps?

Pattern Skills

Mental models (reducing-complexity concepts)

Test with:

Recognition scenarios: Recognize when pattern applies?
Application scenarios: Use the mental model?
Counter-examples: Know when NOT to apply?

Reference Skills

Documentation/APIs

Test with:

Retrieval scenarios: Find the right information?
Application scenarios: Use it correctly?
Gap testing: Common use cases covered?

Common Rationalization Excuses

Excuse	Reality
"Skill is obviously clear"	Clear to you ≠ clear to agents. Test it.
"It's just a reference"	References have gaps. Test retrieval.
"Testing is overkill"	Untested skills have issues. Always.
"I'll test if problems emerge"	Problems = agents can't use skill. Test BEFORE.
"Too tedious to test"	Less tedious than debugging bad skill in production.
"I'm confident it's good"	Overconfidence guarantees issues. Test anyway.

Testing Workflow

# 1. RED - Create baseline
create_pressure_scenario()
run_without_skill() > baseline.txt
document_rationalizations()

# 2. GREEN - Write skill
write_minimal_skill_addressing_baseline()
run_with_skill() > with_skill.txt
verify_compliance()

# 3. REFACTOR - Close loopholes
identify_new_rationalizations()
add_explicit_counters_to_skill()
rerun_tests()
repeat_until_bulletproof()

Real Example: TDD Skill for Skills

When creating the writing-skills skill itself:

RED:

Ran scenario: "Create a simple condition-waiting skill"
Without TDD skill: Agent wrote skill without testing
Rationalization: "It's simple enough, testing would be overkill"

GREEN:

Wrote skill with Iron Law and rationalization table
Re-ran scenario with skill present
Agent now creates test scenarios first

REFACTOR:

New rationalization: "I'll test after since it's just documentation"
Added explicit counter: "Tests after = what does this do? Tests first = what SHOULD this do?"
Re-tested: Agent complies

Deployment Checklist

For EACH skill (no batching):

RED Phase:

Created pressure scenarios (3+ combined pressures)
Ran WITHOUT skill - documented verbatim behavior
Identified rationalization patterns

GREEN Phase:

Wrote minimal skill addressing baseline failures
Ran WITH skill - verified compliance

REFACTOR Phase:

Identified NEW rationalizations
Added explicit counters
Built rationalization table
Created red flags list
Re-tested until bulletproof

Quality:

Follows Anthropic best practices
Has concrete examples
Quick reference table included

Deploy:

Committed to git
Tested in production conversation

Key Insights

Test BEFORE writing - Only way to know skill teaches right thing
Pressure scenarios required - Agents need stress to reveal rationalizations
Explicit > Implicit - Close every loophole with explicit forbid
Iterate - First version never bulletproof, refactor until it is
Same rigor as code - Skills are infrastructure, test like code

Source

Based on obra/superpowers-skills

skills/meta/writing-skills/SKILL.md
TDD methodology applied to documentation

Quick TDD Checklist

Before deploying any skill:

Created and documented failing baseline test?
Skill addresses specific observed violations?
Tested with skill present - passes?
Identified and countered rationalizations?
Explicit forbid list included?
Red flags section present?
Re-tested until bulletproof?

7.4 KiB Raw Permalink Blame History

Test-Driven Documentation (TDD for Skills)

Core Concept

TDD Mapping

The Iron Law

RED-GREEN-REFACTOR Cycle

RED Phase: Write Failing Test

GREEN Phase: Write Minimal Skill

REFACTOR Phase: Close Loopholes

Pressure Types

Time Pressure

Sunk Cost

Authority

Exhaustion

Bulletproofing Against Rationalization

Pattern 1: Add Explicit Forbid List

Pattern 2: Build Rationalization Table

Pattern 3: Create Red Flags List

Pattern 4: Address Spirit vs Letter

Testing Different Skill Types

Discipline-Enforcing Skills

Technique Skills

Pattern Skills

Reference Skills

Common Rationalization Excuses

Testing Workflow

Real Example: TDD Skill for Skills

Deployment Checklist

Key Insights

Source

Quick TDD Checklist

7.4 KiB

Raw Permalink Blame History