Files
2025-11-30 09:06:38 +08:00

5.5 KiB

Testing All Skill Types

Different skill types need different test approaches:

Discipline-Enforcing Skills (rules/requirements)

Examples: TDD, hyperpowers:verification-before-completion, hyperpowers:designing-before-coding

Test with:

  • Academic questions: Do they understand the rules?
  • Pressure scenarios: Do they comply under stress?
  • Multiple pressures combined: time + sunk cost + exhaustion
  • Identify rationalizations and add explicit counters

Success criteria: Agent follows rule under maximum pressure

Technique Skills (how-to guides)

Examples: condition-based-waiting, hyperpowers:root-cause-tracing, defensive-programming

Test with:

  • Application scenarios: Can they apply the technique correctly?
  • Variation scenarios: Do they handle edge cases?
  • Missing information tests: Do instructions have gaps?

Success criteria: Agent successfully applies technique to new scenario

Pattern Skills (mental models)

Examples: reducing-complexity, information-hiding concepts

Test with:

  • Recognition scenarios: Do they recognize when pattern applies?
  • Application scenarios: Can they use the mental model?
  • Counter-examples: Do they know when NOT to apply?

Success criteria: Agent correctly identifies when/how to apply pattern

Reference Skills (documentation/APIs)

Examples: API documentation, command references, library guides

Test with:

  • Retrieval scenarios: Can they find the right information?
  • Application scenarios: Can they use what they found correctly?
  • Gap testing: Are common use cases covered?

Success criteria: Agent finds and correctly applies reference information

Common Rationalizations for Skipping Testing

Excuse Reality
"Skill is obviously clear" Clear to you ≠ clear to other agents. Test it.
"It's just a reference" References can have gaps, unclear sections. Test retrieval.
"Testing is overkill" Untested skills have issues. Always. 15 min testing saves hours.
"I'll test if problems emerge" Problems = agents can't use skill. Test BEFORE deploying.
"Too tedious to test" Testing is less tedious than debugging bad skill in production.
"I'm confident it's good" Overconfidence guarantees issues. Test anyway.
"Academic review is enough" Reading ≠ using. Test application scenarios.
"No time to test" Deploying untested skill wastes more time fixing it later.

All of these mean: Test before deploying. No exceptions.

Bulletproofing Skills Against Rationalization

Skills that enforce discipline (like TDD) need to resist rationalization. Agents are smart and will find loopholes when under pressure.

Psychology note: Understanding WHY persuasion techniques work helps you apply them systematically. See persuasion-principles.md for research foundation (Cialdini, 2021; Meincke et al., 2025) on authority, commitment, scarcity, social proof, and unity principles.

Close Every Loophole Explicitly

Don't just state the rule - forbid specific workarounds:

```markdown Write code before test? Delete it. ``` ```markdown Write code before test? Delete it. Start over.

No exceptions:

  • Don't keep it as "reference"
  • Don't "adapt" it while writing tests
  • Don't look at it
  • Delete means delete
</Good>

### Address "Spirit vs Letter" Arguments

Add foundational principle early:

```markdown
**Violating the letter of the rules is violating the spirit of the rules.**

This cuts off entire class of "I'm following the spirit" rationalizations.

Build Rationalization Table

Capture rationalizations from baseline testing (see Testing section below). Every excuse agents make goes in the table:

| Excuse | Reality |
|--------|---------|
| "Too simple to test" | Simple code breaks. Test takes 30 seconds. |
| "I'll test after" | Tests passing immediately prove nothing. |
| "Tests after achieve same goals" | Tests-after = "what does this do?" Tests-first = "what should this do?" |

Create Red Flags List

Make it easy for agents to self-check when rationalizing:

## Red Flags - STOP and Start Over

- Code before test
- "I already manually tested it"
- "Tests after achieve the same purpose"
- "It's about spirit not ritual"
- "This is different because..."

**All of these mean: Delete code. Start over with TDD.**

Update CSO for Violation Symptoms

Add to description: symptoms of when you're ABOUT to violate the rule:

description: use when implementing any feature or bugfix, before writing implementation code

RED-GREEN-REFACTOR for Skills

Follow the TDD cycle:

RED: Write Failing Test (Baseline)

Run pressure scenario with subagent WITHOUT the skill. Document exact behavior:

  • What choices did they make?
  • What rationalizations did they use (verbatim)?
  • Which pressures triggered violations?

This is "watch the test fail" - you must see what agents naturally do before writing the skill.

GREEN: Write Minimal Skill

Write skill that addresses those specific rationalizations. Don't add extra content for hypothetical cases.

Run same scenarios WITH skill. Agent should now comply.

REFACTOR: Close Loopholes

Agent found new rationalization? Add explicit counter. Re-test until bulletproof.

REQUIRED SUB-SKILL: Use superpowers:testing-skills-with-subagents for the complete testing methodology:

  • How to write pressure scenarios
  • Pressure types (time, sunk cost, authority, exhaustion)
  • Plugging holes systematically
  • Meta-testing techniques