## Testing All Skill Types Different skill types need different test approaches: ### Discipline-Enforcing Skills (rules/requirements) **Examples:** TDD, hyperpowers:verification-before-completion, hyperpowers:designing-before-coding **Test with:** - Academic questions: Do they understand the rules? - Pressure scenarios: Do they comply under stress? - Multiple pressures combined: time + sunk cost + exhaustion - Identify rationalizations and add explicit counters **Success criteria:** Agent follows rule under maximum pressure ### Technique Skills (how-to guides) **Examples:** condition-based-waiting, hyperpowers:root-cause-tracing, defensive-programming **Test with:** - Application scenarios: Can they apply the technique correctly? - Variation scenarios: Do they handle edge cases? - Missing information tests: Do instructions have gaps? **Success criteria:** Agent successfully applies technique to new scenario ### Pattern Skills (mental models) **Examples:** reducing-complexity, information-hiding concepts **Test with:** - Recognition scenarios: Do they recognize when pattern applies? - Application scenarios: Can they use the mental model? - Counter-examples: Do they know when NOT to apply? **Success criteria:** Agent correctly identifies when/how to apply pattern ### Reference Skills (documentation/APIs) **Examples:** API documentation, command references, library guides **Test with:** - Retrieval scenarios: Can they find the right information? - Application scenarios: Can they use what they found correctly? - Gap testing: Are common use cases covered? **Success criteria:** Agent finds and correctly applies reference information ## Common Rationalizations for Skipping Testing | Excuse | Reality | |--------|---------| | "Skill is obviously clear" | Clear to you ≠ clear to other agents. Test it. | | "It's just a reference" | References can have gaps, unclear sections. Test retrieval. | | "Testing is overkill" | Untested skills have issues. Always. 15 min testing saves hours. | | "I'll test if problems emerge" | Problems = agents can't use skill. Test BEFORE deploying. | | "Too tedious to test" | Testing is less tedious than debugging bad skill in production. | | "I'm confident it's good" | Overconfidence guarantees issues. Test anyway. | | "Academic review is enough" | Reading ≠ using. Test application scenarios. | | "No time to test" | Deploying untested skill wastes more time fixing it later. | **All of these mean: Test before deploying. No exceptions.** ## Bulletproofing Skills Against Rationalization Skills that enforce discipline (like TDD) need to resist rationalization. Agents are smart and will find loopholes when under pressure. **Psychology note:** Understanding WHY persuasion techniques work helps you apply them systematically. See persuasion-principles.md for research foundation (Cialdini, 2021; Meincke et al., 2025) on authority, commitment, scarcity, social proof, and unity principles. ### Close Every Loophole Explicitly Don't just state the rule - forbid specific workarounds: ```markdown Write code before test? Delete it. ``` ```markdown Write code before test? Delete it. Start over. **No exceptions:** - Don't keep it as "reference" - Don't "adapt" it while writing tests - Don't look at it - Delete means delete ``` ### Address "Spirit vs Letter" Arguments Add foundational principle early: ```markdown **Violating the letter of the rules is violating the spirit of the rules.** ``` This cuts off entire class of "I'm following the spirit" rationalizations. ### Build Rationalization Table Capture rationalizations from baseline testing (see Testing section below). Every excuse agents make goes in the table: ```markdown | Excuse | Reality | |--------|---------| | "Too simple to test" | Simple code breaks. Test takes 30 seconds. | | "I'll test after" | Tests passing immediately prove nothing. | | "Tests after achieve same goals" | Tests-after = "what does this do?" Tests-first = "what should this do?" | ``` ### Create Red Flags List Make it easy for agents to self-check when rationalizing: ```markdown ## Red Flags - STOP and Start Over - Code before test - "I already manually tested it" - "Tests after achieve the same purpose" - "It's about spirit not ritual" - "This is different because..." **All of these mean: Delete code. Start over with TDD.** ``` ### Update CSO for Violation Symptoms Add to description: symptoms of when you're ABOUT to violate the rule: ```yaml description: use when implementing any feature or bugfix, before writing implementation code ``` ## RED-GREEN-REFACTOR for Skills Follow the TDD cycle: ### RED: Write Failing Test (Baseline) Run pressure scenario with subagent WITHOUT the skill. Document exact behavior: - What choices did they make? - What rationalizations did they use (verbatim)? - Which pressures triggered violations? This is "watch the test fail" - you must see what agents naturally do before writing the skill. ### GREEN: Write Minimal Skill Write skill that addresses those specific rationalizations. Don't add extra content for hypothetical cases. Run same scenarios WITH skill. Agent should now comply. ### REFACTOR: Close Loopholes Agent found new rationalization? Add explicit counter. Re-test until bulletproof. **REQUIRED SUB-SKILL:** Use superpowers:testing-skills-with-subagents for the complete testing methodology: - How to write pressure scenarios - Pressure types (time, sunk cost, authority, exhaustion) - Plugging holes systematically - Meta-testing techniques