168 lines
5.5 KiB
Markdown
168 lines
5.5 KiB
Markdown
## Testing All Skill Types
|
|
|
|
Different skill types need different test approaches:
|
|
|
|
### Discipline-Enforcing Skills (rules/requirements)
|
|
|
|
**Examples:** TDD, hyperpowers:verification-before-completion, hyperpowers:designing-before-coding
|
|
|
|
**Test with:**
|
|
- Academic questions: Do they understand the rules?
|
|
- Pressure scenarios: Do they comply under stress?
|
|
- Multiple pressures combined: time + sunk cost + exhaustion
|
|
- Identify rationalizations and add explicit counters
|
|
|
|
**Success criteria:** Agent follows rule under maximum pressure
|
|
|
|
### Technique Skills (how-to guides)
|
|
|
|
**Examples:** condition-based-waiting, hyperpowers:root-cause-tracing, defensive-programming
|
|
|
|
**Test with:**
|
|
- Application scenarios: Can they apply the technique correctly?
|
|
- Variation scenarios: Do they handle edge cases?
|
|
- Missing information tests: Do instructions have gaps?
|
|
|
|
**Success criteria:** Agent successfully applies technique to new scenario
|
|
|
|
### Pattern Skills (mental models)
|
|
|
|
**Examples:** reducing-complexity, information-hiding concepts
|
|
|
|
**Test with:**
|
|
- Recognition scenarios: Do they recognize when pattern applies?
|
|
- Application scenarios: Can they use the mental model?
|
|
- Counter-examples: Do they know when NOT to apply?
|
|
|
|
**Success criteria:** Agent correctly identifies when/how to apply pattern
|
|
|
|
### Reference Skills (documentation/APIs)
|
|
|
|
**Examples:** API documentation, command references, library guides
|
|
|
|
**Test with:**
|
|
- Retrieval scenarios: Can they find the right information?
|
|
- Application scenarios: Can they use what they found correctly?
|
|
- Gap testing: Are common use cases covered?
|
|
|
|
**Success criteria:** Agent finds and correctly applies reference information
|
|
|
|
## Common Rationalizations for Skipping Testing
|
|
|
|
| Excuse | Reality |
|
|
|--------|---------|
|
|
| "Skill is obviously clear" | Clear to you ≠ clear to other agents. Test it. |
|
|
| "It's just a reference" | References can have gaps, unclear sections. Test retrieval. |
|
|
| "Testing is overkill" | Untested skills have issues. Always. 15 min testing saves hours. |
|
|
| "I'll test if problems emerge" | Problems = agents can't use skill. Test BEFORE deploying. |
|
|
| "Too tedious to test" | Testing is less tedious than debugging bad skill in production. |
|
|
| "I'm confident it's good" | Overconfidence guarantees issues. Test anyway. |
|
|
| "Academic review is enough" | Reading ≠ using. Test application scenarios. |
|
|
| "No time to test" | Deploying untested skill wastes more time fixing it later. |
|
|
|
|
**All of these mean: Test before deploying. No exceptions.**
|
|
|
|
## Bulletproofing Skills Against Rationalization
|
|
|
|
Skills that enforce discipline (like TDD) need to resist rationalization. Agents are smart and will find loopholes when under pressure.
|
|
|
|
**Psychology note:** Understanding WHY persuasion techniques work helps you apply them systematically. See persuasion-principles.md for research foundation (Cialdini, 2021; Meincke et al., 2025) on authority, commitment, scarcity, social proof, and unity principles.
|
|
|
|
### Close Every Loophole Explicitly
|
|
|
|
Don't just state the rule - forbid specific workarounds:
|
|
|
|
<Bad>
|
|
```markdown
|
|
Write code before test? Delete it.
|
|
```
|
|
</Bad>
|
|
|
|
<Good>
|
|
```markdown
|
|
Write code before test? Delete it. Start over.
|
|
|
|
**No exceptions:**
|
|
- Don't keep it as "reference"
|
|
- Don't "adapt" it while writing tests
|
|
- Don't look at it
|
|
- Delete means delete
|
|
```
|
|
</Good>
|
|
|
|
### Address "Spirit vs Letter" Arguments
|
|
|
|
Add foundational principle early:
|
|
|
|
```markdown
|
|
**Violating the letter of the rules is violating the spirit of the rules.**
|
|
```
|
|
|
|
This cuts off entire class of "I'm following the spirit" rationalizations.
|
|
|
|
### Build Rationalization Table
|
|
|
|
Capture rationalizations from baseline testing (see Testing section below). Every excuse agents make goes in the table:
|
|
|
|
```markdown
|
|
| Excuse | Reality |
|
|
|--------|---------|
|
|
| "Too simple to test" | Simple code breaks. Test takes 30 seconds. |
|
|
| "I'll test after" | Tests passing immediately prove nothing. |
|
|
| "Tests after achieve same goals" | Tests-after = "what does this do?" Tests-first = "what should this do?" |
|
|
```
|
|
|
|
### Create Red Flags List
|
|
|
|
Make it easy for agents to self-check when rationalizing:
|
|
|
|
```markdown
|
|
## Red Flags - STOP and Start Over
|
|
|
|
- Code before test
|
|
- "I already manually tested it"
|
|
- "Tests after achieve the same purpose"
|
|
- "It's about spirit not ritual"
|
|
- "This is different because..."
|
|
|
|
**All of these mean: Delete code. Start over with TDD.**
|
|
```
|
|
|
|
### Update CSO for Violation Symptoms
|
|
|
|
Add to description: symptoms of when you're ABOUT to violate the rule:
|
|
|
|
```yaml
|
|
description: use when implementing any feature or bugfix, before writing implementation code
|
|
```
|
|
|
|
## RED-GREEN-REFACTOR for Skills
|
|
|
|
Follow the TDD cycle:
|
|
|
|
### RED: Write Failing Test (Baseline)
|
|
|
|
Run pressure scenario with subagent WITHOUT the skill. Document exact behavior:
|
|
- What choices did they make?
|
|
- What rationalizations did they use (verbatim)?
|
|
- Which pressures triggered violations?
|
|
|
|
This is "watch the test fail" - you must see what agents naturally do before writing the skill.
|
|
|
|
### GREEN: Write Minimal Skill
|
|
|
|
Write skill that addresses those specific rationalizations. Don't add extra content for hypothetical cases.
|
|
|
|
Run same scenarios WITH skill. Agent should now comply.
|
|
|
|
### REFACTOR: Close Loopholes
|
|
|
|
Agent found new rationalization? Add explicit counter. Re-test until bulletproof.
|
|
|
|
**REQUIRED SUB-SKILL:** Use superpowers:testing-skills-with-subagents for the complete testing methodology:
|
|
- How to write pressure scenarios
|
|
- Pressure types (time, sunk cost, authority, exhaustion)
|
|
- Plugging holes systematically
|
|
- Meta-testing techniques
|
|
|