Files
2025-11-30 08:59:38 +08:00

9.5 KiB

Testing Skill Quality

Purpose: Pass 2 - Run gauntlet tests on each skill using subagents to identify issues requiring fixes.

Core Principle

Behavioral testing, NOT syntactic validation.

Skills are process documentation. Test if they guide agents correctly, not if they parse correctly.

What We're Testing

Effectiveness questions:

  • Does the skill actually guide agents correctly?
  • Do agents follow the skill under pressure?
  • Does the skill handle edge cases?
  • Are there gaps in guidance that leave agents stranded?

What we're NOT testing:

  • Syntax (markdown parsing, code syntax) - syntactic, not behavioral
  • Coverage (already done in Pass 1) - structural, not behavioral
  • Quality benchmarking (comparing to other skills) - comparative, not behavioral

Gauntlet Design

Priority: A → C → B

A. Pressure Scenarios (Catch Rationalizations)

Purpose: Test if skill holds up when agents want to skip it.

Pressure types:

1. Time pressure:

  • "This is urgent, we need it done quickly"
  • "Just get it working, we can improve it later"
  • "The deadline is in an hour"

2. Simplicity temptation:

  • "This seems too simple to need [skill pattern]"
  • "The example is straightforward, no need to overthink"
  • "This is a trivial case"

3. Overkill perception:

  • "The skill is designed for complex cases, this is basic"
  • "We don't need the full process for this small change"
  • "That's way more than necessary"

4. Sunk cost:

  • "I already wrote most of the code"
  • "We've invested time in this approach"
  • "Just need to finish this last part"

Design approach:

  • Combine 2-3 pressures for maximum effect
  • Example: Time pressure + simplicity + sunk cost
  • Watch for rationalizations (verbatim documentation critical)

C. Adversarial Edge Cases (Test Robustness)

Purpose: Test if skill provides guidance for corner cases.

Edge case types:

1. Principle conflicts:

  • When skill's guidelines conflict with each other
  • Example: "DRY vs. explicit" or "test-first vs. prototyping"
  • Does skill help resolve conflict?

2. Naive application failures:

  • Cases where following skill literally doesn't work
  • Example: TDD for exploratory research code
  • Does skill explain when/how to adapt?

3. Missing information:

  • Scenarios requiring knowledge skill doesn't provide
  • Does skill reference other resources?
  • Does it leave agent completely stuck?

4. Tool limitations:

  • When environment doesn't support skill's approach
  • Example: No test framework available
  • Does skill have fallback guidance?

Design approach:

  • Identify skill's core principles
  • Find situations where they conflict or fail
  • Test if skill handles gracefully

B. Real-World Complexity (Validate Utility)

Purpose: Test if skill guides toward best practices in realistic scenarios.

Complexity types:

1. Messy requirements:

  • Unclear specifications
  • Conflicting stakeholder needs
  • Evolving requirements mid-task

2. Multiple valid approaches:

  • Several solutions, all reasonable
  • Trade-offs between options
  • Does skill help choose?

3. Integration constraints:

  • Existing codebase patterns
  • Team conventions
  • Technical debt

4. Incomplete information:

  • Missing context
  • Unknown dependencies
  • Undocumented behavior

Design approach:

  • Use realistic scenarios from the domain
  • Include ambiguity and messiness
  • Test if skill provides actionable guidance

Testing Process (Per Skill)

D - Iterative Hardening:

1. Design Challenging Scenario

Pick from gauntlet categories (prioritize A → C → B):

For discipline-enforcing skills (TDD, verification-before-completion):

  • Focus on A (pressure) scenarios
  • Combine multiple pressures
  • Test rationalization resistance

For technique skills (condition-based-waiting, root-cause-tracing):

  • Focus on C (edge cases) and B (real-world)
  • Test application correctness
  • Test gap identification

For pattern skills (reducing-complexity, information-hiding):

  • Focus on C (edge cases) and B (real-world)
  • Test recognition and application
  • Test when NOT to apply

For reference skills (API docs, command references):

  • Focus on B (real-world)
  • Test information retrieval
  • Test application of retrieved info

2. Run Subagent with Current Skill

Critical: Use the Task tool to dispatch subagent.

Provide to subagent:

  • The scenario (task description)
  • Access to the skill being tested
  • Any necessary context (codebase, tools)

What NOT to provide:

  • Meta-testing instructions (don't tell them they're being tested)
  • Expected behavior (let them apply skill naturally)
  • Hints about what you're looking for

3. Observe and Document

Watch for:

Compliance:

  • Did agent follow the skill?
  • Did they reference it explicitly?
  • Did they apply patterns correctly?

Rationalizations (verbatim):

  • Exact words used to skip steps
  • Justifications for shortcuts
  • "Spirit vs. letter" arguments

Failure modes:

  • Where did skill guidance fail?
  • Where was agent left without guidance?
  • Where did naive application break?

Edge case handling:

  • Did skill provide guidance for corner cases?
  • Did agent get stuck?
  • Did they improvise (potentially incorrectly)?

4. Assess Result

Pass criteria:

  • Agent followed skill correctly
  • Skill provided sufficient guidance
  • No significant rationalizations
  • Edge cases handled appropriately

Fix needed criteria:

  • Agent skipped skill steps (with rationalization)
  • Skill had gaps leaving agent stuck
  • Edge cases not covered
  • Naive application failed

5. Document Issues

If fix needed, document specifically:

Issue category:

  • Rationalization vulnerability (A)
  • Edge case gap (C)
  • Real-world guidance gap (B)
  • Missing anti-pattern warning
  • Unclear instructions
  • Missing cross-reference

Priority:

  • Critical - Skill fails basic use cases, agents skip it consistently
  • Major - Edge cases fail, significant gaps in guidance
  • Minor - Clarity improvements, additional examples needed

Specific fixes needed:

  • "Add explicit counter for rationalization: [quote]"
  • "Add guidance for edge case: [description]"
  • "Add example for scenario: [description]"
  • "Clarify instruction: [which section]"

Testing Multiple Skills

Strategy:

Priority order:

  1. Router skills first (affects all specialist discovery)
  2. Foundational skills (prerequisites for others)
  3. Core technique skills (most frequently used)
  4. Advanced skills (expert-level)

Batch approach:

  • Test 3-5 skills at a time
  • Document results before moving to next batch
  • Allows pattern recognition across skills

Efficiency:

  • Skills that passed in previous maintenance cycles: Spot-check only
  • New skills or significantly changed: Full gauntlet
  • Minor edits: Targeted testing of changed sections

Output Format

Generate per-skill report:

# Quality Testing Results: [pack-name]

## Summary

- Total skills tested: [count]
- Passed: [count]
- Fix needed: [count]
  - Critical: [count]
  - Major: [count]
  - Minor: [count]

## Detailed Results

### [Skill 1 Name]

**Result:** [Pass / Fix needed]

[If Fix needed]

**Priority:** [Critical / Major / Minor]

**Test scenario used:** [Brief description]

**Issues identified:**

1. **Issue:** [Description]
   - **Category:** [Rationalization / Edge case / Real-world gap / etc.]
   - **Evidence:** "[Verbatim quote from subagent if applicable]"
   - **Fix needed:** [Specific action]

2. **Issue:** [Description]
   [Same format]

**Test transcript:** [Link or summary of subagent behavior]

---

### [Skill 2 Name]

**Result:** Pass

**Test scenario used:** [Brief description]

**Notes:** Skill performed well, no issues identified.

---

[Repeat for all skills]

Common Rationalizations (Meta-Testing)

When YOU are doing the testing, watch for these rationalizations:

Excuse Reality
"Skill looks good, no need to test" Looking ≠ testing. Run gauntlet.
"I'll just check the syntax" Syntactic validation ≠ behavioral. Use subagents.
"Testing is overkill for small changes" Small changes can break guidance. Test anyway.
"I'm confident this works" Confidence ≠ validation. Test behavior.
"Quality benchmarking is enough" Comparison ≠ effectiveness. Test with scenarios.

If you catch yourself thinking these → STOP. Run gauntlet with subagents.

Philosophy

D as gauntlet + B for fixes:

  • D (iterative hardening): Run challenging scenarios to identify issues
  • B (targeted fixes): Fix specific identified problems

If skill passes gauntlet → No changes needed.

The LLM is both author and judge of skill fitness. Trust the testing process.

Proceeding to Next Stage

After testing all skills:

  • Compile complete test report
  • Proceed to Pass 3 (coherence validation)
  • Test results will inform implementation fixes in Stage 4

Anti-Patterns

Anti-Pattern Why Bad Instead
Syntactic validation only Doesn't test if skill actually works Run behavioral tests with subagents
Self-assessment You can't objectively test your own work Dispatch subagents for testing
"Looks good" review Visual inspection ≠ behavioral testing Run gauntlet scenarios
Skipping pressure tests Miss rationalization vulnerabilities Use A-priority pressure scenarios
Generic test scenarios Don't reveal real issues Use domain-specific, realistic scenarios
Testing without documenting Can't track patterns or close loops Document verbatim rationalizations