Files
gh-tachyon-beep-skillpacks-…/skills/using-skillpack-maintenance/testing-skill-quality.md
2025-11-30 08:59:38 +08:00

341 lines
9.5 KiB
Markdown

# Testing Skill Quality
**Purpose:** Pass 2 - Run gauntlet tests on each skill using subagents to identify issues requiring fixes.
## Core Principle
**Behavioral testing, NOT syntactic validation.**
Skills are process documentation. Test if they guide agents correctly, not if they parse correctly.
## What We're Testing
**Effectiveness questions:**
- Does the skill actually guide agents correctly?
- Do agents follow the skill under pressure?
- Does the skill handle edge cases?
- Are there gaps in guidance that leave agents stranded?
**What we're NOT testing:**
- Syntax (markdown parsing, code syntax) - syntactic, not behavioral
- Coverage (already done in Pass 1) - structural, not behavioral
- Quality benchmarking (comparing to other skills) - comparative, not behavioral
## Gauntlet Design
**Priority: A → C → B**
### A. Pressure Scenarios (Catch Rationalizations)
**Purpose:** Test if skill holds up when agents want to skip it.
**Pressure types:**
**1. Time pressure:**
- "This is urgent, we need it done quickly"
- "Just get it working, we can improve it later"
- "The deadline is in an hour"
**2. Simplicity temptation:**
- "This seems too simple to need [skill pattern]"
- "The example is straightforward, no need to overthink"
- "This is a trivial case"
**3. Overkill perception:**
- "The skill is designed for complex cases, this is basic"
- "We don't need the full process for this small change"
- "That's way more than necessary"
**4. Sunk cost:**
- "I already wrote most of the code"
- "We've invested time in this approach"
- "Just need to finish this last part"
**Design approach:**
- Combine 2-3 pressures for maximum effect
- Example: Time pressure + simplicity + sunk cost
- Watch for rationalizations (verbatim documentation critical)
### C. Adversarial Edge Cases (Test Robustness)
**Purpose:** Test if skill provides guidance for corner cases.
**Edge case types:**
**1. Principle conflicts:**
- When skill's guidelines conflict with each other
- Example: "DRY vs. explicit" or "test-first vs. prototyping"
- Does skill help resolve conflict?
**2. Naive application failures:**
- Cases where following skill literally doesn't work
- Example: TDD for exploratory research code
- Does skill explain when/how to adapt?
**3. Missing information:**
- Scenarios requiring knowledge skill doesn't provide
- Does skill reference other resources?
- Does it leave agent completely stuck?
**4. Tool limitations:**
- When environment doesn't support skill's approach
- Example: No test framework available
- Does skill have fallback guidance?
**Design approach:**
- Identify skill's core principles
- Find situations where they conflict or fail
- Test if skill handles gracefully
### B. Real-World Complexity (Validate Utility)
**Purpose:** Test if skill guides toward best practices in realistic scenarios.
**Complexity types:**
**1. Messy requirements:**
- Unclear specifications
- Conflicting stakeholder needs
- Evolving requirements mid-task
**2. Multiple valid approaches:**
- Several solutions, all reasonable
- Trade-offs between options
- Does skill help choose?
**3. Integration constraints:**
- Existing codebase patterns
- Team conventions
- Technical debt
**4. Incomplete information:**
- Missing context
- Unknown dependencies
- Undocumented behavior
**Design approach:**
- Use realistic scenarios from the domain
- Include ambiguity and messiness
- Test if skill provides actionable guidance
## Testing Process (Per Skill)
**D - Iterative Hardening:**
### 1. Design Challenging Scenario
Pick from gauntlet categories (prioritize A → C → B):
**For discipline-enforcing skills** (TDD, verification-before-completion):
- Focus on **A (pressure)** scenarios
- Combine multiple pressures
- Test rationalization resistance
**For technique skills** (condition-based-waiting, root-cause-tracing):
- Focus on **C (edge cases)** and **B (real-world)**
- Test application correctness
- Test gap identification
**For pattern skills** (reducing-complexity, information-hiding):
- Focus on **C (edge cases)** and **B (real-world)**
- Test recognition and application
- Test when NOT to apply
**For reference skills** (API docs, command references):
- Focus on **B (real-world)**
- Test information retrieval
- Test application of retrieved info
### 2. Run Subagent with Current Skill
**Critical:** Use the Task tool to dispatch subagent.
**Provide to subagent:**
- The scenario (task description)
- Access to the skill being tested
- Any necessary context (codebase, tools)
**What NOT to provide:**
- Meta-testing instructions (don't tell them they're being tested)
- Expected behavior (let them apply skill naturally)
- Hints about what you're looking for
### 3. Observe and Document
**Watch for:**
**Compliance:**
- Did agent follow the skill?
- Did they reference it explicitly?
- Did they apply patterns correctly?
**Rationalizations (verbatim):**
- Exact words used to skip steps
- Justifications for shortcuts
- "Spirit vs. letter" arguments
**Failure modes:**
- Where did skill guidance fail?
- Where was agent left without guidance?
- Where did naive application break?
**Edge case handling:**
- Did skill provide guidance for corner cases?
- Did agent get stuck?
- Did they improvise (potentially incorrectly)?
### 4. Assess Result
**Pass criteria:**
- Agent followed skill correctly
- Skill provided sufficient guidance
- No significant rationalizations
- Edge cases handled appropriately
**Fix needed criteria:**
- Agent skipped skill steps (with rationalization)
- Skill had gaps leaving agent stuck
- Edge cases not covered
- Naive application failed
### 5. Document Issues
**If fix needed, document specifically:**
**Issue category:**
- Rationalization vulnerability (A)
- Edge case gap (C)
- Real-world guidance gap (B)
- Missing anti-pattern warning
- Unclear instructions
- Missing cross-reference
**Priority:**
- **Critical** - Skill fails basic use cases, agents skip it consistently
- **Major** - Edge cases fail, significant gaps in guidance
- **Minor** - Clarity improvements, additional examples needed
**Specific fixes needed:**
- "Add explicit counter for rationalization: [quote]"
- "Add guidance for edge case: [description]"
- "Add example for scenario: [description]"
- "Clarify instruction: [which section]"
## Testing Multiple Skills
**Strategy:**
**Priority order:**
1. Router skills first (affects all specialist discovery)
2. Foundational skills (prerequisites for others)
3. Core technique skills (most frequently used)
4. Advanced skills (expert-level)
**Batch approach:**
- Test 3-5 skills at a time
- Document results before moving to next batch
- Allows pattern recognition across skills
**Efficiency:**
- Skills that passed in previous maintenance cycles: Spot-check only
- New skills or significantly changed: Full gauntlet
- Minor edits: Targeted testing of changed sections
## Output Format
Generate per-skill report:
```
# Quality Testing Results: [pack-name]
## Summary
- Total skills tested: [count]
- Passed: [count]
- Fix needed: [count]
- Critical: [count]
- Major: [count]
- Minor: [count]
## Detailed Results
### [Skill 1 Name]
**Result:** [Pass / Fix needed]
[If Fix needed]
**Priority:** [Critical / Major / Minor]
**Test scenario used:** [Brief description]
**Issues identified:**
1. **Issue:** [Description]
- **Category:** [Rationalization / Edge case / Real-world gap / etc.]
- **Evidence:** "[Verbatim quote from subagent if applicable]"
- **Fix needed:** [Specific action]
2. **Issue:** [Description]
[Same format]
**Test transcript:** [Link or summary of subagent behavior]
---
### [Skill 2 Name]
**Result:** Pass
**Test scenario used:** [Brief description]
**Notes:** Skill performed well, no issues identified.
---
[Repeat for all skills]
```
## Common Rationalizations (Meta-Testing)
When YOU are doing the testing, watch for these rationalizations:
| Excuse | Reality |
|--------|---------|
| "Skill looks good, no need to test" | Looking ≠ testing. Run gauntlet. |
| "I'll just check the syntax" | Syntactic validation ≠ behavioral. Use subagents. |
| "Testing is overkill for small changes" | Small changes can break guidance. Test anyway. |
| "I'm confident this works" | Confidence ≠ validation. Test behavior. |
| "Quality benchmarking is enough" | Comparison ≠ effectiveness. Test with scenarios. |
**If you catch yourself thinking these → STOP. Run gauntlet with subagents.**
## Philosophy
**D as gauntlet + B for fixes:**
- **D (iterative hardening):** Run challenging scenarios to identify issues
- **B (targeted fixes):** Fix specific identified problems
If skill passes gauntlet → No changes needed.
The LLM is both author and judge of skill fitness. Trust the testing process.
## Proceeding to Next Stage
After testing all skills:
- Compile complete test report
- Proceed to Pass 3 (coherence validation)
- Test results will inform implementation fixes in Stage 4
## Anti-Patterns
| Anti-Pattern | Why Bad | Instead |
|--------------|---------|---------|
| Syntactic validation only | Doesn't test if skill actually works | Run behavioral tests with subagents |
| Self-assessment | You can't objectively test your own work | Dispatch subagents for testing |
| "Looks good" review | Visual inspection ≠ behavioral testing | Run gauntlet scenarios |
| Skipping pressure tests | Miss rationalization vulnerabilities | Use A-priority pressure scenarios |
| Generic test scenarios | Don't reveal real issues | Use domain-specific, realistic scenarios |
| Testing without documenting | Can't track patterns or close loops | Document verbatim rationalizations |