zhongwei/gh-tachyon-beep-skillpacks-plugins-meta-skillpack-maintenance

Fork 0

Files

Zhongwei Li 57790ee711 Initial commit

2025-11-30 08:59:38 +08:00

9.5 KiB

Raw Permalink Blame History

Testing Skill Quality

Purpose: Pass 2 - Run gauntlet tests on each skill using subagents to identify issues requiring fixes.

Core Principle

Behavioral testing, NOT syntactic validation.

Skills are process documentation. Test if they guide agents correctly, not if they parse correctly.

What We're Testing

Effectiveness questions:

Does the skill actually guide agents correctly?
Do agents follow the skill under pressure?
Does the skill handle edge cases?
Are there gaps in guidance that leave agents stranded?

What we're NOT testing:

Syntax (markdown parsing, code syntax) - syntactic, not behavioral
Coverage (already done in Pass 1) - structural, not behavioral
Quality benchmarking (comparing to other skills) - comparative, not behavioral

Gauntlet Design

Priority: A → C → B

A. Pressure Scenarios (Catch Rationalizations)

Purpose: Test if skill holds up when agents want to skip it.

Pressure types:

1. Time pressure:

"This is urgent, we need it done quickly"
"Just get it working, we can improve it later"
"The deadline is in an hour"

2. Simplicity temptation:

"This seems too simple to need [skill pattern]"
"The example is straightforward, no need to overthink"
"This is a trivial case"

3. Overkill perception:

"The skill is designed for complex cases, this is basic"
"We don't need the full process for this small change"
"That's way more than necessary"

4. Sunk cost:

"I already wrote most of the code"
"We've invested time in this approach"
"Just need to finish this last part"

Design approach:

Combine 2-3 pressures for maximum effect
Example: Time pressure + simplicity + sunk cost
Watch for rationalizations (verbatim documentation critical)

C. Adversarial Edge Cases (Test Robustness)

Purpose: Test if skill provides guidance for corner cases.

Edge case types:

1. Principle conflicts:

When skill's guidelines conflict with each other
Example: "DRY vs. explicit" or "test-first vs. prototyping"
Does skill help resolve conflict?

2. Naive application failures:

Cases where following skill literally doesn't work
Example: TDD for exploratory research code
Does skill explain when/how to adapt?

3. Missing information:

Scenarios requiring knowledge skill doesn't provide
Does skill reference other resources?
Does it leave agent completely stuck?

4. Tool limitations:

When environment doesn't support skill's approach
Example: No test framework available
Does skill have fallback guidance?

Design approach:

Identify skill's core principles
Find situations where they conflict or fail
Test if skill handles gracefully

B. Real-World Complexity (Validate Utility)

Purpose: Test if skill guides toward best practices in realistic scenarios.

Complexity types:

1. Messy requirements:

Unclear specifications
Conflicting stakeholder needs
Evolving requirements mid-task

2. Multiple valid approaches:

Several solutions, all reasonable
Trade-offs between options
Does skill help choose?

3. Integration constraints:

Existing codebase patterns
Team conventions
Technical debt

4. Incomplete information:

Missing context
Unknown dependencies
Undocumented behavior

Design approach:

Use realistic scenarios from the domain
Include ambiguity and messiness
Test if skill provides actionable guidance

Testing Process (Per Skill)

D - Iterative Hardening:

1. Design Challenging Scenario

Pick from gauntlet categories (prioritize A → C → B):

For discipline-enforcing skills (TDD, verification-before-completion):

Focus on A (pressure) scenarios
Combine multiple pressures
Test rationalization resistance

For technique skills (condition-based-waiting, root-cause-tracing):

Focus on C (edge cases) and B (real-world)
Test application correctness
Test gap identification

For pattern skills (reducing-complexity, information-hiding):

Focus on C (edge cases) and B (real-world)
Test recognition and application
Test when NOT to apply

For reference skills (API docs, command references):

Focus on B (real-world)
Test information retrieval
Test application of retrieved info

2. Run Subagent with Current Skill

Critical: Use the Task tool to dispatch subagent.

Provide to subagent:

The scenario (task description)
Access to the skill being tested
Any necessary context (codebase, tools)

What NOT to provide:

Meta-testing instructions (don't tell them they're being tested)
Expected behavior (let them apply skill naturally)
Hints about what you're looking for

3. Observe and Document

Watch for:

Compliance:

Did agent follow the skill?
Did they reference it explicitly?
Did they apply patterns correctly?

Rationalizations (verbatim):

Exact words used to skip steps
Justifications for shortcuts
"Spirit vs. letter" arguments

Failure modes:

Where did skill guidance fail?
Where was agent left without guidance?
Where did naive application break?

Edge case handling:

Did skill provide guidance for corner cases?
Did agent get stuck?
Did they improvise (potentially incorrectly)?

4. Assess Result

Pass criteria:

Agent followed skill correctly
Skill provided sufficient guidance
No significant rationalizations
Edge cases handled appropriately

Fix needed criteria:

Agent skipped skill steps (with rationalization)
Skill had gaps leaving agent stuck
Edge cases not covered
Naive application failed

5. Document Issues

If fix needed, document specifically:

Issue category:

Rationalization vulnerability (A)
Edge case gap (C)
Real-world guidance gap (B)
Missing anti-pattern warning
Unclear instructions
Missing cross-reference

Priority:

Critical - Skill fails basic use cases, agents skip it consistently
Major - Edge cases fail, significant gaps in guidance
Minor - Clarity improvements, additional examples needed

Specific fixes needed:

"Add explicit counter for rationalization: [quote]"
"Add guidance for edge case: [description]"
"Add example for scenario: [description]"
"Clarify instruction: [which section]"

Testing Multiple Skills

Strategy:

Priority order:

Router skills first (affects all specialist discovery)
Foundational skills (prerequisites for others)
Core technique skills (most frequently used)
Advanced skills (expert-level)

Batch approach:

Test 3-5 skills at a time
Document results before moving to next batch
Allows pattern recognition across skills

Efficiency:

Skills that passed in previous maintenance cycles: Spot-check only
New skills or significantly changed: Full gauntlet
Minor edits: Targeted testing of changed sections

Output Format

Generate per-skill report:

# Quality Testing Results: [pack-name]

## Summary

- Total skills tested: [count]
- Passed: [count]
- Fix needed: [count]
  - Critical: [count]
  - Major: [count]
  - Minor: [count]

## Detailed Results

### [Skill 1 Name]

**Result:** [Pass / Fix needed]

[If Fix needed]

**Priority:** [Critical / Major / Minor]

**Test scenario used:** [Brief description]

**Issues identified:**

1. **Issue:** [Description]
   - **Category:** [Rationalization / Edge case / Real-world gap / etc.]
   - **Evidence:** "[Verbatim quote from subagent if applicable]"
   - **Fix needed:** [Specific action]

2. **Issue:** [Description]
   [Same format]

**Test transcript:** [Link or summary of subagent behavior]

---

### [Skill 2 Name]

**Result:** Pass

**Test scenario used:** [Brief description]

**Notes:** Skill performed well, no issues identified.

---

[Repeat for all skills]

Common Rationalizations (Meta-Testing)

When YOU are doing the testing, watch for these rationalizations:

Excuse	Reality
"Skill looks good, no need to test"	Looking ≠ testing. Run gauntlet.
"I'll just check the syntax"	Syntactic validation ≠ behavioral. Use subagents.
"Testing is overkill for small changes"	Small changes can break guidance. Test anyway.
"I'm confident this works"	Confidence ≠ validation. Test behavior.
"Quality benchmarking is enough"	Comparison ≠ effectiveness. Test with scenarios.

If you catch yourself thinking these → STOP. Run gauntlet with subagents.

Philosophy

D as gauntlet + B for fixes:

D (iterative hardening): Run challenging scenarios to identify issues
B (targeted fixes): Fix specific identified problems

If skill passes gauntlet → No changes needed.

The LLM is both author and judge of skill fitness. Trust the testing process.

Proceeding to Next Stage

After testing all skills:

Compile complete test report
Proceed to Pass 3 (coherence validation)
Test results will inform implementation fixes in Stage 4

Anti-Patterns

Anti-Pattern	Why Bad	Instead
Syntactic validation only	Doesn't test if skill actually works	Run behavioral tests with subagents
Self-assessment	You can't objectively test your own work	Dispatch subagents for testing
"Looks good" review	Visual inspection ≠ behavioral testing	Run gauntlet scenarios
Skipping pressure tests	Miss rationalization vulnerabilities	Use A-priority pressure scenarios
Generic test scenarios	Don't reveal real issues	Use domain-specific, realistic scenarios
Testing without documenting	Can't track patterns or close loops	Document verbatim rationalizations

9.5 KiB Raw Permalink Blame History

Testing Skill Quality

Core Principle

What We're Testing

Gauntlet Design

A. Pressure Scenarios (Catch Rationalizations)

C. Adversarial Edge Cases (Test Robustness)

B. Real-World Complexity (Validate Utility)

Testing Process (Per Skill)

1. Design Challenging Scenario

2. Run Subagent with Current Skill

3. Observe and Document

4. Assess Result

5. Document Issues

Testing Multiple Skills

Output Format

Common Rationalizations (Meta-Testing)

Philosophy

Proceeding to Next Stage

Anti-Patterns

9.5 KiB

Raw Permalink Blame History