Initial commit
This commit is contained in:
249
skills/using-skillpack-maintenance/SKILL.md
Normal file
249
skills/using-skillpack-maintenance/SKILL.md
Normal file
@@ -0,0 +1,249 @@
|
||||
---
|
||||
name: using-skillpack-maintenance
|
||||
description: Use when maintaining or enhancing existing skill packs in the skillpacks repository - systematic pack refresh through domain analysis, structure review, RED-GREEN-REFACTOR gauntlet testing, and automated quality improvements
|
||||
---
|
||||
|
||||
# Skillpack Maintenance
|
||||
|
||||
## Overview
|
||||
|
||||
Systematic maintenance and enhancement of existing skill packs using investigative domain analysis, RED-GREEN-REFACTOR testing, and automated improvements.
|
||||
|
||||
**Core principle:** Maintenance uses behavioral testing (gauntlet with subagents), not syntactic validation. Skills are process documentation - test if they guide agents correctly, not if they parse correctly.
|
||||
|
||||
## When to Use
|
||||
|
||||
Use when:
|
||||
- Enhancing an existing skill pack (e.g., "refresh yzmir-deep-rl")
|
||||
- Improving existing SKILL.md files
|
||||
- Identifying gaps in pack coverage
|
||||
- Validating skill quality through testing
|
||||
|
||||
**Do NOT use for:**
|
||||
- Creating new skills from scratch (use superpowers:writing-skills)
|
||||
- Creating new packs from scratch (design first, then use creation workflow)
|
||||
|
||||
## The Iron Law
|
||||
|
||||
**NO SKILL CHANGES WITHOUT BEHAVIORAL TESTING**
|
||||
|
||||
Syntactic validation (does it parse?) ≠ Behavioral testing (does it work?)
|
||||
|
||||
## Common Rationalizations (from baseline testing)
|
||||
|
||||
| Excuse | Reality |
|
||||
|--------|---------|
|
||||
| "Syntactic validation is sufficient" | Parsing ≠ effectiveness. Test with subagents. |
|
||||
| "Quality benchmarking = effectiveness" | Comparing structure ≠ testing behavior. Run gauntlet. |
|
||||
| "Comprehensive coverage = working skill" | Coverage ≠ guidance quality. Test if agents follow it. |
|
||||
| "Following patterns = success" | Pattern-matching ≠ validation. Behavioral testing required. |
|
||||
| "I'll test if issues emerge" | Issues = broken skills in production. Test BEFORE deploying. |
|
||||
|
||||
**All of these mean: Run behavioral tests with subagents. No exceptions.**
|
||||
|
||||
## Workflow Overview
|
||||
|
||||
**Review → Discuss → [Create New Skills if Needed] → Execute**
|
||||
|
||||
1. **Investigation & Scorecard** → Load `analyzing-pack-domain.md`
|
||||
2. **Structure Review (Pass 1)** → Load `reviewing-pack-structure.md`
|
||||
3. **Content Testing (Pass 2)** → Load `testing-skill-quality.md`
|
||||
4. **Coherence Check (Pass 3)** → Validate cross-skill consistency
|
||||
5. **Discussion** → Present findings, get approval
|
||||
6. **[CONDITIONAL] Create New Skills** → If gaps identified, use `superpowers:writing-skills` for EACH gap (RED-GREEN-REFACTOR)
|
||||
7. **Execution** → Load `implementing-fixes.md`, enhance existing skills only
|
||||
8. **Commit** → Single commit with version bump
|
||||
|
||||
## Stage 1: Investigation & Scorecard
|
||||
|
||||
**Load briefing:** `analyzing-pack-domain.md`
|
||||
|
||||
**Purpose:** Establish "what this pack should cover" from first principles.
|
||||
|
||||
**Adaptive investigation (D→B→C→A):**
|
||||
1. **User-guided scope (D)** - Ask user about pack intent and boundaries
|
||||
2. **LLM knowledge analysis (B)** - Map domain comprehensively, flag if research needed
|
||||
3. **Existing pack audit (C)** - Compare current state vs. coverage map
|
||||
4. **Research if needed (A)** - Conditional: only if domain is rapidly evolving
|
||||
|
||||
**Output:** Domain coverage map, gap analysis, research currency flag
|
||||
|
||||
**Then: Load `reviewing-pack-structure.md` for scorecard**
|
||||
|
||||
**Scorecard levels:**
|
||||
- **Critical** - Pack unusable, recommend rebuild vs. enhance
|
||||
- **Major** - Significant gaps or duplicates
|
||||
- **Minor** - Organizational improvements
|
||||
- **Pass** - Structurally sound
|
||||
|
||||
**Decision gate:** Present scorecard → User decides: Proceed / Rebuild / Cancel
|
||||
|
||||
## Stage 2: Comprehensive Review
|
||||
|
||||
### Pass 1: Structure (from reviewing-pack-structure.md)
|
||||
|
||||
**Analyze:**
|
||||
- Gaps (missing skills based on coverage map)
|
||||
- Duplicates (overlapping coverage - merge/specialize/remove)
|
||||
- Organization (router accuracy, faction alignment, metadata sync)
|
||||
|
||||
**Output:** Structural issues with priorities (critical/major/minor)
|
||||
|
||||
### Pass 2: Content Quality (from testing-skill-quality.md)
|
||||
|
||||
**CRITICAL:** This is behavioral testing with subagents, not syntactic validation.
|
||||
|
||||
**Gauntlet design (A→C→B priority):**
|
||||
|
||||
**A. Pressure scenarios** - Catch rationalizations:
|
||||
- Time pressure: "This is urgent, just do it quickly"
|
||||
- Simplicity temptation: "Too simple to need the skill"
|
||||
- Overkill perception: "Skill is for complex cases, this is straightforward"
|
||||
|
||||
**C. Adversarial edge cases** - Test robustness:
|
||||
- Corner cases where skill principles conflict
|
||||
- Situations where naive application fails
|
||||
|
||||
**B. Real-world complexity** - Validate utility:
|
||||
- Messy requirements, unclear constraints
|
||||
- Multiple valid approaches
|
||||
|
||||
**Testing process per skill:**
|
||||
1. Design challenging scenario from gauntlet categories
|
||||
2. **Run subagent WITH current skill** (behavioral test)
|
||||
3. Observe: Does it follow? Where does it rationalize/fail?
|
||||
4. Document failure modes
|
||||
5. Result: Pass OR Fix needed (with specific issues listed)
|
||||
|
||||
**Philosophy:** D as gauntlet to identify issues, B for targeted fixes. If skill passes gauntlet, no changes needed.
|
||||
|
||||
**Output:** Per-skill test results (Pass / Fix needed + priorities)
|
||||
|
||||
### Pass 3: Coherence
|
||||
|
||||
**After structure/content analysis, validate pack-level coherence:**
|
||||
|
||||
1. **Cross-skill consistency** - Terminology, examples, cross-references
|
||||
2. **Router accuracy** - Does using-X router reflect current specialists?
|
||||
3. **Faction alignment** - Check FACTIONS.md, flag drift, suggest rehoming if needed
|
||||
4. **Metadata sync** - plugin.json description, skill count
|
||||
5. **Navigation** - Can users find skills easily?
|
||||
|
||||
**CRITICAL:** Update skills to reference new/enhanced skills (post-update hygiene)
|
||||
|
||||
**Output:** Coherence issues, faction drift flags
|
||||
|
||||
## Stage 3: Interactive Discussion
|
||||
|
||||
**Present findings conversationally:**
|
||||
|
||||
**Structural category:**
|
||||
- **Gaps requiring superpowers:writing-skills** (new skills needed - each requires RED-GREEN-REFACTOR)
|
||||
- Duplicates to remove/merge
|
||||
- Organization issues
|
||||
|
||||
**Content category:**
|
||||
- Skills needing enhancement (from gauntlet failures)
|
||||
- Severity levels (critical/major/minor)
|
||||
- Specific failure modes identified
|
||||
|
||||
**Coherence category:**
|
||||
- Cross-reference updates needed
|
||||
- Faction alignment issues
|
||||
- Metadata corrections
|
||||
|
||||
**Get user approval for scope of work**
|
||||
|
||||
**CRITICAL DECISION POINT:** If gaps (new skills) were identified:
|
||||
- User approves → **IMMEDIATELY use superpowers:writing-skills for EACH gap**
|
||||
- Do NOT proceed to Stage 4 until ALL new skills are created and tested
|
||||
- Each gap = separate RED-GREEN-REFACTOR cycle
|
||||
- Return to Stage 4 only after ALL gaps are filled
|
||||
|
||||
## Stage 4: Autonomous Execution
|
||||
|
||||
**Load briefing:** `implementing-fixes.md`
|
||||
|
||||
**PREREQUISITE CHECK:**
|
||||
- ✓ Zero gaps identified, OR
|
||||
- ✓ All gaps already filled using superpowers:writing-skills (each skill individually tested)
|
||||
|
||||
**If gaps exist and you haven't used writing-skills:** STOP. Return to Stage 3.
|
||||
|
||||
**Execute approved changes:**
|
||||
|
||||
1. **Structural fixes** - Remove/merge duplicate skills, update router
|
||||
2. **Content enhancements** - Fix gauntlet failures, add missing guidance to existing skills
|
||||
3. **Coherence improvements** - Cross-references, terminology alignment, faction voice
|
||||
4. **Version management** - Apply impact-based bump (patch/minor/major)
|
||||
5. **Git commit** - Single commit with all changes
|
||||
|
||||
**Version bump rules (impact-based):**
|
||||
- **Patch (x.y.Z)** - Low-impact: typos, formatting, minor clarifications
|
||||
- **Minor (x.Y.0)** - Medium-impact: enhanced guidance, new skills, better examples (DEFAULT)
|
||||
- **Major (X.0.0)** - High-impact: skills removed, structural changes, philosophy shifts (RARE)
|
||||
|
||||
**Commit format:**
|
||||
```
|
||||
feat(meta): enhance [pack-name] - [summary]
|
||||
|
||||
[Detailed list of changes by category]
|
||||
- Structure: [changes]
|
||||
- Content: [changes]
|
||||
- Coherence: [changes]
|
||||
|
||||
Version bump: [reason for patch/minor/major]
|
||||
```
|
||||
|
||||
**Output:** Enhanced pack, commit created, summary report
|
||||
|
||||
## Briefing Files Reference
|
||||
|
||||
All briefing files are in this skill directory:
|
||||
|
||||
- `analyzing-pack-domain.md` - Investigative domain analysis (D→B→C→A)
|
||||
- `reviewing-pack-structure.md` - Structure review, scorecard, gap/duplicate analysis
|
||||
- `testing-skill-quality.md` - Gauntlet testing methodology with subagents
|
||||
- `implementing-fixes.md` - Autonomous execution, version management, git commit
|
||||
|
||||
**Load appropriate briefing at each stage.**
|
||||
|
||||
## Critical Distinctions
|
||||
|
||||
**Behavioral vs. Syntactic Testing:**
|
||||
- ❌ **Syntactic:** "Does Python code parse?" → ast.parse()
|
||||
- ✅ **Behavioral:** "Does skill guide agents correctly?" → Subagent gauntlet
|
||||
|
||||
**This workflow requires BEHAVIORAL testing.**
|
||||
|
||||
**Maintenance vs. Creation:**
|
||||
- **Maintenance** (this skill): Enhancing existing SKILL.md files
|
||||
- **Creation** (superpowers:writing-skills): Writing new skills from scratch
|
||||
|
||||
**Use the right tool for the task.**
|
||||
|
||||
## Red Flags - STOP and Switch Tools
|
||||
|
||||
If you catch yourself thinking ANY of these:
|
||||
- "I'll write the new skills during execution" → NO. Use superpowers:writing-skills for EACH gap
|
||||
- "implementing-fixes.md says to create skills" → NO. That section was REMOVED. Exit and use writing-skills
|
||||
- "Token efficiency - I can just write good skills" → NO. Untested skills = broken skills
|
||||
- "I see the pattern, I can replicate it" → NO. Pattern-matching ≠ behavioral testing
|
||||
- "User wants this done quickly" → NO. Fast + untested = waste of time fixing later
|
||||
- "I'm competent, testing is overkill" → NO. Competence = following the process
|
||||
- "Gaps were approved, so I should fill them" → YES, but using writing-skills, not here
|
||||
- Validating syntax instead of behavior → Load testing-skill-quality.md
|
||||
- Skipping gauntlet testing → You're violating the Iron Law
|
||||
- Making changes without user approval → Follow Review→Discuss→Execute
|
||||
|
||||
**All of these mean: STOP. Exit workflow. Use superpowers:writing-skills.**
|
||||
|
||||
## The Bottom Line
|
||||
|
||||
**Maintaining skills requires behavioral testing, not syntactic validation.**
|
||||
|
||||
Same principle as code: you test behavior, not syntax.
|
||||
|
||||
Load briefings at each stage. Test with subagents. Get approval. Execute.
|
||||
|
||||
No shortcuts. No rationalizations.
|
||||
168
skills/using-skillpack-maintenance/analyzing-pack-domain.md
Normal file
168
skills/using-skillpack-maintenance/analyzing-pack-domain.md
Normal file
@@ -0,0 +1,168 @@
|
||||
# Analyzing Pack Domain
|
||||
|
||||
**Purpose:** Investigative process to establish "what this pack should cover" from first principles.
|
||||
|
||||
## Adaptive Investigation Workflow
|
||||
|
||||
**Sequence: D → B → C → A (conditional)**
|
||||
|
||||
### Phase D: User-Guided Scope
|
||||
|
||||
**Ask user:**
|
||||
- "What is the intended scope and purpose of [pack-name]?"
|
||||
- "What boundaries should this pack respect?"
|
||||
- "Who is the target audience? (beginners / practitioners / experts)"
|
||||
- "What depth of coverage is expected? (overview / comprehensive / exhaustive)"
|
||||
|
||||
**Document:**
|
||||
- User's vision as baseline
|
||||
- Explicit boundaries (what's IN scope, what's OUT of scope)
|
||||
- Success criteria (what makes this pack "complete"?)
|
||||
|
||||
### Phase B: LLM Knowledge-Based Analysis
|
||||
|
||||
**Leverage model knowledge to map the domain:**
|
||||
|
||||
1. **Generate comprehensive coverage map:**
|
||||
- What are the major concepts/algorithms/techniques in this domain?
|
||||
- What are the core patterns practitioners need to know?
|
||||
- What are common implementation challenges?
|
||||
|
||||
2. **Identify structure:**
|
||||
- Foundational concepts (must understand first)
|
||||
- Core techniques (bread-and-butter patterns)
|
||||
- Advanced topics (expert-level material)
|
||||
- Cross-cutting concerns (testing, debugging, optimization)
|
||||
|
||||
3. **Flag research currency:**
|
||||
- Is this domain stable or rapidly evolving?
|
||||
- Stable examples: Design patterns, basic algorithms, established protocols
|
||||
- Evolving examples: AI/ML, security, modern web frameworks
|
||||
- If evolving → Flag for Phase A research
|
||||
|
||||
**Output:** Coverage map with categorization (foundational/core/advanced)
|
||||
|
||||
### Phase C: Existing Pack Audit
|
||||
|
||||
**Read all current skills in the pack:**
|
||||
|
||||
1. **Inventory:**
|
||||
- List all SKILL.md files
|
||||
- Note skill names and descriptions
|
||||
- Check router skill (if exists) for specialist list
|
||||
|
||||
2. **Compare against coverage map:**
|
||||
- What's covered? (existing skills matching coverage areas)
|
||||
- What's missing? (gaps in foundational/core/advanced areas)
|
||||
- What overlaps? (multiple skills covering same concept)
|
||||
- What's obsolete? (outdated approaches, deprecated patterns)
|
||||
|
||||
3. **Quality check:**
|
||||
- Are descriptions accurate?
|
||||
- Do skills match their descriptions?
|
||||
- Are there broken cross-references?
|
||||
|
||||
**Output:** Gap list, duplicate list, obsolescence flags
|
||||
|
||||
### Phase A: Research (Conditional)
|
||||
|
||||
**ONLY if Phase B flagged domain as rapidly evolving.**
|
||||
|
||||
**Research authoritative sources:**
|
||||
|
||||
1. **For AI/ML domains:**
|
||||
- Latest survey papers (search: "[domain] survey 2024/2025")
|
||||
- Current textbooks (check publication dates)
|
||||
- Official library documentation (PyTorch, TensorFlow, etc.)
|
||||
- Research benchmarks (Papers with Code, etc.)
|
||||
|
||||
2. **For security domains:**
|
||||
- OWASP guidelines
|
||||
- NIST standards
|
||||
- Recent CVE patterns
|
||||
- Current threat landscape
|
||||
|
||||
3. **For framework domains:**
|
||||
- Official documentation (latest version)
|
||||
- Migration guides (breaking changes)
|
||||
- Best practices (official recommendations)
|
||||
|
||||
**Update coverage map:**
|
||||
- Add new techniques/patterns
|
||||
- Flag deprecated approaches in existing skills
|
||||
- Note version-specific considerations
|
||||
|
||||
**Decision criteria for Phase A:**
|
||||
- **Skip research** for: Math, algorithms, design patterns, established protocols
|
||||
- **Run research** for: AI/ML, security, modern frameworks, evolving standards
|
||||
|
||||
## Outputs
|
||||
|
||||
Generate comprehensive report:
|
||||
|
||||
### 1. Domain Coverage Map
|
||||
|
||||
```
|
||||
Foundational:
|
||||
- [Concept 1] - [Status: Exists / Missing / Needs enhancement]
|
||||
- [Concept 2] - [Status: Exists / Missing / Needs enhancement]
|
||||
|
||||
Core Techniques:
|
||||
- [Technique 1] - [Status: ...]
|
||||
- [Technique 2] - [Status: ...]
|
||||
|
||||
Advanced Topics:
|
||||
- [Topic 1] - [Status: ...]
|
||||
- [Topic 2] - [Status: ...]
|
||||
|
||||
Cross-Cutting:
|
||||
- [Concern 1] - [Status: ...]
|
||||
```
|
||||
|
||||
### 2. Current State Assessment
|
||||
|
||||
```
|
||||
Existing skills: [count]
|
||||
- [Skill 1 name] - covers [domain area]
|
||||
- [Skill 2 name] - covers [domain area]
|
||||
...
|
||||
```
|
||||
|
||||
### 3. Gap Analysis
|
||||
|
||||
```
|
||||
Missing (High Priority):
|
||||
- [Gap 1] - foundational concept not covered
|
||||
- [Gap 2] - core technique missing
|
||||
|
||||
Missing (Medium Priority):
|
||||
- [Gap 3] - advanced topic not covered
|
||||
|
||||
Duplicates:
|
||||
- [Skill A] and [Skill B] overlap on [topic]
|
||||
|
||||
Obsolete:
|
||||
- [Skill C] uses deprecated approach [old pattern]
|
||||
```
|
||||
|
||||
### 4. Research Currency Flag
|
||||
|
||||
```
|
||||
Domain stability: [Stable / Evolving]
|
||||
Research conducted: [Yes / No / Not needed]
|
||||
Currency concerns: [None / List specific areas needing updates]
|
||||
```
|
||||
|
||||
## Proceeding to Next Stage
|
||||
|
||||
After completing investigation, hand off to `reviewing-pack-structure.md` for scorecard generation.
|
||||
|
||||
## Common Mistakes
|
||||
|
||||
| Mistake | Fix |
|
||||
|---------|-----|
|
||||
| Skipping user input (Phase D) | Always start with user vision - they define scope |
|
||||
| Over-relying on LLM knowledge | For evolving domains, run research (Phase A) |
|
||||
| Skipping gap analysis | Compare coverage map vs. existing skills systematically |
|
||||
| Treating all domains as stable | Flag AI/ML/security/frameworks for research |
|
||||
| Vague gap descriptions | Be specific: "Missing TaskGroup patterns" not "async needs work" |
|
||||
334
skills/using-skillpack-maintenance/implementing-fixes.md
Normal file
334
skills/using-skillpack-maintenance/implementing-fixes.md
Normal file
@@ -0,0 +1,334 @@
|
||||
# Implementing Fixes
|
||||
|
||||
**Purpose:** Autonomous execution of approved changes with version management and git commit.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
You should have completed and gotten approval for:
|
||||
- Pass 1: Structure review (gaps, duplicates, organization)
|
||||
- Pass 2: Content testing (gauntlet results, fix priorities)
|
||||
- Pass 3: Coherence validation (cross-skill consistency, faction alignment)
|
||||
- User discussion and approval of scope
|
||||
|
||||
**Do NOT proceed without user approval of the scope of work.**
|
||||
|
||||
## Execution Workflow
|
||||
|
||||
### 1. Structural Fixes (from Pass 1)
|
||||
|
||||
**CRITICAL CHECKPOINT - New Skills:**
|
||||
|
||||
**STOP:** Did you identify gaps (new skills needed) in Pass 1?
|
||||
|
||||
**If YES → You MUST exit this workflow NOW:**
|
||||
|
||||
1. **DO NOT proceed to execution**
|
||||
2. **For EACH gap identified:**
|
||||
- Use `superpowers:writing-skills` skill
|
||||
- RED: Test scenario WITHOUT the skill
|
||||
- GREEN: Write skill addressing gaps
|
||||
- REFACTOR: Close loopholes
|
||||
- **Commit that ONE skill**
|
||||
3. **Repeat for ALL gaps** (each skill = separate RED-GREEN-REFACTOR cycle)
|
||||
4. **AFTER all new skills are tested and committed:**
|
||||
- Return to meta-skillpack-maintenance
|
||||
- Load this briefing again
|
||||
- Continue with other structural fixes below
|
||||
|
||||
**Proceeding past this checkpoint assumes:**
|
||||
- ✓ Zero new skills needed, OR
|
||||
- ✓ All new skills already created via superpowers:writing-skills
|
||||
- ✓ You are ONLY enhancing existing skills, removing duplicates, updating router/metadata
|
||||
|
||||
**If you identified ANY gaps and haven't used superpowers:writing-skills for each:**
|
||||
**STOP. Exit now. You're violating the Iron Law: NO SKILL WITHOUT BEHAVIORAL TESTING.**
|
||||
|
||||
---
|
||||
|
||||
**Remove duplicate skills:**
|
||||
|
||||
For skills marked for removal:
|
||||
1. Identify unique value in skill being removed
|
||||
2. Merge unique value into kept skill (if any)
|
||||
3. Delete duplicate SKILL.md and directory
|
||||
4. Remove references from router (if exists)
|
||||
5. Update cross-references in other skills
|
||||
|
||||
**Merge overlapping skills:**
|
||||
|
||||
For partial duplicates:
|
||||
1. Identify all unique content from both skills
|
||||
2. Create merged skill with comprehensive coverage
|
||||
3. Reorganize structure if needed
|
||||
4. Delete original skills
|
||||
5. Update router and cross-references
|
||||
6. Update skill name/description if needed
|
||||
|
||||
**Update router skill:**
|
||||
|
||||
If pack has using-X router:
|
||||
1. Update specialist list to reflect adds/removes
|
||||
2. Update descriptions to match current skills
|
||||
3. Verify routing logic makes sense
|
||||
4. Add cross-references as needed
|
||||
|
||||
### 2. Content Enhancements (from Pass 2)
|
||||
|
||||
**For each skill marked "Fix needed" in gauntlet testing:**
|
||||
|
||||
**Fix rationalizations (A-type issues):**
|
||||
1. Add explicit counter for each identified rationalization
|
||||
2. Update "Common Rationalizations" table
|
||||
3. Add to "Red Flags" list if applicable
|
||||
4. Strengthen "No exceptions" language
|
||||
|
||||
**Fill edge case gaps (C-type issues):**
|
||||
1. Add guidance for identified corner cases
|
||||
2. Document when/how to adapt core principles
|
||||
3. Add examples for edge case handling
|
||||
4. Cross-reference related skills if needed
|
||||
|
||||
**Enhance real-world guidance (B-type issues):**
|
||||
1. Add examples from realistic scenarios
|
||||
2. Clarify ambiguous instructions
|
||||
3. Add decision frameworks where needed
|
||||
4. Update "When to Use" section if unclear
|
||||
|
||||
**Add anti-patterns:**
|
||||
1. Document observed failure modes from testing
|
||||
2. Add ❌ WRONG / ✅ CORRECT examples
|
||||
3. Update "Common Mistakes" section
|
||||
4. Add warnings for subtle pitfalls
|
||||
|
||||
**Improve examples:**
|
||||
1. Replace weak examples with tested scenarios
|
||||
2. Ensure examples are complete and runnable
|
||||
3. Add comments explaining WHY, not just WHAT
|
||||
4. Use realistic domain context
|
||||
|
||||
### 3. Coherence Improvements (from Pass 3)
|
||||
|
||||
**Cross-reference updates:**
|
||||
|
||||
**CRITICAL:** This is post-update hygiene - ensure skills reference new/enhanced skills.
|
||||
|
||||
For each skill in pack:
|
||||
1. Identify related skills (related concepts, prerequisites, follow-ups)
|
||||
2. Add cross-references where helpful:
|
||||
- "See [skill-name] for [related concept]"
|
||||
- "**REQUIRED BACKGROUND:** [skill-name]"
|
||||
- "After mastering this, see [skill-name]"
|
||||
3. Update router cross-references
|
||||
4. Ensure bidirectional links (if A references B, should B reference A?)
|
||||
|
||||
**Terminology alignment:**
|
||||
|
||||
1. Identify terminology inconsistencies across skills
|
||||
2. Choose canonical terms (most clear/standard)
|
||||
3. Update all skills to use canonical terms
|
||||
4. Add glossary to router if needed
|
||||
|
||||
**Faction voice adjustment:**
|
||||
|
||||
For skills flagged with faction drift:
|
||||
1. Read FACTIONS.md for faction principles
|
||||
2. Adjust language/tone to match faction
|
||||
3. Realign examples with faction philosophy
|
||||
4. If severe drift: Flag for potential rehoming
|
||||
|
||||
**If rehoming recommended:**
|
||||
- Document which faction skill should move to
|
||||
- Note in commit message for manual handling
|
||||
- Don't move skills automatically (requires marketplace changes)
|
||||
|
||||
**Metadata synchronization:**
|
||||
|
||||
Update `plugin.json`:
|
||||
1. Description - ensure it matches enhanced pack content
|
||||
2. Count skills if tool supports it
|
||||
3. Verify category is appropriate
|
||||
|
||||
### 4. Version Management (Impact-Based)
|
||||
|
||||
**Assess impact of all changes:**
|
||||
|
||||
**Patch bump (x.y.Z) - Low impact:**
|
||||
- Typos fixed
|
||||
- Formatting improvements
|
||||
- Minor clarifications (< 50 words added)
|
||||
- Small example corrections
|
||||
- No new skills, no skills removed
|
||||
|
||||
**Minor bump (x.Y.0) - Medium impact (DEFAULT):**
|
||||
- Enhanced guidance (added sections, better examples)
|
||||
- New skills added
|
||||
- Improved existing skills significantly
|
||||
- Better anti-pattern coverage
|
||||
- Fixed gauntlet failures
|
||||
- Updated for current best practices
|
||||
|
||||
**Major bump (X.0.0) - High impact (RARE, use sparingly):**
|
||||
- Skills removed entirely
|
||||
- Structural reorganization
|
||||
- Philosophy shifts
|
||||
- Breaking changes to how skills work
|
||||
- Deprecated major patterns
|
||||
|
||||
**Decision logic:**
|
||||
1. Any new skills added? → Minor minimum
|
||||
2. Any skills removed? → Consider major
|
||||
3. Only fixes/clarifications? → Patch
|
||||
4. Enhanced multiple skills significantly? → Minor
|
||||
5. Changed pack philosophy? → Major
|
||||
|
||||
**Default for maintenance reviews: Minor bump**
|
||||
|
||||
**Update version in plugin.json:**
|
||||
```json
|
||||
{
|
||||
"version": "[new-version]"
|
||||
}
|
||||
```
|
||||
|
||||
### 5. Git Commit
|
||||
|
||||
**Single commit with all changes:**
|
||||
|
||||
**Commit message format:**
|
||||
|
||||
```
|
||||
feat(meta): enhance [pack-name] - [one-line summary]
|
||||
|
||||
Structure changes:
|
||||
- Added [count] new skills: [skill-1], [skill-2], ...
|
||||
- Removed [count] duplicate skills: [skill-1], [skill-2], ...
|
||||
- Merged [skill-a] + [skill-b] into [skill-merged]
|
||||
- Updated router to reflect new structure
|
||||
|
||||
Content improvements:
|
||||
- Enhanced [skill-1]: [specific improvements]
|
||||
- Enhanced [skill-2]: [specific improvements]
|
||||
- Fixed gauntlet failures in [skill-3]: [issues addressed]
|
||||
- Added anti-patterns to [skill-4]
|
||||
|
||||
Coherence updates:
|
||||
- Added cross-references between [count] skills
|
||||
- Aligned terminology across pack
|
||||
- Adjusted faction voice in [skill-name]
|
||||
- Updated plugin.json metadata
|
||||
|
||||
Version: [old-version] → [new-version] ([patch/minor/major])
|
||||
Rationale: [reason for version bump type]
|
||||
```
|
||||
|
||||
**Commit command:**
|
||||
|
||||
```bash
|
||||
git add plugins/[pack-name]/
|
||||
git commit -m "$(cat <<'EOF'
|
||||
feat(meta): enhance [pack-name] - [summary]
|
||||
|
||||
[Full message body as above]
|
||||
EOF
|
||||
)"
|
||||
```
|
||||
|
||||
**Do NOT push** - let user decide when to push.
|
||||
|
||||
## Execution Principles
|
||||
|
||||
**Autonomous within approved scope:**
|
||||
- Execute all approved changes without asking again
|
||||
- Follow user's approved plan exactly
|
||||
- Make editorial decisions within scope
|
||||
- Ask only if something unexpected blocks progress
|
||||
|
||||
**Quality standards:**
|
||||
- All new skills follow CSO guidelines (name/description format)
|
||||
- All code examples are complete and appropriate to domain
|
||||
- All cross-references are accurate
|
||||
- Faction voice is maintained
|
||||
|
||||
**Verification before commit:**
|
||||
- Verify YAML front matter syntax in all modified skills
|
||||
- Check that all cross-references point to existing skills
|
||||
- Ensure router (if exists) references all current skills
|
||||
- Verify plugin.json has valid JSON syntax
|
||||
|
||||
## Output After Completion
|
||||
|
||||
Provide comprehensive summary:
|
||||
|
||||
```
|
||||
# Pack Enhancement Complete: [pack-name]
|
||||
|
||||
## Version: [old] → [new] ([type])
|
||||
|
||||
## Summary Statistics
|
||||
|
||||
- Skills added: [count]
|
||||
- Skills removed: [count]
|
||||
- Skills enhanced: [count]
|
||||
- Skills tested and passed: [count]
|
||||
|
||||
## Changes by Category
|
||||
|
||||
### Structure
|
||||
[List of structural changes]
|
||||
|
||||
### Content
|
||||
[List of content improvements]
|
||||
|
||||
### Coherence
|
||||
[List of coherence updates]
|
||||
|
||||
## Git Commit
|
||||
|
||||
Created commit: [commit-hash if available]
|
||||
Message: [first line of commit]
|
||||
|
||||
Ready to push: [Yes]
|
||||
```
|
||||
|
||||
## Common Mistakes
|
||||
|
||||
| Mistake | Fix |
|
||||
|---------|-----|
|
||||
| Proceeding without approval | Always get user approval before executing |
|
||||
| Batch changes across passes | Complete one pass fully before next |
|
||||
| Inconsistent faction voice | Read FACTIONS.md, maintain voice throughout |
|
||||
| Broken cross-references | Verify all referenced skills exist |
|
||||
| Invalid YAML | Check syntax before committing |
|
||||
| Pushing automatically | Let user decide when to push |
|
||||
| Vague commit messages | Be specific about what changed and why |
|
||||
| Wrong version bump | Follow impact-based rules, default to minor |
|
||||
|
||||
## Anti-Patterns
|
||||
|
||||
**❌ Changing scope during execution:**
|
||||
- Don't add extra improvements not discussed
|
||||
- Don't skip approved changes because "not needed"
|
||||
- Stick to approved scope exactly
|
||||
|
||||
**❌ Sub-optimal quality:**
|
||||
- Don't write quick/dirty skills to fill gaps
|
||||
- Don't copy-paste without adapting to faction
|
||||
- Don't skip cross-references to save time
|
||||
|
||||
**❌ Incomplete commits:**
|
||||
- Don't commit partial work
|
||||
- Don't split into multiple commits
|
||||
- Single commit with all changes
|
||||
|
||||
**❌ No verification:**
|
||||
- Don't assume syntax is correct
|
||||
- Don't skip cross-reference checking
|
||||
- Verify before committing
|
||||
|
||||
## The Bottom Line
|
||||
|
||||
**Execute approved changes autonomously with high quality standards.**
|
||||
|
||||
One commit. Proper versioning. Complete summary.
|
||||
|
||||
No shortcuts. No scope creep. Professional execution.
|
||||
252
skills/using-skillpack-maintenance/reviewing-pack-structure.md
Normal file
252
skills/using-skillpack-maintenance/reviewing-pack-structure.md
Normal file
@@ -0,0 +1,252 @@
|
||||
# Reviewing Pack Structure
|
||||
|
||||
**Purpose:** Pass 1 - Analyze pack organization, identify structural issues, generate fitness scorecard.
|
||||
|
||||
## Inputs
|
||||
|
||||
From `analyzing-pack-domain.md`:
|
||||
- Domain coverage map (what should exist)
|
||||
- Current skill inventory (what does exist)
|
||||
- Gap analysis (missing/duplicate/obsolete)
|
||||
- Research currency flag
|
||||
|
||||
## Analysis Tasks
|
||||
|
||||
### 1. Fitness Scorecard
|
||||
|
||||
Generate scorecard with risk-driven prioritization:
|
||||
|
||||
**Critical Issues** - Pack unusable or fundamentally broken:
|
||||
- Missing core foundational concepts (users can't understand basics)
|
||||
- Major gaps in essential coverage (50%+ of core techniques missing)
|
||||
- Router completely inaccurate or missing when needed
|
||||
- Multiple skills broken or contradictory
|
||||
|
||||
**Decision:** Critical issues → Recommend "Rebuild from scratch" vs. "Enhance existing"
|
||||
|
||||
Rebuild if:
|
||||
- More skills missing than exist
|
||||
- Fundamental philosophy is wrong
|
||||
- Faction mismatch is severe
|
||||
|
||||
Enhance if:
|
||||
- Core structure is sound, just needs additions/fixes
|
||||
- Most existing skills are salvageable
|
||||
|
||||
**Major Issues** - Significant effectiveness reduction:
|
||||
- Important gaps in core coverage (20-50% of core techniques missing)
|
||||
- Multiple duplicate skills causing confusion
|
||||
- Obsolete skills teaching deprecated patterns
|
||||
- Faction drift across multiple skills
|
||||
- Metadata significantly out of sync
|
||||
|
||||
**Minor Issues** - Polish and improvements:
|
||||
- Small gaps in advanced topics
|
||||
- Minor organizational inconsistencies
|
||||
- Router descriptions slightly outdated
|
||||
- Small metadata corrections needed
|
||||
|
||||
**Pass** - Structurally sound:
|
||||
- Comprehensive coverage of foundational and core areas
|
||||
- No major gaps or duplicates
|
||||
- Router (if exists) is accurate
|
||||
- Faction alignment is good
|
||||
- Metadata is current
|
||||
|
||||
**Output:** Scorecard with category and specific issues listed
|
||||
|
||||
### 2. Gap Identification
|
||||
|
||||
From coverage map, identify missing skills:
|
||||
|
||||
**Prioritize by importance:**
|
||||
|
||||
**High priority (foundational/core):**
|
||||
- Foundational concepts users must understand
|
||||
- Core techniques used frequently
|
||||
- Common patterns missing from basics
|
||||
|
||||
**Medium priority (advanced):**
|
||||
- Advanced topics for expert users
|
||||
- Specialized techniques
|
||||
- Edge case handling
|
||||
|
||||
**Low priority (nice-to-have):**
|
||||
- Rare patterns
|
||||
- Future-looking topics
|
||||
- Experimental techniques
|
||||
|
||||
**For each gap:**
|
||||
- Draft skill name (following naming conventions)
|
||||
- Write description (following CSO guidelines)
|
||||
- Estimate scope (small/medium/large skill)
|
||||
- Note dependencies (what skills should be read first)
|
||||
|
||||
**Output:** Prioritized list of gaps with draft names/descriptions
|
||||
|
||||
### 3. Duplicate Detection
|
||||
|
||||
Find skills with overlapping coverage:
|
||||
|
||||
**Analysis process:**
|
||||
1. Read all skill descriptions
|
||||
2. Identify content overlap (skills covering same concepts)
|
||||
3. Read overlapping skills to assess actual content
|
||||
4. Determine relationship
|
||||
|
||||
**Duplicate types:**
|
||||
|
||||
**Complete duplicates** - Same content, different names:
|
||||
- **Action:** Remove one, preserve unique value from both
|
||||
|
||||
**Partial overlap** - Significant shared content:
|
||||
- **Action:** Merge into single comprehensive skill
|
||||
|
||||
**Specialization** - One general, one specific:
|
||||
- **Action:** Keep both, clarify relationship via cross-references
|
||||
- Example: "async-patterns" (general) + "asyncio-taskgroup" (specific)
|
||||
|
||||
**Complementary** - Different angles on same topic:
|
||||
- **Action:** Keep both, strengthen cross-references
|
||||
- Example: "testing-async-code" + "async-patterns-and-concurrency"
|
||||
|
||||
**False positive** - Similar names, different content:
|
||||
- **Action:** No change, maybe clarify descriptions
|
||||
|
||||
**For each duplicate pair:**
|
||||
- Classification (complete/partial/specialization/complementary/false)
|
||||
- Recommendation (remove/merge/keep with cross-refs)
|
||||
- Preserve unique value from each
|
||||
|
||||
**Output:** Duplicate analysis with recommendations
|
||||
|
||||
### 4. Organization Validation
|
||||
|
||||
Check pack-level organization:
|
||||
|
||||
**Router skill validation (if exists):**
|
||||
- Does router list all current specialist skills?
|
||||
- Are descriptions in router accurate?
|
||||
- Does routing logic make sense?
|
||||
- Are there skills NOT mentioned in router?
|
||||
- Are there router entries for NON-EXISTENT skills?
|
||||
|
||||
**Faction alignment:**
|
||||
- Read FACTIONS.md for this pack's faction principles
|
||||
- Check 3-5 representative skills for voice/philosophy
|
||||
- Identify drift patterns
|
||||
- Severity: Minor (style drift) / Major (wrong philosophy)
|
||||
|
||||
**Metadata validation:**
|
||||
- plugin.json description matches actual content?
|
||||
- Skill count is accurate?
|
||||
- Category is appropriate?
|
||||
- Version reflects current state?
|
||||
|
||||
**Navigation experience:**
|
||||
- Can users find appropriate skills easily?
|
||||
- Are skill names descriptive?
|
||||
- Are descriptions helpful for discovery?
|
||||
|
||||
**Output:** Organization issues with severity
|
||||
|
||||
## Generate Complete Report
|
||||
|
||||
Combine all analyses:
|
||||
|
||||
```
|
||||
# Structural Review: [pack-name]
|
||||
|
||||
## Scorecard: [Critical / Major / Minor / Pass]
|
||||
|
||||
[If Critical]
|
||||
Recommendation: [Rebuild from scratch / Enhance existing]
|
||||
Rationale: [Specific reasons]
|
||||
|
||||
## Issues by Priority
|
||||
|
||||
### Critical Issues ([count])
|
||||
- [Issue 1] - [Description]
|
||||
- [Issue 2] - [Description]
|
||||
|
||||
### Major Issues ([count])
|
||||
- [Issue 1] - [Description]
|
||||
- [Issue 2] - [Description]
|
||||
|
||||
### Minor Issues ([count])
|
||||
- [Issue 1] - [Description]
|
||||
- [Issue 2] - [Description]
|
||||
|
||||
## Gap Analysis
|
||||
|
||||
### High Priority Gaps ([count])
|
||||
- [Gap 1]
|
||||
- Skill name: [proposed-name]
|
||||
- Description: [draft description]
|
||||
- Scope: [small/medium/large]
|
||||
- Dependencies: [prerequisites]
|
||||
|
||||
### Medium Priority Gaps ([count])
|
||||
[Same format]
|
||||
|
||||
### Low Priority Gaps ([count])
|
||||
[Same format]
|
||||
|
||||
## Duplicate Analysis
|
||||
|
||||
- [Skill A] + [Skill B]
|
||||
- Type: [complete/partial/specialization/complementary]
|
||||
- Recommendation: [remove/merge/keep with cross-refs]
|
||||
- Rationale: [why]
|
||||
|
||||
## Organization Issues
|
||||
|
||||
### Router ([issues count])
|
||||
- [Issue description]
|
||||
|
||||
### Faction Alignment ([severity])
|
||||
- [Drift pattern]
|
||||
- Affected skills: [list]
|
||||
|
||||
### Metadata ([issues count])
|
||||
- [Issue description]
|
||||
|
||||
## Recommended Actions
|
||||
|
||||
**Gaps requiring superpowers:writing-skills:**
|
||||
- [count] new skills needed (RED-GREEN-REFACTOR for each, outside this workflow)
|
||||
|
||||
**Structure fixes (for later execution phase):**
|
||||
- Remove: [count] duplicate skills
|
||||
- Merge: [count] partial duplicates
|
||||
- Update router: [Yes/No]
|
||||
```
|
||||
|
||||
## Decision Gate
|
||||
|
||||
Present scorecard and report to user:
|
||||
|
||||
**If Critical:**
|
||||
- Explain rebuild vs. enhance trade-offs
|
||||
- Get user decision before proceeding
|
||||
|
||||
**If Major/Minor/Pass:**
|
||||
- Present findings
|
||||
- Confirm user wants to proceed with Pass 2 (content testing)
|
||||
|
||||
## Proceeding to Next Stage
|
||||
|
||||
After scorecard approval:
|
||||
- If proceeding → Move to `testing-skill-quality.md` (Pass 2)
|
||||
- If rebuilding → Stop maintenance workflow, switch to creation workflow
|
||||
- If canceling → Stop workflow
|
||||
|
||||
## Common Mistakes
|
||||
|
||||
| Mistake | Fix |
|
||||
|---------|-----|
|
||||
| Scorecard too lenient | Be honest: missing 50% of core = Critical |
|
||||
| Vague gap descriptions | Draft actual skill names and descriptions |
|
||||
| Keeping all duplicates | Duplicates confuse users - merge or remove |
|
||||
| Ignoring faction drift | Faction identity matters - flag misalignment |
|
||||
| Skipping metadata check | Inaccurate metadata breaks discovery |
|
||||
340
skills/using-skillpack-maintenance/testing-skill-quality.md
Normal file
340
skills/using-skillpack-maintenance/testing-skill-quality.md
Normal file
@@ -0,0 +1,340 @@
|
||||
# Testing Skill Quality
|
||||
|
||||
**Purpose:** Pass 2 - Run gauntlet tests on each skill using subagents to identify issues requiring fixes.
|
||||
|
||||
## Core Principle
|
||||
|
||||
**Behavioral testing, NOT syntactic validation.**
|
||||
|
||||
Skills are process documentation. Test if they guide agents correctly, not if they parse correctly.
|
||||
|
||||
## What We're Testing
|
||||
|
||||
**Effectiveness questions:**
|
||||
- Does the skill actually guide agents correctly?
|
||||
- Do agents follow the skill under pressure?
|
||||
- Does the skill handle edge cases?
|
||||
- Are there gaps in guidance that leave agents stranded?
|
||||
|
||||
**What we're NOT testing:**
|
||||
- Syntax (markdown parsing, code syntax) - syntactic, not behavioral
|
||||
- Coverage (already done in Pass 1) - structural, not behavioral
|
||||
- Quality benchmarking (comparing to other skills) - comparative, not behavioral
|
||||
|
||||
## Gauntlet Design
|
||||
|
||||
**Priority: A → C → B**
|
||||
|
||||
### A. Pressure Scenarios (Catch Rationalizations)
|
||||
|
||||
**Purpose:** Test if skill holds up when agents want to skip it.
|
||||
|
||||
**Pressure types:**
|
||||
|
||||
**1. Time pressure:**
|
||||
- "This is urgent, we need it done quickly"
|
||||
- "Just get it working, we can improve it later"
|
||||
- "The deadline is in an hour"
|
||||
|
||||
**2. Simplicity temptation:**
|
||||
- "This seems too simple to need [skill pattern]"
|
||||
- "The example is straightforward, no need to overthink"
|
||||
- "This is a trivial case"
|
||||
|
||||
**3. Overkill perception:**
|
||||
- "The skill is designed for complex cases, this is basic"
|
||||
- "We don't need the full process for this small change"
|
||||
- "That's way more than necessary"
|
||||
|
||||
**4. Sunk cost:**
|
||||
- "I already wrote most of the code"
|
||||
- "We've invested time in this approach"
|
||||
- "Just need to finish this last part"
|
||||
|
||||
**Design approach:**
|
||||
- Combine 2-3 pressures for maximum effect
|
||||
- Example: Time pressure + simplicity + sunk cost
|
||||
- Watch for rationalizations (verbatim documentation critical)
|
||||
|
||||
### C. Adversarial Edge Cases (Test Robustness)
|
||||
|
||||
**Purpose:** Test if skill provides guidance for corner cases.
|
||||
|
||||
**Edge case types:**
|
||||
|
||||
**1. Principle conflicts:**
|
||||
- When skill's guidelines conflict with each other
|
||||
- Example: "DRY vs. explicit" or "test-first vs. prototyping"
|
||||
- Does skill help resolve conflict?
|
||||
|
||||
**2. Naive application failures:**
|
||||
- Cases where following skill literally doesn't work
|
||||
- Example: TDD for exploratory research code
|
||||
- Does skill explain when/how to adapt?
|
||||
|
||||
**3. Missing information:**
|
||||
- Scenarios requiring knowledge skill doesn't provide
|
||||
- Does skill reference other resources?
|
||||
- Does it leave agent completely stuck?
|
||||
|
||||
**4. Tool limitations:**
|
||||
- When environment doesn't support skill's approach
|
||||
- Example: No test framework available
|
||||
- Does skill have fallback guidance?
|
||||
|
||||
**Design approach:**
|
||||
- Identify skill's core principles
|
||||
- Find situations where they conflict or fail
|
||||
- Test if skill handles gracefully
|
||||
|
||||
### B. Real-World Complexity (Validate Utility)
|
||||
|
||||
**Purpose:** Test if skill guides toward best practices in realistic scenarios.
|
||||
|
||||
**Complexity types:**
|
||||
|
||||
**1. Messy requirements:**
|
||||
- Unclear specifications
|
||||
- Conflicting stakeholder needs
|
||||
- Evolving requirements mid-task
|
||||
|
||||
**2. Multiple valid approaches:**
|
||||
- Several solutions, all reasonable
|
||||
- Trade-offs between options
|
||||
- Does skill help choose?
|
||||
|
||||
**3. Integration constraints:**
|
||||
- Existing codebase patterns
|
||||
- Team conventions
|
||||
- Technical debt
|
||||
|
||||
**4. Incomplete information:**
|
||||
- Missing context
|
||||
- Unknown dependencies
|
||||
- Undocumented behavior
|
||||
|
||||
**Design approach:**
|
||||
- Use realistic scenarios from the domain
|
||||
- Include ambiguity and messiness
|
||||
- Test if skill provides actionable guidance
|
||||
|
||||
## Testing Process (Per Skill)
|
||||
|
||||
**D - Iterative Hardening:**
|
||||
|
||||
### 1. Design Challenging Scenario
|
||||
|
||||
Pick from gauntlet categories (prioritize A → C → B):
|
||||
|
||||
**For discipline-enforcing skills** (TDD, verification-before-completion):
|
||||
- Focus on **A (pressure)** scenarios
|
||||
- Combine multiple pressures
|
||||
- Test rationalization resistance
|
||||
|
||||
**For technique skills** (condition-based-waiting, root-cause-tracing):
|
||||
- Focus on **C (edge cases)** and **B (real-world)**
|
||||
- Test application correctness
|
||||
- Test gap identification
|
||||
|
||||
**For pattern skills** (reducing-complexity, information-hiding):
|
||||
- Focus on **C (edge cases)** and **B (real-world)**
|
||||
- Test recognition and application
|
||||
- Test when NOT to apply
|
||||
|
||||
**For reference skills** (API docs, command references):
|
||||
- Focus on **B (real-world)**
|
||||
- Test information retrieval
|
||||
- Test application of retrieved info
|
||||
|
||||
### 2. Run Subagent with Current Skill
|
||||
|
||||
**Critical:** Use the Task tool to dispatch subagent.
|
||||
|
||||
**Provide to subagent:**
|
||||
- The scenario (task description)
|
||||
- Access to the skill being tested
|
||||
- Any necessary context (codebase, tools)
|
||||
|
||||
**What NOT to provide:**
|
||||
- Meta-testing instructions (don't tell them they're being tested)
|
||||
- Expected behavior (let them apply skill naturally)
|
||||
- Hints about what you're looking for
|
||||
|
||||
### 3. Observe and Document
|
||||
|
||||
**Watch for:**
|
||||
|
||||
**Compliance:**
|
||||
- Did agent follow the skill?
|
||||
- Did they reference it explicitly?
|
||||
- Did they apply patterns correctly?
|
||||
|
||||
**Rationalizations (verbatim):**
|
||||
- Exact words used to skip steps
|
||||
- Justifications for shortcuts
|
||||
- "Spirit vs. letter" arguments
|
||||
|
||||
**Failure modes:**
|
||||
- Where did skill guidance fail?
|
||||
- Where was agent left without guidance?
|
||||
- Where did naive application break?
|
||||
|
||||
**Edge case handling:**
|
||||
- Did skill provide guidance for corner cases?
|
||||
- Did agent get stuck?
|
||||
- Did they improvise (potentially incorrectly)?
|
||||
|
||||
### 4. Assess Result
|
||||
|
||||
**Pass criteria:**
|
||||
- Agent followed skill correctly
|
||||
- Skill provided sufficient guidance
|
||||
- No significant rationalizations
|
||||
- Edge cases handled appropriately
|
||||
|
||||
**Fix needed criteria:**
|
||||
- Agent skipped skill steps (with rationalization)
|
||||
- Skill had gaps leaving agent stuck
|
||||
- Edge cases not covered
|
||||
- Naive application failed
|
||||
|
||||
### 5. Document Issues
|
||||
|
||||
**If fix needed, document specifically:**
|
||||
|
||||
**Issue category:**
|
||||
- Rationalization vulnerability (A)
|
||||
- Edge case gap (C)
|
||||
- Real-world guidance gap (B)
|
||||
- Missing anti-pattern warning
|
||||
- Unclear instructions
|
||||
- Missing cross-reference
|
||||
|
||||
**Priority:**
|
||||
- **Critical** - Skill fails basic use cases, agents skip it consistently
|
||||
- **Major** - Edge cases fail, significant gaps in guidance
|
||||
- **Minor** - Clarity improvements, additional examples needed
|
||||
|
||||
**Specific fixes needed:**
|
||||
- "Add explicit counter for rationalization: [quote]"
|
||||
- "Add guidance for edge case: [description]"
|
||||
- "Add example for scenario: [description]"
|
||||
- "Clarify instruction: [which section]"
|
||||
|
||||
## Testing Multiple Skills
|
||||
|
||||
**Strategy:**
|
||||
|
||||
**Priority order:**
|
||||
1. Router skills first (affects all specialist discovery)
|
||||
2. Foundational skills (prerequisites for others)
|
||||
3. Core technique skills (most frequently used)
|
||||
4. Advanced skills (expert-level)
|
||||
|
||||
**Batch approach:**
|
||||
- Test 3-5 skills at a time
|
||||
- Document results before moving to next batch
|
||||
- Allows pattern recognition across skills
|
||||
|
||||
**Efficiency:**
|
||||
- Skills that passed in previous maintenance cycles: Spot-check only
|
||||
- New skills or significantly changed: Full gauntlet
|
||||
- Minor edits: Targeted testing of changed sections
|
||||
|
||||
## Output Format
|
||||
|
||||
Generate per-skill report:
|
||||
|
||||
```
|
||||
# Quality Testing Results: [pack-name]
|
||||
|
||||
## Summary
|
||||
|
||||
- Total skills tested: [count]
|
||||
- Passed: [count]
|
||||
- Fix needed: [count]
|
||||
- Critical: [count]
|
||||
- Major: [count]
|
||||
- Minor: [count]
|
||||
|
||||
## Detailed Results
|
||||
|
||||
### [Skill 1 Name]
|
||||
|
||||
**Result:** [Pass / Fix needed]
|
||||
|
||||
[If Fix needed]
|
||||
|
||||
**Priority:** [Critical / Major / Minor]
|
||||
|
||||
**Test scenario used:** [Brief description]
|
||||
|
||||
**Issues identified:**
|
||||
|
||||
1. **Issue:** [Description]
|
||||
- **Category:** [Rationalization / Edge case / Real-world gap / etc.]
|
||||
- **Evidence:** "[Verbatim quote from subagent if applicable]"
|
||||
- **Fix needed:** [Specific action]
|
||||
|
||||
2. **Issue:** [Description]
|
||||
[Same format]
|
||||
|
||||
**Test transcript:** [Link or summary of subagent behavior]
|
||||
|
||||
---
|
||||
|
||||
### [Skill 2 Name]
|
||||
|
||||
**Result:** Pass
|
||||
|
||||
**Test scenario used:** [Brief description]
|
||||
|
||||
**Notes:** Skill performed well, no issues identified.
|
||||
|
||||
---
|
||||
|
||||
[Repeat for all skills]
|
||||
```
|
||||
|
||||
## Common Rationalizations (Meta-Testing)
|
||||
|
||||
When YOU are doing the testing, watch for these rationalizations:
|
||||
|
||||
| Excuse | Reality |
|
||||
|--------|---------|
|
||||
| "Skill looks good, no need to test" | Looking ≠ testing. Run gauntlet. |
|
||||
| "I'll just check the syntax" | Syntactic validation ≠ behavioral. Use subagents. |
|
||||
| "Testing is overkill for small changes" | Small changes can break guidance. Test anyway. |
|
||||
| "I'm confident this works" | Confidence ≠ validation. Test behavior. |
|
||||
| "Quality benchmarking is enough" | Comparison ≠ effectiveness. Test with scenarios. |
|
||||
|
||||
**If you catch yourself thinking these → STOP. Run gauntlet with subagents.**
|
||||
|
||||
## Philosophy
|
||||
|
||||
**D as gauntlet + B for fixes:**
|
||||
|
||||
- **D (iterative hardening):** Run challenging scenarios to identify issues
|
||||
- **B (targeted fixes):** Fix specific identified problems
|
||||
|
||||
If skill passes gauntlet → No changes needed.
|
||||
|
||||
The LLM is both author and judge of skill fitness. Trust the testing process.
|
||||
|
||||
## Proceeding to Next Stage
|
||||
|
||||
After testing all skills:
|
||||
- Compile complete test report
|
||||
- Proceed to Pass 3 (coherence validation)
|
||||
- Test results will inform implementation fixes in Stage 4
|
||||
|
||||
## Anti-Patterns
|
||||
|
||||
| Anti-Pattern | Why Bad | Instead |
|
||||
|--------------|---------|---------|
|
||||
| Syntactic validation only | Doesn't test if skill actually works | Run behavioral tests with subagents |
|
||||
| Self-assessment | You can't objectively test your own work | Dispatch subagents for testing |
|
||||
| "Looks good" review | Visual inspection ≠ behavioral testing | Run gauntlet scenarios |
|
||||
| Skipping pressure tests | Miss rationalization vulnerabilities | Use A-priority pressure scenarios |
|
||||
| Generic test scenarios | Don't reveal real issues | Use domain-specific, realistic scenarios |
|
||||
| Testing without documenting | Can't track patterns or close loops | Document verbatim rationalizations |
|
||||
Reference in New Issue
Block a user