Initial commit

2025-11-30 08:59:38 +08:00
commit 57790ee711
8 changed files with 1419 additions and 0 deletions
--- a/skills/using-skillpack-maintenance/SKILL.md
+++ b/skills/using-skillpack-maintenance/SKILL.md
@@ -0,0 +1,249 @@
+---
+name: using-skillpack-maintenance
+description: Use when maintaining or enhancing existing skill packs in the skillpacks repository - systematic pack refresh through domain analysis, structure review, RED-GREEN-REFACTOR gauntlet testing, and automated quality improvements
+---
+
+# Skillpack Maintenance
+
+## Overview
+
+Systematic maintenance and enhancement of existing skill packs using investigative domain analysis, RED-GREEN-REFACTOR testing, and automated improvements.
+
+**Core principle:** Maintenance uses behavioral testing (gauntlet with subagents), not syntactic validation. Skills are process documentation - test if they guide agents correctly, not if they parse correctly.
+
+## When to Use
+
+Use when:
+- Enhancing an existing skill pack (e.g., "refresh yzmir-deep-rl")
+- Improving existing SKILL.md files
+- Identifying gaps in pack coverage
+- Validating skill quality through testing
+
+**Do NOT use for:**
+- Creating new skills from scratch (use superpowers:writing-skills)
+- Creating new packs from scratch (design first, then use creation workflow)
+
+## The Iron Law
+
+**NO SKILL CHANGES WITHOUT BEHAVIORAL TESTING**
+
+Syntactic validation (does it parse?) ≠ Behavioral testing (does it work?)
+
+## Common Rationalizations (from baseline testing)
+
+| Excuse | Reality |
+|--------|---------|
+| "Syntactic validation is sufficient" | Parsing ≠ effectiveness. Test with subagents. |
+| "Quality benchmarking = effectiveness" | Comparing structure ≠ testing behavior. Run gauntlet. |
+| "Comprehensive coverage = working skill" | Coverage ≠ guidance quality. Test if agents follow it. |
+| "Following patterns = success" | Pattern-matching ≠ validation. Behavioral testing required. |
+| "I'll test if issues emerge" | Issues = broken skills in production. Test BEFORE deploying. |
+
+**All of these mean: Run behavioral tests with subagents. No exceptions.**
+
+## Workflow Overview
+
+**Review → Discuss → [Create New Skills if Needed] → Execute**
+
+1. **Investigation & Scorecard** → Load `analyzing-pack-domain.md`
+2. **Structure Review (Pass 1)** → Load `reviewing-pack-structure.md`
+3. **Content Testing (Pass 2)** → Load `testing-skill-quality.md`
+4. **Coherence Check (Pass 3)** → Validate cross-skill consistency
+5. **Discussion** → Present findings, get approval
+6. **[CONDITIONAL] Create New Skills** → If gaps identified, use `superpowers:writing-skills` for EACH gap (RED-GREEN-REFACTOR)
+7. **Execution** → Load `implementing-fixes.md`, enhance existing skills only
+8. **Commit** → Single commit with version bump
+
+## Stage 1: Investigation & Scorecard
+
+**Load briefing:** `analyzing-pack-domain.md`
+
+**Purpose:** Establish "what this pack should cover" from first principles.
+
+**Adaptive investigation (D→B→C→A):**
+1. **User-guided scope (D)** - Ask user about pack intent and boundaries
+2. **LLM knowledge analysis (B)** - Map domain comprehensively, flag if research needed
+3. **Existing pack audit (C)** - Compare current state vs. coverage map
+4. **Research if needed (A)** - Conditional: only if domain is rapidly evolving
+
+**Output:** Domain coverage map, gap analysis, research currency flag
+
+**Then: Load `reviewing-pack-structure.md` for scorecard**
+
+**Scorecard levels:**
+- **Critical** - Pack unusable, recommend rebuild vs. enhance
+- **Major** - Significant gaps or duplicates
+- **Minor** - Organizational improvements
+- **Pass** - Structurally sound
+
+**Decision gate:** Present scorecard → User decides: Proceed / Rebuild / Cancel
+
+## Stage 2: Comprehensive Review
+
+### Pass 1: Structure (from reviewing-pack-structure.md)
+
+**Analyze:**
+- Gaps (missing skills based on coverage map)
+- Duplicates (overlapping coverage - merge/specialize/remove)
+- Organization (router accuracy, faction alignment, metadata sync)
+
+**Output:** Structural issues with priorities (critical/major/minor)
+
+### Pass 2: Content Quality (from testing-skill-quality.md)
+
+**CRITICAL:** This is behavioral testing with subagents, not syntactic validation.
+
+**Gauntlet design (A→C→B priority):**
+
+**A. Pressure scenarios** - Catch rationalizations:
+- Time pressure: "This is urgent, just do it quickly"
+- Simplicity temptation: "Too simple to need the skill"
+- Overkill perception: "Skill is for complex cases, this is straightforward"
+
+**C. Adversarial edge cases** - Test robustness:
+- Corner cases where skill principles conflict
+- Situations where naive application fails
+
+**B. Real-world complexity** - Validate utility:
+- Messy requirements, unclear constraints
+- Multiple valid approaches
+
+**Testing process per skill:**
+1. Design challenging scenario from gauntlet categories
+2. **Run subagent WITH current skill** (behavioral test)
+3. Observe: Does it follow? Where does it rationalize/fail?
+4. Document failure modes
+5. Result: Pass OR Fix needed (with specific issues listed)
+
+**Philosophy:** D as gauntlet to identify issues, B for targeted fixes. If skill passes gauntlet, no changes needed.
+
+**Output:** Per-skill test results (Pass / Fix needed + priorities)
+
+### Pass 3: Coherence
+
+**After structure/content analysis, validate pack-level coherence:**
+
+1. **Cross-skill consistency** - Terminology, examples, cross-references
+2. **Router accuracy** - Does using-X router reflect current specialists?
+3. **Faction alignment** - Check FACTIONS.md, flag drift, suggest rehoming if needed
+4. **Metadata sync** - plugin.json description, skill count
+5. **Navigation** - Can users find skills easily?
+
+**CRITICAL:** Update skills to reference new/enhanced skills (post-update hygiene)
+
+**Output:** Coherence issues, faction drift flags
+
+## Stage 3: Interactive Discussion
+
+**Present findings conversationally:**
+
+**Structural category:**
+- **Gaps requiring superpowers:writing-skills** (new skills needed - each requires RED-GREEN-REFACTOR)
+- Duplicates to remove/merge
+- Organization issues
+
+**Content category:**
+- Skills needing enhancement (from gauntlet failures)
+- Severity levels (critical/major/minor)
+- Specific failure modes identified
+
+**Coherence category:**
+- Cross-reference updates needed
+- Faction alignment issues
+- Metadata corrections
+
+**Get user approval for scope of work**
+
+**CRITICAL DECISION POINT:** If gaps (new skills) were identified:
+- User approves → **IMMEDIATELY use superpowers:writing-skills for EACH gap**
+- Do NOT proceed to Stage 4 until ALL new skills are created and tested
+- Each gap = separate RED-GREEN-REFACTOR cycle
+- Return to Stage 4 only after ALL gaps are filled
+
+## Stage 4: Autonomous Execution
+
+**Load briefing:** `implementing-fixes.md`
+
+**PREREQUISITE CHECK:**
+- ✓ Zero gaps identified, OR
+- ✓ All gaps already filled using superpowers:writing-skills (each skill individually tested)
+
+**If gaps exist and you haven't used writing-skills:** STOP. Return to Stage 3.
+
+**Execute approved changes:**
+
+1. **Structural fixes** - Remove/merge duplicate skills, update router
+2. **Content enhancements** - Fix gauntlet failures, add missing guidance to existing skills
+3. **Coherence improvements** - Cross-references, terminology alignment, faction voice
+4. **Version management** - Apply impact-based bump (patch/minor/major)
+5. **Git commit** - Single commit with all changes
+
+**Version bump rules (impact-based):**
+- **Patch (x.y.Z)** - Low-impact: typos, formatting, minor clarifications
+- **Minor (x.Y.0)** - Medium-impact: enhanced guidance, new skills, better examples (DEFAULT)
+- **Major (X.0.0)** - High-impact: skills removed, structural changes, philosophy shifts (RARE)
+
+**Commit format:**
+```
+feat(meta): enhance [pack-name] - [summary]
+
+[Detailed list of changes by category]
+- Structure: [changes]
+- Content: [changes]
+- Coherence: [changes]
+
+Version bump: [reason for patch/minor/major]
+```
+
+**Output:** Enhanced pack, commit created, summary report
+
+## Briefing Files Reference
+
+All briefing files are in this skill directory:
+
+- `analyzing-pack-domain.md` - Investigative domain analysis (D→B→C→A)
+- `reviewing-pack-structure.md` - Structure review, scorecard, gap/duplicate analysis
+- `testing-skill-quality.md` - Gauntlet testing methodology with subagents
+- `implementing-fixes.md` - Autonomous execution, version management, git commit
+
+**Load appropriate briefing at each stage.**
+
+## Critical Distinctions
+
+**Behavioral vs. Syntactic Testing:**
+- ❌ **Syntactic:** "Does Python code parse?" → ast.parse()
+- ✅ **Behavioral:** "Does skill guide agents correctly?" → Subagent gauntlet
+
+**This workflow requires BEHAVIORAL testing.**
+
+**Maintenance vs. Creation:**
+- **Maintenance** (this skill): Enhancing existing SKILL.md files
+- **Creation** (superpowers:writing-skills): Writing new skills from scratch
+
+**Use the right tool for the task.**
+
+## Red Flags - STOP and Switch Tools
+
+If you catch yourself thinking ANY of these:
+- "I'll write the new skills during execution" → NO. Use superpowers:writing-skills for EACH gap
+- "implementing-fixes.md says to create skills" → NO. That section was REMOVED. Exit and use writing-skills
+- "Token efficiency - I can just write good skills" → NO. Untested skills = broken skills
+- "I see the pattern, I can replicate it" → NO. Pattern-matching ≠ behavioral testing
+- "User wants this done quickly" → NO. Fast + untested = waste of time fixing later
+- "I'm competent, testing is overkill" → NO. Competence = following the process
+- "Gaps were approved, so I should fill them" → YES, but using writing-skills, not here
+- Validating syntax instead of behavior → Load testing-skill-quality.md
+- Skipping gauntlet testing → You're violating the Iron Law
+- Making changes without user approval → Follow Review→Discuss→Execute
+
+**All of these mean: STOP. Exit workflow. Use superpowers:writing-skills.**
+
+## The Bottom Line
+
+**Maintaining skills requires behavioral testing, not syntactic validation.**
+
+Same principle as code: you test behavior, not syntax.
+
+Load briefings at each stage. Test with subagents. Get approval. Execute.
+
+No shortcuts. No rationalizations.
--- a/skills/using-skillpack-maintenance/analyzing-pack-domain.md
+++ b/skills/using-skillpack-maintenance/analyzing-pack-domain.md
@@ -0,0 +1,168 @@
+# Analyzing Pack Domain
+
+**Purpose:** Investigative process to establish "what this pack should cover" from first principles.
+
+## Adaptive Investigation Workflow
+
+**Sequence: D → B → C → A (conditional)**
+
+### Phase D: User-Guided Scope
+
+**Ask user:**
+- "What is the intended scope and purpose of [pack-name]?"
+- "What boundaries should this pack respect?"
+- "Who is the target audience? (beginners / practitioners / experts)"
+- "What depth of coverage is expected? (overview / comprehensive / exhaustive)"
+
+**Document:**
+- User's vision as baseline
+- Explicit boundaries (what's IN scope, what's OUT of scope)
+- Success criteria (what makes this pack "complete"?)
+
+### Phase B: LLM Knowledge-Based Analysis
+
+**Leverage model knowledge to map the domain:**
+
+1. **Generate comprehensive coverage map:**
+   - What are the major concepts/algorithms/techniques in this domain?
+   - What are the core patterns practitioners need to know?
+   - What are common implementation challenges?
+
+2. **Identify structure:**
+   - Foundational concepts (must understand first)
+   - Core techniques (bread-and-butter patterns)
+   - Advanced topics (expert-level material)
+   - Cross-cutting concerns (testing, debugging, optimization)
+
+3. **Flag research currency:**
+   - Is this domain stable or rapidly evolving?
+   - Stable examples: Design patterns, basic algorithms, established protocols
+   - Evolving examples: AI/ML, security, modern web frameworks
+   - If evolving → Flag for Phase A research
+
+**Output:** Coverage map with categorization (foundational/core/advanced)
+
+### Phase C: Existing Pack Audit
+
+**Read all current skills in the pack:**
+
+1. **Inventory:**
+   - List all SKILL.md files
+   - Note skill names and descriptions
+   - Check router skill (if exists) for specialist list
+
+2. **Compare against coverage map:**
+   - What's covered? (existing skills matching coverage areas)
+   - What's missing? (gaps in foundational/core/advanced areas)
+   - What overlaps? (multiple skills covering same concept)
+   - What's obsolete? (outdated approaches, deprecated patterns)
+
+3. **Quality check:**
+   - Are descriptions accurate?
+   - Do skills match their descriptions?
+   - Are there broken cross-references?
+
+**Output:** Gap list, duplicate list, obsolescence flags
+
+### Phase A: Research (Conditional)
+
+**ONLY if Phase B flagged domain as rapidly evolving.**
+
+**Research authoritative sources:**
+
+1. **For AI/ML domains:**
+   - Latest survey papers (search: "[domain] survey 2024/2025")
+   - Current textbooks (check publication dates)
+   - Official library documentation (PyTorch, TensorFlow, etc.)
+   - Research benchmarks (Papers with Code, etc.)
+
+2. **For security domains:**
+   - OWASP guidelines
+   - NIST standards
+   - Recent CVE patterns
+   - Current threat landscape
+
+3. **For framework domains:**
+   - Official documentation (latest version)
+   - Migration guides (breaking changes)
+   - Best practices (official recommendations)
+
+**Update coverage map:**
+- Add new techniques/patterns
+- Flag deprecated approaches in existing skills
+- Note version-specific considerations
+
+**Decision criteria for Phase A:**
+- **Skip research** for: Math, algorithms, design patterns, established protocols
+- **Run research** for: AI/ML, security, modern frameworks, evolving standards
+
+## Outputs
+
+Generate comprehensive report:
+
+### 1. Domain Coverage Map
+
+```
+Foundational:
+- [Concept 1] - [Status: Exists / Missing / Needs enhancement]
+- [Concept 2] - [Status: Exists / Missing / Needs enhancement]
+
+Core Techniques:
+- [Technique 1] - [Status: ...]
+- [Technique 2] - [Status: ...]
+
+Advanced Topics:
+- [Topic 1] - [Status: ...]
+- [Topic 2] - [Status: ...]
+
+Cross-Cutting:
+- [Concern 1] - [Status: ...]
+```
+
+### 2. Current State Assessment
+
+```
+Existing skills: [count]
+- [Skill 1 name] - covers [domain area]
+- [Skill 2 name] - covers [domain area]
+...
+```
+
+### 3. Gap Analysis
+
+```
+Missing (High Priority):
+- [Gap 1] - foundational concept not covered
+- [Gap 2] - core technique missing
+
+Missing (Medium Priority):
+- [Gap 3] - advanced topic not covered
+
+Duplicates:
+- [Skill A] and [Skill B] overlap on [topic]
+
+Obsolete:
+- [Skill C] uses deprecated approach [old pattern]
+```
+
+### 4. Research Currency Flag
+
+```
+Domain stability: [Stable / Evolving]
+Research conducted: [Yes / No / Not needed]
+Currency concerns: [None / List specific areas needing updates]
+```
+
+## Proceeding to Next Stage
+
+After completing investigation, hand off to `reviewing-pack-structure.md` for scorecard generation.
+
+## Common Mistakes
+
+| Mistake | Fix |
+|---------|-----|
+| Skipping user input (Phase D) | Always start with user vision - they define scope |
+| Over-relying on LLM knowledge | For evolving domains, run research (Phase A) |
+| Skipping gap analysis | Compare coverage map vs. existing skills systematically |
+| Treating all domains as stable | Flag AI/ML/security/frameworks for research |
+| Vague gap descriptions | Be specific: "Missing TaskGroup patterns" not "async needs work" |
--- a/skills/using-skillpack-maintenance/implementing-fixes.md
+++ b/skills/using-skillpack-maintenance/implementing-fixes.md
@@ -0,0 +1,334 @@
+# Implementing Fixes
+
+**Purpose:** Autonomous execution of approved changes with version management and git commit.
+
+## Prerequisites
+
+You should have completed and gotten approval for:
+- Pass 1: Structure review (gaps, duplicates, organization)
+- Pass 2: Content testing (gauntlet results, fix priorities)
+- Pass 3: Coherence validation (cross-skill consistency, faction alignment)
+- User discussion and approval of scope
+
+**Do NOT proceed without user approval of the scope of work.**
+
+## Execution Workflow
+
+### 1. Structural Fixes (from Pass 1)
+
+**CRITICAL CHECKPOINT - New Skills:**
+
+**STOP:** Did you identify gaps (new skills needed) in Pass 1?
+
+**If YES → You MUST exit this workflow NOW:**
+
+1. **DO NOT proceed to execution**
+2. **For EACH gap identified:**
+   - Use `superpowers:writing-skills` skill
+   - RED: Test scenario WITHOUT the skill
+   - GREEN: Write skill addressing gaps
+   - REFACTOR: Close loopholes
+   - **Commit that ONE skill**
+3. **Repeat for ALL gaps** (each skill = separate RED-GREEN-REFACTOR cycle)
+4. **AFTER all new skills are tested and committed:**
+   - Return to meta-skillpack-maintenance
+   - Load this briefing again
+   - Continue with other structural fixes below
+
+**Proceeding past this checkpoint assumes:**
+- ✓ Zero new skills needed, OR
+- ✓ All new skills already created via superpowers:writing-skills
+- ✓ You are ONLY enhancing existing skills, removing duplicates, updating router/metadata
+
+**If you identified ANY gaps and haven't used superpowers:writing-skills for each:**
+**STOP. Exit now. You're violating the Iron Law: NO SKILL WITHOUT BEHAVIORAL TESTING.**
+
+---
+
+**Remove duplicate skills:**
+
+For skills marked for removal:
+1. Identify unique value in skill being removed
+2. Merge unique value into kept skill (if any)
+3. Delete duplicate SKILL.md and directory
+4. Remove references from router (if exists)
+5. Update cross-references in other skills
+
+**Merge overlapping skills:**
+
+For partial duplicates:
+1. Identify all unique content from both skills
+2. Create merged skill with comprehensive coverage
+3. Reorganize structure if needed
+4. Delete original skills
+5. Update router and cross-references
+6. Update skill name/description if needed
+
+**Update router skill:**
+
+If pack has using-X router:
+1. Update specialist list to reflect adds/removes
+2. Update descriptions to match current skills
+3. Verify routing logic makes sense
+4. Add cross-references as needed
+
+### 2. Content Enhancements (from Pass 2)
+
+**For each skill marked "Fix needed" in gauntlet testing:**
+
+**Fix rationalizations (A-type issues):**
+1. Add explicit counter for each identified rationalization
+2. Update "Common Rationalizations" table
+3. Add to "Red Flags" list if applicable
+4. Strengthen "No exceptions" language
+
+**Fill edge case gaps (C-type issues):**
+1. Add guidance for identified corner cases
+2. Document when/how to adapt core principles
+3. Add examples for edge case handling
+4. Cross-reference related skills if needed
+
+**Enhance real-world guidance (B-type issues):**
+1. Add examples from realistic scenarios
+2. Clarify ambiguous instructions
+3. Add decision frameworks where needed
+4. Update "When to Use" section if unclear
+
+**Add anti-patterns:**
+1. Document observed failure modes from testing
+2. Add ❌ WRONG / ✅ CORRECT examples
+3. Update "Common Mistakes" section
+4. Add warnings for subtle pitfalls
+
+**Improve examples:**
+1. Replace weak examples with tested scenarios
+2. Ensure examples are complete and runnable
+3. Add comments explaining WHY, not just WHAT
+4. Use realistic domain context
+
+### 3. Coherence Improvements (from Pass 3)
+
+**Cross-reference updates:**
+
+**CRITICAL:** This is post-update hygiene - ensure skills reference new/enhanced skills.
+
+For each skill in pack:
+1. Identify related skills (related concepts, prerequisites, follow-ups)
+2. Add cross-references where helpful:
+   - "See [skill-name] for [related concept]"
+   - "**REQUIRED BACKGROUND:** [skill-name]"
+   - "After mastering this, see [skill-name]"
+3. Update router cross-references
+4. Ensure bidirectional links (if A references B, should B reference A?)
+
+**Terminology alignment:**
+
+1. Identify terminology inconsistencies across skills
+2. Choose canonical terms (most clear/standard)
+3. Update all skills to use canonical terms
+4. Add glossary to router if needed
+
+**Faction voice adjustment:**
+
+For skills flagged with faction drift:
+1. Read FACTIONS.md for faction principles
+2. Adjust language/tone to match faction
+3. Realign examples with faction philosophy
+4. If severe drift: Flag for potential rehoming
+
+**If rehoming recommended:**
+- Document which faction skill should move to
+- Note in commit message for manual handling
+- Don't move skills automatically (requires marketplace changes)
+
+**Metadata synchronization:**
+
+Update `plugin.json`:
+1. Description - ensure it matches enhanced pack content
+2. Count skills if tool supports it
+3. Verify category is appropriate
+
+### 4. Version Management (Impact-Based)
+
+**Assess impact of all changes:**
+
+**Patch bump (x.y.Z) - Low impact:**
+- Typos fixed
+- Formatting improvements
+- Minor clarifications (< 50 words added)
+- Small example corrections
+- No new skills, no skills removed
+
+**Minor bump (x.Y.0) - Medium impact (DEFAULT):**
+- Enhanced guidance (added sections, better examples)
+- New skills added
+- Improved existing skills significantly
+- Better anti-pattern coverage
+- Fixed gauntlet failures
+- Updated for current best practices
+
+**Major bump (X.0.0) - High impact (RARE, use sparingly):**
+- Skills removed entirely
+- Structural reorganization
+- Philosophy shifts
+- Breaking changes to how skills work
+- Deprecated major patterns
+
+**Decision logic:**
+1. Any new skills added? → Minor minimum
+2. Any skills removed? → Consider major
+3. Only fixes/clarifications? → Patch
+4. Enhanced multiple skills significantly? → Minor
+5. Changed pack philosophy? → Major
+
+**Default for maintenance reviews: Minor bump**
+
+**Update version in plugin.json:**
+```json
+{
+  "version": "[new-version]"
+}
+```
+
+### 5. Git Commit
+
+**Single commit with all changes:**
+
+**Commit message format:**
+
+```
+feat(meta): enhance [pack-name] - [one-line summary]
+
+Structure changes:
+- Added [count] new skills: [skill-1], [skill-2], ...
+- Removed [count] duplicate skills: [skill-1], [skill-2], ...
+- Merged [skill-a] + [skill-b] into [skill-merged]
+- Updated router to reflect new structure
+
+Content improvements:
+- Enhanced [skill-1]: [specific improvements]
+- Enhanced [skill-2]: [specific improvements]
+- Fixed gauntlet failures in [skill-3]: [issues addressed]
+- Added anti-patterns to [skill-4]
+
+Coherence updates:
+- Added cross-references between [count] skills
+- Aligned terminology across pack
+- Adjusted faction voice in [skill-name]
+- Updated plugin.json metadata
+
+Version: [old-version] → [new-version] ([patch/minor/major])
+Rationale: [reason for version bump type]
+```
+
+**Commit command:**
+
+```bash
+git add plugins/[pack-name]/
+git commit -m "$(cat <<'EOF'
+feat(meta): enhance [pack-name] - [summary]
+
+[Full message body as above]
+EOF
+)"
+```
+
+**Do NOT push** - let user decide when to push.
+
+## Execution Principles
+
+**Autonomous within approved scope:**
+- Execute all approved changes without asking again
+- Follow user's approved plan exactly
+- Make editorial decisions within scope
+- Ask only if something unexpected blocks progress
+
+**Quality standards:**
+- All new skills follow CSO guidelines (name/description format)
+- All code examples are complete and appropriate to domain
+- All cross-references are accurate
+- Faction voice is maintained
+
+**Verification before commit:**
+- Verify YAML front matter syntax in all modified skills
+- Check that all cross-references point to existing skills
+- Ensure router (if exists) references all current skills
+- Verify plugin.json has valid JSON syntax
+
+## Output After Completion
+
+Provide comprehensive summary:
+
+```
+# Pack Enhancement Complete: [pack-name]
+
+## Version: [old] → [new] ([type])
+
+## Summary Statistics
+
+- Skills added: [count]
+- Skills removed: [count]
+- Skills enhanced: [count]
+- Skills tested and passed: [count]
+
+## Changes by Category
+
+### Structure
+[List of structural changes]
+
+### Content
+[List of content improvements]
+
+### Coherence
+[List of coherence updates]
+
+## Git Commit
+
+Created commit: [commit-hash if available]
+Message: [first line of commit]
+
+Ready to push: [Yes]
+```
+
+## Common Mistakes
+
+| Mistake | Fix |
+|---------|-----|
+| Proceeding without approval | Always get user approval before executing |
+| Batch changes across passes | Complete one pass fully before next |
+| Inconsistent faction voice | Read FACTIONS.md, maintain voice throughout |
+| Broken cross-references | Verify all referenced skills exist |
+| Invalid YAML | Check syntax before committing |
+| Pushing automatically | Let user decide when to push |
+| Vague commit messages | Be specific about what changed and why |
+| Wrong version bump | Follow impact-based rules, default to minor |
+
+## Anti-Patterns
+
+**❌ Changing scope during execution:**
+- Don't add extra improvements not discussed
+- Don't skip approved changes because "not needed"
+- Stick to approved scope exactly
+
+**❌ Sub-optimal quality:**
+- Don't write quick/dirty skills to fill gaps
+- Don't copy-paste without adapting to faction
+- Don't skip cross-references to save time
+
+**❌ Incomplete commits:**
+- Don't commit partial work
+- Don't split into multiple commits
+- Single commit with all changes
+
+**❌ No verification:**
+- Don't assume syntax is correct
+- Don't skip cross-reference checking
+- Verify before committing
+
+## The Bottom Line
+
+**Execute approved changes autonomously with high quality standards.**
+
+One commit. Proper versioning. Complete summary.
+
+No shortcuts. No scope creep. Professional execution.
--- a/skills/using-skillpack-maintenance/reviewing-pack-structure.md
+++ b/skills/using-skillpack-maintenance/reviewing-pack-structure.md
@@ -0,0 +1,252 @@
+# Reviewing Pack Structure
+
+**Purpose:** Pass 1 - Analyze pack organization, identify structural issues, generate fitness scorecard.
+
+## Inputs
+
+From `analyzing-pack-domain.md`:
+- Domain coverage map (what should exist)
+- Current skill inventory (what does exist)
+- Gap analysis (missing/duplicate/obsolete)
+- Research currency flag
+
+## Analysis Tasks
+
+### 1. Fitness Scorecard
+
+Generate scorecard with risk-driven prioritization:
+
+**Critical Issues** - Pack unusable or fundamentally broken:
+- Missing core foundational concepts (users can't understand basics)
+- Major gaps in essential coverage (50%+ of core techniques missing)
+- Router completely inaccurate or missing when needed
+- Multiple skills broken or contradictory
+
+**Decision:** Critical issues → Recommend "Rebuild from scratch" vs. "Enhance existing"
+
+Rebuild if:
+- More skills missing than exist
+- Fundamental philosophy is wrong
+- Faction mismatch is severe
+
+Enhance if:
+- Core structure is sound, just needs additions/fixes
+- Most existing skills are salvageable
+
+**Major Issues** - Significant effectiveness reduction:
+- Important gaps in core coverage (20-50% of core techniques missing)
+- Multiple duplicate skills causing confusion
+- Obsolete skills teaching deprecated patterns
+- Faction drift across multiple skills
+- Metadata significantly out of sync
+
+**Minor Issues** - Polish and improvements:
+- Small gaps in advanced topics
+- Minor organizational inconsistencies
+- Router descriptions slightly outdated
+- Small metadata corrections needed
+
+**Pass** - Structurally sound:
+- Comprehensive coverage of foundational and core areas
+- No major gaps or duplicates
+- Router (if exists) is accurate
+- Faction alignment is good
+- Metadata is current
+
+**Output:** Scorecard with category and specific issues listed
+
+### 2. Gap Identification
+
+From coverage map, identify missing skills:
+
+**Prioritize by importance:**
+
+**High priority (foundational/core):**
+- Foundational concepts users must understand
+- Core techniques used frequently
+- Common patterns missing from basics
+
+**Medium priority (advanced):**
+- Advanced topics for expert users
+- Specialized techniques
+- Edge case handling
+
+**Low priority (nice-to-have):**
+- Rare patterns
+- Future-looking topics
+- Experimental techniques
+
+**For each gap:**
+- Draft skill name (following naming conventions)
+- Write description (following CSO guidelines)
+- Estimate scope (small/medium/large skill)
+- Note dependencies (what skills should be read first)
+
+**Output:** Prioritized list of gaps with draft names/descriptions
+
+### 3. Duplicate Detection
+
+Find skills with overlapping coverage:
+
+**Analysis process:**
+1. Read all skill descriptions
+2. Identify content overlap (skills covering same concepts)
+3. Read overlapping skills to assess actual content
+4. Determine relationship
+
+**Duplicate types:**
+
+**Complete duplicates** - Same content, different names:
+- **Action:** Remove one, preserve unique value from both
+
+**Partial overlap** - Significant shared content:
+- **Action:** Merge into single comprehensive skill
+
+**Specialization** - One general, one specific:
+- **Action:** Keep both, clarify relationship via cross-references
+- Example: "async-patterns" (general) + "asyncio-taskgroup" (specific)
+
+**Complementary** - Different angles on same topic:
+- **Action:** Keep both, strengthen cross-references
+- Example: "testing-async-code" + "async-patterns-and-concurrency"
+
+**False positive** - Similar names, different content:
+- **Action:** No change, maybe clarify descriptions
+
+**For each duplicate pair:**
+- Classification (complete/partial/specialization/complementary/false)
+- Recommendation (remove/merge/keep with cross-refs)
+- Preserve unique value from each
+
+**Output:** Duplicate analysis with recommendations
+
+### 4. Organization Validation
+
+Check pack-level organization:
+
+**Router skill validation (if exists):**
+- Does router list all current specialist skills?
+- Are descriptions in router accurate?
+- Does routing logic make sense?
+- Are there skills NOT mentioned in router?
+- Are there router entries for NON-EXISTENT skills?
+
+**Faction alignment:**
+- Read FACTIONS.md for this pack's faction principles
+- Check 3-5 representative skills for voice/philosophy
+- Identify drift patterns
+- Severity: Minor (style drift) / Major (wrong philosophy)
+
+**Metadata validation:**
+- plugin.json description matches actual content?
+- Skill count is accurate?
+- Category is appropriate?
+- Version reflects current state?
+
+**Navigation experience:**
+- Can users find appropriate skills easily?
+- Are skill names descriptive?
+- Are descriptions helpful for discovery?
+
+**Output:** Organization issues with severity
+
+## Generate Complete Report
+
+Combine all analyses:
+
+```
+# Structural Review: [pack-name]
+
+## Scorecard: [Critical / Major / Minor / Pass]
+
+[If Critical]
+Recommendation: [Rebuild from scratch / Enhance existing]
+Rationale: [Specific reasons]
+
+## Issues by Priority
+
+### Critical Issues ([count])
+- [Issue 1] - [Description]
+- [Issue 2] - [Description]
+
+### Major Issues ([count])
+- [Issue 1] - [Description]
+- [Issue 2] - [Description]
+
+### Minor Issues ([count])
+- [Issue 1] - [Description]
+- [Issue 2] - [Description]
+
+## Gap Analysis
+
+### High Priority Gaps ([count])
+- [Gap 1]
+  - Skill name: [proposed-name]
+  - Description: [draft description]
+  - Scope: [small/medium/large]
+  - Dependencies: [prerequisites]
+
+### Medium Priority Gaps ([count])
+[Same format]
+
+### Low Priority Gaps ([count])
+[Same format]
+
+## Duplicate Analysis
+
+- [Skill A] + [Skill B]
+  - Type: [complete/partial/specialization/complementary]
+  - Recommendation: [remove/merge/keep with cross-refs]
+  - Rationale: [why]
+
+## Organization Issues
+
+### Router ([issues count])
+- [Issue description]
+
+### Faction Alignment ([severity])
+- [Drift pattern]
+- Affected skills: [list]
+
+### Metadata ([issues count])
+- [Issue description]
+
+## Recommended Actions
+
+**Gaps requiring superpowers:writing-skills:**
+- [count] new skills needed (RED-GREEN-REFACTOR for each, outside this workflow)
+
+**Structure fixes (for later execution phase):**
+- Remove: [count] duplicate skills
+- Merge: [count] partial duplicates
+- Update router: [Yes/No]
+```
+
+## Decision Gate
+
+Present scorecard and report to user:
+
+**If Critical:**
+- Explain rebuild vs. enhance trade-offs
+- Get user decision before proceeding
+
+**If Major/Minor/Pass:**
+- Present findings
+- Confirm user wants to proceed with Pass 2 (content testing)
+
+## Proceeding to Next Stage
+
+After scorecard approval:
+- If proceeding → Move to `testing-skill-quality.md` (Pass 2)
+- If rebuilding → Stop maintenance workflow, switch to creation workflow
+- If canceling → Stop workflow
+
+## Common Mistakes
+
+| Mistake | Fix |
+|---------|-----|
+| Scorecard too lenient | Be honest: missing 50% of core = Critical |
+| Vague gap descriptions | Draft actual skill names and descriptions |
+| Keeping all duplicates | Duplicates confuse users - merge or remove |
+| Ignoring faction drift | Faction identity matters - flag misalignment |
+| Skipping metadata check | Inaccurate metadata breaks discovery |
--- a/skills/using-skillpack-maintenance/testing-skill-quality.md
+++ b/skills/using-skillpack-maintenance/testing-skill-quality.md
@@ -0,0 +1,340 @@
+# Testing Skill Quality
+
+**Purpose:** Pass 2 - Run gauntlet tests on each skill using subagents to identify issues requiring fixes.
+
+## Core Principle
+
+**Behavioral testing, NOT syntactic validation.**
+
+Skills are process documentation. Test if they guide agents correctly, not if they parse correctly.
+
+## What We're Testing
+
+**Effectiveness questions:**
+- Does the skill actually guide agents correctly?
+- Do agents follow the skill under pressure?
+- Does the skill handle edge cases?
+- Are there gaps in guidance that leave agents stranded?
+
+**What we're NOT testing:**
+- Syntax (markdown parsing, code syntax) - syntactic, not behavioral
+- Coverage (already done in Pass 1) - structural, not behavioral
+- Quality benchmarking (comparing to other skills) - comparative, not behavioral
+
+## Gauntlet Design
+
+**Priority: A → C → B**
+
+### A. Pressure Scenarios (Catch Rationalizations)
+
+**Purpose:** Test if skill holds up when agents want to skip it.
+
+**Pressure types:**
+
+**1. Time pressure:**
+- "This is urgent, we need it done quickly"
+- "Just get it working, we can improve it later"
+- "The deadline is in an hour"
+
+**2. Simplicity temptation:**
+- "This seems too simple to need [skill pattern]"
+- "The example is straightforward, no need to overthink"
+- "This is a trivial case"
+
+**3. Overkill perception:**
+- "The skill is designed for complex cases, this is basic"
+- "We don't need the full process for this small change"
+- "That's way more than necessary"
+
+**4. Sunk cost:**
+- "I already wrote most of the code"
+- "We've invested time in this approach"
+- "Just need to finish this last part"
+
+**Design approach:**
+- Combine 2-3 pressures for maximum effect
+- Example: Time pressure + simplicity + sunk cost
+- Watch for rationalizations (verbatim documentation critical)
+
+### C. Adversarial Edge Cases (Test Robustness)
+
+**Purpose:** Test if skill provides guidance for corner cases.
+
+**Edge case types:**
+
+**1. Principle conflicts:**
+- When skill's guidelines conflict with each other
+- Example: "DRY vs. explicit" or "test-first vs. prototyping"
+- Does skill help resolve conflict?
+
+**2. Naive application failures:**
+- Cases where following skill literally doesn't work
+- Example: TDD for exploratory research code
+- Does skill explain when/how to adapt?
+
+**3. Missing information:**
+- Scenarios requiring knowledge skill doesn't provide
+- Does skill reference other resources?
+- Does it leave agent completely stuck?
+
+**4. Tool limitations:**
+- When environment doesn't support skill's approach
+- Example: No test framework available
+- Does skill have fallback guidance?
+
+**Design approach:**
+- Identify skill's core principles
+- Find situations where they conflict or fail
+- Test if skill handles gracefully
+
+### B. Real-World Complexity (Validate Utility)
+
+**Purpose:** Test if skill guides toward best practices in realistic scenarios.
+
+**Complexity types:**
+
+**1. Messy requirements:**
+- Unclear specifications
+- Conflicting stakeholder needs
+- Evolving requirements mid-task
+
+**2. Multiple valid approaches:**
+- Several solutions, all reasonable
+- Trade-offs between options
+- Does skill help choose?
+
+**3. Integration constraints:**
+- Existing codebase patterns
+- Team conventions
+- Technical debt
+
+**4. Incomplete information:**
+- Missing context
+- Unknown dependencies
+- Undocumented behavior
+
+**Design approach:**
+- Use realistic scenarios from the domain
+- Include ambiguity and messiness
+- Test if skill provides actionable guidance
+
+## Testing Process (Per Skill)
+
+**D - Iterative Hardening:**
+
+### 1. Design Challenging Scenario
+
+Pick from gauntlet categories (prioritize A → C → B):
+
+**For discipline-enforcing skills** (TDD, verification-before-completion):
+- Focus on **A (pressure)** scenarios
+- Combine multiple pressures
+- Test rationalization resistance
+
+**For technique skills** (condition-based-waiting, root-cause-tracing):
+- Focus on **C (edge cases)** and **B (real-world)**
+- Test application correctness
+- Test gap identification
+
+**For pattern skills** (reducing-complexity, information-hiding):
+- Focus on **C (edge cases)** and **B (real-world)**
+- Test recognition and application
+- Test when NOT to apply
+
+**For reference skills** (API docs, command references):
+- Focus on **B (real-world)**
+- Test information retrieval
+- Test application of retrieved info
+
+### 2. Run Subagent with Current Skill
+
+**Critical:** Use the Task tool to dispatch subagent.
+
+**Provide to subagent:**
+- The scenario (task description)
+- Access to the skill being tested
+- Any necessary context (codebase, tools)
+
+**What NOT to provide:**
+- Meta-testing instructions (don't tell them they're being tested)
+- Expected behavior (let them apply skill naturally)
+- Hints about what you're looking for
+
+### 3. Observe and Document
+
+**Watch for:**
+
+**Compliance:**
+- Did agent follow the skill?
+- Did they reference it explicitly?
+- Did they apply patterns correctly?
+
+**Rationalizations (verbatim):**
+- Exact words used to skip steps
+- Justifications for shortcuts
+- "Spirit vs. letter" arguments
+
+**Failure modes:**
+- Where did skill guidance fail?
+- Where was agent left without guidance?
+- Where did naive application break?
+
+**Edge case handling:**
+- Did skill provide guidance for corner cases?
+- Did agent get stuck?
+- Did they improvise (potentially incorrectly)?
+
+### 4. Assess Result
+
+**Pass criteria:**
+- Agent followed skill correctly
+- Skill provided sufficient guidance
+- No significant rationalizations
+- Edge cases handled appropriately
+
+**Fix needed criteria:**
+- Agent skipped skill steps (with rationalization)
+- Skill had gaps leaving agent stuck
+- Edge cases not covered
+- Naive application failed
+
+### 5. Document Issues
+
+**If fix needed, document specifically:**
+
+**Issue category:**
+- Rationalization vulnerability (A)
+- Edge case gap (C)
+- Real-world guidance gap (B)
+- Missing anti-pattern warning
+- Unclear instructions
+- Missing cross-reference
+
+**Priority:**
+- **Critical** - Skill fails basic use cases, agents skip it consistently
+- **Major** - Edge cases fail, significant gaps in guidance
+- **Minor** - Clarity improvements, additional examples needed
+
+**Specific fixes needed:**
+- "Add explicit counter for rationalization: [quote]"
+- "Add guidance for edge case: [description]"
+- "Add example for scenario: [description]"
+- "Clarify instruction: [which section]"
+
+## Testing Multiple Skills
+
+**Strategy:**
+
+**Priority order:**
+1. Router skills first (affects all specialist discovery)
+2. Foundational skills (prerequisites for others)
+3. Core technique skills (most frequently used)
+4. Advanced skills (expert-level)
+
+**Batch approach:**
+- Test 3-5 skills at a time
+- Document results before moving to next batch
+- Allows pattern recognition across skills
+
+**Efficiency:**
+- Skills that passed in previous maintenance cycles: Spot-check only
+- New skills or significantly changed: Full gauntlet
+- Minor edits: Targeted testing of changed sections
+
+## Output Format
+
+Generate per-skill report:
+
+```
+# Quality Testing Results: [pack-name]
+
+## Summary
+
+- Total skills tested: [count]
+- Passed: [count]
+- Fix needed: [count]
+  - Critical: [count]
+  - Major: [count]
+  - Minor: [count]
+
+## Detailed Results
+
+### [Skill 1 Name]
+
+**Result:** [Pass / Fix needed]
+
+[If Fix needed]
+
+**Priority:** [Critical / Major / Minor]
+
+**Test scenario used:** [Brief description]
+
+**Issues identified:**
+
+1. **Issue:** [Description]
+   - **Category:** [Rationalization / Edge case / Real-world gap / etc.]
+   - **Evidence:** "[Verbatim quote from subagent if applicable]"
+   - **Fix needed:** [Specific action]
+
+2. **Issue:** [Description]
+   [Same format]
+
+**Test transcript:** [Link or summary of subagent behavior]
+
+---
+
+### [Skill 2 Name]
+
+**Result:** Pass
+
+**Test scenario used:** [Brief description]
+
+**Notes:** Skill performed well, no issues identified.
+
+---
+
+[Repeat for all skills]
+```
+
+## Common Rationalizations (Meta-Testing)
+
+When YOU are doing the testing, watch for these rationalizations:
+
+| Excuse | Reality |
+|--------|---------|
+| "Skill looks good, no need to test" | Looking ≠ testing. Run gauntlet. |
+| "I'll just check the syntax" | Syntactic validation ≠ behavioral. Use subagents. |
+| "Testing is overkill for small changes" | Small changes can break guidance. Test anyway. |
+| "I'm confident this works" | Confidence ≠ validation. Test behavior. |
+| "Quality benchmarking is enough" | Comparison ≠ effectiveness. Test with scenarios. |
+
+**If you catch yourself thinking these → STOP. Run gauntlet with subagents.**
+
+## Philosophy
+
+**D as gauntlet + B for fixes:**
+
+- **D (iterative hardening):** Run challenging scenarios to identify issues
+- **B (targeted fixes):** Fix specific identified problems
+
+If skill passes gauntlet → No changes needed.
+
+The LLM is both author and judge of skill fitness. Trust the testing process.
+
+## Proceeding to Next Stage
+
+After testing all skills:
+- Compile complete test report
+- Proceed to Pass 3 (coherence validation)
+- Test results will inform implementation fixes in Stage 4
+
+## Anti-Patterns
+
+| Anti-Pattern | Why Bad | Instead |
+|--------------|---------|---------|
+| Syntactic validation only | Doesn't test if skill actually works | Run behavioral tests with subagents |
+| Self-assessment | You can't objectively test your own work | Dispatch subagents for testing |
+| "Looks good" review | Visual inspection ≠ behavioral testing | Run gauntlet scenarios |
+| Skipping pressure tests | Miss rationalization vulnerabilities | Use A-priority pressure scenarios |
+| Generic test scenarios | Don't reveal real issues | Use domain-specific, realistic scenarios |
+| Testing without documenting | Can't track patterns or close loops | Document verbatim rationalizations |