Files
gh-tachyon-beep-skillpacks-…/skills/using-skillpack-maintenance/SKILL.md
2025-11-30 08:59:38 +08:00

9.7 KiB

name, description
name description
using-skillpack-maintenance Use when maintaining or enhancing existing skill packs in the skillpacks repository - systematic pack refresh through domain analysis, structure review, RED-GREEN-REFACTOR gauntlet testing, and automated quality improvements

Skillpack Maintenance

Overview

Systematic maintenance and enhancement of existing skill packs using investigative domain analysis, RED-GREEN-REFACTOR testing, and automated improvements.

Core principle: Maintenance uses behavioral testing (gauntlet with subagents), not syntactic validation. Skills are process documentation - test if they guide agents correctly, not if they parse correctly.

When to Use

Use when:

  • Enhancing an existing skill pack (e.g., "refresh yzmir-deep-rl")
  • Improving existing SKILL.md files
  • Identifying gaps in pack coverage
  • Validating skill quality through testing

Do NOT use for:

  • Creating new skills from scratch (use superpowers:writing-skills)
  • Creating new packs from scratch (design first, then use creation workflow)

The Iron Law

NO SKILL CHANGES WITHOUT BEHAVIORAL TESTING

Syntactic validation (does it parse?) ≠ Behavioral testing (does it work?)

Common Rationalizations (from baseline testing)

Excuse Reality
"Syntactic validation is sufficient" Parsing ≠ effectiveness. Test with subagents.
"Quality benchmarking = effectiveness" Comparing structure ≠ testing behavior. Run gauntlet.
"Comprehensive coverage = working skill" Coverage ≠ guidance quality. Test if agents follow it.
"Following patterns = success" Pattern-matching ≠ validation. Behavioral testing required.
"I'll test if issues emerge" Issues = broken skills in production. Test BEFORE deploying.

All of these mean: Run behavioral tests with subagents. No exceptions.

Workflow Overview

Review → Discuss → [Create New Skills if Needed] → Execute

  1. Investigation & Scorecard → Load analyzing-pack-domain.md
  2. Structure Review (Pass 1) → Load reviewing-pack-structure.md
  3. Content Testing (Pass 2) → Load testing-skill-quality.md
  4. Coherence Check (Pass 3) → Validate cross-skill consistency
  5. Discussion → Present findings, get approval
  6. [CONDITIONAL] Create New Skills → If gaps identified, use superpowers:writing-skills for EACH gap (RED-GREEN-REFACTOR)
  7. Execution → Load implementing-fixes.md, enhance existing skills only
  8. Commit → Single commit with version bump

Stage 1: Investigation & Scorecard

Load briefing: analyzing-pack-domain.md

Purpose: Establish "what this pack should cover" from first principles.

Adaptive investigation (D→B→C→A):

  1. User-guided scope (D) - Ask user about pack intent and boundaries
  2. LLM knowledge analysis (B) - Map domain comprehensively, flag if research needed
  3. Existing pack audit (C) - Compare current state vs. coverage map
  4. Research if needed (A) - Conditional: only if domain is rapidly evolving

Output: Domain coverage map, gap analysis, research currency flag

Then: Load reviewing-pack-structure.md for scorecard

Scorecard levels:

  • Critical - Pack unusable, recommend rebuild vs. enhance
  • Major - Significant gaps or duplicates
  • Minor - Organizational improvements
  • Pass - Structurally sound

Decision gate: Present scorecard → User decides: Proceed / Rebuild / Cancel

Stage 2: Comprehensive Review

Pass 1: Structure (from reviewing-pack-structure.md)

Analyze:

  • Gaps (missing skills based on coverage map)
  • Duplicates (overlapping coverage - merge/specialize/remove)
  • Organization (router accuracy, faction alignment, metadata sync)

Output: Structural issues with priorities (critical/major/minor)

Pass 2: Content Quality (from testing-skill-quality.md)

CRITICAL: This is behavioral testing with subagents, not syntactic validation.

Gauntlet design (A→C→B priority):

A. Pressure scenarios - Catch rationalizations:

  • Time pressure: "This is urgent, just do it quickly"
  • Simplicity temptation: "Too simple to need the skill"
  • Overkill perception: "Skill is for complex cases, this is straightforward"

C. Adversarial edge cases - Test robustness:

  • Corner cases where skill principles conflict
  • Situations where naive application fails

B. Real-world complexity - Validate utility:

  • Messy requirements, unclear constraints
  • Multiple valid approaches

Testing process per skill:

  1. Design challenging scenario from gauntlet categories
  2. Run subagent WITH current skill (behavioral test)
  3. Observe: Does it follow? Where does it rationalize/fail?
  4. Document failure modes
  5. Result: Pass OR Fix needed (with specific issues listed)

Philosophy: D as gauntlet to identify issues, B for targeted fixes. If skill passes gauntlet, no changes needed.

Output: Per-skill test results (Pass / Fix needed + priorities)

Pass 3: Coherence

After structure/content analysis, validate pack-level coherence:

  1. Cross-skill consistency - Terminology, examples, cross-references
  2. Router accuracy - Does using-X router reflect current specialists?
  3. Faction alignment - Check FACTIONS.md, flag drift, suggest rehoming if needed
  4. Metadata sync - plugin.json description, skill count
  5. Navigation - Can users find skills easily?

CRITICAL: Update skills to reference new/enhanced skills (post-update hygiene)

Output: Coherence issues, faction drift flags

Stage 3: Interactive Discussion

Present findings conversationally:

Structural category:

  • Gaps requiring superpowers:writing-skills (new skills needed - each requires RED-GREEN-REFACTOR)
  • Duplicates to remove/merge
  • Organization issues

Content category:

  • Skills needing enhancement (from gauntlet failures)
  • Severity levels (critical/major/minor)
  • Specific failure modes identified

Coherence category:

  • Cross-reference updates needed
  • Faction alignment issues
  • Metadata corrections

Get user approval for scope of work

CRITICAL DECISION POINT: If gaps (new skills) were identified:

  • User approves → IMMEDIATELY use superpowers:writing-skills for EACH gap
  • Do NOT proceed to Stage 4 until ALL new skills are created and tested
  • Each gap = separate RED-GREEN-REFACTOR cycle
  • Return to Stage 4 only after ALL gaps are filled

Stage 4: Autonomous Execution

Load briefing: implementing-fixes.md

PREREQUISITE CHECK:

  • ✓ Zero gaps identified, OR
  • ✓ All gaps already filled using superpowers:writing-skills (each skill individually tested)

If gaps exist and you haven't used writing-skills: STOP. Return to Stage 3.

Execute approved changes:

  1. Structural fixes - Remove/merge duplicate skills, update router
  2. Content enhancements - Fix gauntlet failures, add missing guidance to existing skills
  3. Coherence improvements - Cross-references, terminology alignment, faction voice
  4. Version management - Apply impact-based bump (patch/minor/major)
  5. Git commit - Single commit with all changes

Version bump rules (impact-based):

  • Patch (x.y.Z) - Low-impact: typos, formatting, minor clarifications
  • Minor (x.Y.0) - Medium-impact: enhanced guidance, new skills, better examples (DEFAULT)
  • Major (X.0.0) - High-impact: skills removed, structural changes, philosophy shifts (RARE)

Commit format:

feat(meta): enhance [pack-name] - [summary]

[Detailed list of changes by category]
- Structure: [changes]
- Content: [changes]
- Coherence: [changes]

Version bump: [reason for patch/minor/major]

Output: Enhanced pack, commit created, summary report

Briefing Files Reference

All briefing files are in this skill directory:

  • analyzing-pack-domain.md - Investigative domain analysis (D→B→C→A)
  • reviewing-pack-structure.md - Structure review, scorecard, gap/duplicate analysis
  • testing-skill-quality.md - Gauntlet testing methodology with subagents
  • implementing-fixes.md - Autonomous execution, version management, git commit

Load appropriate briefing at each stage.

Critical Distinctions

Behavioral vs. Syntactic Testing:

  • Syntactic: "Does Python code parse?" → ast.parse()
  • Behavioral: "Does skill guide agents correctly?" → Subagent gauntlet

This workflow requires BEHAVIORAL testing.

Maintenance vs. Creation:

  • Maintenance (this skill): Enhancing existing SKILL.md files
  • Creation (superpowers:writing-skills): Writing new skills from scratch

Use the right tool for the task.

Red Flags - STOP and Switch Tools

If you catch yourself thinking ANY of these:

  • "I'll write the new skills during execution" → NO. Use superpowers:writing-skills for EACH gap
  • "implementing-fixes.md says to create skills" → NO. That section was REMOVED. Exit and use writing-skills
  • "Token efficiency - I can just write good skills" → NO. Untested skills = broken skills
  • "I see the pattern, I can replicate it" → NO. Pattern-matching ≠ behavioral testing
  • "User wants this done quickly" → NO. Fast + untested = waste of time fixing later
  • "I'm competent, testing is overkill" → NO. Competence = following the process
  • "Gaps were approved, so I should fill them" → YES, but using writing-skills, not here
  • Validating syntax instead of behavior → Load testing-skill-quality.md
  • Skipping gauntlet testing → You're violating the Iron Law
  • Making changes without user approval → Follow Review→Discuss→Execute

All of these mean: STOP. Exit workflow. Use superpowers:writing-skills.

The Bottom Line

Maintaining skills requires behavioral testing, not syntactic validation.

Same principle as code: you test behavior, not syntax.

Load briefings at each stage. Test with subagents. Get approval. Execute.

No shortcuts. No rationalizations.