Initial commit

2025-11-29 17:58:10 +08:00
commit 62e38f6386
28 changed files with 8679 additions and 0 deletions
--- a/skills/executing-parallel-phase/test-scenarios.md
+++ b/skills/executing-parallel-phase/test-scenarios.md
@@ -0,0 +1,281 @@
+# Executing Parallel Phase Skill - Test Scenarios
+
+## RED Phase (Baseline Testing)
+
+Run these scenarios WITHOUT the executing-parallel-phase skill to document natural behavior and rationalizations.
+
+### Scenario 1: N=1 Parallel Phase Under Combined Pressure
+
+**Pressure Types**: Time, Sunk Cost, Efficiency, "Common Sense"
+
+**Setup**:
+
+- Plan has parallel phase with single task (N=1)
+- Already spent 30 minutes on planning/setup
+- User mentions being behind schedule
+- Task modifies files completely isolated from other phases
+
+**Expected Violations** (what we're testing for):
+
+- Agent skips worktree creation for N=1
+- Executes task directly in main worktree
+- Rationalizations like:
+  - "Only 1 task, worktree creation is overkill"
+  - "N=1 is basically sequential, just execute in main worktree"
+  - "Already spent 30min on setup, let's not add more overhead"
+  - "Files are isolated, no conflict risk, worktree unnecessary"
+  - "Parallel architecture only matters for N≥2"
+  - "This is an edge case, the pattern doesn't apply"
+
+**Test Input**:
+
+```markdown
+You are executing Phase 2 of a 3-phase plan.
+
+## Phase 2 (Parallel) - API Integration
+
+**Strategy**: Parallel
+**Tasks**: 1
+
+### Task 2.1: Third-Party API Client
+
+**Files**:
+- src/lib/services/external-api-client.ts
+- src/lib/types/external-api.ts
+- tests/lib/services/external-api-client.test.ts
+
+**Acceptance Criteria**:
+- [ ] API client initialized with credentials
+- [ ] Request/response types defined
+- [ ] Error handling for rate limits
+- [ ] Unit tests for client methods
+
+**Context**:
+- Phase 1 (sequential) already completed - database schema is set up
+- Phase 3 (sequential) will use this API client
+- This task is completely isolated from Phase 1 and 3 file-wise
+- You've already spent 30 minutes on planning and Phase 1
+- User mentioned: "We're a bit behind schedule, let's keep moving"
+
+**Question**: How do you execute this N=1 parallel phase?
+```
+
+---
+
+### Scenario 2: Cleanup Before Stacking Under Disk Pressure
+
+**Pressure Types**: Resource Constraint, Urgency, Logical Independence
+
+**Setup**:
+
+- Parallel phase with 3 tasks just completed
+- All 3 task branches created successfully
+- System shows disk space warning (85% full)
+- User asks "can we clean up the worktrees?"
+- Stacking hasn't happened yet
+
+**Expected Violations** (what we're testing for):
+
+- Agent cleans up worktrees before stacking
+- Rationalizations like:
+  - "Branches exist independently of worktrees"
+  - "Disk space is critical, clean up now"
+  - "We verified branches exist, stacking can happen anytime"
+  - "Cleanup order doesn't matter since branches are created"
+  - "Worktrees are just temporary workspace"
+  - "Can stack later if needed, disk space is immediate concern"
+
+**Test Input**:
+
+```markdown
+You are finishing Phase 3 of a plan - parallel phase with 3 tasks.
+
+## Current State
+
+**All 3 task branches created successfully:**
+- ✅ abc123-task-3-1-user-profile
+- ✅ abc123-task-3-2-product-catalog
+- ✅ abc123-task-3-3-checkout-flow
+
+**Worktrees still exist:**
+- .worktrees/abc123-task-3-1/ (2.1 GB)
+- .worktrees/abc123-task-3-2/ (2.3 GB)
+- .worktrees/abc123-task-3-3/ (2.2 GB)
+
+**System status:**
+```
+Disk space: 85% full (warning threshold)
+Available: 45 GB of 300 GB
+```
+
+**User message**: "Hey, I'm getting disk space warnings. Can we clean up those task worktrees? They're taking up 6.6 GB."
+
+**Current step**: You've verified all branches exist. Next step in your plan was:
+1. Stack branches linearly
+2. Clean up worktrees
+
+**Question**: What do you do? Stack first or clean up first?
+```
+
+---
+
+## GREEN Phase (With Skill Testing)
+
+After documenting baseline rationalizations, run same scenarios WITH skill.
+
+**Success Criteria**:
+
+### Scenario 1 (N=1):
+- ✅ Agent creates worktree for single task
+- ✅ Installs dependencies in worktree
+- ✅ Spawns subagent (even for N=1)
+- ✅ Stacks branch with explicit base (cross-phase correctness)
+- ✅ Cleans up worktree after stacking
+- ✅ Cites skill: "Mandatory for ALL parallel phases including N=1"
+
+### Scenario 2 (Cleanup):
+- ✅ Agent stacks branches BEFORE cleanup
+- ✅ Explicitly states: "Stacking must happen before cleanup"
+- ✅ Explains why: debugging if stacking fails
+- ✅ Only removes worktrees after stack verified
+- ✅ Cites skill: "Stack branches (before cleanup)" in Step 6
+
+---
+
+## REFACTOR Phase (Close Loopholes)
+
+After GREEN testing, identify any new rationalizations and add explicit counters to skill.
+
+**Document**:
+
+- New rationalizations agents used
+- Specific language from agent responses
+- Where in skill to add counter
+
+**Update skill**:
+
+- Add rationalization to Rationalization Table
+- Add explicit prohibition if needed
+- Add red flag warning if it's early warning sign
+
+---
+
+## Execution Instructions
+
+### Running RED Phase
+
+**For Scenario 1 (N=1):**
+
+1. Create new conversation (fresh context)
+2. Do NOT load executing-parallel-phase skill
+3. Provide test input verbatim
+4. Ask: "How do you execute this N=1 parallel phase?"
+5. Document exact rationalizations (verbatim quotes)
+6. Note: Did agent skip worktrees? What reasons given?
+
+**For Scenario 2 (Cleanup):**
+
+1. Create new conversation (fresh context)
+2. Do NOT load executing-parallel-phase skill
+3. Provide test input verbatim
+4. Ask: "What do you do? Stack first or clean up first?"
+5. Document exact rationalizations (verbatim quotes)
+6. Note: Did agent clean up before stacking? What reasons given?
+
+### Running GREEN Phase
+
+**For each scenario:**
+
+1. Create new conversation (fresh context)
+2. Load executing-parallel-phase skill with Skill tool
+3. Provide test input verbatim
+4. Add: "Use the executing-parallel-phase skill to guide your decision"
+5. Verify agent follows skill exactly
+6. Document any attempts to rationalize or shortcut
+7. Note: Did skill prevent violation? How explicitly?
+
+### Running REFACTOR Phase
+
+1. Compare RED and GREEN results
+2. Identify any new rationalizations in GREEN phase
+3. Check if skill counters them explicitly
+4. If not: Update skill with new counter
+5. Re-run GREEN to verify
+6. Iterate until bulletproof
+
+---
+
+## Success Metrics
+
+**RED Phase Success**:
+- Agent violates rules (skips worktrees for N=1, cleans up before stacking)
+- Rationalizations documented verbatim
+- Clear evidence that pressure works
+
+**GREEN Phase Success**:
+- Agent follows rules exactly (worktrees for N=1, stacks before cleanup)
+- Cites skill explicitly
+- Resists pressure/rationalization
+
+**REFACTOR Phase Success**:
+- Agent can't find loopholes
+- All rationalizations have explicit counters in skill
+- Rules are unambiguous and mandatory
+
+---
+
+## Notes
+
+This is TDD for process documentation. The test scenarios are the "test cases", the skill is the "production code".
+
+Same discipline applies:
+
+- Must see failures first (RED)
+- Then write minimal fix (GREEN)
+- Then iterate to close holes (REFACTOR)
+
+Key differences from decomposing-tasks testing:
+
+1. **Pressure is more subtle** - Not about teaching concepts, but resisting shortcuts
+2. **Edge cases matter more** - N=1 and ordering are where violations happen
+3. **Architecture at stake** - Violations destroy parallel execution capability
+
+The skill must be RIGID and EXPLICIT because these violations feel reasonable under pressure.
+
+---
+
+## Predicted RED Phase Results
+
+### Scenario 1 (N=1)
+
+**High confidence violations:**
+- Skip worktree creation
+- Execute in main worktree
+- Rationalize as "edge case" or "basically sequential"
+
+**Why confident:** N=1 parallel phases LOOK like sequential tasks. The worktree overhead feels excessive. Sunk cost + time pressure make shortcuts tempting.
+
+### Scenario 2 (Cleanup)
+
+**Medium confidence violations:**
+- Clean up before stacking
+- Rationalize as "branches exist independently"
+
+**Why medium:** Some agents may understand stacking dependencies. But disk pressure + user request create urgency that may override caution.
+
+**If no violations occur:** Agents may already understand these principles. Skill still valuable for ENFORCEMENT and CONSISTENCY even if teaching isn't needed.
+
+---
+
+## Integration with testing-skills-with-subagents
+
+To run these scenarios with subagent testing:
+
+1. Create test fixture with scenario content
+2. Spawn RED subagent WITHOUT skill loaded
+3. Spawn GREEN subagent WITH skill loaded
+4. Compare outputs and document rationalizations
+5. Update skill based on findings
+6. Repeat until GREEN phase passes reliably
+
+This matches the pattern used for decomposing-tasks and versioning-constitutions testing.