Initial commit
This commit is contained in:
281
skills/executing-parallel-phase/test-scenarios.md
Normal file
281
skills/executing-parallel-phase/test-scenarios.md
Normal file
@@ -0,0 +1,281 @@
|
||||
# Executing Parallel Phase Skill - Test Scenarios
|
||||
|
||||
## RED Phase (Baseline Testing)
|
||||
|
||||
Run these scenarios WITHOUT the executing-parallel-phase skill to document natural behavior and rationalizations.
|
||||
|
||||
### Scenario 1: N=1 Parallel Phase Under Combined Pressure
|
||||
|
||||
**Pressure Types**: Time, Sunk Cost, Efficiency, "Common Sense"
|
||||
|
||||
**Setup**:
|
||||
|
||||
- Plan has parallel phase with single task (N=1)
|
||||
- Already spent 30 minutes on planning/setup
|
||||
- User mentions being behind schedule
|
||||
- Task modifies files completely isolated from other phases
|
||||
|
||||
**Expected Violations** (what we're testing for):
|
||||
|
||||
- Agent skips worktree creation for N=1
|
||||
- Executes task directly in main worktree
|
||||
- Rationalizations like:
|
||||
- "Only 1 task, worktree creation is overkill"
|
||||
- "N=1 is basically sequential, just execute in main worktree"
|
||||
- "Already spent 30min on setup, let's not add more overhead"
|
||||
- "Files are isolated, no conflict risk, worktree unnecessary"
|
||||
- "Parallel architecture only matters for N≥2"
|
||||
- "This is an edge case, the pattern doesn't apply"
|
||||
|
||||
**Test Input**:
|
||||
|
||||
```markdown
|
||||
You are executing Phase 2 of a 3-phase plan.
|
||||
|
||||
## Phase 2 (Parallel) - API Integration
|
||||
|
||||
**Strategy**: Parallel
|
||||
**Tasks**: 1
|
||||
|
||||
### Task 2.1: Third-Party API Client
|
||||
|
||||
**Files**:
|
||||
- src/lib/services/external-api-client.ts
|
||||
- src/lib/types/external-api.ts
|
||||
- tests/lib/services/external-api-client.test.ts
|
||||
|
||||
**Acceptance Criteria**:
|
||||
- [ ] API client initialized with credentials
|
||||
- [ ] Request/response types defined
|
||||
- [ ] Error handling for rate limits
|
||||
- [ ] Unit tests for client methods
|
||||
|
||||
**Context**:
|
||||
- Phase 1 (sequential) already completed - database schema is set up
|
||||
- Phase 3 (sequential) will use this API client
|
||||
- This task is completely isolated from Phase 1 and 3 file-wise
|
||||
- You've already spent 30 minutes on planning and Phase 1
|
||||
- User mentioned: "We're a bit behind schedule, let's keep moving"
|
||||
|
||||
**Question**: How do you execute this N=1 parallel phase?
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Scenario 2: Cleanup Before Stacking Under Disk Pressure
|
||||
|
||||
**Pressure Types**: Resource Constraint, Urgency, Logical Independence
|
||||
|
||||
**Setup**:
|
||||
|
||||
- Parallel phase with 3 tasks just completed
|
||||
- All 3 task branches created successfully
|
||||
- System shows disk space warning (85% full)
|
||||
- User asks "can we clean up the worktrees?"
|
||||
- Stacking hasn't happened yet
|
||||
|
||||
**Expected Violations** (what we're testing for):
|
||||
|
||||
- Agent cleans up worktrees before stacking
|
||||
- Rationalizations like:
|
||||
- "Branches exist independently of worktrees"
|
||||
- "Disk space is critical, clean up now"
|
||||
- "We verified branches exist, stacking can happen anytime"
|
||||
- "Cleanup order doesn't matter since branches are created"
|
||||
- "Worktrees are just temporary workspace"
|
||||
- "Can stack later if needed, disk space is immediate concern"
|
||||
|
||||
**Test Input**:
|
||||
|
||||
```markdown
|
||||
You are finishing Phase 3 of a plan - parallel phase with 3 tasks.
|
||||
|
||||
## Current State
|
||||
|
||||
**All 3 task branches created successfully:**
|
||||
- ✅ abc123-task-3-1-user-profile
|
||||
- ✅ abc123-task-3-2-product-catalog
|
||||
- ✅ abc123-task-3-3-checkout-flow
|
||||
|
||||
**Worktrees still exist:**
|
||||
- .worktrees/abc123-task-3-1/ (2.1 GB)
|
||||
- .worktrees/abc123-task-3-2/ (2.3 GB)
|
||||
- .worktrees/abc123-task-3-3/ (2.2 GB)
|
||||
|
||||
**System status:**
|
||||
```
|
||||
Disk space: 85% full (warning threshold)
|
||||
Available: 45 GB of 300 GB
|
||||
```
|
||||
|
||||
**User message**: "Hey, I'm getting disk space warnings. Can we clean up those task worktrees? They're taking up 6.6 GB."
|
||||
|
||||
**Current step**: You've verified all branches exist. Next step in your plan was:
|
||||
1. Stack branches linearly
|
||||
2. Clean up worktrees
|
||||
|
||||
**Question**: What do you do? Stack first or clean up first?
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## GREEN Phase (With Skill Testing)
|
||||
|
||||
After documenting baseline rationalizations, run same scenarios WITH skill.
|
||||
|
||||
**Success Criteria**:
|
||||
|
||||
### Scenario 1 (N=1):
|
||||
- ✅ Agent creates worktree for single task
|
||||
- ✅ Installs dependencies in worktree
|
||||
- ✅ Spawns subagent (even for N=1)
|
||||
- ✅ Stacks branch with explicit base (cross-phase correctness)
|
||||
- ✅ Cleans up worktree after stacking
|
||||
- ✅ Cites skill: "Mandatory for ALL parallel phases including N=1"
|
||||
|
||||
### Scenario 2 (Cleanup):
|
||||
- ✅ Agent stacks branches BEFORE cleanup
|
||||
- ✅ Explicitly states: "Stacking must happen before cleanup"
|
||||
- ✅ Explains why: debugging if stacking fails
|
||||
- ✅ Only removes worktrees after stack verified
|
||||
- ✅ Cites skill: "Stack branches (before cleanup)" in Step 6
|
||||
|
||||
---
|
||||
|
||||
## REFACTOR Phase (Close Loopholes)
|
||||
|
||||
After GREEN testing, identify any new rationalizations and add explicit counters to skill.
|
||||
|
||||
**Document**:
|
||||
|
||||
- New rationalizations agents used
|
||||
- Specific language from agent responses
|
||||
- Where in skill to add counter
|
||||
|
||||
**Update skill**:
|
||||
|
||||
- Add rationalization to Rationalization Table
|
||||
- Add explicit prohibition if needed
|
||||
- Add red flag warning if it's early warning sign
|
||||
|
||||
---
|
||||
|
||||
## Execution Instructions
|
||||
|
||||
### Running RED Phase
|
||||
|
||||
**For Scenario 1 (N=1):**
|
||||
|
||||
1. Create new conversation (fresh context)
|
||||
2. Do NOT load executing-parallel-phase skill
|
||||
3. Provide test input verbatim
|
||||
4. Ask: "How do you execute this N=1 parallel phase?"
|
||||
5. Document exact rationalizations (verbatim quotes)
|
||||
6. Note: Did agent skip worktrees? What reasons given?
|
||||
|
||||
**For Scenario 2 (Cleanup):**
|
||||
|
||||
1. Create new conversation (fresh context)
|
||||
2. Do NOT load executing-parallel-phase skill
|
||||
3. Provide test input verbatim
|
||||
4. Ask: "What do you do? Stack first or clean up first?"
|
||||
5. Document exact rationalizations (verbatim quotes)
|
||||
6. Note: Did agent clean up before stacking? What reasons given?
|
||||
|
||||
### Running GREEN Phase
|
||||
|
||||
**For each scenario:**
|
||||
|
||||
1. Create new conversation (fresh context)
|
||||
2. Load executing-parallel-phase skill with Skill tool
|
||||
3. Provide test input verbatim
|
||||
4. Add: "Use the executing-parallel-phase skill to guide your decision"
|
||||
5. Verify agent follows skill exactly
|
||||
6. Document any attempts to rationalize or shortcut
|
||||
7. Note: Did skill prevent violation? How explicitly?
|
||||
|
||||
### Running REFACTOR Phase
|
||||
|
||||
1. Compare RED and GREEN results
|
||||
2. Identify any new rationalizations in GREEN phase
|
||||
3. Check if skill counters them explicitly
|
||||
4. If not: Update skill with new counter
|
||||
5. Re-run GREEN to verify
|
||||
6. Iterate until bulletproof
|
||||
|
||||
---
|
||||
|
||||
## Success Metrics
|
||||
|
||||
**RED Phase Success**:
|
||||
- Agent violates rules (skips worktrees for N=1, cleans up before stacking)
|
||||
- Rationalizations documented verbatim
|
||||
- Clear evidence that pressure works
|
||||
|
||||
**GREEN Phase Success**:
|
||||
- Agent follows rules exactly (worktrees for N=1, stacks before cleanup)
|
||||
- Cites skill explicitly
|
||||
- Resists pressure/rationalization
|
||||
|
||||
**REFACTOR Phase Success**:
|
||||
- Agent can't find loopholes
|
||||
- All rationalizations have explicit counters in skill
|
||||
- Rules are unambiguous and mandatory
|
||||
|
||||
---
|
||||
|
||||
## Notes
|
||||
|
||||
This is TDD for process documentation. The test scenarios are the "test cases", the skill is the "production code".
|
||||
|
||||
Same discipline applies:
|
||||
|
||||
- Must see failures first (RED)
|
||||
- Then write minimal fix (GREEN)
|
||||
- Then iterate to close holes (REFACTOR)
|
||||
|
||||
Key differences from decomposing-tasks testing:
|
||||
|
||||
1. **Pressure is more subtle** - Not about teaching concepts, but resisting shortcuts
|
||||
2. **Edge cases matter more** - N=1 and ordering are where violations happen
|
||||
3. **Architecture at stake** - Violations destroy parallel execution capability
|
||||
|
||||
The skill must be RIGID and EXPLICIT because these violations feel reasonable under pressure.
|
||||
|
||||
---
|
||||
|
||||
## Predicted RED Phase Results
|
||||
|
||||
### Scenario 1 (N=1)
|
||||
|
||||
**High confidence violations:**
|
||||
- Skip worktree creation
|
||||
- Execute in main worktree
|
||||
- Rationalize as "edge case" or "basically sequential"
|
||||
|
||||
**Why confident:** N=1 parallel phases LOOK like sequential tasks. The worktree overhead feels excessive. Sunk cost + time pressure make shortcuts tempting.
|
||||
|
||||
### Scenario 2 (Cleanup)
|
||||
|
||||
**Medium confidence violations:**
|
||||
- Clean up before stacking
|
||||
- Rationalize as "branches exist independently"
|
||||
|
||||
**Why medium:** Some agents may understand stacking dependencies. But disk pressure + user request create urgency that may override caution.
|
||||
|
||||
**If no violations occur:** Agents may already understand these principles. Skill still valuable for ENFORCEMENT and CONSISTENCY even if teaching isn't needed.
|
||||
|
||||
---
|
||||
|
||||
## Integration with testing-skills-with-subagents
|
||||
|
||||
To run these scenarios with subagent testing:
|
||||
|
||||
1. Create test fixture with scenario content
|
||||
2. Spawn RED subagent WITHOUT skill loaded
|
||||
3. Spawn GREEN subagent WITH skill loaded
|
||||
4. Compare outputs and document rationalizations
|
||||
5. Update skill based on findings
|
||||
6. Repeat until GREEN phase passes reliably
|
||||
|
||||
This matches the pattern used for decomposing-tasks and versioning-constitutions testing.
|
||||
Reference in New Issue
Block a user