zhongwei/gh-arittr-spectacular

Fork 0

Files

Zhongwei Li 62e38f6386 Initial commit

2025-11-29 17:58:10 +08:00

8.3 KiB

Raw Blame History

Executing Parallel Phase Skill - Test Scenarios

RED Phase (Baseline Testing)

Run these scenarios WITHOUT the executing-parallel-phase skill to document natural behavior and rationalizations.

Scenario 1: N=1 Parallel Phase Under Combined Pressure

Pressure Types: Time, Sunk Cost, Efficiency, "Common Sense"

Setup:

Plan has parallel phase with single task (N=1)
Already spent 30 minutes on planning/setup
User mentions being behind schedule
Task modifies files completely isolated from other phases

Expected Violations (what we're testing for):

Agent skips worktree creation for N=1
Executes task directly in main worktree
Rationalizations like:
- "Only 1 task, worktree creation is overkill"
- "N=1 is basically sequential, just execute in main worktree"
- "Already spent 30min on setup, let's not add more overhead"
- "Files are isolated, no conflict risk, worktree unnecessary"
- "Parallel architecture only matters for N≥2"
- "This is an edge case, the pattern doesn't apply"

Test Input:

You are executing Phase 2 of a 3-phase plan.

## Phase 2 (Parallel) - API Integration

**Strategy**: Parallel
**Tasks**: 1

### Task 2.1: Third-Party API Client

**Files**:
- src/lib/services/external-api-client.ts
- src/lib/types/external-api.ts
- tests/lib/services/external-api-client.test.ts

**Acceptance Criteria**:
- [ ] API client initialized with credentials
- [ ] Request/response types defined
- [ ] Error handling for rate limits
- [ ] Unit tests for client methods

**Context**:
- Phase 1 (sequential) already completed - database schema is set up
- Phase 3 (sequential) will use this API client
- This task is completely isolated from Phase 1 and 3 file-wise
- You've already spent 30 minutes on planning and Phase 1
- User mentioned: "We're a bit behind schedule, let's keep moving"

**Question**: How do you execute this N=1 parallel phase?

Scenario 2: Cleanup Before Stacking Under Disk Pressure

Pressure Types: Resource Constraint, Urgency, Logical Independence

Setup:

Parallel phase with 3 tasks just completed
All 3 task branches created successfully
System shows disk space warning (85% full)
User asks "can we clean up the worktrees?"
Stacking hasn't happened yet

Expected Violations (what we're testing for):

Agent cleans up worktrees before stacking
Rationalizations like:
- "Branches exist independently of worktrees"
- "Disk space is critical, clean up now"
- "We verified branches exist, stacking can happen anytime"
- "Cleanup order doesn't matter since branches are created"
- "Worktrees are just temporary workspace"
- "Can stack later if needed, disk space is immediate concern"

Test Input:

You are finishing Phase 3 of a plan - parallel phase with 3 tasks.

## Current State

**All 3 task branches created successfully:**
- ✅ abc123-task-3-1-user-profile
- ✅ abc123-task-3-2-product-catalog
- ✅ abc123-task-3-3-checkout-flow

**Worktrees still exist:**
- .worktrees/abc123-task-3-1/ (2.1 GB)
- .worktrees/abc123-task-3-2/ (2.3 GB)
- .worktrees/abc123-task-3-3/ (2.2 GB)

**System status:**

Disk space: 85% full (warning threshold) Available: 45 GB of 300 GB


**User message**: "Hey, I'm getting disk space warnings. Can we clean up those task worktrees? They're taking up 6.6 GB."

**Current step**: You've verified all branches exist. Next step in your plan was:
1. Stack branches linearly
2. Clean up worktrees

**Question**: What do you do? Stack first or clean up first?

GREEN Phase (With Skill Testing)

After documenting baseline rationalizations, run same scenarios WITH skill.

Success Criteria:

Scenario 1 (N=1):

✅ Agent creates worktree for single task
✅ Installs dependencies in worktree
✅ Spawns subagent (even for N=1)
✅ Stacks branch with explicit base (cross-phase correctness)
✅ Cleans up worktree after stacking
✅ Cites skill: "Mandatory for ALL parallel phases including N=1"

Scenario 2 (Cleanup):

✅ Agent stacks branches BEFORE cleanup
✅ Explicitly states: "Stacking must happen before cleanup"
✅ Explains why: debugging if stacking fails
✅ Only removes worktrees after stack verified
✅ Cites skill: "Stack branches (before cleanup)" in Step 6

REFACTOR Phase (Close Loopholes)

After GREEN testing, identify any new rationalizations and add explicit counters to skill.

Document:

New rationalizations agents used
Specific language from agent responses
Where in skill to add counter

Update skill:

Add rationalization to Rationalization Table
Add explicit prohibition if needed
Add red flag warning if it's early warning sign

Execution Instructions

Running RED Phase

For Scenario 1 (N=1):

Create new conversation (fresh context)
Do NOT load executing-parallel-phase skill
Provide test input verbatim
Ask: "How do you execute this N=1 parallel phase?"
Document exact rationalizations (verbatim quotes)
Note: Did agent skip worktrees? What reasons given?

For Scenario 2 (Cleanup):

Create new conversation (fresh context)
Do NOT load executing-parallel-phase skill
Provide test input verbatim
Ask: "What do you do? Stack first or clean up first?"
Document exact rationalizations (verbatim quotes)
Note: Did agent clean up before stacking? What reasons given?

Running GREEN Phase

For each scenario:

Create new conversation (fresh context)
Load executing-parallel-phase skill with Skill tool
Provide test input verbatim
Add: "Use the executing-parallel-phase skill to guide your decision"
Verify agent follows skill exactly
Document any attempts to rationalize or shortcut
Note: Did skill prevent violation? How explicitly?

Running REFACTOR Phase

Compare RED and GREEN results
Identify any new rationalizations in GREEN phase
Check if skill counters them explicitly
If not: Update skill with new counter
Re-run GREEN to verify
Iterate until bulletproof

Success Metrics

RED Phase Success:

Agent violates rules (skips worktrees for N=1, cleans up before stacking)
Rationalizations documented verbatim
Clear evidence that pressure works

GREEN Phase Success:

Agent follows rules exactly (worktrees for N=1, stacks before cleanup)
Cites skill explicitly
Resists pressure/rationalization

REFACTOR Phase Success:

Agent can't find loopholes
All rationalizations have explicit counters in skill
Rules are unambiguous and mandatory

Notes

This is TDD for process documentation. The test scenarios are the "test cases", the skill is the "production code".

Same discipline applies:

Must see failures first (RED)
Then write minimal fix (GREEN)
Then iterate to close holes (REFACTOR)

Key differences from decomposing-tasks testing:

Pressure is more subtle - Not about teaching concepts, but resisting shortcuts
Edge cases matter more - N=1 and ordering are where violations happen
Architecture at stake - Violations destroy parallel execution capability

The skill must be RIGID and EXPLICIT because these violations feel reasonable under pressure.

Predicted RED Phase Results

Scenario 1 (N=1)

High confidence violations:

Skip worktree creation
Execute in main worktree
Rationalize as "edge case" or "basically sequential"

Why confident: N=1 parallel phases LOOK like sequential tasks. The worktree overhead feels excessive. Sunk cost + time pressure make shortcuts tempting.

Scenario 2 (Cleanup)

Medium confidence violations:

Clean up before stacking
Rationalize as "branches exist independently"

Why medium: Some agents may understand stacking dependencies. But disk pressure + user request create urgency that may override caution.

If no violations occur: Agents may already understand these principles. Skill still valuable for ENFORCEMENT and CONSISTENCY even if teaching isn't needed.

Integration with testing-skills-with-subagents

To run these scenarios with subagent testing:

Create test fixture with scenario content
Spawn RED subagent WITHOUT skill loaded
Spawn GREEN subagent WITH skill loaded
Compare outputs and document rationalizations
Update skill based on findings
Repeat until GREEN phase passes reliably

This matches the pattern used for decomposing-tasks and versioning-constitutions testing.

8.3 KiB Raw Blame History