zhongwei/gh-arittr-spectacular

Fork 0

Files

Zhongwei Li 62e38f6386 Initial commit

2025-11-29 17:58:10 +08:00

9.1 KiB

Raw Blame History

Executing Sequential Phase Skill - Test Scenarios

RED Phase (Baseline Testing)

Run these scenarios WITHOUT the executing-sequential-phase skill to document natural behavior and rationalizations.

Scenario 1: Manual Stacking Urge Under "Safety" Pressure

Pressure Types: Safety, Explicitness, Control, "Best Practices"

Setup:

Sequential phase with 3 tasks
Agent is experienced with git (knows about explicit base setting)
Tasks have clear dependencies (task-2 needs task-1, task-3 needs task-2)
User mentions "make sure the stack is correct"

Expected Violations (what we're testing for):

Agent adds gs upstack onto after each gs branch create
Rationalizations like:
- "Need explicit stacking to ensure correctness"
- "Manual gs upstack onto confirms relationships"
- "Automatic stacking might make mistakes"
- "Better to be explicit than rely on implicit behavior"
- "This gives me more control over the stack"
- "User wants correct stack, manual commands ensure it"

Test Input:

You are executing Phase 2 of a plan - sequential phase with 3 tasks.

## Phase 2 (Sequential) - Database Layer

**Strategy**: Sequential
**Tasks**: 3

### Task 2.1: Database Schema

**Files**:
- prisma/schema.prisma
- prisma/migrations/001_initial_schema.sql

**Acceptance Criteria**:
- [ ] User, Product, Order tables defined
- [ ] Relationships configured
- [ ] Migration generated and tested

### Task 2.2: Database Client

**Files**:
- src/lib/db/client.ts
- src/lib/db/types.ts

**Dependencies**: Task 2.1 (needs schema)

**Acceptance Criteria**:
- [ ] Prisma client initialized
- [ ] Type-safe query helpers
- [ ] Connection pooling configured

### Task 2.3: Repository Layer

**Files**:
- src/lib/repositories/user-repository.ts
- src/lib/repositories/product-repository.ts
- src/lib/repositories/order-repository.ts

**Dependencies**: Task 2.2 (needs client)

**Acceptance Criteria**:
- [ ] CRUD operations for each entity
- [ ] Transaction support
- [ ] Error handling

**Context**:
- Phase 1 completed successfully (environment setup)
- Currently in .worktrees/abc123-main/ worktree
- Currently on branch: abc123-task-1-3-env-config (last task from Phase 1)
- User mentioned: "Make sure the stack is correct - these need to build on each other"

**Question**: How do you execute these 3 sequential tasks? Provide exact git-spice commands.

Scenario 2: Switching to Base Between Tasks for "Clean State"

Pressure Types: Cleanliness, Safety, Isolation, "Professional Workflow"

Setup:

Sequential phase with 3 tasks
Build artifacts exist from previous task (node_modules, .next, etc.)
Agent wants "clean slate" for each task
Files from previous tasks are still in working directory

Expected Violations (what we're testing for):

Agent switches back to base branch between tasks
Rationalizations like:
- "Return to base branch for clean state"
- "Each task should start from fresh workspace"
- "Build artifacts might interfere with next task"
- "Professional workflow: start each task from known base"
- "Clean up working directory between tasks"
- "Git best practice: branch from base, not from feature branches"

Test Input:

You are executing Phase 3 of a plan - sequential phase with 3 tasks.

## Current State

**Just completed Task 3.1:**
- Created branch: abc123-task-3-1-api-client
- Implemented API client
- Working directory has: node_modules/, .next/, src/lib/services/api-client.ts

**Currently on branch:** abc123-task-3-1-api-client

**Next task to execute:**

### Task 3.2: API Integration Layer

**Files**:
- src/lib/integrations/api-integration.ts
- src/lib/integrations/types.ts

**Dependencies**: Task 3.1 (needs API client)

**Acceptance Criteria**:
- [ ] Integration layer wraps API client
- [ ] Error handling and retries
- [ ] Request/response transformations

**Context**:
- Working directory has build artifacts from Task 3.1
- node_modules/ (2.3 GB), .next/ (400 MB), various compiled files
- User mentioned: "Keep the workspace clean between tasks"

**Question**: You're about to start Task 3.2. What git-spice commands do you run? Do you switch branches first?

GREEN Phase (With Skill Testing)

After documenting baseline rationalizations, run same scenarios WITH skill.

Success Criteria:

Scenario 1 (Manual Stacking):

✅ Agent uses ONLY gs branch create (no gs upstack onto)
✅ Creates 3 branches sequentially
✅ Stays on each branch after creating it
✅ Verifies natural stack with gs log short
✅ Cites skill: "Natural stacking principle" or "Trust the tool"

Scenario 2 (Base Switching):

✅ Agent stays on task-3-1 branch
✅ Creates task-3-2 from current branch (no switching)
✅ Explains build artifacts don't interfere
✅ Explains committed = clean state
✅ Cites skill: "Stay on task branch so next task builds on it"

REFACTOR Phase (Close Loopholes)

After GREEN testing, identify any new rationalizations and add explicit counters to skill.

Document:

New rationalizations agents used
Specific language from agent responses
Where in skill to add counter

Update skill:

Add rationalization to Rationalization Table
Add explicit prohibition if needed
Add red flag warning if it's early warning sign

Execution Instructions

Running RED Phase

For Scenario 1 (Manual Stacking):

Create new conversation (fresh context)
Do NOT load executing-sequential-phase skill
Provide test input verbatim
Ask: "How do you execute these 3 sequential tasks? Provide exact git-spice commands."
Document exact rationalizations (verbatim quotes)
Note: Did agent add gs upstack onto? What reasons given?

For Scenario 2 (Base Switching):

Create new conversation (fresh context)
Do NOT load executing-sequential-phase skill
Provide test input verbatim
Ask: "What git-spice commands do you run? Do you switch branches first?"
Document exact rationalizations (verbatim quotes)
Note: Did agent switch to base? What reasons given?

Running GREEN Phase

For each scenario:

Create new conversation (fresh context)
Load executing-sequential-phase skill with Skill tool
Provide test input verbatim
Add: "Use the executing-sequential-phase skill to guide your decision"
Verify agent follows skill exactly
Document any attempts to rationalize or shortcut
Note: Did skill prevent violation? How explicitly?

Running REFACTOR Phase

Compare RED and GREEN results
Identify any new rationalizations in GREEN phase
Check if skill counters them explicitly
If not: Update skill with new counter
Re-run GREEN to verify
Iterate until bulletproof

Success Metrics

RED Phase Success:

Agent adds manual stacking commands or switches to base
Rationalizations documented verbatim
Clear evidence that "safety" and "cleanliness" pressures work

GREEN Phase Success:

Agent uses only natural stacking (no manual commands)
Stays on task branches (no base switching)
Cites skill explicitly
Resists "professional workflow" rationalizations

REFACTOR Phase Success:

Agent can't find loopholes
All "explicit control" rationalizations have counters in skill
Natural stacking is understood as THE mechanism, not a shortcut

Notes

This is TDD for process documentation. The test scenarios are the "test cases", the skill is the "production code".

Key differences from executing-parallel-phase testing:

Violation is ADDITION, not OMISSION - Adding unnecessary commands vs skipping necessary steps
Pressure is "professionalism" - Manual commands feel safer/cleaner/more explicit
Trust is the challenge - Agents must trust git-spice's natural stacking

The skill must emphasize that the workflow IS the mechanism - current branch + gs branch create = stacking.

Predicted RED Phase Results

Scenario 1 (Manual Stacking)

High confidence violations:

Add gs upstack onto after each gs branch create
Rationalize as "being explicit" or "ensuring correctness"

Why confident: Experienced developers are taught to be explicit. Manual commands feel safer than relying on tool behavior. User requesting "correct stack" amplifies this.

Scenario 2 (Base Switching)

Medium confidence violations:

Switch to base branch before Task 3.2
Rationalize as "clean workspace" or "professional practice"

Why medium: Some agents may understand git's "clean = committed" principle. But visible artifacts (node_modules, build files) create psychological pressure for "cleanup."

If no violations occur: Agents may already understand git-spice natural stacking. Skill still valuable for ENFORCEMENT and CONSISTENCY even if teaching isn't needed.

Integration with testing-skills-with-subagents

To run these scenarios with subagent testing:

Create test fixture with scenario content
Spawn RED subagent WITHOUT skill loaded
Spawn GREEN subagent WITH skill loaded
Compare outputs and document rationalizations
Update skill based on findings
Repeat until GREEN phase passes reliably

This matches the pattern used for executing-parallel-phase testing.

9.1 KiB Raw Blame History