Files
2025-11-29 17:58:10 +08:00

15 KiB

name, description
name description
testing-workflows-with-subagents Use when creating or editing commands, orchestrator prompts, or workflow documentation before deployment - applies RED-GREEN-REFACTOR to test instruction clarity by finding real execution failures, creating test scenarios, and verifying fixes with subagents

Testing Workflows With Subagents

Overview

Testing workflows is TDD applied to orchestrator instructions and command documentation.

You find real execution failures (git logs, error reports), create test scenarios that reproduce them, watch subagents follow ambiguous instructions incorrectly (RED), fix the instructions (GREEN), and verify subagents now follow correctly (REFACTOR).

Core principle: If you didn't watch an agent misinterpret the instructions in a test, you don't know if your fix prevents the right failures.

REQUIRED BACKGROUND: You MUST understand superpowers:test-driven-development before using this skill. That skill defines the fundamental RED-GREEN-REFACTOR cycle. This skill applies TDD to workflow documentation.

When to Use

Use when:

  • Creating new commands (.claude/commands/*.md, /spectacular:*)
  • Editing orchestrator prompts for subagents
  • Updating workflow documentation in commands
  • You observed real execution failures (wrong branches, skipped steps, misinterpreted instructions)
  • Instructions involve multiple steps where order matters
  • Agents work under time pressure or cognitive load

Don't test:

  • Pure reference documentation (no workflow steps)
  • Single-step commands with no ambiguity
  • Documentation without actionable instructions

TDD Mapping for Workflow Testing

TDD Phase Workflow Testing What You Do
RED Find real failure Check git logs, error reports for evidence of agents misinterpreting instructions
Verify RED Create failing test scenario Reproduce the failure with test repo + pressure scenario
GREEN Fix instructions Rewrite ambiguous steps with explicit ordering, warnings, examples
Verify GREEN Test with subagent Same scenario with fixed instructions - agent follows correctly
REFACTOR Iterate on clarity Find remaining ambiguities, improve wording, re-test
Stay GREEN Re-verify Test again to ensure fix holds

Same cycle as code TDD, different test format.

RED Phase: Find Real Execution Failures

Goal: Gather evidence of how instructions were misinterpreted in actual execution.

Where to look:

  • Git history: Commits on wrong branches, missing branches, incorrect stack structure
  • Run logs: Steps skipped, wrong order executed, missing quality checks
  • Error reports: Failed tasks, cleanup issues, integration problems
  • User reports: "Agents did X when I expected Y"

Document evidence:

## RED Phase Evidence

**Source**: bignight.party git log (Run ID: 082687)

**Failure**: Task 4.3 commit on branch `082687-task-4.2-auth-domain-migration`

**Expected**: Create branch `082687-task-4.3-server-actions`

**Actual**: Committed to Task 4.2's branch instead

**Root cause hypothesis**: Instructions ambiguous about creating branch before committing

Critical: Get actual git commits, branch names, error messages - not hypothetical scenarios.

Create RED Test Scenario

Goal: Reproduce the failure in a controlled test environment.

Test Repository Setup

Create minimal repo that simulates real execution state:

cd /path/to/test-area
mkdir workflow-test && cd workflow-test
git init
git config user.name "Test" && git config user.email "test@test.com"

# Set up state that led to failure
# (e.g., existing task branches for sequential phase testing)

Pressure Scenario Document

Create test file with:

  1. Role definition: "You are subagent implementing Task X"
  2. Current state: Branch, uncommitted work, what's done
  3. Actual instructions: Copy current ambiguous instructions
  4. Pressure context: Combine 2-3 pressure types from table below
  5. Options: Give explicit choices (forces decision, no deferring)

Pressure Types for Workflow Testing

Pressure Example Effect on Agent
Time "Orchestrator waiting", "4 more tasks to do", "Need to move fast" Skips reading skills, chooses fast option
Cognitive load "2 hours in, tired", "Third sequential task", "Complex state" Misreads instructions, makes assumptions
Urgency "Choose NOW", "Execute immediately", "No delays" Skips verification steps, commits to first interpretation
Task volume "4 more tasks after this", "Part of 10-task phase" Rushes through steps, skips optional guidance
Complexity "Multiple branches exist", "Shared worktree", "Parallel tasks running" Confused about current state, wrong branch

Best test scenarios combine 2-3 pressures to simulate realistic execution conditions.

Example:

# RED Test: Sequential Phase Task Execution

**Role**: Implementation subagent for Task 2.3

**Current state**: On branch `abc123-task-2-2-database`
Uncommitted changes in auth.js (your completed work)

**Instructions from execute.md (CURRENT VERSION)**:
  1. Use using-git-spice skill to:
    • Create branch: abc123-task-2-3-auth
    • Commit with message: "[Task 2.3] Add auth"

**Pressure**: 2 hours in, tired, 4 more tasks to do

**Options**:
A) Read skill (2 min delay)
B) Just commit now
C) Create branch with git, then commit
D) Guess git-spice command

Choose and execute NOW.

Run RED Test

# Dispatch subagent with test scenario
# Use haiku for speed and realistic "under pressure" behavior
# Document exact choice and reasoning verbatim

Expected RED result: Agent makes wrong choice, commits to wrong branch, or skips creating branch.

If agent succeeds, your test scenario isn't realistic enough - add more pressure or make options more tempting.

GREEN Phase: Fix Instructions

Goal: Rewrite instructions to prevent the specific failure observed in RED.

Analyze Root Cause

From RED test, identify:

  • Which step was ambiguous?
  • What order was unclear?
  • What assumptions did agent make?
  • What did pressure cause them to skip?

Example analysis:

**Ambiguous**: "Create branch" and "Commit" as separate bullets
**Unclear order**: Could mean "create then commit" OR "commit then create"
**Assumption**: "I'll just commit first, cleaner workflow"
**Pressure effect**: Skipped reading skill, chose fast option

Fix Patterns

Pattern 1: Explicit Sequential Steps

```markdown - Create branch: X - Commit with message: Y - Stay on branch ``` ```markdown a) FIRST: Stage changes - Command: `git add .`

b) THEN: Create branch (commits automatically)

  • Command: gs branch create X -m "Y"

c) Stay on new branch

</After>

**Pattern 2: Critical Warnings**

Add consequences upfront:
```markdown
CRITICAL: Stage changes FIRST, then create branch.

If you commit BEFORE creating branch, work goes to wrong branch.

Pattern 3: Show Commands

Reduce friction under pressure - show exact commands:

b) THEN: Create new stacked branch
   - Command: `gs branch create {name} -m "message"`
   - This creates branch and commits in one operation

Pattern 4: Skill-Based Reference

Balance showing commands with learning:

Use `using-git-spice` skill which teaches this two-step workflow:
[commands here]
Read the skill if uncertain about the workflow.

Apply Fix

Edit the actual command file with GREEN fix.

Verify GREEN: Test Fix

Goal: Confirm agents now follow instructions correctly under same pressure.

Reset Test Repository

cd /path/to/workflow-test
git reset --hard initial-state
# Recreate same starting conditions as RED test

Create GREEN Test Scenario

Same as RED test but:

  • Update to "Instructions from execute.md (NEW IMPROVED VERSION)"
  • Include the fixed instructions
  • Same pressure, same options available
  • Same "execute NOW" urgency

Run GREEN Test

# Dispatch subagent with GREEN scenario
# Use same model (haiku) for consistency
# Agent should now choose correct option and execute correctly

Expected GREEN result:

  • Agent follows fixed instructions
  • Creates branch before committing (or whatever correct behavior is)
  • Quotes reasoning showing they understood the explicit ordering
  • Work ends up in correct state

If agent still fails: Instructions still ambiguous. Return to GREEN phase, improve clarity, re-test.

Meta-Testing (When GREEN Fails)

If agent still misinterprets after your fix, ask them directly:

your human partner: You read the improved instructions and still chose the wrong option.

How could those instructions have been written differently to make
the correct workflow crystal clear?

Three possible responses:

  1. "The instructions WERE clear, I just rushed"

    • Not a clarity problem - need stronger warning/consequence
    • Add "CRITICAL:" or "MUST" language upfront
    • State failure consequence explicitly
  2. "The instructions should have said X"

    • Clarity problem - agent's suggestion usually good
    • Add their exact wording verbatim
    • Re-test to verify improvement
  3. "I didn't notice section Y"

    • Organization problem - important info buried
    • Move critical steps earlier
    • Use formatting (bold, CRITICAL) to highlight

Use meta-testing to diagnose WHY fix didn't work, not just add more content.

REFACTOR Phase: Iterate on Clarity

Goal: Find and fix remaining ambiguities.

Check for Remaining Issues

Run additional test scenarios:

  • Different pressure combinations
  • Different task positions (first task, middle, last)
  • Different states (clean vs dirty working tree)
  • Different agent models (haiku, sonnet)

Common Clarity Issues

Issue Symptom Fix
Order ambiguous Steps done out of order Add "a) FIRST, b) THEN" labels
Missing consequences Agent skips step Add "If you skip X, Y will fail"
Too abstract Agent guesses commands Show exact commands inline
No warnings Agent makes wrong choice Add "CRITICAL:" upfront
Assumes knowledge Agent doesn't know tool Reference skill + show command

Document Test Results

Create summary document:

# Test Results: execute.md Sequential Phase Fix

**RED Phase**: Task 4.3 committed to Task 4.2 branch (real failure)
**GREEN Phase**: Agent created correct branch, no ambiguity
**REFACTOR**: Tested with different models, all passed

**Fix applied**: Lines 277-297 (sequential), 418-438 (parallel)
**Success criteria**: New stacked branch created BEFORE commit

Differences from Testing Skills

testing-skills-with-subagents (discipline skills):

  • Tests agent compliance under pressure (will they follow rules?)
  • Focuses on closing rationalization loopholes
  • Uses multiple combined pressures (time + sunk cost + exhaustion)
  • Goal: Bulletproof against shortcuts

testing-workflows-with-subagents (workflow documentation):

  • Tests instruction clarity (can they understand what to do?)
  • Focuses on removing ambiguity in ordering and steps
  • Uses realistic execution pressure (tired, more tasks, time limits)
  • Goal: Unambiguous instructions agents can follow correctly

Different problem: Skills test "will you comply?" vs Workflows test "can you understand?"

Quick Reference: RED-GREEN-REFACTOR Cycle

RED Phase:

  1. Find real execution failure (git log, error reports)
  2. Create test repo simulating pre-failure state
  3. Write pressure scenario with current instructions
  4. Launch subagent, document exact failure

GREEN Phase:

  1. Analyze root cause (what was ambiguous?)
  2. Fix instructions (explicit order, warnings, commands)
  3. Reset test repo to same state
  4. Write GREEN scenario with fixed instructions
  5. Launch subagent, verify correct execution

REFACTOR Phase:

  1. Test additional scenarios (different pressures, states)
  2. Find remaining ambiguities
  3. Improve clarity
  4. Re-verify all tests pass

Testing Checklist (TDD for Workflows)

IMPORTANT: Use TodoWrite to track these steps.

RED Phase:

  • Find real execution failure (git commits, logs, errors)
  • Document evidence (branch names, commit hashes, expected vs actual)
  • Create test repository simulating pre-failure state
  • Write pressure scenario with current instructions
  • Run subagent test, document exact failure and reasoning

GREEN Phase:

  • Analyze root cause of ambiguity
  • Fix instructions (explicit ordering, warnings, commands)
  • Apply fix to actual command file
  • Reset test repository to same starting state
  • Write GREEN scenario with fixed instructions
  • Run subagent test, verify correct execution

REFACTOR Phase:

  • Test with different pressure levels
  • Test with different execution states
  • Test with different agent models
  • Document all remaining ambiguities found
  • Improve clarity for each issue
  • Re-verify all scenarios pass

Documentation:

  • Create test results summary
  • Document before/after instructions
  • Save test scenarios for regression testing
  • Note which lines in command file were changed

Common Mistakes

Testing without real failure evidence Start with hypothetical "this might be confusing" leads to fixes that don't address actual problems. Fix: Always start with git logs, error reports, real execution traces.

Test scenario without pressure Agents follow instructions carefully when not rushed - doesn't match real execution. Fix: Add time pressure, cognitive load (tired, many tasks), urgency.

Improving instructions without testing Guessing what's clear vs actually verifying leads to still-ambiguous docs. Fix: Always run GREEN verification with subagent before considering done.

Testing once and done First test might not catch all ambiguities. Fix: REFACTOR phase with varied scenarios, different models.

Vague test options Giving "what do you do?" instead of concrete choices lets agent defer. Fix: Force A/B/C/D choice with "Choose NOW" urgency.

Real-World Impact

From testing execute.md sequential phase instructions (this session):

RED Phase:

  • Found Task 4.3 committed to Task 4.2's branch in bignight.party
  • Created test scenario reproducing failure mode

GREEN Phase:

  • Fixed instructions with "a) FIRST, b) THEN" explicit ordering
  • Added CRITICAL warning about staging before branch creation
  • Showed exact atomic command: gs branch create -m

Result:

  • Agent followed corrected workflow perfectly
  • Quote: "The two-step process is clear and effective"
  • Prevents same failure class across all future executions

Time investment: 1 hour testing, prevents repeated failures across all spectacular runs.

The Bottom Line

Test workflow documentation the same way you test code.

RED (find real failure) → GREEN (fix and verify) → REFACTOR (iterate on clarity).

If you wouldn't deploy code without tests, don't deploy commands without verifying agents can follow them correctly.