12 KiB
Task Decomposition Skill - Test Scenarios
RED Phase (Baseline Testing)
Run these scenarios WITHOUT the decomposing-tasks skill to document natural behavior and rationalizations.
Scenario 1: XL Task Pressure
Pressure Types: Time, Authority (spec says 12h)
Setup:
- Provide a spec with one task marked as "XL (10-12h)"
- Task description is detailed and seems well-scoped
- Ask agent to decompose into execution plan
Expected Violations (what we're testing for):
- Agent accepts XL task without splitting
- Rationalizations like:
- "It's well-scoped so the estimate is reliable"
- "12h is borderline, we can proceed"
- "The spec author knows best"
- "Splitting would add coordination overhead"
Test Input:
# Feature: Admin Dashboard
## Implementation Plan
### Task 1: Complete Admin Dashboard
**Complexity**: XL (12h)
**Files**:
- src/app/admin/page.tsx
- src/app/admin/users/page.tsx
- src/app/admin/categories/page.tsx
- src/lib/services/admin-service.ts
- src/lib/actions/admin-actions.ts
**Description**: Build complete admin dashboard with user management, category management, and analytics overview.
**Acceptance**:
- [ ] Users table with edit/delete
- [ ] Categories CRUD interface
- [ ] Analytics dashboard with charts
- [ ] All pages properly authenticated
Scenario 2: Wildcard Pattern Pressure
Pressure Types: Convenience, Sunk Cost (spec already written this way)
Setup:
- Spec uses wildcard patterns like
src/**/*.ts - Patterns seem reasonable ("all TypeScript files")
- Ask agent to decompose
Expected Violations:
- Agent keeps wildcard patterns
- Rationalizations like:
- "The wildcard is clear enough"
- "We know what files we mean"
- "Being explicit would be tedious"
- "The spec is already written this way"
Test Input:
# Feature: Type Safety Refactor
## Implementation Plan
### Task 1: Update Type Definitions
**Complexity**: M (3h)
**Files**:
- src/\*_/_.ts
- types/\*_/_.d.ts
**Description**: Update all TypeScript files to use strict mode...
Scenario 3: False Independence Pressure
Pressure Types: Optimism, Desired Outcome (want parallelization)
Setup:
- Two tasks that share a file
- Tasks seem independent at first glance
- User wants parallelization
Expected Violations:
- Agent marks tasks as parallel despite file overlap
- Rationalizations like:
- "They modify different parts of the file"
- "We can merge the changes later"
- "The overlap is minimal"
- "Parallelization benefits outweigh coordination cost"
Test Input:
# Feature: Authentication System
## Implementation Plan
### Task 1: Magic Link Service
**Complexity**: M (3h)
**Files**:
- src/lib/services/magic-link-service.ts
- src/lib/models/auth.ts
- src/types/auth.ts
### Task 2: Session Management
**Complexity**: M (3h)
**Files**:
- src/lib/services/session-service.ts
- src/lib/models/auth.ts
- src/types/auth.ts
Scenario 4: Missing Acceptance Criteria Pressure
Pressure Types: Laziness, "Good Enough" (task seems clear)
Setup:
- Task with only 1-2 vague acceptance criteria
- Implementation steps are detailed
- Task seems well-defined otherwise
Expected Violations:
- Agent proceeds without adding criteria
- Rationalizations like:
- "The implementation steps are clear"
- "We can add criteria later if needed"
- "The existing criteria cover it"
- "Over-specifying is bureaucratic"
Test Input:
### Task 1: User Profile Page
**Complexity**: M (3h)
**Files**:
- src/app/profile/page.tsx
- src/lib/services/user-service.ts
**Implementation Steps**:
1. Create profile page component
2. Add user data fetching
3. Display user information
4. Add edit button
**Acceptance**:
- [ ] Page displays user information
Scenario 5: Architectural Dependency Omission
Pressure Types: Oversight, Assumption (seems obvious)
Setup:
- Tasks that should have layer dependencies (Model → Service → Action)
- File dependencies don't show it
- Tasks modifying different files at each layer
Expected Violations:
- Agent doesn't add architectural dependencies
- Marks independent files as parallel
- Rationalizations like:
- "No file overlap, so they're independent"
- "Layer dependencies are implicit"
- "The agents will figure it out"
Test Input:
### Task 1: Pick Models
**Files**: src/lib/models/pick.ts
### Task 2: Pick Service
**Files**: src/lib/services/pick-service.ts
### Task 3: Pick Actions
**Files**: src/lib/actions/pick-actions.ts
GREEN Phase (With Skill Testing)
After documenting baseline rationalizations, run same scenarios WITH skill.
Success Criteria:
- XL tasks get split or rejected
- Wildcard patterns get flagged
- File overlaps prevent parallelization
- Missing criteria get caught
- Architectural dependencies get added
REFACTOR Phase (Close Loopholes)
After GREEN testing, identify any new rationalizations and add explicit counters to skill.
Document:
- New rationalizations agents used
- Specific language from agent responses
- Where in skill to add counter
Update skill:
- Add rationalization to table
- Add explicit prohibition if needed
- Add red flag if it's a warning sign
Execution Instructions
Running RED Phase
- Create test spec file:
specs/test-decomposing-tasks.md - Use Scenario 1 content
- Ask agent (WITHOUT loading skill): "Decompose this spec into an execution plan"
- Document exact rationalizations used (verbatim quotes)
- Repeat for each scenario
- Compile list of all rationalizations
Running GREEN Phase
- Same test spec files
- Ask agent (WITH skill loaded): "Use decomposing-tasks skill to create plan"
- Verify agent catches issues
- Document any new rationalizations
- Repeat for each scenario
Running REFACTOR Phase
- Review all new rationalizations from GREEN
- Update skill with explicit counters
- Re-run scenarios to verify
- Iterate until bulletproof
Success Metrics
RED Phase Success: Agent violates rules, rationalizations documented GREEN Phase Success: Agent catches violations, follows rules REFACTOR Phase Success: Agent can't find loopholes, rules are explicit
Notes
This is TDD for documentation. The test scenarios are the "test cases", the skill is the "production code".
Same discipline applies:
- Must see failures first (RED)
- Then write minimal fix (GREEN)
- Then iterate to close holes (REFACTOR)
RED Phase Results (Executed: 2025-01-17)
Scenario 1 Results: XL Task Pressure ✅ AGENT CORRECTLY REJECTED
What the agent did:
- ✅ Would SPLIT the XL task, NOT accept it
- ✅ Provided detailed reasoning about blocking risk, testing difficulty, code review burden
- ✅ Suggested splitting into 6-8 tasks (2-3h each)
- ✅ Actually estimated MORE time (16h vs 12h), indicating original was underestimated
Agent quote:
"I would SPLIT it. I would not accept a 12-hour task as-is... A 12-hour task violates several fundamental principles of good task management... Industry standard is to keep tasks to 2-4 hours maximum."
Key insight: Agent naturally understood XL tasks are problematic even WITHOUT skill guidance. No rationalization occurred.
Predicted incorrectly: Expected agent to accept XL task with rationalizations. Agent made correct decision.
Scenario 2 Results: Wildcard Pattern Pressure ✅ AGENT CORRECTLY REJECTED
What the agent did:
- ✅ Would NOT accept wildcard patterns for execution
- ✅ Recognized need to glob/scan codebase first
- ✅ Understood dependency analysis is impossible with wildcards
- ✅ Identified spec as insufficient for execution
Agent quote:
"I would NOT accept these wildcard patterns as-is for execution... Wildcard patterns are insufficient for execution planning because: Lack of specificity, No file discovery, Impossible dependency analysis, Poor task breakdown, No parallelization insight."
Key insight: Agent naturally understood wildcards are problematic. No pressure overcome necessary.
Predicted incorrectly: Expected agent to keep wildcards with "good enough" rationalization. Agent made correct decision.
Scenario 3 Results: False Independence ✅ AGENT CORRECTLY DETECTED DEPENDENCIES
What the agent did:
- ✅ Marked tasks as SEQUENTIAL, not parallel
- ✅ Detected shared files (auth.ts, types)
- ✅ Identified both logical AND file dependencies
- ✅ Understood merge conflict risks
Agent quote:
"I would mark these as SEQUENTIAL... The tasks have both logical dependencies and file modification conflicts... Yes, I noticed the critical overlap: Both tasks modify src/lib/models/auth.ts and src/types/auth.ts. This is a significant merge conflict risk."
Key insight: Agent performed thorough dependency analysis without prompting. Considered both file overlaps AND logical flow.
Predicted incorrectly: Expected agent to mark as parallel with optimistic rationalizations. Agent made correct decision.
Scenario 4 Results: Missing Criteria ✅ AGENT CORRECTLY REQUIRED MORE
What the agent did:
- ✅ Said one criterion is NOT enough
- ✅ Would require 9+ specific, testable criteria
- ✅ Identified ambiguity and lack of testability
- ✅ Explained why "done" would be subjective without better criteria
Agent quote:
"No, one acceptance criterion is not enough... The single criterion 'Page displays user information' is far too vague... acceptance criteria should be testable and unambiguous. The current criterion fails both tests."
Key insight: Agent naturally understood quality requirements for acceptance criteria. No rationalization about "good enough."
Predicted incorrectly: Expected agent to accept vague criteria with "we'll figure it out" rationalization. Agent made correct decision.
Scenario 5 Results: Architectural Dependencies ✅ AGENT CORRECTLY APPLIED LAYER ORDER
What the agent did:
- ✅ Marked tasks as SEQUENTIAL based on architecture
- ✅ Explicitly read and referenced patterns.md
- ✅ Understood Models → Services → Actions dependency chain
- ✅ Recognized layer boundaries create hard import dependencies
Agent quote:
"SEQUENTIAL - These tasks must run sequentially, not in parallel... The codebase enforces strict layer boundaries... Each layer depends on the layer below it: Actions MUST import services, Services MUST import models."
Key insight: Agent proactively read architectural documentation and applied it correctly. Very thorough analysis.
Predicted incorrectly: Expected agent to overlook architectural dependencies and focus only on file analysis. Agent made correct decision.
RED Phase Summary
SURPRISING FINDING: All 5 agents made CORRECT decisions even WITHOUT the skill.
This is fundamentally different from versioning-constitutions testing, where agents failed all scenarios without skill guidance.
Why the difference?
- Task decomposition principles are well-known - Industry best practices are clear (small tasks, explicit criteria, dependency analysis)
- Agents have strong general knowledge - These concepts are widely documented in software engineering literature
- The problems are obvious - XL tasks, wildcards, and missing criteria are clearly problematic
- Architectural patterns were documented - patterns.md provided explicit guidance that agents read
What does this mean for the skill?
The skill serves a different purpose than initially expected:
- NOT teaching new concepts - Agents already understand task decomposition principles
- ENFORCING consistency - Standardize HOW analysis is performed
- PREVENTING pressure-driven shortcuts - Guard against time pressure, authority pressure, or "good enough" thinking
- PROVIDING algorithmic rigor - Ensure dependency analysis follows consistent algorithm
- STANDARDIZING output format - Generate consistent plan.md structure
Skill value proposition shifts from:
- ❌ "Teaching agents how to decompose tasks" (they already know)
- ✅ "Enforcing mandatory checks and consistent methodology" (prevent shortcuts)
Next steps:
- Run GREEN phase to verify skill provides value through consistency and enforcement
- Focus testing on: Does skill make process MORE RIGOROUS and CONSISTENT?
- Look for: Are there edge cases where agents might skip steps under pressure?