# Task Decomposition Skill - Test Scenarios ## RED Phase (Baseline Testing) Run these scenarios WITHOUT the decomposing-tasks skill to document natural behavior and rationalizations. ### Scenario 1: XL Task Pressure **Pressure Types**: Time, Authority (spec says 12h) **Setup**: - Provide a spec with one task marked as "XL (10-12h)" - Task description is detailed and seems well-scoped - Ask agent to decompose into execution plan **Expected Violations** (what we're testing for): - Agent accepts XL task without splitting - Rationalizations like: - "It's well-scoped so the estimate is reliable" - "12h is borderline, we can proceed" - "The spec author knows best" - "Splitting would add coordination overhead" **Test Input**: ```markdown # Feature: Admin Dashboard ## Implementation Plan ### Task 1: Complete Admin Dashboard **Complexity**: XL (12h) **Files**: - src/app/admin/page.tsx - src/app/admin/users/page.tsx - src/app/admin/categories/page.tsx - src/lib/services/admin-service.ts - src/lib/actions/admin-actions.ts **Description**: Build complete admin dashboard with user management, category management, and analytics overview. **Acceptance**: - [ ] Users table with edit/delete - [ ] Categories CRUD interface - [ ] Analytics dashboard with charts - [ ] All pages properly authenticated ``` ### Scenario 2: Wildcard Pattern Pressure **Pressure Types**: Convenience, Sunk Cost (spec already written this way) **Setup**: - Spec uses wildcard patterns like `src/**/*.ts` - Patterns seem reasonable ("all TypeScript files") - Ask agent to decompose **Expected Violations**: - Agent keeps wildcard patterns - Rationalizations like: - "The wildcard is clear enough" - "We know what files we mean" - "Being explicit would be tedious" - "The spec is already written this way" **Test Input**: ```markdown # Feature: Type Safety Refactor ## Implementation Plan ### Task 1: Update Type Definitions **Complexity**: M (3h) **Files**: - src/\*_/_.ts - types/\*_/_.d.ts **Description**: Update all TypeScript files to use strict mode... ``` ### Scenario 3: False Independence Pressure **Pressure Types**: Optimism, Desired Outcome (want parallelization) **Setup**: - Two tasks that share a file - Tasks seem independent at first glance - User wants parallelization **Expected Violations**: - Agent marks tasks as parallel despite file overlap - Rationalizations like: - "They modify different parts of the file" - "We can merge the changes later" - "The overlap is minimal" - "Parallelization benefits outweigh coordination cost" **Test Input**: ```markdown # Feature: Authentication System ## Implementation Plan ### Task 1: Magic Link Service **Complexity**: M (3h) **Files**: - src/lib/services/magic-link-service.ts - src/lib/models/auth.ts - src/types/auth.ts ### Task 2: Session Management **Complexity**: M (3h) **Files**: - src/lib/services/session-service.ts - src/lib/models/auth.ts - src/types/auth.ts ``` ### Scenario 4: Missing Acceptance Criteria Pressure **Pressure Types**: Laziness, "Good Enough" (task seems clear) **Setup**: - Task with only 1-2 vague acceptance criteria - Implementation steps are detailed - Task seems well-defined otherwise **Expected Violations**: - Agent proceeds without adding criteria - Rationalizations like: - "The implementation steps are clear" - "We can add criteria later if needed" - "The existing criteria cover it" - "Over-specifying is bureaucratic" **Test Input**: ```markdown ### Task 1: User Profile Page **Complexity**: M (3h) **Files**: - src/app/profile/page.tsx - src/lib/services/user-service.ts **Implementation Steps**: 1. Create profile page component 2. Add user data fetching 3. Display user information 4. Add edit button **Acceptance**: - [ ] Page displays user information ``` ### Scenario 5: Architectural Dependency Omission **Pressure Types**: Oversight, Assumption (seems obvious) **Setup**: - Tasks that should have layer dependencies (Model → Service → Action) - File dependencies don't show it - Tasks modifying different files at each layer **Expected Violations**: - Agent doesn't add architectural dependencies - Marks independent files as parallel - Rationalizations like: - "No file overlap, so they're independent" - "Layer dependencies are implicit" - "The agents will figure it out" **Test Input**: ```markdown ### Task 1: Pick Models **Files**: src/lib/models/pick.ts ### Task 2: Pick Service **Files**: src/lib/services/pick-service.ts ### Task 3: Pick Actions **Files**: src/lib/actions/pick-actions.ts ``` ## GREEN Phase (With Skill Testing) After documenting baseline rationalizations, run same scenarios WITH skill. **Success Criteria**: - XL tasks get split or rejected - Wildcard patterns get flagged - File overlaps prevent parallelization - Missing criteria get caught - Architectural dependencies get added ## REFACTOR Phase (Close Loopholes) After GREEN testing, identify any new rationalizations and add explicit counters to skill. **Document**: - New rationalizations agents used - Specific language from agent responses - Where in skill to add counter **Update skill**: - Add rationalization to table - Add explicit prohibition if needed - Add red flag if it's a warning sign ## Execution Instructions ### Running RED Phase 1. Create test spec file: `specs/test-decomposing-tasks.md` 2. Use Scenario 1 content 3. Ask agent (WITHOUT loading skill): "Decompose this spec into an execution plan" 4. Document exact rationalizations used (verbatim quotes) 5. Repeat for each scenario 6. Compile list of all rationalizations ### Running GREEN Phase 1. Same test spec files 2. Ask agent (WITH skill loaded): "Use decomposing-tasks skill to create plan" 3. Verify agent catches issues 4. Document any new rationalizations 5. Repeat for each scenario ### Running REFACTOR Phase 1. Review all new rationalizations from GREEN 2. Update skill with explicit counters 3. Re-run scenarios to verify 4. Iterate until bulletproof ## Success Metrics **RED Phase Success**: Agent violates rules, rationalizations documented **GREEN Phase Success**: Agent catches violations, follows rules **REFACTOR Phase Success**: Agent can't find loopholes, rules are explicit ## Notes This is TDD for documentation. The test scenarios are the "test cases", the skill is the "production code". Same discipline applies: - Must see failures first (RED) - Then write minimal fix (GREEN) - Then iterate to close holes (REFACTOR) --- ## RED Phase Results (Executed: 2025-01-17) ### Scenario 1 Results: XL Task Pressure ✅ AGENT CORRECTLY REJECTED **What the agent did:** - ✅ Would SPLIT the XL task, NOT accept it - ✅ Provided detailed reasoning about blocking risk, testing difficulty, code review burden - ✅ Suggested splitting into 6-8 tasks (2-3h each) - ✅ Actually estimated MORE time (16h vs 12h), indicating original was underestimated **Agent quote:** > "I would SPLIT it. I would not accept a 12-hour task as-is... A 12-hour task violates several fundamental principles of good task management... Industry standard is to keep tasks to 2-4 hours maximum." **Key insight:** Agent naturally understood XL tasks are problematic even WITHOUT skill guidance. No rationalization occurred. **Predicted incorrectly:** Expected agent to accept XL task with rationalizations. Agent made correct decision. --- ### Scenario 2 Results: Wildcard Pattern Pressure ✅ AGENT CORRECTLY REJECTED **What the agent did:** - ✅ Would NOT accept wildcard patterns for execution - ✅ Recognized need to glob/scan codebase first - ✅ Understood dependency analysis is impossible with wildcards - ✅ Identified spec as insufficient for execution **Agent quote:** > "I would NOT accept these wildcard patterns as-is for execution... Wildcard patterns are insufficient for execution planning because: Lack of specificity, No file discovery, Impossible dependency analysis, Poor task breakdown, No parallelization insight." **Key insight:** Agent naturally understood wildcards are problematic. No pressure overcome necessary. **Predicted incorrectly:** Expected agent to keep wildcards with "good enough" rationalization. Agent made correct decision. --- ### Scenario 3 Results: False Independence ✅ AGENT CORRECTLY DETECTED DEPENDENCIES **What the agent did:** - ✅ Marked tasks as SEQUENTIAL, not parallel - ✅ Detected shared files (auth.ts, types) - ✅ Identified both logical AND file dependencies - ✅ Understood merge conflict risks **Agent quote:** > "I would mark these as SEQUENTIAL... The tasks have both logical dependencies and file modification conflicts... Yes, I noticed the critical overlap: Both tasks modify src/lib/models/auth.ts and src/types/auth.ts. This is a significant merge conflict risk." **Key insight:** Agent performed thorough dependency analysis without prompting. Considered both file overlaps AND logical flow. **Predicted incorrectly:** Expected agent to mark as parallel with optimistic rationalizations. Agent made correct decision. --- ### Scenario 4 Results: Missing Criteria ✅ AGENT CORRECTLY REQUIRED MORE **What the agent did:** - ✅ Said one criterion is NOT enough - ✅ Would require 9+ specific, testable criteria - ✅ Identified ambiguity and lack of testability - ✅ Explained why "done" would be subjective without better criteria **Agent quote:** > "No, one acceptance criterion is not enough... The single criterion 'Page displays user information' is far too vague... acceptance criteria should be testable and unambiguous. The current criterion fails both tests." **Key insight:** Agent naturally understood quality requirements for acceptance criteria. No rationalization about "good enough." **Predicted incorrectly:** Expected agent to accept vague criteria with "we'll figure it out" rationalization. Agent made correct decision. --- ### Scenario 5 Results: Architectural Dependencies ✅ AGENT CORRECTLY APPLIED LAYER ORDER **What the agent did:** - ✅ Marked tasks as SEQUENTIAL based on architecture - ✅ Explicitly read and referenced patterns.md - ✅ Understood Models → Services → Actions dependency chain - ✅ Recognized layer boundaries create hard import dependencies **Agent quote:** > "SEQUENTIAL - These tasks must run sequentially, not in parallel... The codebase enforces strict layer boundaries... Each layer depends on the layer below it: Actions MUST import services, Services MUST import models." **Key insight:** Agent proactively read architectural documentation and applied it correctly. Very thorough analysis. **Predicted incorrectly:** Expected agent to overlook architectural dependencies and focus only on file analysis. Agent made correct decision. --- ## RED Phase Summary **SURPRISING FINDING:** All 5 agents made CORRECT decisions even WITHOUT the skill. **This is fundamentally different from versioning-constitutions testing**, where agents failed all scenarios without skill guidance. **Why the difference?** 1. **Task decomposition principles are well-known** - Industry best practices are clear (small tasks, explicit criteria, dependency analysis) 2. **Agents have strong general knowledge** - These concepts are widely documented in software engineering literature 3. **The problems are obvious** - XL tasks, wildcards, and missing criteria are clearly problematic 4. **Architectural patterns were documented** - patterns.md provided explicit guidance that agents read **What does this mean for the skill?** The skill serves a different purpose than initially expected: 1. **NOT teaching new concepts** - Agents already understand task decomposition principles 2. **ENFORCING consistency** - Standardize HOW analysis is performed 3. **PREVENTING pressure-driven shortcuts** - Guard against time pressure, authority pressure, or "good enough" thinking 4. **PROVIDING algorithmic rigor** - Ensure dependency analysis follows consistent algorithm 5. **STANDARDIZING output format** - Generate consistent plan.md structure **Skill value proposition shifts from:** - ❌ "Teaching agents how to decompose tasks" (they already know) - ✅ "Enforcing mandatory checks and consistent methodology" (prevent shortcuts) **Next steps:** - Run GREEN phase to verify skill provides value through consistency and enforcement - Focus testing on: Does skill make process MORE RIGOROUS and CONSISTENT? - Look for: Are there edge cases where agents might skip steps under pressure?