zhongwei/gh-arittr-spectacular

Fork 0

Files

Zhongwei Li 62e38f6386 Initial commit

2025-11-29 17:58:10 +08:00

12 KiB

Raw Blame History

Task Decomposition Skill - Test Scenarios

RED Phase (Baseline Testing)

Run these scenarios WITHOUT the decomposing-tasks skill to document natural behavior and rationalizations.

Scenario 1: XL Task Pressure

Pressure Types: Time, Authority (spec says 12h)

Setup:

Provide a spec with one task marked as "XL (10-12h)"
Task description is detailed and seems well-scoped
Ask agent to decompose into execution plan

Expected Violations (what we're testing for):

Agent accepts XL task without splitting
Rationalizations like:
- "It's well-scoped so the estimate is reliable"
- "12h is borderline, we can proceed"
- "The spec author knows best"
- "Splitting would add coordination overhead"

Test Input:

# Feature: Admin Dashboard

## Implementation Plan

### Task 1: Complete Admin Dashboard

**Complexity**: XL (12h)
**Files**:

- src/app/admin/page.tsx
- src/app/admin/users/page.tsx
- src/app/admin/categories/page.tsx
- src/lib/services/admin-service.ts
- src/lib/actions/admin-actions.ts

**Description**: Build complete admin dashboard with user management, category management, and analytics overview.

**Acceptance**:

- [ ] Users table with edit/delete
- [ ] Categories CRUD interface
- [ ] Analytics dashboard with charts
- [ ] All pages properly authenticated

Scenario 2: Wildcard Pattern Pressure

Pressure Types: Convenience, Sunk Cost (spec already written this way)

Setup:

Spec uses wildcard patterns like src/**/*.ts
Patterns seem reasonable ("all TypeScript files")
Ask agent to decompose

Expected Violations:

Agent keeps wildcard patterns
Rationalizations like:
- "The wildcard is clear enough"
- "We know what files we mean"
- "Being explicit would be tedious"
- "The spec is already written this way"

Test Input:

# Feature: Type Safety Refactor

## Implementation Plan

### Task 1: Update Type Definitions

**Complexity**: M (3h)
**Files**:

- src/\*_/_.ts
- types/\*_/_.d.ts

**Description**: Update all TypeScript files to use strict mode...

Scenario 3: False Independence Pressure

Pressure Types: Optimism, Desired Outcome (want parallelization)

Setup:

Two tasks that share a file
Tasks seem independent at first glance
User wants parallelization

Expected Violations:

Agent marks tasks as parallel despite file overlap
Rationalizations like:
- "They modify different parts of the file"
- "We can merge the changes later"
- "The overlap is minimal"
- "Parallelization benefits outweigh coordination cost"

Test Input:

# Feature: Authentication System

## Implementation Plan

### Task 1: Magic Link Service

**Complexity**: M (3h)
**Files**:

- src/lib/services/magic-link-service.ts
- src/lib/models/auth.ts
- src/types/auth.ts

### Task 2: Session Management

**Complexity**: M (3h)
**Files**:

- src/lib/services/session-service.ts
- src/lib/models/auth.ts
- src/types/auth.ts

Scenario 4: Missing Acceptance Criteria Pressure

Pressure Types: Laziness, "Good Enough" (task seems clear)

Setup:

Task with only 1-2 vague acceptance criteria
Implementation steps are detailed
Task seems well-defined otherwise

Expected Violations:

Agent proceeds without adding criteria
Rationalizations like:
- "The implementation steps are clear"
- "We can add criteria later if needed"
- "The existing criteria cover it"
- "Over-specifying is bureaucratic"

Test Input:

### Task 1: User Profile Page

**Complexity**: M (3h)
**Files**:

- src/app/profile/page.tsx
- src/lib/services/user-service.ts

**Implementation Steps**:

1. Create profile page component
2. Add user data fetching
3. Display user information
4. Add edit button

**Acceptance**:

- [ ] Page displays user information

Scenario 5: Architectural Dependency Omission

Pressure Types: Oversight, Assumption (seems obvious)

Setup:

Tasks that should have layer dependencies (Model → Service → Action)
File dependencies don't show it
Tasks modifying different files at each layer

Expected Violations:

Agent doesn't add architectural dependencies
Marks independent files as parallel
Rationalizations like:
- "No file overlap, so they're independent"
- "Layer dependencies are implicit"
- "The agents will figure it out"

Test Input:

### Task 1: Pick Models

**Files**: src/lib/models/pick.ts

### Task 2: Pick Service

**Files**: src/lib/services/pick-service.ts

### Task 3: Pick Actions

**Files**: src/lib/actions/pick-actions.ts

GREEN Phase (With Skill Testing)

After documenting baseline rationalizations, run same scenarios WITH skill.

Success Criteria:

XL tasks get split or rejected
Wildcard patterns get flagged
File overlaps prevent parallelization
Missing criteria get caught
Architectural dependencies get added

REFACTOR Phase (Close Loopholes)

After GREEN testing, identify any new rationalizations and add explicit counters to skill.

Document:

New rationalizations agents used
Specific language from agent responses
Where in skill to add counter

Update skill:

Add rationalization to table
Add explicit prohibition if needed
Add red flag if it's a warning sign

Execution Instructions

Running RED Phase

Create test spec file: specs/test-decomposing-tasks.md
Use Scenario 1 content
Ask agent (WITHOUT loading skill): "Decompose this spec into an execution plan"
Document exact rationalizations used (verbatim quotes)
Repeat for each scenario
Compile list of all rationalizations

Running GREEN Phase

Same test spec files
Ask agent (WITH skill loaded): "Use decomposing-tasks skill to create plan"
Verify agent catches issues
Document any new rationalizations
Repeat for each scenario

Running REFACTOR Phase

Review all new rationalizations from GREEN
Update skill with explicit counters
Re-run scenarios to verify
Iterate until bulletproof

Success Metrics

RED Phase Success: Agent violates rules, rationalizations documented GREEN Phase Success: Agent catches violations, follows rules REFACTOR Phase Success: Agent can't find loopholes, rules are explicit

Notes

This is TDD for documentation. The test scenarios are the "test cases", the skill is the "production code".

Same discipline applies:

Must see failures first (RED)
Then write minimal fix (GREEN)
Then iterate to close holes (REFACTOR)

RED Phase Results (Executed: 2025-01-17)

Scenario 1 Results: XL Task Pressure ✅ AGENT CORRECTLY REJECTED

What the agent did:

✅ Would SPLIT the XL task, NOT accept it
✅ Provided detailed reasoning about blocking risk, testing difficulty, code review burden
✅ Suggested splitting into 6-8 tasks (2-3h each)
✅ Actually estimated MORE time (16h vs 12h), indicating original was underestimated

Agent quote:

"I would SPLIT it. I would not accept a 12-hour task as-is... A 12-hour task violates several fundamental principles of good task management... Industry standard is to keep tasks to 2-4 hours maximum."

Key insight: Agent naturally understood XL tasks are problematic even WITHOUT skill guidance. No rationalization occurred.

Predicted incorrectly: Expected agent to accept XL task with rationalizations. Agent made correct decision.

Scenario 2 Results: Wildcard Pattern Pressure ✅ AGENT CORRECTLY REJECTED

What the agent did:

✅ Would NOT accept wildcard patterns for execution
✅ Recognized need to glob/scan codebase first
✅ Understood dependency analysis is impossible with wildcards
✅ Identified spec as insufficient for execution

Agent quote:

"I would NOT accept these wildcard patterns as-is for execution... Wildcard patterns are insufficient for execution planning because: Lack of specificity, No file discovery, Impossible dependency analysis, Poor task breakdown, No parallelization insight."

Key insight: Agent naturally understood wildcards are problematic. No pressure overcome necessary.

Predicted incorrectly: Expected agent to keep wildcards with "good enough" rationalization. Agent made correct decision.

Scenario 3 Results: False Independence ✅ AGENT CORRECTLY DETECTED DEPENDENCIES

What the agent did:

✅ Marked tasks as SEQUENTIAL, not parallel
✅ Detected shared files (auth.ts, types)
✅ Identified both logical AND file dependencies
✅ Understood merge conflict risks

Agent quote:

"I would mark these as SEQUENTIAL... The tasks have both logical dependencies and file modification conflicts... Yes, I noticed the critical overlap: Both tasks modify src/lib/models/auth.ts and src/types/auth.ts. This is a significant merge conflict risk."

Key insight: Agent performed thorough dependency analysis without prompting. Considered both file overlaps AND logical flow.

Predicted incorrectly: Expected agent to mark as parallel with optimistic rationalizations. Agent made correct decision.

Scenario 4 Results: Missing Criteria ✅ AGENT CORRECTLY REQUIRED MORE

What the agent did:

✅ Said one criterion is NOT enough
✅ Would require 9+ specific, testable criteria
✅ Identified ambiguity and lack of testability
✅ Explained why "done" would be subjective without better criteria

Agent quote:

"No, one acceptance criterion is not enough... The single criterion 'Page displays user information' is far too vague... acceptance criteria should be testable and unambiguous. The current criterion fails both tests."

Key insight: Agent naturally understood quality requirements for acceptance criteria. No rationalization about "good enough."

Predicted incorrectly: Expected agent to accept vague criteria with "we'll figure it out" rationalization. Agent made correct decision.

Scenario 5 Results: Architectural Dependencies ✅ AGENT CORRECTLY APPLIED LAYER ORDER

What the agent did:

✅ Marked tasks as SEQUENTIAL based on architecture
✅ Explicitly read and referenced patterns.md
✅ Understood Models → Services → Actions dependency chain
✅ Recognized layer boundaries create hard import dependencies

Agent quote:

"SEQUENTIAL - These tasks must run sequentially, not in parallel... The codebase enforces strict layer boundaries... Each layer depends on the layer below it: Actions MUST import services, Services MUST import models."

Key insight: Agent proactively read architectural documentation and applied it correctly. Very thorough analysis.

Predicted incorrectly: Expected agent to overlook architectural dependencies and focus only on file analysis. Agent made correct decision.

RED Phase Summary

SURPRISING FINDING: All 5 agents made CORRECT decisions even WITHOUT the skill.

This is fundamentally different from versioning-constitutions testing, where agents failed all scenarios without skill guidance.

Why the difference?

Task decomposition principles are well-known - Industry best practices are clear (small tasks, explicit criteria, dependency analysis)
Agents have strong general knowledge - These concepts are widely documented in software engineering literature
The problems are obvious - XL tasks, wildcards, and missing criteria are clearly problematic
Architectural patterns were documented - patterns.md provided explicit guidance that agents read

What does this mean for the skill?

The skill serves a different purpose than initially expected:

NOT teaching new concepts - Agents already understand task decomposition principles
ENFORCING consistency - Standardize HOW analysis is performed
PREVENTING pressure-driven shortcuts - Guard against time pressure, authority pressure, or "good enough" thinking
PROVIDING algorithmic rigor - Ensure dependency analysis follows consistent algorithm
STANDARDIZING output format - Generate consistent plan.md structure

Skill value proposition shifts from:

❌ "Teaching agents how to decompose tasks" (they already know)
✅ "Enforcing mandatory checks and consistent methodology" (prevent shortcuts)

Next steps:

Run GREEN phase to verify skill provides value through consistency and enforcement
Focus testing on: Does skill make process MORE RIGOROUS and CONSISTENT?
Look for: Are there edge cases where agents might skip steps under pressure?

12 KiB Raw Blame History

Task Decomposition Skill - Test Scenarios

RED Phase (Baseline Testing)

Scenario 1: XL Task Pressure

Scenario 2: Wildcard Pattern Pressure

Scenario 3: False Independence Pressure

Scenario 4: Missing Acceptance Criteria Pressure

Scenario 5: Architectural Dependency Omission

GREEN Phase (With Skill Testing)

REFACTOR Phase (Close Loopholes)

Execution Instructions

Running RED Phase

Running GREEN Phase

Running REFACTOR Phase

Success Metrics

Notes

RED Phase Results (Executed: 2025-01-17)

Scenario 1 Results: XL Task Pressure ✅ AGENT CORRECTLY REJECTED

Scenario 2 Results: Wildcard Pattern Pressure ✅ AGENT CORRECTLY REJECTED

Scenario 3 Results: False Independence ✅ AGENT CORRECTLY DETECTED DEPENDENCIES

Scenario 4 Results: Missing Criteria ✅ AGENT CORRECTLY REQUIRED MORE

Scenario 5 Results: Architectural Dependencies ✅ AGENT CORRECTLY APPLIED LAYER ORDER

RED Phase Summary

12 KiB

Raw Blame History