Initial commit
This commit is contained in:
414
skills/decomposing-tasks/test-scenarios.md
Normal file
414
skills/decomposing-tasks/test-scenarios.md
Normal file
@@ -0,0 +1,414 @@
|
||||
# Task Decomposition Skill - Test Scenarios
|
||||
|
||||
## RED Phase (Baseline Testing)
|
||||
|
||||
Run these scenarios WITHOUT the decomposing-tasks skill to document natural behavior and rationalizations.
|
||||
|
||||
### Scenario 1: XL Task Pressure
|
||||
|
||||
**Pressure Types**: Time, Authority (spec says 12h)
|
||||
|
||||
**Setup**:
|
||||
|
||||
- Provide a spec with one task marked as "XL (10-12h)"
|
||||
- Task description is detailed and seems well-scoped
|
||||
- Ask agent to decompose into execution plan
|
||||
|
||||
**Expected Violations** (what we're testing for):
|
||||
|
||||
- Agent accepts XL task without splitting
|
||||
- Rationalizations like:
|
||||
- "It's well-scoped so the estimate is reliable"
|
||||
- "12h is borderline, we can proceed"
|
||||
- "The spec author knows best"
|
||||
- "Splitting would add coordination overhead"
|
||||
|
||||
**Test Input**:
|
||||
|
||||
```markdown
|
||||
# Feature: Admin Dashboard
|
||||
|
||||
## Implementation Plan
|
||||
|
||||
### Task 1: Complete Admin Dashboard
|
||||
|
||||
**Complexity**: XL (12h)
|
||||
**Files**:
|
||||
|
||||
- src/app/admin/page.tsx
|
||||
- src/app/admin/users/page.tsx
|
||||
- src/app/admin/categories/page.tsx
|
||||
- src/lib/services/admin-service.ts
|
||||
- src/lib/actions/admin-actions.ts
|
||||
|
||||
**Description**: Build complete admin dashboard with user management, category management, and analytics overview.
|
||||
|
||||
**Acceptance**:
|
||||
|
||||
- [ ] Users table with edit/delete
|
||||
- [ ] Categories CRUD interface
|
||||
- [ ] Analytics dashboard with charts
|
||||
- [ ] All pages properly authenticated
|
||||
```
|
||||
|
||||
### Scenario 2: Wildcard Pattern Pressure
|
||||
|
||||
**Pressure Types**: Convenience, Sunk Cost (spec already written this way)
|
||||
|
||||
**Setup**:
|
||||
|
||||
- Spec uses wildcard patterns like `src/**/*.ts`
|
||||
- Patterns seem reasonable ("all TypeScript files")
|
||||
- Ask agent to decompose
|
||||
|
||||
**Expected Violations**:
|
||||
|
||||
- Agent keeps wildcard patterns
|
||||
- Rationalizations like:
|
||||
- "The wildcard is clear enough"
|
||||
- "We know what files we mean"
|
||||
- "Being explicit would be tedious"
|
||||
- "The spec is already written this way"
|
||||
|
||||
**Test Input**:
|
||||
|
||||
```markdown
|
||||
# Feature: Type Safety Refactor
|
||||
|
||||
## Implementation Plan
|
||||
|
||||
### Task 1: Update Type Definitions
|
||||
|
||||
**Complexity**: M (3h)
|
||||
**Files**:
|
||||
|
||||
- src/\*_/_.ts
|
||||
- types/\*_/_.d.ts
|
||||
|
||||
**Description**: Update all TypeScript files to use strict mode...
|
||||
```
|
||||
|
||||
### Scenario 3: False Independence Pressure
|
||||
|
||||
**Pressure Types**: Optimism, Desired Outcome (want parallelization)
|
||||
|
||||
**Setup**:
|
||||
|
||||
- Two tasks that share a file
|
||||
- Tasks seem independent at first glance
|
||||
- User wants parallelization
|
||||
|
||||
**Expected Violations**:
|
||||
|
||||
- Agent marks tasks as parallel despite file overlap
|
||||
- Rationalizations like:
|
||||
- "They modify different parts of the file"
|
||||
- "We can merge the changes later"
|
||||
- "The overlap is minimal"
|
||||
- "Parallelization benefits outweigh coordination cost"
|
||||
|
||||
**Test Input**:
|
||||
|
||||
```markdown
|
||||
# Feature: Authentication System
|
||||
|
||||
## Implementation Plan
|
||||
|
||||
### Task 1: Magic Link Service
|
||||
|
||||
**Complexity**: M (3h)
|
||||
**Files**:
|
||||
|
||||
- src/lib/services/magic-link-service.ts
|
||||
- src/lib/models/auth.ts
|
||||
- src/types/auth.ts
|
||||
|
||||
### Task 2: Session Management
|
||||
|
||||
**Complexity**: M (3h)
|
||||
**Files**:
|
||||
|
||||
- src/lib/services/session-service.ts
|
||||
- src/lib/models/auth.ts
|
||||
- src/types/auth.ts
|
||||
```
|
||||
|
||||
### Scenario 4: Missing Acceptance Criteria Pressure
|
||||
|
||||
**Pressure Types**: Laziness, "Good Enough" (task seems clear)
|
||||
|
||||
**Setup**:
|
||||
|
||||
- Task with only 1-2 vague acceptance criteria
|
||||
- Implementation steps are detailed
|
||||
- Task seems well-defined otherwise
|
||||
|
||||
**Expected Violations**:
|
||||
|
||||
- Agent proceeds without adding criteria
|
||||
- Rationalizations like:
|
||||
- "The implementation steps are clear"
|
||||
- "We can add criteria later if needed"
|
||||
- "The existing criteria cover it"
|
||||
- "Over-specifying is bureaucratic"
|
||||
|
||||
**Test Input**:
|
||||
|
||||
```markdown
|
||||
### Task 1: User Profile Page
|
||||
|
||||
**Complexity**: M (3h)
|
||||
**Files**:
|
||||
|
||||
- src/app/profile/page.tsx
|
||||
- src/lib/services/user-service.ts
|
||||
|
||||
**Implementation Steps**:
|
||||
|
||||
1. Create profile page component
|
||||
2. Add user data fetching
|
||||
3. Display user information
|
||||
4. Add edit button
|
||||
|
||||
**Acceptance**:
|
||||
|
||||
- [ ] Page displays user information
|
||||
```
|
||||
|
||||
### Scenario 5: Architectural Dependency Omission
|
||||
|
||||
**Pressure Types**: Oversight, Assumption (seems obvious)
|
||||
|
||||
**Setup**:
|
||||
|
||||
- Tasks that should have layer dependencies (Model → Service → Action)
|
||||
- File dependencies don't show it
|
||||
- Tasks modifying different files at each layer
|
||||
|
||||
**Expected Violations**:
|
||||
|
||||
- Agent doesn't add architectural dependencies
|
||||
- Marks independent files as parallel
|
||||
- Rationalizations like:
|
||||
- "No file overlap, so they're independent"
|
||||
- "Layer dependencies are implicit"
|
||||
- "The agents will figure it out"
|
||||
|
||||
**Test Input**:
|
||||
|
||||
```markdown
|
||||
### Task 1: Pick Models
|
||||
|
||||
**Files**: src/lib/models/pick.ts
|
||||
|
||||
### Task 2: Pick Service
|
||||
|
||||
**Files**: src/lib/services/pick-service.ts
|
||||
|
||||
### Task 3: Pick Actions
|
||||
|
||||
**Files**: src/lib/actions/pick-actions.ts
|
||||
```
|
||||
|
||||
## GREEN Phase (With Skill Testing)
|
||||
|
||||
After documenting baseline rationalizations, run same scenarios WITH skill.
|
||||
|
||||
**Success Criteria**:
|
||||
|
||||
- XL tasks get split or rejected
|
||||
- Wildcard patterns get flagged
|
||||
- File overlaps prevent parallelization
|
||||
- Missing criteria get caught
|
||||
- Architectural dependencies get added
|
||||
|
||||
## REFACTOR Phase (Close Loopholes)
|
||||
|
||||
After GREEN testing, identify any new rationalizations and add explicit counters to skill.
|
||||
|
||||
**Document**:
|
||||
|
||||
- New rationalizations agents used
|
||||
- Specific language from agent responses
|
||||
- Where in skill to add counter
|
||||
|
||||
**Update skill**:
|
||||
|
||||
- Add rationalization to table
|
||||
- Add explicit prohibition if needed
|
||||
- Add red flag if it's a warning sign
|
||||
|
||||
## Execution Instructions
|
||||
|
||||
### Running RED Phase
|
||||
|
||||
1. Create test spec file: `specs/test-decomposing-tasks.md`
|
||||
2. Use Scenario 1 content
|
||||
3. Ask agent (WITHOUT loading skill): "Decompose this spec into an execution plan"
|
||||
4. Document exact rationalizations used (verbatim quotes)
|
||||
5. Repeat for each scenario
|
||||
6. Compile list of all rationalizations
|
||||
|
||||
### Running GREEN Phase
|
||||
|
||||
1. Same test spec files
|
||||
2. Ask agent (WITH skill loaded): "Use decomposing-tasks skill to create plan"
|
||||
3. Verify agent catches issues
|
||||
4. Document any new rationalizations
|
||||
5. Repeat for each scenario
|
||||
|
||||
### Running REFACTOR Phase
|
||||
|
||||
1. Review all new rationalizations from GREEN
|
||||
2. Update skill with explicit counters
|
||||
3. Re-run scenarios to verify
|
||||
4. Iterate until bulletproof
|
||||
|
||||
## Success Metrics
|
||||
|
||||
**RED Phase Success**: Agent violates rules, rationalizations documented
|
||||
**GREEN Phase Success**: Agent catches violations, follows rules
|
||||
**REFACTOR Phase Success**: Agent can't find loopholes, rules are explicit
|
||||
|
||||
## Notes
|
||||
|
||||
This is TDD for documentation. The test scenarios are the "test cases", the skill is the "production code".
|
||||
|
||||
Same discipline applies:
|
||||
|
||||
- Must see failures first (RED)
|
||||
- Then write minimal fix (GREEN)
|
||||
- Then iterate to close holes (REFACTOR)
|
||||
|
||||
---
|
||||
|
||||
## RED Phase Results (Executed: 2025-01-17)
|
||||
|
||||
### Scenario 1 Results: XL Task Pressure ✅ AGENT CORRECTLY REJECTED
|
||||
|
||||
**What the agent did:**
|
||||
|
||||
- ✅ Would SPLIT the XL task, NOT accept it
|
||||
- ✅ Provided detailed reasoning about blocking risk, testing difficulty, code review burden
|
||||
- ✅ Suggested splitting into 6-8 tasks (2-3h each)
|
||||
- ✅ Actually estimated MORE time (16h vs 12h), indicating original was underestimated
|
||||
|
||||
**Agent quote:**
|
||||
|
||||
> "I would SPLIT it. I would not accept a 12-hour task as-is... A 12-hour task violates several fundamental principles of good task management... Industry standard is to keep tasks to 2-4 hours maximum."
|
||||
|
||||
**Key insight:** Agent naturally understood XL tasks are problematic even WITHOUT skill guidance. No rationalization occurred.
|
||||
|
||||
**Predicted incorrectly:** Expected agent to accept XL task with rationalizations. Agent made correct decision.
|
||||
|
||||
---
|
||||
|
||||
### Scenario 2 Results: Wildcard Pattern Pressure ✅ AGENT CORRECTLY REJECTED
|
||||
|
||||
**What the agent did:**
|
||||
|
||||
- ✅ Would NOT accept wildcard patterns for execution
|
||||
- ✅ Recognized need to glob/scan codebase first
|
||||
- ✅ Understood dependency analysis is impossible with wildcards
|
||||
- ✅ Identified spec as insufficient for execution
|
||||
|
||||
**Agent quote:**
|
||||
|
||||
> "I would NOT accept these wildcard patterns as-is for execution... Wildcard patterns are insufficient for execution planning because: Lack of specificity, No file discovery, Impossible dependency analysis, Poor task breakdown, No parallelization insight."
|
||||
|
||||
**Key insight:** Agent naturally understood wildcards are problematic. No pressure overcome necessary.
|
||||
|
||||
**Predicted incorrectly:** Expected agent to keep wildcards with "good enough" rationalization. Agent made correct decision.
|
||||
|
||||
---
|
||||
|
||||
### Scenario 3 Results: False Independence ✅ AGENT CORRECTLY DETECTED DEPENDENCIES
|
||||
|
||||
**What the agent did:**
|
||||
|
||||
- ✅ Marked tasks as SEQUENTIAL, not parallel
|
||||
- ✅ Detected shared files (auth.ts, types)
|
||||
- ✅ Identified both logical AND file dependencies
|
||||
- ✅ Understood merge conflict risks
|
||||
|
||||
**Agent quote:**
|
||||
|
||||
> "I would mark these as SEQUENTIAL... The tasks have both logical dependencies and file modification conflicts... Yes, I noticed the critical overlap: Both tasks modify src/lib/models/auth.ts and src/types/auth.ts. This is a significant merge conflict risk."
|
||||
|
||||
**Key insight:** Agent performed thorough dependency analysis without prompting. Considered both file overlaps AND logical flow.
|
||||
|
||||
**Predicted incorrectly:** Expected agent to mark as parallel with optimistic rationalizations. Agent made correct decision.
|
||||
|
||||
---
|
||||
|
||||
### Scenario 4 Results: Missing Criteria ✅ AGENT CORRECTLY REQUIRED MORE
|
||||
|
||||
**What the agent did:**
|
||||
|
||||
- ✅ Said one criterion is NOT enough
|
||||
- ✅ Would require 9+ specific, testable criteria
|
||||
- ✅ Identified ambiguity and lack of testability
|
||||
- ✅ Explained why "done" would be subjective without better criteria
|
||||
|
||||
**Agent quote:**
|
||||
|
||||
> "No, one acceptance criterion is not enough... The single criterion 'Page displays user information' is far too vague... acceptance criteria should be testable and unambiguous. The current criterion fails both tests."
|
||||
|
||||
**Key insight:** Agent naturally understood quality requirements for acceptance criteria. No rationalization about "good enough."
|
||||
|
||||
**Predicted incorrectly:** Expected agent to accept vague criteria with "we'll figure it out" rationalization. Agent made correct decision.
|
||||
|
||||
---
|
||||
|
||||
### Scenario 5 Results: Architectural Dependencies ✅ AGENT CORRECTLY APPLIED LAYER ORDER
|
||||
|
||||
**What the agent did:**
|
||||
|
||||
- ✅ Marked tasks as SEQUENTIAL based on architecture
|
||||
- ✅ Explicitly read and referenced patterns.md
|
||||
- ✅ Understood Models → Services → Actions dependency chain
|
||||
- ✅ Recognized layer boundaries create hard import dependencies
|
||||
|
||||
**Agent quote:**
|
||||
|
||||
> "SEQUENTIAL - These tasks must run sequentially, not in parallel... The codebase enforces strict layer boundaries... Each layer depends on the layer below it: Actions MUST import services, Services MUST import models."
|
||||
|
||||
**Key insight:** Agent proactively read architectural documentation and applied it correctly. Very thorough analysis.
|
||||
|
||||
**Predicted incorrectly:** Expected agent to overlook architectural dependencies and focus only on file analysis. Agent made correct decision.
|
||||
|
||||
---
|
||||
|
||||
## RED Phase Summary
|
||||
|
||||
**SURPRISING FINDING:** All 5 agents made CORRECT decisions even WITHOUT the skill.
|
||||
|
||||
**This is fundamentally different from versioning-constitutions testing**, where agents failed all scenarios without skill guidance.
|
||||
|
||||
**Why the difference?**
|
||||
|
||||
1. **Task decomposition principles are well-known** - Industry best practices are clear (small tasks, explicit criteria, dependency analysis)
|
||||
2. **Agents have strong general knowledge** - These concepts are widely documented in software engineering literature
|
||||
3. **The problems are obvious** - XL tasks, wildcards, and missing criteria are clearly problematic
|
||||
4. **Architectural patterns were documented** - patterns.md provided explicit guidance that agents read
|
||||
|
||||
**What does this mean for the skill?**
|
||||
|
||||
The skill serves a different purpose than initially expected:
|
||||
|
||||
1. **NOT teaching new concepts** - Agents already understand task decomposition principles
|
||||
2. **ENFORCING consistency** - Standardize HOW analysis is performed
|
||||
3. **PREVENTING pressure-driven shortcuts** - Guard against time pressure, authority pressure, or "good enough" thinking
|
||||
4. **PROVIDING algorithmic rigor** - Ensure dependency analysis follows consistent algorithm
|
||||
5. **STANDARDIZING output format** - Generate consistent plan.md structure
|
||||
|
||||
**Skill value proposition shifts from:**
|
||||
|
||||
- ❌ "Teaching agents how to decompose tasks" (they already know)
|
||||
- ✅ "Enforcing mandatory checks and consistent methodology" (prevent shortcuts)
|
||||
|
||||
**Next steps:**
|
||||
|
||||
- Run GREEN phase to verify skill provides value through consistency and enforcement
|
||||
- Focus testing on: Does skill make process MORE RIGOROUS and CONSISTENT?
|
||||
- Look for: Are there edge cases where agents might skip steps under pressure?
|
||||
Reference in New Issue
Block a user