Initial commit

This commit is contained in:
Zhongwei Li
2025-11-29 17:58:10 +08:00
commit 62e38f6386
28 changed files with 8679 additions and 0 deletions

View File

@@ -0,0 +1,414 @@
# Task Decomposition Skill - Test Scenarios
## RED Phase (Baseline Testing)
Run these scenarios WITHOUT the decomposing-tasks skill to document natural behavior and rationalizations.
### Scenario 1: XL Task Pressure
**Pressure Types**: Time, Authority (spec says 12h)
**Setup**:
- Provide a spec with one task marked as "XL (10-12h)"
- Task description is detailed and seems well-scoped
- Ask agent to decompose into execution plan
**Expected Violations** (what we're testing for):
- Agent accepts XL task without splitting
- Rationalizations like:
- "It's well-scoped so the estimate is reliable"
- "12h is borderline, we can proceed"
- "The spec author knows best"
- "Splitting would add coordination overhead"
**Test Input**:
```markdown
# Feature: Admin Dashboard
## Implementation Plan
### Task 1: Complete Admin Dashboard
**Complexity**: XL (12h)
**Files**:
- src/app/admin/page.tsx
- src/app/admin/users/page.tsx
- src/app/admin/categories/page.tsx
- src/lib/services/admin-service.ts
- src/lib/actions/admin-actions.ts
**Description**: Build complete admin dashboard with user management, category management, and analytics overview.
**Acceptance**:
- [ ] Users table with edit/delete
- [ ] Categories CRUD interface
- [ ] Analytics dashboard with charts
- [ ] All pages properly authenticated
```
### Scenario 2: Wildcard Pattern Pressure
**Pressure Types**: Convenience, Sunk Cost (spec already written this way)
**Setup**:
- Spec uses wildcard patterns like `src/**/*.ts`
- Patterns seem reasonable ("all TypeScript files")
- Ask agent to decompose
**Expected Violations**:
- Agent keeps wildcard patterns
- Rationalizations like:
- "The wildcard is clear enough"
- "We know what files we mean"
- "Being explicit would be tedious"
- "The spec is already written this way"
**Test Input**:
```markdown
# Feature: Type Safety Refactor
## Implementation Plan
### Task 1: Update Type Definitions
**Complexity**: M (3h)
**Files**:
- src/\*_/_.ts
- types/\*_/_.d.ts
**Description**: Update all TypeScript files to use strict mode...
```
### Scenario 3: False Independence Pressure
**Pressure Types**: Optimism, Desired Outcome (want parallelization)
**Setup**:
- Two tasks that share a file
- Tasks seem independent at first glance
- User wants parallelization
**Expected Violations**:
- Agent marks tasks as parallel despite file overlap
- Rationalizations like:
- "They modify different parts of the file"
- "We can merge the changes later"
- "The overlap is minimal"
- "Parallelization benefits outweigh coordination cost"
**Test Input**:
```markdown
# Feature: Authentication System
## Implementation Plan
### Task 1: Magic Link Service
**Complexity**: M (3h)
**Files**:
- src/lib/services/magic-link-service.ts
- src/lib/models/auth.ts
- src/types/auth.ts
### Task 2: Session Management
**Complexity**: M (3h)
**Files**:
- src/lib/services/session-service.ts
- src/lib/models/auth.ts
- src/types/auth.ts
```
### Scenario 4: Missing Acceptance Criteria Pressure
**Pressure Types**: Laziness, "Good Enough" (task seems clear)
**Setup**:
- Task with only 1-2 vague acceptance criteria
- Implementation steps are detailed
- Task seems well-defined otherwise
**Expected Violations**:
- Agent proceeds without adding criteria
- Rationalizations like:
- "The implementation steps are clear"
- "We can add criteria later if needed"
- "The existing criteria cover it"
- "Over-specifying is bureaucratic"
**Test Input**:
```markdown
### Task 1: User Profile Page
**Complexity**: M (3h)
**Files**:
- src/app/profile/page.tsx
- src/lib/services/user-service.ts
**Implementation Steps**:
1. Create profile page component
2. Add user data fetching
3. Display user information
4. Add edit button
**Acceptance**:
- [ ] Page displays user information
```
### Scenario 5: Architectural Dependency Omission
**Pressure Types**: Oversight, Assumption (seems obvious)
**Setup**:
- Tasks that should have layer dependencies (Model → Service → Action)
- File dependencies don't show it
- Tasks modifying different files at each layer
**Expected Violations**:
- Agent doesn't add architectural dependencies
- Marks independent files as parallel
- Rationalizations like:
- "No file overlap, so they're independent"
- "Layer dependencies are implicit"
- "The agents will figure it out"
**Test Input**:
```markdown
### Task 1: Pick Models
**Files**: src/lib/models/pick.ts
### Task 2: Pick Service
**Files**: src/lib/services/pick-service.ts
### Task 3: Pick Actions
**Files**: src/lib/actions/pick-actions.ts
```
## GREEN Phase (With Skill Testing)
After documenting baseline rationalizations, run same scenarios WITH skill.
**Success Criteria**:
- XL tasks get split or rejected
- Wildcard patterns get flagged
- File overlaps prevent parallelization
- Missing criteria get caught
- Architectural dependencies get added
## REFACTOR Phase (Close Loopholes)
After GREEN testing, identify any new rationalizations and add explicit counters to skill.
**Document**:
- New rationalizations agents used
- Specific language from agent responses
- Where in skill to add counter
**Update skill**:
- Add rationalization to table
- Add explicit prohibition if needed
- Add red flag if it's a warning sign
## Execution Instructions
### Running RED Phase
1. Create test spec file: `specs/test-decomposing-tasks.md`
2. Use Scenario 1 content
3. Ask agent (WITHOUT loading skill): "Decompose this spec into an execution plan"
4. Document exact rationalizations used (verbatim quotes)
5. Repeat for each scenario
6. Compile list of all rationalizations
### Running GREEN Phase
1. Same test spec files
2. Ask agent (WITH skill loaded): "Use decomposing-tasks skill to create plan"
3. Verify agent catches issues
4. Document any new rationalizations
5. Repeat for each scenario
### Running REFACTOR Phase
1. Review all new rationalizations from GREEN
2. Update skill with explicit counters
3. Re-run scenarios to verify
4. Iterate until bulletproof
## Success Metrics
**RED Phase Success**: Agent violates rules, rationalizations documented
**GREEN Phase Success**: Agent catches violations, follows rules
**REFACTOR Phase Success**: Agent can't find loopholes, rules are explicit
## Notes
This is TDD for documentation. The test scenarios are the "test cases", the skill is the "production code".
Same discipline applies:
- Must see failures first (RED)
- Then write minimal fix (GREEN)
- Then iterate to close holes (REFACTOR)
---
## RED Phase Results (Executed: 2025-01-17)
### Scenario 1 Results: XL Task Pressure ✅ AGENT CORRECTLY REJECTED
**What the agent did:**
- ✅ Would SPLIT the XL task, NOT accept it
- ✅ Provided detailed reasoning about blocking risk, testing difficulty, code review burden
- ✅ Suggested splitting into 6-8 tasks (2-3h each)
- ✅ Actually estimated MORE time (16h vs 12h), indicating original was underestimated
**Agent quote:**
> "I would SPLIT it. I would not accept a 12-hour task as-is... A 12-hour task violates several fundamental principles of good task management... Industry standard is to keep tasks to 2-4 hours maximum."
**Key insight:** Agent naturally understood XL tasks are problematic even WITHOUT skill guidance. No rationalization occurred.
**Predicted incorrectly:** Expected agent to accept XL task with rationalizations. Agent made correct decision.
---
### Scenario 2 Results: Wildcard Pattern Pressure ✅ AGENT CORRECTLY REJECTED
**What the agent did:**
- ✅ Would NOT accept wildcard patterns for execution
- ✅ Recognized need to glob/scan codebase first
- ✅ Understood dependency analysis is impossible with wildcards
- ✅ Identified spec as insufficient for execution
**Agent quote:**
> "I would NOT accept these wildcard patterns as-is for execution... Wildcard patterns are insufficient for execution planning because: Lack of specificity, No file discovery, Impossible dependency analysis, Poor task breakdown, No parallelization insight."
**Key insight:** Agent naturally understood wildcards are problematic. No pressure overcome necessary.
**Predicted incorrectly:** Expected agent to keep wildcards with "good enough" rationalization. Agent made correct decision.
---
### Scenario 3 Results: False Independence ✅ AGENT CORRECTLY DETECTED DEPENDENCIES
**What the agent did:**
- ✅ Marked tasks as SEQUENTIAL, not parallel
- ✅ Detected shared files (auth.ts, types)
- ✅ Identified both logical AND file dependencies
- ✅ Understood merge conflict risks
**Agent quote:**
> "I would mark these as SEQUENTIAL... The tasks have both logical dependencies and file modification conflicts... Yes, I noticed the critical overlap: Both tasks modify src/lib/models/auth.ts and src/types/auth.ts. This is a significant merge conflict risk."
**Key insight:** Agent performed thorough dependency analysis without prompting. Considered both file overlaps AND logical flow.
**Predicted incorrectly:** Expected agent to mark as parallel with optimistic rationalizations. Agent made correct decision.
---
### Scenario 4 Results: Missing Criteria ✅ AGENT CORRECTLY REQUIRED MORE
**What the agent did:**
- ✅ Said one criterion is NOT enough
- ✅ Would require 9+ specific, testable criteria
- ✅ Identified ambiguity and lack of testability
- ✅ Explained why "done" would be subjective without better criteria
**Agent quote:**
> "No, one acceptance criterion is not enough... The single criterion 'Page displays user information' is far too vague... acceptance criteria should be testable and unambiguous. The current criterion fails both tests."
**Key insight:** Agent naturally understood quality requirements for acceptance criteria. No rationalization about "good enough."
**Predicted incorrectly:** Expected agent to accept vague criteria with "we'll figure it out" rationalization. Agent made correct decision.
---
### Scenario 5 Results: Architectural Dependencies ✅ AGENT CORRECTLY APPLIED LAYER ORDER
**What the agent did:**
- ✅ Marked tasks as SEQUENTIAL based on architecture
- ✅ Explicitly read and referenced patterns.md
- ✅ Understood Models → Services → Actions dependency chain
- ✅ Recognized layer boundaries create hard import dependencies
**Agent quote:**
> "SEQUENTIAL - These tasks must run sequentially, not in parallel... The codebase enforces strict layer boundaries... Each layer depends on the layer below it: Actions MUST import services, Services MUST import models."
**Key insight:** Agent proactively read architectural documentation and applied it correctly. Very thorough analysis.
**Predicted incorrectly:** Expected agent to overlook architectural dependencies and focus only on file analysis. Agent made correct decision.
---
## RED Phase Summary
**SURPRISING FINDING:** All 5 agents made CORRECT decisions even WITHOUT the skill.
**This is fundamentally different from versioning-constitutions testing**, where agents failed all scenarios without skill guidance.
**Why the difference?**
1. **Task decomposition principles are well-known** - Industry best practices are clear (small tasks, explicit criteria, dependency analysis)
2. **Agents have strong general knowledge** - These concepts are widely documented in software engineering literature
3. **The problems are obvious** - XL tasks, wildcards, and missing criteria are clearly problematic
4. **Architectural patterns were documented** - patterns.md provided explicit guidance that agents read
**What does this mean for the skill?**
The skill serves a different purpose than initially expected:
1. **NOT teaching new concepts** - Agents already understand task decomposition principles
2. **ENFORCING consistency** - Standardize HOW analysis is performed
3. **PREVENTING pressure-driven shortcuts** - Guard against time pressure, authority pressure, or "good enough" thinking
4. **PROVIDING algorithmic rigor** - Ensure dependency analysis follows consistent algorithm
5. **STANDARDIZING output format** - Generate consistent plan.md structure
**Skill value proposition shifts from:**
- ❌ "Teaching agents how to decompose tasks" (they already know)
- ✅ "Enforcing mandatory checks and consistent methodology" (prevent shortcuts)
**Next steps:**
- Run GREEN phase to verify skill provides value through consistency and enforcement
- Focus testing on: Does skill make process MORE RIGOROUS and CONSISTENT?
- Look for: Are there edge cases where agents might skip steps under pressure?