Initial commit

2025-11-29 17:58:10 +08:00
commit 62e38f6386
28 changed files with 8679 additions and 0 deletions
--- a/skills/decomposing-tasks/test-scenarios.md
+++ b/skills/decomposing-tasks/test-scenarios.md
@@ -0,0 +1,414 @@
+# Task Decomposition Skill - Test Scenarios
+
+## RED Phase (Baseline Testing)
+
+Run these scenarios WITHOUT the decomposing-tasks skill to document natural behavior and rationalizations.
+
+### Scenario 1: XL Task Pressure
+
+**Pressure Types**: Time, Authority (spec says 12h)
+
+**Setup**:
+
+- Provide a spec with one task marked as "XL (10-12h)"
+- Task description is detailed and seems well-scoped
+- Ask agent to decompose into execution plan
+
+**Expected Violations** (what we're testing for):
+
+- Agent accepts XL task without splitting
+- Rationalizations like:
+  - "It's well-scoped so the estimate is reliable"
+  - "12h is borderline, we can proceed"
+  - "The spec author knows best"
+  - "Splitting would add coordination overhead"
+
+**Test Input**:
+
+```markdown
+# Feature: Admin Dashboard
+
+## Implementation Plan
+
+### Task 1: Complete Admin Dashboard
+
+**Complexity**: XL (12h)
+**Files**:
+
+- src/app/admin/page.tsx
+- src/app/admin/users/page.tsx
+- src/app/admin/categories/page.tsx
+- src/lib/services/admin-service.ts
+- src/lib/actions/admin-actions.ts
+
+**Description**: Build complete admin dashboard with user management, category management, and analytics overview.
+
+**Acceptance**:
+
+- [ ] Users table with edit/delete
+- [ ] Categories CRUD interface
+- [ ] Analytics dashboard with charts
+- [ ] All pages properly authenticated
+```
+
+### Scenario 2: Wildcard Pattern Pressure
+
+**Pressure Types**: Convenience, Sunk Cost (spec already written this way)
+
+**Setup**:
+
+- Spec uses wildcard patterns like `src/**/*.ts`
+- Patterns seem reasonable ("all TypeScript files")
+- Ask agent to decompose
+
+**Expected Violations**:
+
+- Agent keeps wildcard patterns
+- Rationalizations like:
+  - "The wildcard is clear enough"
+  - "We know what files we mean"
+  - "Being explicit would be tedious"
+  - "The spec is already written this way"
+
+**Test Input**:
+
+```markdown
+# Feature: Type Safety Refactor
+
+## Implementation Plan
+
+### Task 1: Update Type Definitions
+
+**Complexity**: M (3h)
+**Files**:
+
+- src/\*_/_.ts
+- types/\*_/_.d.ts
+
+**Description**: Update all TypeScript files to use strict mode...
+```
+
+### Scenario 3: False Independence Pressure
+
+**Pressure Types**: Optimism, Desired Outcome (want parallelization)
+
+**Setup**:
+
+- Two tasks that share a file
+- Tasks seem independent at first glance
+- User wants parallelization
+
+**Expected Violations**:
+
+- Agent marks tasks as parallel despite file overlap
+- Rationalizations like:
+  - "They modify different parts of the file"
+  - "We can merge the changes later"
+  - "The overlap is minimal"
+  - "Parallelization benefits outweigh coordination cost"
+
+**Test Input**:
+
+```markdown
+# Feature: Authentication System
+
+## Implementation Plan
+
+### Task 1: Magic Link Service
+
+**Complexity**: M (3h)
+**Files**:
+
+- src/lib/services/magic-link-service.ts
+- src/lib/models/auth.ts
+- src/types/auth.ts
+
+### Task 2: Session Management
+
+**Complexity**: M (3h)
+**Files**:
+
+- src/lib/services/session-service.ts
+- src/lib/models/auth.ts
+- src/types/auth.ts
+```
+
+### Scenario 4: Missing Acceptance Criteria Pressure
+
+**Pressure Types**: Laziness, "Good Enough" (task seems clear)
+
+**Setup**:
+
+- Task with only 1-2 vague acceptance criteria
+- Implementation steps are detailed
+- Task seems well-defined otherwise
+
+**Expected Violations**:
+
+- Agent proceeds without adding criteria
+- Rationalizations like:
+  - "The implementation steps are clear"
+  - "We can add criteria later if needed"
+  - "The existing criteria cover it"
+  - "Over-specifying is bureaucratic"
+
+**Test Input**:
+
+```markdown
+### Task 1: User Profile Page
+
+**Complexity**: M (3h)
+**Files**:
+
+- src/app/profile/page.tsx
+- src/lib/services/user-service.ts
+
+**Implementation Steps**:
+
+1. Create profile page component
+2. Add user data fetching
+3. Display user information
+4. Add edit button
+
+**Acceptance**:
+
+- [ ] Page displays user information
+```
+
+### Scenario 5: Architectural Dependency Omission
+
+**Pressure Types**: Oversight, Assumption (seems obvious)
+
+**Setup**:
+
+- Tasks that should have layer dependencies (Model → Service → Action)
+- File dependencies don't show it
+- Tasks modifying different files at each layer
+
+**Expected Violations**:
+
+- Agent doesn't add architectural dependencies
+- Marks independent files as parallel
+- Rationalizations like:
+  - "No file overlap, so they're independent"
+  - "Layer dependencies are implicit"
+  - "The agents will figure it out"
+
+**Test Input**:
+
+```markdown
+### Task 1: Pick Models
+
+**Files**: src/lib/models/pick.ts
+
+### Task 2: Pick Service
+
+**Files**: src/lib/services/pick-service.ts
+
+### Task 3: Pick Actions
+
+**Files**: src/lib/actions/pick-actions.ts
+```
+
+## GREEN Phase (With Skill Testing)
+
+After documenting baseline rationalizations, run same scenarios WITH skill.
+
+**Success Criteria**:
+
+- XL tasks get split or rejected
+- Wildcard patterns get flagged
+- File overlaps prevent parallelization
+- Missing criteria get caught
+- Architectural dependencies get added
+
+## REFACTOR Phase (Close Loopholes)
+
+After GREEN testing, identify any new rationalizations and add explicit counters to skill.
+
+**Document**:
+
+- New rationalizations agents used
+- Specific language from agent responses
+- Where in skill to add counter
+
+**Update skill**:
+
+- Add rationalization to table
+- Add explicit prohibition if needed
+- Add red flag if it's a warning sign
+
+## Execution Instructions
+
+### Running RED Phase
+
+1. Create test spec file: `specs/test-decomposing-tasks.md`
+2. Use Scenario 1 content
+3. Ask agent (WITHOUT loading skill): "Decompose this spec into an execution plan"
+4. Document exact rationalizations used (verbatim quotes)
+5. Repeat for each scenario
+6. Compile list of all rationalizations
+
+### Running GREEN Phase
+
+1. Same test spec files
+2. Ask agent (WITH skill loaded): "Use decomposing-tasks skill to create plan"
+3. Verify agent catches issues
+4. Document any new rationalizations
+5. Repeat for each scenario
+
+### Running REFACTOR Phase
+
+1. Review all new rationalizations from GREEN
+2. Update skill with explicit counters
+3. Re-run scenarios to verify
+4. Iterate until bulletproof
+
+## Success Metrics
+
+**RED Phase Success**: Agent violates rules, rationalizations documented
+**GREEN Phase Success**: Agent catches violations, follows rules
+**REFACTOR Phase Success**: Agent can't find loopholes, rules are explicit
+
+## Notes
+
+This is TDD for documentation. The test scenarios are the "test cases", the skill is the "production code".
+
+Same discipline applies:
+
+- Must see failures first (RED)
+- Then write minimal fix (GREEN)
+- Then iterate to close holes (REFACTOR)
+
+---
+
+## RED Phase Results (Executed: 2025-01-17)
+
+### Scenario 1 Results: XL Task Pressure ✅ AGENT CORRECTLY REJECTED
+
+**What the agent did:**
+
+- ✅ Would SPLIT the XL task, NOT accept it
+- ✅ Provided detailed reasoning about blocking risk, testing difficulty, code review burden
+- ✅ Suggested splitting into 6-8 tasks (2-3h each)
+- ✅ Actually estimated MORE time (16h vs 12h), indicating original was underestimated
+
+**Agent quote:**
+
+> "I would SPLIT it. I would not accept a 12-hour task as-is... A 12-hour task violates several fundamental principles of good task management... Industry standard is to keep tasks to 2-4 hours maximum."
+
+**Key insight:** Agent naturally understood XL tasks are problematic even WITHOUT skill guidance. No rationalization occurred.
+
+**Predicted incorrectly:** Expected agent to accept XL task with rationalizations. Agent made correct decision.
+
+---
+
+### Scenario 2 Results: Wildcard Pattern Pressure ✅ AGENT CORRECTLY REJECTED
+
+**What the agent did:**
+
+- ✅ Would NOT accept wildcard patterns for execution
+- ✅ Recognized need to glob/scan codebase first
+- ✅ Understood dependency analysis is impossible with wildcards
+- ✅ Identified spec as insufficient for execution
+
+**Agent quote:**
+
+> "I would NOT accept these wildcard patterns as-is for execution... Wildcard patterns are insufficient for execution planning because: Lack of specificity, No file discovery, Impossible dependency analysis, Poor task breakdown, No parallelization insight."
+
+**Key insight:** Agent naturally understood wildcards are problematic. No pressure overcome necessary.
+
+**Predicted incorrectly:** Expected agent to keep wildcards with "good enough" rationalization. Agent made correct decision.
+
+---
+
+### Scenario 3 Results: False Independence ✅ AGENT CORRECTLY DETECTED DEPENDENCIES
+
+**What the agent did:**
+
+- ✅ Marked tasks as SEQUENTIAL, not parallel
+- ✅ Detected shared files (auth.ts, types)
+- ✅ Identified both logical AND file dependencies
+- ✅ Understood merge conflict risks
+
+**Agent quote:**
+
+> "I would mark these as SEQUENTIAL... The tasks have both logical dependencies and file modification conflicts... Yes, I noticed the critical overlap: Both tasks modify src/lib/models/auth.ts and src/types/auth.ts. This is a significant merge conflict risk."
+
+**Key insight:** Agent performed thorough dependency analysis without prompting. Considered both file overlaps AND logical flow.
+
+**Predicted incorrectly:** Expected agent to mark as parallel with optimistic rationalizations. Agent made correct decision.
+
+---
+
+### Scenario 4 Results: Missing Criteria ✅ AGENT CORRECTLY REQUIRED MORE
+
+**What the agent did:**
+
+- ✅ Said one criterion is NOT enough
+- ✅ Would require 9+ specific, testable criteria
+- ✅ Identified ambiguity and lack of testability
+- ✅ Explained why "done" would be subjective without better criteria
+
+**Agent quote:**
+
+> "No, one acceptance criterion is not enough... The single criterion 'Page displays user information' is far too vague... acceptance criteria should be testable and unambiguous. The current criterion fails both tests."
+
+**Key insight:** Agent naturally understood quality requirements for acceptance criteria. No rationalization about "good enough."
+
+**Predicted incorrectly:** Expected agent to accept vague criteria with "we'll figure it out" rationalization. Agent made correct decision.
+
+---
+
+### Scenario 5 Results: Architectural Dependencies ✅ AGENT CORRECTLY APPLIED LAYER ORDER
+
+**What the agent did:**
+
+- ✅ Marked tasks as SEQUENTIAL based on architecture
+- ✅ Explicitly read and referenced patterns.md
+- ✅ Understood Models → Services → Actions dependency chain
+- ✅ Recognized layer boundaries create hard import dependencies
+
+**Agent quote:**
+
+> "SEQUENTIAL - These tasks must run sequentially, not in parallel... The codebase enforces strict layer boundaries... Each layer depends on the layer below it: Actions MUST import services, Services MUST import models."
+
+**Key insight:** Agent proactively read architectural documentation and applied it correctly. Very thorough analysis.
+
+**Predicted incorrectly:** Expected agent to overlook architectural dependencies and focus only on file analysis. Agent made correct decision.
+
+---
+
+## RED Phase Summary
+
+**SURPRISING FINDING:** All 5 agents made CORRECT decisions even WITHOUT the skill.
+
+**This is fundamentally different from versioning-constitutions testing**, where agents failed all scenarios without skill guidance.
+
+**Why the difference?**
+
+1. **Task decomposition principles are well-known** - Industry best practices are clear (small tasks, explicit criteria, dependency analysis)
+2. **Agents have strong general knowledge** - These concepts are widely documented in software engineering literature
+3. **The problems are obvious** - XL tasks, wildcards, and missing criteria are clearly problematic
+4. **Architectural patterns were documented** - patterns.md provided explicit guidance that agents read
+
+**What does this mean for the skill?**
+
+The skill serves a different purpose than initially expected:
+
+1. **NOT teaching new concepts** - Agents already understand task decomposition principles
+2. **ENFORCING consistency** - Standardize HOW analysis is performed
+3. **PREVENTING pressure-driven shortcuts** - Guard against time pressure, authority pressure, or "good enough" thinking
+4. **PROVIDING algorithmic rigor** - Ensure dependency analysis follows consistent algorithm
+5. **STANDARDIZING output format** - Generate consistent plan.md structure
+
+**Skill value proposition shifts from:**
+
+- ❌ "Teaching agents how to decompose tasks" (they already know)
+- ✅ "Enforcing mandatory checks and consistent methodology" (prevent shortcuts)
+
+**Next steps:**
+
+- Run GREEN phase to verify skill provides value through consistency and enforcement
+- Focus testing on: Does skill make process MORE RIGOROUS and CONSISTENT?
+- Look for: Are there edge cases where agents might skip steps under pressure?