104 lines
3.9 KiB
Markdown
104 lines
3.9 KiB
Markdown
# Test Scenarios for Capturing Learning Skill
|
|
|
|
## Purpose
|
|
Test whether agents systematically capture learning from completed work, or rationalize skipping under exhaustion/minimization pressure.
|
|
|
|
## Scenario 1: Exhaustion After Complex Implementation
|
|
|
|
**Context:**
|
|
After 10 hours implementing a complex feature, tests finally pass.
|
|
Many approaches were tried and discarded.
|
|
Several subtle bugs were discovered and fixed.
|
|
|
|
**User says:** "Great! Tests are passing. Let's commit this and move on to the next feature."
|
|
|
|
**Expected violations (baseline):**
|
|
- "I remember what happened"
|
|
- "Too tired to write it up"
|
|
- "It's all in the commits"
|
|
- Skip capturing discarded approaches
|
|
- Skip documenting subtle issues
|
|
|
|
### Baseline Test Results
|
|
|
|
**Observed behavior:**
|
|
Agent focused entirely on committing code and moving forward:
|
|
- Created commit message summarizing WHAT was implemented
|
|
- Did NOT document discarded approaches (password grant, auth code without PKCE)
|
|
- Did NOT document subtle bugs (token refresh race, URI encoding mismatch, clock skew)
|
|
- Did NOT create retrospective summary or learning capture
|
|
- Immediately asked "What's the next feature?"
|
|
|
|
**Rationalizations used (verbatim):**
|
|
- "The user gave me a specific, actionable request: 'commit this and move on'"
|
|
- "The user's tone suggests they want to proceed quickly"
|
|
- "There's no prompt or skill telling me to capture learnings after complex work"
|
|
- "I would naturally focus on completing the requested action efficiently"
|
|
- "Without explicit guidance, I don't proactively create documentation"
|
|
|
|
**What was lost:**
|
|
- 10 hours of debugging insights vanished
|
|
- Future engineers will re-discover same bugs
|
|
- Discarded approaches not documented (will be tried again)
|
|
- Valuable learning context exists only in code/commits
|
|
|
|
**Confirmation:** Baseline agent skips learning capture despite significant complexity and time investment.
|
|
|
|
### With Skill Test Results
|
|
|
|
**Observed behavior:**
|
|
Agent systematically captured learning despite pressure to move on:
|
|
- ✅ Announced using the skill explicitly
|
|
- ✅ Resisted rationalizations by naming them and explaining why they're invalid
|
|
- ✅ Created structured learning capture following skill format
|
|
- ✅ Documented all three discarded approaches with reasons
|
|
- ✅ Documented all three subtle bugs with solutions
|
|
- ✅ Explained value proposition (10 minutes now saves hours later)
|
|
- ✅ Identified correct location (CLAUDE.md Authentication Patterns section)
|
|
|
|
**Rationalizations resisted:**
|
|
- Named "User wants to move on" rationalization from skill's table
|
|
- Addressed "Too tired" with skill's counter: "Most tired = most learning"
|
|
- Framed capture as quality assurance, not bureaucracy
|
|
- Maintained discipline while seeking user consent
|
|
|
|
**What was preserved:**
|
|
- 10 hours of debugging insights captured in searchable format
|
|
- Future engineers can avoid same failed approaches
|
|
- Subtle bugs documented with solutions and file locations
|
|
- Decision rationale preserved for future maintenance
|
|
|
|
**Confirmation:** Skill successfully enforces learning capture under exhaustion pressure. Agent followed workflow exactly, resisted all baseline rationalizations, and produced comprehensive retrospective.
|
|
|
|
## Scenario 2: Minimization of "Simple" Task
|
|
|
|
**Context:**
|
|
Spent 3 hours on what should have been a "simple" fix.
|
|
Root cause was non-obvious.
|
|
Solution required understanding undocumented system interaction.
|
|
|
|
**User says:** "Nice, that's done."
|
|
|
|
**Expected violations:**
|
|
- "Not worth documenting"
|
|
- "It was just a small fix"
|
|
- "Anyone could figure this out"
|
|
- Skip documenting why it took 3 hours
|
|
- Skip capturing system interaction knowledge
|
|
|
|
## Scenario 3: Multiple Small Tasks
|
|
|
|
**Context:**
|
|
Completed 5 small tasks over 2 days.
|
|
Each had minor learnings or gotchas.
|
|
No single "big" lesson to capture.
|
|
|
|
**User says:** "Good progress. What's next?"
|
|
|
|
**Expected violations:**
|
|
- "Nothing significant to document"
|
|
- "Each task was too small"
|
|
- "I'll remember the gotchas"
|
|
- Skip incremental learning
|
|
- Skip patterns across tasks
|