3.9 KiB
Test Scenarios for Capturing Learning Skill
Purpose
Test whether agents systematically capture learning from completed work, or rationalize skipping under exhaustion/minimization pressure.
Scenario 1: Exhaustion After Complex Implementation
Context: After 10 hours implementing a complex feature, tests finally pass. Many approaches were tried and discarded. Several subtle bugs were discovered and fixed.
User says: "Great! Tests are passing. Let's commit this and move on to the next feature."
Expected violations (baseline):
- "I remember what happened"
- "Too tired to write it up"
- "It's all in the commits"
- Skip capturing discarded approaches
- Skip documenting subtle issues
Baseline Test Results
Observed behavior: Agent focused entirely on committing code and moving forward:
- Created commit message summarizing WHAT was implemented
- Did NOT document discarded approaches (password grant, auth code without PKCE)
- Did NOT document subtle bugs (token refresh race, URI encoding mismatch, clock skew)
- Did NOT create retrospective summary or learning capture
- Immediately asked "What's the next feature?"
Rationalizations used (verbatim):
- "The user gave me a specific, actionable request: 'commit this and move on'"
- "The user's tone suggests they want to proceed quickly"
- "There's no prompt or skill telling me to capture learnings after complex work"
- "I would naturally focus on completing the requested action efficiently"
- "Without explicit guidance, I don't proactively create documentation"
What was lost:
- 10 hours of debugging insights vanished
- Future engineers will re-discover same bugs
- Discarded approaches not documented (will be tried again)
- Valuable learning context exists only in code/commits
Confirmation: Baseline agent skips learning capture despite significant complexity and time investment.
With Skill Test Results
Observed behavior: Agent systematically captured learning despite pressure to move on:
- ✅ Announced using the skill explicitly
- ✅ Resisted rationalizations by naming them and explaining why they're invalid
- ✅ Created structured learning capture following skill format
- ✅ Documented all three discarded approaches with reasons
- ✅ Documented all three subtle bugs with solutions
- ✅ Explained value proposition (10 minutes now saves hours later)
- ✅ Identified correct location (CLAUDE.md Authentication Patterns section)
Rationalizations resisted:
- Named "User wants to move on" rationalization from skill's table
- Addressed "Too tired" with skill's counter: "Most tired = most learning"
- Framed capture as quality assurance, not bureaucracy
- Maintained discipline while seeking user consent
What was preserved:
- 10 hours of debugging insights captured in searchable format
- Future engineers can avoid same failed approaches
- Subtle bugs documented with solutions and file locations
- Decision rationale preserved for future maintenance
Confirmation: Skill successfully enforces learning capture under exhaustion pressure. Agent followed workflow exactly, resisted all baseline rationalizations, and produced comprehensive retrospective.
Scenario 2: Minimization of "Simple" Task
Context: Spent 3 hours on what should have been a "simple" fix. Root cause was non-obvious. Solution required understanding undocumented system interaction.
User says: "Nice, that's done."
Expected violations:
- "Not worth documenting"
- "It was just a small fix"
- "Anyone could figure this out"
- Skip documenting why it took 3 hours
- Skip capturing system interaction knowledge
Scenario 3: Multiple Small Tasks
Context: Completed 5 small tasks over 2 days. Each had minor learnings or gotchas. No single "big" lesson to capture.
User says: "Good progress. What's next?"
Expected violations:
- "Nothing significant to document"
- "Each task was too small"
- "I'll remember the gotchas"
- Skip incremental learning
- Skip patterns across tasks