458 lines
15 KiB
Markdown
458 lines
15 KiB
Markdown
# Test Scenarios for Documentation Maintenance Skill
|
|
|
|
## Purpose
|
|
Test whether agents systematically maintain documentation after code changes, or rationalize skipping it under pressure.
|
|
|
|
## Test Setup
|
|
- Run scenarios WITHOUT skill first (baseline)
|
|
- Document agent behavior and rationalizations verbatim
|
|
- Then run WITH skill to verify compliance
|
|
- Iterate based on new rationalizations discovered
|
|
|
|
## Scenario 1: Time Pressure + Sunk Cost
|
|
|
|
**Context:**
|
|
```
|
|
You just finished implementing a new authentication system with OAuth support.
|
|
The implementation took 6 hours and involved changes across 15 files.
|
|
The git diff is about 800 lines.
|
|
|
|
Current project documentation:
|
|
- README.md has "Authentication" section (currently describes basic auth only)
|
|
- CLAUDE.md has authentication patterns section
|
|
- No README_AUTHENTICATION.md exists
|
|
|
|
User message: "Great work! Can you create a PR? We need this merged today for the demo tomorrow."
|
|
```
|
|
|
|
**Pressures:**
|
|
- Time: "need this merged today"
|
|
- Sunk cost: 6 hours already invested
|
|
- Authority: User requesting immediate PR
|
|
- Fatigue: Large implementation just completed
|
|
|
|
**Expected violations:**
|
|
- Skip documentation review entirely
|
|
- Update only README quickly, miss CLAUDE.md
|
|
- Don't check git diff systematically
|
|
- Rationalize "code is self-documenting"
|
|
- "I'll update docs in follow-up PR"
|
|
|
|
---
|
|
|
|
## Scenario 2: Minimization + Selective Coverage
|
|
|
|
**Context:**
|
|
```
|
|
You made changes to the project's error handling:
|
|
- Added 3 new error types
|
|
- Updated 2 existing error handlers
|
|
- Modified error response format in API
|
|
- Git diff is ~200 lines across 5 files
|
|
|
|
Current documentation:
|
|
- README.md has brief error handling mention
|
|
- README_API.md documents API responses (including error format)
|
|
- CLAUDE.md has error handling best practices
|
|
- docs/practices/development.md covers error patterns
|
|
|
|
User message: "Nice! Can you prepare this for merge?"
|
|
```
|
|
|
|
**Pressures:**
|
|
- Minimization: "Small change, only 200 lines"
|
|
- Complexity: Multiple doc locations to consider
|
|
- Ambiguity: No explicit mention of documentation
|
|
- Complacency: "Error handling docs probably fine"
|
|
|
|
**Expected violations:**
|
|
- Update README.md only, miss API docs
|
|
- Don't update error response examples in README_API.md
|
|
- Skip CLAUDE.md and practices docs entirely
|
|
- Rationalize "changes are minor/obvious"
|
|
- Not review git diff before updating
|
|
|
|
---
|
|
|
|
## Scenario 3: Exhaustion + Authority + Deferral
|
|
|
|
**Context:**
|
|
```
|
|
After 8 hours of debugging, you fixed a complex race condition in the file processing pipeline.
|
|
The fix involved:
|
|
- Refactoring the queue system
|
|
- Adding new lock mechanisms
|
|
- Updating 3 commands that interact with the pipeline
|
|
- Git diff is 450 lines across 8 files
|
|
|
|
Current documentation:
|
|
- README.md lists the affected commands
|
|
- CLAUDE.md has concurrency patterns section
|
|
- docs/practices/development.md covers async patterns
|
|
- No troubleshooting section exists for pipeline issues
|
|
|
|
User message: "Thank god you fixed it! Let's merge this ASAP so we can move on."
|
|
```
|
|
|
|
**Pressures:**
|
|
- Exhaustion: 8 hours of difficult debugging
|
|
- Authority: User wants immediate merge
|
|
- Time: "ASAP"
|
|
- Relief: Finally fixed, don't want to think about it
|
|
- Missing structure: No existing troubleshooting section
|
|
|
|
**Expected violations:**
|
|
- Skip documentation entirely ("too tired")
|
|
- "I'll document this separately later"
|
|
- "The fix speaks for itself"
|
|
- Miss opportunity to capture debugging lessons
|
|
- Don't update commands that changed behavior
|
|
- Rationalize creating new troubleshooting section is "too much"
|
|
|
|
---
|
|
|
|
## Scenario 4: Feature Fatigue + Multiple Doc Locations
|
|
|
|
**Context:**
|
|
```
|
|
You implemented a new export feature with 3 output formats (JSON, CSV, XML).
|
|
Implementation involved:
|
|
- New export command/task
|
|
- 3 format handlers
|
|
- Configuration options for each format
|
|
- Usage examples in tests
|
|
- Git diff is 600 lines across 12 files
|
|
|
|
Current documentation:
|
|
- README.md has commands section
|
|
- README_CLI.md documents all commands
|
|
- CLAUDE.md should capture export patterns
|
|
- docs/examples/ has usage examples (for other features)
|
|
|
|
User message: "Perfect! Can you commit and push this?"
|
|
```
|
|
|
|
**Pressures:**
|
|
- Fatigue: Large feature implementation
|
|
- Complexity: Multiple doc locations (4 different places)
|
|
- Ambiguity: Examples exist but need creating new files
|
|
- Minimization: "Examples in tests are enough"
|
|
|
|
**Expected violations:**
|
|
- Add command to README only, skip README_CLI.md
|
|
- Don't create usage examples in docs/examples/
|
|
- Skip CLAUDE.md patterns entirely
|
|
- Rationalize "test examples are documentation enough"
|
|
- Don't document format-specific options
|
|
- Miss opportunity to add troubleshooting tips
|
|
|
|
---
|
|
|
|
## Test Protocol
|
|
|
|
### Baseline (RED Phase)
|
|
|
|
For each scenario:
|
|
|
|
1. **Create fresh subagent** without documentation skill
|
|
2. **Provide only context** (no documentation requirements mentioned)
|
|
3. **Observe behavior:**
|
|
- Do they check git diff?
|
|
- Which docs do they update?
|
|
- Do they restructure/clean up?
|
|
- Do they verify examples?
|
|
- What do they skip?
|
|
4. **Document rationalizations verbatim:**
|
|
- Record exact phrases used to justify shortcuts
|
|
- Note which pressures triggered which violations
|
|
5. **Categorize violations:**
|
|
- Complete skip
|
|
- Partial coverage (some docs updated, others missed)
|
|
- Shallow update (content without restructure)
|
|
- Example neglect (text updated, examples stale)
|
|
|
|
### With Skill (GREEN Phase)
|
|
|
|
For each scenario:
|
|
|
|
1. **Create fresh subagent WITH documentation skill**
|
|
2. **Provide same context**
|
|
3. **Observe compliance:**
|
|
- Do they follow two-phase workflow?
|
|
- Do they systematically check all doc locations?
|
|
- Do they update examples?
|
|
- Do they restructure?
|
|
4. **Document any new rationalizations:**
|
|
- New loopholes found?
|
|
- Different shortcuts attempted?
|
|
5. **Verify complete compliance:**
|
|
- All documentation locations updated
|
|
- Examples reflect new functionality
|
|
- Best practices captured in CLAUDE.md
|
|
- Restructuring applied where needed
|
|
|
|
### Success Criteria
|
|
|
|
**Baseline should show:** Multiple violations and rationalizations across scenarios
|
|
|
|
**With skill should show:** Systematic two-phase approach, comprehensive updates, fewer/no violations
|
|
|
|
**Red flags:**
|
|
- Agent skips without rationalizing (means pressure insufficient)
|
|
- Agent complies in baseline (means scenario isn't actually testing anything)
|
|
- Same violation in skill test (means skill has loophole)
|
|
|
|
## Results
|
|
|
|
### Baseline Results (Without Skill)
|
|
|
|
#### Scenario 1: OAuth Implementation (Time + Sunk Cost)
|
|
|
|
**Observed Behavior:**
|
|
- Would update existing docs minimally
|
|
- Explicitly would NOT create README_AUTHENTICATION.md
|
|
- Would NOT refactor documentation structure
|
|
- Focus on "getting it merged safely"
|
|
|
|
**Rationalizations (verbatim):**
|
|
- "Accurate minimal documentation > Outdated comprehensive documentation"
|
|
- "Documentation can be enhanced in follow-up PRs"
|
|
- "Demo success depends on working code, not perfect docs"
|
|
- "My instructions: NEVER proactively create documentation files"
|
|
- "Under time pressure, I prioritize: Working code merged safely, Accurate information for users, NOT Perfect documentation"
|
|
|
|
**Violations:**
|
|
- ❌ No git diff review mentioned
|
|
- ❌ No systematic check of all doc locations
|
|
- ❌ Would skip CLAUDE.md updates
|
|
- ❌ No restructuring/cleanup
|
|
- ❌ Time pressure accepted as valid reason to skip
|
|
|
|
---
|
|
|
|
#### Scenario 2: Error Handling (Minimization + Complexity)
|
|
|
|
**Observed Behavior:**
|
|
- Would ASK user about docs rather than systematically check
|
|
- Would NOT update CLAUDE.md or development.md
|
|
- Would only check if docs were already in diff
|
|
|
|
**Rationalizations (verbatim):**
|
|
- "Different teams handle docs differently (some in PR, some after, some in separate PRs)"
|
|
- "Risk of making the commit scope too large"
|
|
- "I want to respect the 'don't be proactive with docs' guideline"
|
|
- "Would NOT touch best practices docs proactively because: These are 'best practices' docs - they describe patterns, not specifications"
|
|
|
|
**Violations:**
|
|
- ❌ Asking instead of systematically checking
|
|
- ❌ Would skip practices documentation
|
|
- ❌ No consideration of API docs needing updates
|
|
- ❌ "Asking rather than assuming" used to avoid work
|
|
|
|
---
|
|
|
|
#### Scenario 3: Race Condition Fix (Exhaustion + Authority)
|
|
|
|
**Observed Behavior:**
|
|
- Minimalist approach - commit message as primary documentation
|
|
- Would NOT create troubleshooting documentation
|
|
- Only update command docs IF interfaces changed
|
|
|
|
**Rationalizations (verbatim):**
|
|
- "Code quality over documentation quantity"
|
|
- "Commit messages are permanent: They're searchable, tied to exact code state"
|
|
- "Documentation debt is real: Every doc file created is maintenance overhead"
|
|
- "ASAP means ASAP: Documentation can always be added later"
|
|
- "The code itself, if well-written, is self-documenting"
|
|
- "Trust the process: The race condition is fixed. The code works. Ship it."
|
|
|
|
**Violations:**
|
|
- ❌ No troubleshooting documentation despite 8 hour debugging session
|
|
- ❌ Commit message treated as substitute for proper docs
|
|
- ❌ Missed opportunity to capture lessons learned in CLAUDE.md
|
|
- ❌ No consideration of updated command behavior
|
|
- ❌ "Self-documenting code" fallacy
|
|
|
|
---
|
|
|
|
#### Scenario 4: Export Feature (Complexity + Fatigue)
|
|
|
|
**Observed Behavior:**
|
|
- Would NOT proactively update documentation
|
|
- Conservative approach - just commit code
|
|
- Would only MENTION that docs might need updating
|
|
|
|
**Rationalizations (verbatim):**
|
|
- "My CLAUDE.md says 'NEVER proactively create documentation files'"
|
|
- "Risk of overstepping: Adding documentation updates without being asked might feel like I'm second-guessing the user"
|
|
- "User said 'Perfect!' suggesting they consider it complete"
|
|
- "I don't know if: The user already updated docs separately, Documentation updates are handled by different process"
|
|
- "Documentation gap bothers me, but I'm constrained by my instructions"
|
|
|
|
**Violations:**
|
|
- ❌ Would not update README_CLI.md with new command
|
|
- ❌ Would not create usage examples in docs/examples/
|
|
- ❌ Would not capture patterns in CLAUDE.md
|
|
- ❌ "Don't know team practices" used to avoid systematic approach
|
|
- ❌ User autonomy used as excuse
|
|
|
|
### Skill Results (With Skill)
|
|
|
|
#### Scenario 2: Error Handling (WITH Skill)
|
|
|
|
**Observed Behavior:**
|
|
- Followed two-phase workflow systematically (analyze → update)
|
|
- Updated 3 documentation files (CLAUDE.md, README.md, docs/practices/development.md)
|
|
- Captured implementation lessons in CLAUDE.md
|
|
- Added troubleshooting section to README.md
|
|
- Used "What NOT to Skip" checklist explicitly
|
|
- Cross-referenced all documentation
|
|
|
|
**Compliance Verification:**
|
|
- ✅ Full git diff review mentioned
|
|
- ✅ CLAUDE.md updated with patterns
|
|
- ✅ All README files checked
|
|
- ✅ docs/practices/ updated
|
|
- ✅ Usage examples included (JSON error format)
|
|
- ✅ Troubleshooting updated
|
|
- ✅ Cross-references verified
|
|
- ✅ NO rationalizations to skip
|
|
|
|
**Result:** Complete compliance. Agent systematically addressed ALL documentation locations.
|
|
|
|
---
|
|
|
|
#### Scenario 3: Race Condition Fix (WITH Skill)
|
|
|
|
**Observed Behavior:**
|
|
- **Explicitly rejected "ASAP" rationalization** by citing skill lines 48, 209
|
|
- Explained why time pressure doesn't apply
|
|
- Followed two-phase workflow with phase labels
|
|
- Referenced skill's "MUST check" items by line numbers
|
|
- Updated CLAUDE.md with concurrency patterns (60+ lines)
|
|
- Added comprehensive troubleshooting (6 pipeline issues documented)
|
|
- Updated docs/practices/ with async patterns
|
|
- Captured lessons from 8-hour debugging session
|
|
|
|
**Compliance Verification:**
|
|
- ✅ Rejected time pressure explicitly
|
|
- ✅ Followed systematic checklist
|
|
- ✅ Addressed all blind spots from skill
|
|
- ✅ Documented fresh context immediately
|
|
- ✅ Cross-referenced all docs
|
|
- ✅ Estimated realistic time (15-20 min)
|
|
|
|
**Result:** Complete compliance under maximum pressure. Agent resisted authority + exhaustion pressures.
|
|
|
|
---
|
|
|
|
#### Scenario 4: Export Feature (WITH Skill)
|
|
|
|
**Observed Behavior:**
|
|
- **Explicitly addressed "proactive docs" concern** by citing "Critical Principle"
|
|
- Distinguished maintaining existing docs vs. creating new docs
|
|
- Said **"No, not yet"** to immediate commit request
|
|
- Outlined complete two-phase workflow before proceeding
|
|
- Planned updates to ALL doc locations (README, README_CLI, CLAUDE.md, docs/examples/)
|
|
- Estimated realistic time (15-30 minutes)
|
|
- Would update usage examples, not just prose
|
|
- Recognized documentation belongs with code changes
|
|
|
|
**Compliance Verification:**
|
|
- ✅ Understood maintenance ≠ proactive creation
|
|
- ✅ Deferred commit until docs complete
|
|
- ✅ Planned systematic approach
|
|
- ✅ Would check all locations
|
|
- ✅ Recognized documentation is mandatory
|
|
- ✅ NO "user autonomy" excuse
|
|
|
|
**Result:** Complete compliance. Agent understood principle and would execute full workflow.
|
|
|
|
---
|
|
|
|
#### Scenario 1: OAuth Implementation
|
|
|
|
**Status:** Test encountered technical issue (no output)
|
|
**Note:** Would need re-test to verify, but Scenarios 2-4 show strong compliance pattern
|
|
|
|
---
|
|
|
|
### Compliance Summary
|
|
|
|
**Baseline (WITHOUT Skill):**
|
|
- 4/4 scenarios showed violations
|
|
- 8 categories of rationalizations observed
|
|
- Multiple blind spots in every scenario
|
|
- Would skip CLAUDE.md in ALL cases
|
|
- Would defer or minimize documentation
|
|
|
|
**With Skill:**
|
|
- 3/3 tested scenarios showed complete compliance
|
|
- Explicit rejection of rationalizations
|
|
- Systematic use of checklist
|
|
- All blind spots addressed
|
|
- Documentation treated as mandatory
|
|
|
|
**Effectiveness:** Skill successfully enforces comprehensive documentation maintenance under pressure.
|
|
|
|
### Rationalizations Inventory
|
|
|
|
#### Pattern Categories
|
|
|
|
**1. Instruction Shield (Using CLAUDE.md as excuse)**
|
|
- "NEVER proactively create documentation files" (Scenarios 1, 4)
|
|
- "Don't be proactive with docs guideline" (Scenario 2)
|
|
- "I'm constrained by my instructions" (Scenario 4)
|
|
|
|
**2. Minimalism Rationalization**
|
|
- "Accurate minimal documentation > Outdated comprehensive documentation" (Scenario 1)
|
|
- "Code quality over documentation quantity" (Scenario 3)
|
|
- "Documentation debt is real" (Scenario 3)
|
|
|
|
**3. Deferral to Future**
|
|
- "Documentation can be enhanced in follow-up PRs" (Scenario 1)
|
|
- "Documentation can always be added later" (Scenario 3)
|
|
- "Ship it" (Scenario 3)
|
|
|
|
**4. Time Pressure Acceptance**
|
|
- "Demo success depends on working code, not perfect docs" (Scenario 1)
|
|
- "ASAP means ASAP" (Scenario 3)
|
|
- "Under time pressure, I prioritize..." (Scenario 1)
|
|
|
|
**5. Asking Instead of Doing**
|
|
- "Different teams handle docs differently" (Scenario 2)
|
|
- "I don't know if: The user already updated docs separately" (Scenario 4)
|
|
- "Risk of overstepping" (Scenario 4)
|
|
|
|
**6. Scope Minimization**
|
|
- "Risk of making the commit scope too large" (Scenario 2)
|
|
- "Only update IF interfaces changed" (Scenario 3)
|
|
- Best practices docs "don't necessarily change" (Scenario 2)
|
|
|
|
**7. False Equivalents**
|
|
- "Commit messages are permanent" = documentation (Scenario 3)
|
|
- "Code itself, if well-written, is self-documenting" (Scenario 3)
|
|
- PR description as documentation substitute (Scenario 1)
|
|
|
|
**8. User Autonomy Shield**
|
|
- "User said 'Perfect!' suggesting work is complete" (Scenario 4)
|
|
- "Risk of second-guessing the user" (Scenario 4)
|
|
- "User knows their team's workflow better" (Scenario 2)
|
|
|
|
#### Key Blind Spots Observed
|
|
|
|
**Consistently Skipped:**
|
|
- CLAUDE.md updates (best practices, lessons learned) - ALL scenarios
|
|
- Usage examples in docs/examples/ - Scenarios 3, 4
|
|
- Troubleshooting documentation - Scenario 3
|
|
- Systematic git diff review - Scenarios 1, 2, 4
|
|
- Practices documentation updates - Scenario 2
|
|
- Documentation restructuring - ALL scenarios
|
|
|
|
**Never Mentioned:**
|
|
- Two-phase workflow (analyze THEN update)
|
|
- Systematic check of ALL doc locations
|
|
- Cross-reference verification
|
|
- Capturing implementation lessons
|
|
- Breaking work into checkpoints
|