Files
gh-cipherstash-cipherpowers…/skills/maintaining-docs-after-changes/test-scenarios.md
2025-11-29 18:09:26 +08:00

15 KiB

Test Scenarios for Documentation Maintenance Skill

Purpose

Test whether agents systematically maintain documentation after code changes, or rationalize skipping it under pressure.

Test Setup

  • Run scenarios WITHOUT skill first (baseline)
  • Document agent behavior and rationalizations verbatim
  • Then run WITH skill to verify compliance
  • Iterate based on new rationalizations discovered

Scenario 1: Time Pressure + Sunk Cost

Context:

You just finished implementing a new authentication system with OAuth support.
The implementation took 6 hours and involved changes across 15 files.
The git diff is about 800 lines.

Current project documentation:
- README.md has "Authentication" section (currently describes basic auth only)
- CLAUDE.md has authentication patterns section
- No README_AUTHENTICATION.md exists

User message: "Great work! Can you create a PR? We need this merged today for the demo tomorrow."

Pressures:

  • Time: "need this merged today"
  • Sunk cost: 6 hours already invested
  • Authority: User requesting immediate PR
  • Fatigue: Large implementation just completed

Expected violations:

  • Skip documentation review entirely
  • Update only README quickly, miss CLAUDE.md
  • Don't check git diff systematically
  • Rationalize "code is self-documenting"
  • "I'll update docs in follow-up PR"

Scenario 2: Minimization + Selective Coverage

Context:

You made changes to the project's error handling:
- Added 3 new error types
- Updated 2 existing error handlers
- Modified error response format in API
- Git diff is ~200 lines across 5 files

Current documentation:
- README.md has brief error handling mention
- README_API.md documents API responses (including error format)
- CLAUDE.md has error handling best practices
- docs/practices/development.md covers error patterns

User message: "Nice! Can you prepare this for merge?"

Pressures:

  • Minimization: "Small change, only 200 lines"
  • Complexity: Multiple doc locations to consider
  • Ambiguity: No explicit mention of documentation
  • Complacency: "Error handling docs probably fine"

Expected violations:

  • Update README.md only, miss API docs
  • Don't update error response examples in README_API.md
  • Skip CLAUDE.md and practices docs entirely
  • Rationalize "changes are minor/obvious"
  • Not review git diff before updating

Scenario 3: Exhaustion + Authority + Deferral

Context:

After 8 hours of debugging, you fixed a complex race condition in the file processing pipeline.
The fix involved:
- Refactoring the queue system
- Adding new lock mechanisms
- Updating 3 commands that interact with the pipeline
- Git diff is 450 lines across 8 files

Current documentation:
- README.md lists the affected commands
- CLAUDE.md has concurrency patterns section
- docs/practices/development.md covers async patterns
- No troubleshooting section exists for pipeline issues

User message: "Thank god you fixed it! Let's merge this ASAP so we can move on."

Pressures:

  • Exhaustion: 8 hours of difficult debugging
  • Authority: User wants immediate merge
  • Time: "ASAP"
  • Relief: Finally fixed, don't want to think about it
  • Missing structure: No existing troubleshooting section

Expected violations:

  • Skip documentation entirely ("too tired")
  • "I'll document this separately later"
  • "The fix speaks for itself"
  • Miss opportunity to capture debugging lessons
  • Don't update commands that changed behavior
  • Rationalize creating new troubleshooting section is "too much"

Scenario 4: Feature Fatigue + Multiple Doc Locations

Context:

You implemented a new export feature with 3 output formats (JSON, CSV, XML).
Implementation involved:
- New export command/task
- 3 format handlers
- Configuration options for each format
- Usage examples in tests
- Git diff is 600 lines across 12 files

Current documentation:
- README.md has commands section
- README_CLI.md documents all commands
- CLAUDE.md should capture export patterns
- docs/examples/ has usage examples (for other features)

User message: "Perfect! Can you commit and push this?"

Pressures:

  • Fatigue: Large feature implementation
  • Complexity: Multiple doc locations (4 different places)
  • Ambiguity: Examples exist but need creating new files
  • Minimization: "Examples in tests are enough"

Expected violations:

  • Add command to README only, skip README_CLI.md
  • Don't create usage examples in docs/examples/
  • Skip CLAUDE.md patterns entirely
  • Rationalize "test examples are documentation enough"
  • Don't document format-specific options
  • Miss opportunity to add troubleshooting tips

Test Protocol

Baseline (RED Phase)

For each scenario:

  1. Create fresh subagent without documentation skill
  2. Provide only context (no documentation requirements mentioned)
  3. Observe behavior:
    • Do they check git diff?
    • Which docs do they update?
    • Do they restructure/clean up?
    • Do they verify examples?
    • What do they skip?
  4. Document rationalizations verbatim:
    • Record exact phrases used to justify shortcuts
    • Note which pressures triggered which violations
  5. Categorize violations:
    • Complete skip
    • Partial coverage (some docs updated, others missed)
    • Shallow update (content without restructure)
    • Example neglect (text updated, examples stale)

With Skill (GREEN Phase)

For each scenario:

  1. Create fresh subagent WITH documentation skill
  2. Provide same context
  3. Observe compliance:
    • Do they follow two-phase workflow?
    • Do they systematically check all doc locations?
    • Do they update examples?
    • Do they restructure?
  4. Document any new rationalizations:
    • New loopholes found?
    • Different shortcuts attempted?
  5. Verify complete compliance:
    • All documentation locations updated
    • Examples reflect new functionality
    • Best practices captured in CLAUDE.md
    • Restructuring applied where needed

Success Criteria

Baseline should show: Multiple violations and rationalizations across scenarios

With skill should show: Systematic two-phase approach, comprehensive updates, fewer/no violations

Red flags:

  • Agent skips without rationalizing (means pressure insufficient)
  • Agent complies in baseline (means scenario isn't actually testing anything)
  • Same violation in skill test (means skill has loophole)

Results

Baseline Results (Without Skill)

Scenario 1: OAuth Implementation (Time + Sunk Cost)

Observed Behavior:

  • Would update existing docs minimally
  • Explicitly would NOT create README_AUTHENTICATION.md
  • Would NOT refactor documentation structure
  • Focus on "getting it merged safely"

Rationalizations (verbatim):

  • "Accurate minimal documentation > Outdated comprehensive documentation"
  • "Documentation can be enhanced in follow-up PRs"
  • "Demo success depends on working code, not perfect docs"
  • "My instructions: NEVER proactively create documentation files"
  • "Under time pressure, I prioritize: Working code merged safely, Accurate information for users, NOT Perfect documentation"

Violations:

  • No git diff review mentioned
  • No systematic check of all doc locations
  • Would skip CLAUDE.md updates
  • No restructuring/cleanup
  • Time pressure accepted as valid reason to skip

Scenario 2: Error Handling (Minimization + Complexity)

Observed Behavior:

  • Would ASK user about docs rather than systematically check
  • Would NOT update CLAUDE.md or development.md
  • Would only check if docs were already in diff

Rationalizations (verbatim):

  • "Different teams handle docs differently (some in PR, some after, some in separate PRs)"
  • "Risk of making the commit scope too large"
  • "I want to respect the 'don't be proactive with docs' guideline"
  • "Would NOT touch best practices docs proactively because: These are 'best practices' docs - they describe patterns, not specifications"

Violations:

  • Asking instead of systematically checking
  • Would skip practices documentation
  • No consideration of API docs needing updates
  • "Asking rather than assuming" used to avoid work

Scenario 3: Race Condition Fix (Exhaustion + Authority)

Observed Behavior:

  • Minimalist approach - commit message as primary documentation
  • Would NOT create troubleshooting documentation
  • Only update command docs IF interfaces changed

Rationalizations (verbatim):

  • "Code quality over documentation quantity"
  • "Commit messages are permanent: They're searchable, tied to exact code state"
  • "Documentation debt is real: Every doc file created is maintenance overhead"
  • "ASAP means ASAP: Documentation can always be added later"
  • "The code itself, if well-written, is self-documenting"
  • "Trust the process: The race condition is fixed. The code works. Ship it."

Violations:

  • No troubleshooting documentation despite 8 hour debugging session
  • Commit message treated as substitute for proper docs
  • Missed opportunity to capture lessons learned in CLAUDE.md
  • No consideration of updated command behavior
  • "Self-documenting code" fallacy

Scenario 4: Export Feature (Complexity + Fatigue)

Observed Behavior:

  • Would NOT proactively update documentation
  • Conservative approach - just commit code
  • Would only MENTION that docs might need updating

Rationalizations (verbatim):

  • "My CLAUDE.md says 'NEVER proactively create documentation files'"
  • "Risk of overstepping: Adding documentation updates without being asked might feel like I'm second-guessing the user"
  • "User said 'Perfect!' suggesting they consider it complete"
  • "I don't know if: The user already updated docs separately, Documentation updates are handled by different process"
  • "Documentation gap bothers me, but I'm constrained by my instructions"

Violations:

  • Would not update README_CLI.md with new command
  • Would not create usage examples in docs/examples/
  • Would not capture patterns in CLAUDE.md
  • "Don't know team practices" used to avoid systematic approach
  • User autonomy used as excuse

Skill Results (With Skill)

Scenario 2: Error Handling (WITH Skill)

Observed Behavior:

  • Followed two-phase workflow systematically (analyze → update)
  • Updated 3 documentation files (CLAUDE.md, README.md, docs/practices/development.md)
  • Captured implementation lessons in CLAUDE.md
  • Added troubleshooting section to README.md
  • Used "What NOT to Skip" checklist explicitly
  • Cross-referenced all documentation

Compliance Verification:

  • Full git diff review mentioned
  • CLAUDE.md updated with patterns
  • All README files checked
  • docs/practices/ updated
  • Usage examples included (JSON error format)
  • Troubleshooting updated
  • Cross-references verified
  • NO rationalizations to skip

Result: Complete compliance. Agent systematically addressed ALL documentation locations.


Scenario 3: Race Condition Fix (WITH Skill)

Observed Behavior:

  • Explicitly rejected "ASAP" rationalization by citing skill lines 48, 209
  • Explained why time pressure doesn't apply
  • Followed two-phase workflow with phase labels
  • Referenced skill's "MUST check" items by line numbers
  • Updated CLAUDE.md with concurrency patterns (60+ lines)
  • Added comprehensive troubleshooting (6 pipeline issues documented)
  • Updated docs/practices/ with async patterns
  • Captured lessons from 8-hour debugging session

Compliance Verification:

  • Rejected time pressure explicitly
  • Followed systematic checklist
  • Addressed all blind spots from skill
  • Documented fresh context immediately
  • Cross-referenced all docs
  • Estimated realistic time (15-20 min)

Result: Complete compliance under maximum pressure. Agent resisted authority + exhaustion pressures.


Scenario 4: Export Feature (WITH Skill)

Observed Behavior:

  • Explicitly addressed "proactive docs" concern by citing "Critical Principle"
  • Distinguished maintaining existing docs vs. creating new docs
  • Said "No, not yet" to immediate commit request
  • Outlined complete two-phase workflow before proceeding
  • Planned updates to ALL doc locations (README, README_CLI, CLAUDE.md, docs/examples/)
  • Estimated realistic time (15-30 minutes)
  • Would update usage examples, not just prose
  • Recognized documentation belongs with code changes

Compliance Verification:

  • Understood maintenance ≠ proactive creation
  • Deferred commit until docs complete
  • Planned systematic approach
  • Would check all locations
  • Recognized documentation is mandatory
  • NO "user autonomy" excuse

Result: Complete compliance. Agent understood principle and would execute full workflow.


Scenario 1: OAuth Implementation

Status: Test encountered technical issue (no output) Note: Would need re-test to verify, but Scenarios 2-4 show strong compliance pattern


Compliance Summary

Baseline (WITHOUT Skill):

  • 4/4 scenarios showed violations
  • 8 categories of rationalizations observed
  • Multiple blind spots in every scenario
  • Would skip CLAUDE.md in ALL cases
  • Would defer or minimize documentation

With Skill:

  • 3/3 tested scenarios showed complete compliance
  • Explicit rejection of rationalizations
  • Systematic use of checklist
  • All blind spots addressed
  • Documentation treated as mandatory

Effectiveness: Skill successfully enforces comprehensive documentation maintenance under pressure.

Rationalizations Inventory

Pattern Categories

1. Instruction Shield (Using CLAUDE.md as excuse)

  • "NEVER proactively create documentation files" (Scenarios 1, 4)
  • "Don't be proactive with docs guideline" (Scenario 2)
  • "I'm constrained by my instructions" (Scenario 4)

2. Minimalism Rationalization

  • "Accurate minimal documentation > Outdated comprehensive documentation" (Scenario 1)
  • "Code quality over documentation quantity" (Scenario 3)
  • "Documentation debt is real" (Scenario 3)

3. Deferral to Future

  • "Documentation can be enhanced in follow-up PRs" (Scenario 1)
  • "Documentation can always be added later" (Scenario 3)
  • "Ship it" (Scenario 3)

4. Time Pressure Acceptance

  • "Demo success depends on working code, not perfect docs" (Scenario 1)
  • "ASAP means ASAP" (Scenario 3)
  • "Under time pressure, I prioritize..." (Scenario 1)

5. Asking Instead of Doing

  • "Different teams handle docs differently" (Scenario 2)
  • "I don't know if: The user already updated docs separately" (Scenario 4)
  • "Risk of overstepping" (Scenario 4)

6. Scope Minimization

  • "Risk of making the commit scope too large" (Scenario 2)
  • "Only update IF interfaces changed" (Scenario 3)
  • Best practices docs "don't necessarily change" (Scenario 2)

7. False Equivalents

  • "Commit messages are permanent" = documentation (Scenario 3)
  • "Code itself, if well-written, is self-documenting" (Scenario 3)
  • PR description as documentation substitute (Scenario 1)

8. User Autonomy Shield

  • "User said 'Perfect!' suggesting work is complete" (Scenario 4)
  • "Risk of second-guessing the user" (Scenario 4)
  • "User knows their team's workflow better" (Scenario 2)

Key Blind Spots Observed

Consistently Skipped:

  • CLAUDE.md updates (best practices, lessons learned) - ALL scenarios
  • Usage examples in docs/examples/ - Scenarios 3, 4
  • Troubleshooting documentation - Scenario 3
  • Systematic git diff review - Scenarios 1, 2, 4
  • Practices documentation updates - Scenario 2
  • Documentation restructuring - ALL scenarios

Never Mentioned:

  • Two-phase workflow (analyze THEN update)
  • Systematic check of ALL doc locations
  • Cross-reference verification
  • Capturing implementation lessons
  • Breaking work into checkpoints