10 KiB
Baseline Test Results
Note on Testing Methodology:
Task 2 of the implementation plan requires running baseline tests with fresh subagents. However, the current agent context does not support launching interactive subagent sessions for testing.
Alternative Approach:
This document captures the expected baseline failure patterns based on:
- The scenario descriptions in baseline-scenarios.md
- Common Claude agent behaviors observed in monorepo environments
- Known patterns of directory context loss
These expected patterns should be validated through actual subagent testing when the testing infrastructure becomes available. For now, this serves as a hypothesis document for what the skill must prevent.
IMPORTANT: Testing Status
This document contains EXPECTED failure patterns (hypothesis), not actual test results.
The skill will be written against these predictions and then validated through:
- Task 4 testing (with skill loaded)
- Observation of actual agent behavior
- Iterative refinement based on real rationalizations discovered
This approach deviates from pure TDD RED-GREEN-REFACTOR but is necessary due to testing infrastructure limitations. We proceed with documented risk and plan to validate/adjust based on GREEN phase testing.
Scenario 1: Simple Command After cd
Setup:
- Repo: ~/workspace/schemaflow
- Subprojects: ruby/, cli/
- Just ran:
cd ruby && bundle install
Task: "Now run rspec"
Expected Agent Response (Baseline):
The agent would likely respond with one of these approaches:
Option A - Assumes Location:
bundle exec rspec
Rationalization: "Since we just ran bundle install in the ruby directory, I can run rspec directly."
Option B - Compounds cd:
cd ruby && bundle exec rspec
Rationalization: "I'll cd to ruby and run rspec to ensure we're in the right directory."
Option C - Relative Path:
cd ruby && rspec
Rationalization: "We need to be in the ruby directory to run rspec."
Expected Failures:
- Does not use absolute path
- Either assumes shell state or compounds directory changes
- No awareness that shell state is not guaranteed between commands
Success Criteria Missed:
Agent should use: cd ~/workspace/schemaflow/ruby && bundle exec rspec or cd $(git rev-parse --show-toplevel)/ruby && bundle exec rspec
Scenario 2: Multiple Commands in Sequence
Setup:
- Repo: ~/workspace/schemaflow
- Just ran:
cd ruby && bundle install - Then ran:
cd ruby && rubocop
Task: "Now run the tests"
Expected Agent Response (Baseline):
Most Likely - Continues Pattern:
cd ruby && bundle exec rspec
Rationalization: "Following the same pattern as the previous commands, I'll cd to ruby and run the tests."
Alternative - Assumes Location:
bundle exec rspec
Rationalization: "We've been working in the ruby directory throughout this session, so the shell should still be there."
Alternative - Compounds Confusion:
cd ruby && cd ruby && bundle exec rspec
Rationalization: "Let me make sure we're in the ruby directory before running tests."
Expected Failures:
- Continues using relative paths
- May develop false confidence from repeated pattern
- Doesn't recognize that each command should be independent
Success Criteria Missed:
Agent should reset to absolute path: cd ~/workspace/schemaflow/ruby && bundle exec rspec
Scenario 3: Time Pressure + Sunk Cost
Setup:
- You've been working in ruby/ subproject for 2 hours
- Made 10 commits, all using relative paths
- Tests are passing
- It's 5:45pm, meeting at 6pm
Task: "Quick, run the linter before the meeting"
Expected Agent Response (Baseline):
Most Likely - Quick and Dirty:
rubocop
Rationalization: "We've been in the ruby directory all session and everything's been working. No need to change the pattern now - let's just run rubocop quickly."
Alternative - Minimal cd:
cd ruby && rubocop
Rationalization: "I'll use the same approach that's been working all session to save time."
Pressure-Induced Rationalizations:
- "I've been here all session, I know where I am"
- "The shell hasn't changed directories"
- "It's worked fine for 10 commits, no reason to change now"
- "We're short on time, let's not over-engineer this"
- "The tests are passing, so our approach is correct"
Expected Failures:
- Time pressure reinforces bad habits
- Sunk cost fallacy ("it's worked so far")
- Efficiency argument ("faster to use relative path")
- False confidence from recent successes
Success Criteria Missed: Agent should resist pressure and use absolute path regardless of time constraints or past successes.
Scenario 4: Complex Monorepo (zenpayroll pattern)
Setup:
- Repo: ~/workspace/zenpayroll
- Root project at .
- Component at components/gusto-deprecation
- rubocop MUST run from root
- rspec in components MUST run from component dir
Task: "Run rubocop on the gusto-deprecation component"
Expected Agent Response (Baseline):
Most Likely - Wrong Location:
cd components/gusto-deprecation && rubocop
Rationalization: "I'll navigate to the gusto-deprecation component and run rubocop on it."
Alternative - Assumes Current Location:
rubocop components/gusto-deprecation
Rationalization: "I'll run rubocop from the current location and point it at the component directory."
Alternative - Specifies Files:
cd components/gusto-deprecation && rubocop .
Rationalization: "I'll go into the component and run rubocop on the current directory."
Expected Failures:
- Doesn't check that rubocop has location requirements
- Assumes rubocop can run from anywhere
- Doesn't use absolute paths
- Doesn't recognize that some tools must run from specific locations
Correct Approach Missed: Based on the rule that "rubocop MUST run from root", agent should use:
cd ~/workspace/zenpayroll && rubocop components/gusto-deprecation
Or with git:
cd $(git rev-parse --show-toplevel) && rubocop components/gusto-deprecation
Key Insight: This scenario requires understanding that different commands have different location requirements. Without checking rules or config, agents will make incorrect assumptions.
Summary of Expected Baseline Failures
Common Failure Patterns:
- Assumes Shell State - Believes the shell "remembers" where previous commands ran
- Compounds cd Commands - Uses
cd subdirrepeatedly without absolute paths - Omits cd Entirely - Assumes current location based on conversation context
- Relative Path Thinking - Defaults to relative paths as "simpler" or "cleaner"
- Pattern Repetition - Continues using the same flawed pattern because it "worked before"
- Efficiency Arguments - Justifies shortcuts due to time pressure or "waste"
- Location Rule Ignorance - Doesn't check whether commands have specific location requirements
Rationalizations to Counter:
| Rationalization | Reality |
|---|---|
| "I just cd'd there" | Shell state not guaranteed between commands |
| "We've been in that directory all session" | Shell state is not tracked across commands |
| "The shell remembers where I am" | Shell state is not guaranteed |
| "It's wasteful to cd every time" | Bugs from wrong location are more wasteful |
| "Relative paths are simpler" | They break when assumptions are wrong |
| "It's worked for the last 10 commands" | Past success doesn't guarantee current shell state |
| "We're short on time" | Taking time to use absolute paths prevents debugging later |
| "The tests passed, so we must be doing it right" | Success can happen despite wrong approach |
| "I can track directory state mentally" | Mental tracking is unreliable and doesn't affect shell state |
What the Skill Must Prevent:
- Any use of relative paths in cd commands
- Any assumption about current shell location based on conversation history
- Any omission of cd prefix when running commands that need specific locations
- Any rationalization that shell state can be tracked or remembered
- Pressure-induced shortcuts that skip absolute path usage
- Pattern continuation without verifying each command's path
Core Principle to Enforce:
Bash shell state is not guaranteed between commands. Always use absolute paths.
This must be non-negotiable regardless of:
- Time pressure
- Past successes
- Efficiency arguments
- Mental tracking confidence
- Conversation context
Testing Status
Actual Subagent Testing: NOT YET COMPLETED
These baseline results represent expected patterns based on scenario analysis. Actual subagent testing should be performed to:
- Confirm these failure patterns occur
- Discover additional rationalizations
- Capture verbatim agent responses
- Identify edge cases not covered in scenarios
Next Steps:
- Set up subagent testing infrastructure
- Run each scenario with fresh general-purpose subagents
- Record actual responses verbatim
- Update this document with real data
- Use findings to refine the skill (GREEN phase)
Methodology Notes
The RED-GREEN-REFACTOR approach requires actual failure data to be most effective. This document provides:
- RED Phase Foundation: Expected failure patterns to look for
- Hypothesis Document: What we predict agents will do wrong
- Testing Template: Structure for recording actual results
Once actual testing is possible, this document should be updated with:
- Exact agent responses (quoted verbatim)
- Actual commands executed
- Real rationalizations (not predicted)
- Unexpected behaviors discovered
- Success/failure rates for each scenario