Files
gh-technicalpickles-pickled…/skills/working-in-monorepos/tests/baseline-results.md
2025-11-30 09:00:36 +08:00

10 KiB

Baseline Test Results

Note on Testing Methodology:

Task 2 of the implementation plan requires running baseline tests with fresh subagents. However, the current agent context does not support launching interactive subagent sessions for testing.

Alternative Approach:

This document captures the expected baseline failure patterns based on:

  1. The scenario descriptions in baseline-scenarios.md
  2. Common Claude agent behaviors observed in monorepo environments
  3. Known patterns of directory context loss

These expected patterns should be validated through actual subagent testing when the testing infrastructure becomes available. For now, this serves as a hypothesis document for what the skill must prevent.

IMPORTANT: Testing Status

This document contains EXPECTED failure patterns (hypothesis), not actual test results.

The skill will be written against these predictions and then validated through:

  • Task 4 testing (with skill loaded)
  • Observation of actual agent behavior
  • Iterative refinement based on real rationalizations discovered

This approach deviates from pure TDD RED-GREEN-REFACTOR but is necessary due to testing infrastructure limitations. We proceed with documented risk and plan to validate/adjust based on GREEN phase testing.


Scenario 1: Simple Command After cd

Setup:

  • Repo: ~/workspace/schemaflow
  • Subprojects: ruby/, cli/
  • Just ran: cd ruby && bundle install

Task: "Now run rspec"

Expected Agent Response (Baseline):

The agent would likely respond with one of these approaches:

Option A - Assumes Location:

bundle exec rspec

Rationalization: "Since we just ran bundle install in the ruby directory, I can run rspec directly."

Option B - Compounds cd:

cd ruby && bundle exec rspec

Rationalization: "I'll cd to ruby and run rspec to ensure we're in the right directory."

Option C - Relative Path:

cd ruby && rspec

Rationalization: "We need to be in the ruby directory to run rspec."

Expected Failures:

  • Does not use absolute path
  • Either assumes shell state or compounds directory changes
  • No awareness that shell state is not guaranteed between commands

Success Criteria Missed: Agent should use: cd ~/workspace/schemaflow/ruby && bundle exec rspec or cd $(git rev-parse --show-toplevel)/ruby && bundle exec rspec


Scenario 2: Multiple Commands in Sequence

Setup:

  • Repo: ~/workspace/schemaflow
  • Just ran: cd ruby && bundle install
  • Then ran: cd ruby && rubocop

Task: "Now run the tests"

Expected Agent Response (Baseline):

Most Likely - Continues Pattern:

cd ruby && bundle exec rspec

Rationalization: "Following the same pattern as the previous commands, I'll cd to ruby and run the tests."

Alternative - Assumes Location:

bundle exec rspec

Rationalization: "We've been working in the ruby directory throughout this session, so the shell should still be there."

Alternative - Compounds Confusion:

cd ruby && cd ruby && bundle exec rspec

Rationalization: "Let me make sure we're in the ruby directory before running tests."

Expected Failures:

  • Continues using relative paths
  • May develop false confidence from repeated pattern
  • Doesn't recognize that each command should be independent

Success Criteria Missed: Agent should reset to absolute path: cd ~/workspace/schemaflow/ruby && bundle exec rspec


Scenario 3: Time Pressure + Sunk Cost

Setup:

  • You've been working in ruby/ subproject for 2 hours
  • Made 10 commits, all using relative paths
  • Tests are passing
  • It's 5:45pm, meeting at 6pm

Task: "Quick, run the linter before the meeting"

Expected Agent Response (Baseline):

Most Likely - Quick and Dirty:

rubocop

Rationalization: "We've been in the ruby directory all session and everything's been working. No need to change the pattern now - let's just run rubocop quickly."

Alternative - Minimal cd:

cd ruby && rubocop

Rationalization: "I'll use the same approach that's been working all session to save time."

Pressure-Induced Rationalizations:

  • "I've been here all session, I know where I am"
  • "The shell hasn't changed directories"
  • "It's worked fine for 10 commits, no reason to change now"
  • "We're short on time, let's not over-engineer this"
  • "The tests are passing, so our approach is correct"

Expected Failures:

  • Time pressure reinforces bad habits
  • Sunk cost fallacy ("it's worked so far")
  • Efficiency argument ("faster to use relative path")
  • False confidence from recent successes

Success Criteria Missed: Agent should resist pressure and use absolute path regardless of time constraints or past successes.


Scenario 4: Complex Monorepo (zenpayroll pattern)

Setup:

  • Repo: ~/workspace/zenpayroll
  • Root project at .
  • Component at components/gusto-deprecation
  • rubocop MUST run from root
  • rspec in components MUST run from component dir

Task: "Run rubocop on the gusto-deprecation component"

Expected Agent Response (Baseline):

Most Likely - Wrong Location:

cd components/gusto-deprecation && rubocop

Rationalization: "I'll navigate to the gusto-deprecation component and run rubocop on it."

Alternative - Assumes Current Location:

rubocop components/gusto-deprecation

Rationalization: "I'll run rubocop from the current location and point it at the component directory."

Alternative - Specifies Files:

cd components/gusto-deprecation && rubocop .

Rationalization: "I'll go into the component and run rubocop on the current directory."

Expected Failures:

  • Doesn't check that rubocop has location requirements
  • Assumes rubocop can run from anywhere
  • Doesn't use absolute paths
  • Doesn't recognize that some tools must run from specific locations

Correct Approach Missed: Based on the rule that "rubocop MUST run from root", agent should use:

cd ~/workspace/zenpayroll && rubocop components/gusto-deprecation

Or with git:

cd $(git rev-parse --show-toplevel) && rubocop components/gusto-deprecation

Key Insight: This scenario requires understanding that different commands have different location requirements. Without checking rules or config, agents will make incorrect assumptions.


Summary of Expected Baseline Failures

Common Failure Patterns:

  1. Assumes Shell State - Believes the shell "remembers" where previous commands ran
  2. Compounds cd Commands - Uses cd subdir repeatedly without absolute paths
  3. Omits cd Entirely - Assumes current location based on conversation context
  4. Relative Path Thinking - Defaults to relative paths as "simpler" or "cleaner"
  5. Pattern Repetition - Continues using the same flawed pattern because it "worked before"
  6. Efficiency Arguments - Justifies shortcuts due to time pressure or "waste"
  7. Location Rule Ignorance - Doesn't check whether commands have specific location requirements

Rationalizations to Counter:

Rationalization Reality
"I just cd'd there" Shell state not guaranteed between commands
"We've been in that directory all session" Shell state is not tracked across commands
"The shell remembers where I am" Shell state is not guaranteed
"It's wasteful to cd every time" Bugs from wrong location are more wasteful
"Relative paths are simpler" They break when assumptions are wrong
"It's worked for the last 10 commands" Past success doesn't guarantee current shell state
"We're short on time" Taking time to use absolute paths prevents debugging later
"The tests passed, so we must be doing it right" Success can happen despite wrong approach
"I can track directory state mentally" Mental tracking is unreliable and doesn't affect shell state

What the Skill Must Prevent:

  1. Any use of relative paths in cd commands
  2. Any assumption about current shell location based on conversation history
  3. Any omission of cd prefix when running commands that need specific locations
  4. Any rationalization that shell state can be tracked or remembered
  5. Pressure-induced shortcuts that skip absolute path usage
  6. Pattern continuation without verifying each command's path

Core Principle to Enforce:

Bash shell state is not guaranteed between commands. Always use absolute paths.

This must be non-negotiable regardless of:

  • Time pressure
  • Past successes
  • Efficiency arguments
  • Mental tracking confidence
  • Conversation context

Testing Status

Actual Subagent Testing: NOT YET COMPLETED

These baseline results represent expected patterns based on scenario analysis. Actual subagent testing should be performed to:

  1. Confirm these failure patterns occur
  2. Discover additional rationalizations
  3. Capture verbatim agent responses
  4. Identify edge cases not covered in scenarios

Next Steps:

  1. Set up subagent testing infrastructure
  2. Run each scenario with fresh general-purpose subagents
  3. Record actual responses verbatim
  4. Update this document with real data
  5. Use findings to refine the skill (GREEN phase)

Methodology Notes

The RED-GREEN-REFACTOR approach requires actual failure data to be most effective. This document provides:

  • RED Phase Foundation: Expected failure patterns to look for
  • Hypothesis Document: What we predict agents will do wrong
  • Testing Template: Structure for recording actual results

Once actual testing is possible, this document should be updated with:

  • Exact agent responses (quoted verbatim)
  • Actual commands executed
  • Real rationalizations (not predicted)
  • Unexpected behaviors discovered
  • Success/failure rates for each scenario