Files
gh-withzombies-hyperpowers/skills/root-cause-tracing/SKILL.md
2025-11-30 09:06:38 +08:00

15 KiB

name, description
name description
root-cause-tracing Use when errors occur deep in execution - traces bugs backward through call stack to find original trigger, not just symptom

<skill_overview> Bugs manifest deep in the call stack; trace backward until you find the original trigger, then fix at source, not where error appears. </skill_overview>

<rigidity_level> MEDIUM FREEDOM - Follow the backward tracing process strictly, but adapt instrumentation and debugging techniques to your language and tools. </rigidity_level>

<quick_reference>

Step Action Question
1 Read error completely What failed and where?
2 Find immediate cause What code directly threw this?
3 Trace backward one level What called this code?
4 Keep tracing up stack What called that?
5 Find where bad data originated Where was invalid value created?
6 Fix at source Address root cause
7 Add defense at each layer Validate assumptions as backup

Core rule: Never fix just where error appears. Fix where problem originates. </quick_reference>

<when_to_use>

  • Error happens deep in execution (not at entry point)
  • Stack trace shows long call chain
  • Unclear where invalid data originated
  • Need to find which test/code triggers problem
  • Error message points to utility/library code

Example symptoms:

  • "Database rejects empty string" ← Where did empty string come from?
  • "File not found: ''" ← Why is path empty?
  • "Invalid argument to function" ← Who passed invalid argument?
  • "Null pointer dereference" ← What should have been initialized? </when_to_use>

<the_process>

1. Observe the Symptom

Read the complete error:

Error: Invalid email format: ""
  at validateEmail (validator.ts:42)
  at UserService.create (user-service.ts:18)
  at ApiHandler.createUser (api-handler.ts:67)
  at HttpServer.handleRequest (server.ts:123)
  at TestCase.test_create_user (user.test.ts:10)

Symptom: Email validation fails on empty string Location: Deep in validator utility

DON'T fix here yet. This might be symptom, not source.


2. Find Immediate Cause

What code directly causes this?

// validator.ts:42
function validateEmail(email: string): boolean {
  if (!email) throw new Error(`Invalid email format: "${email}"`);
  return EMAIL_REGEX.test(email);
}

Question: Why is email empty? Keep tracing.


3. Trace Backward: What Called This?

Use stack trace:

// user-service.ts:18
create(request: UserRequest): User {
  validateEmail(request.email); // Called with request.email = ""
  // ...
}

Question: Why is request.email empty? Keep tracing.


4. Keep Tracing Up the Stack

// api-handler.ts:67
async createUser(req: Request): Promise<Response> {
  const userRequest = {
    name: req.body.name,
    email: req.body.email || "", // ← FOUND IT!
  };
  return this.userService.create(userRequest);
}

Root cause found: API handler provides default empty string when email missing.


5. Identify the Pattern

Why empty string as default?

  • Misguided "safety": Thought empty string better than undefined
  • Should reject invalid request at API boundary
  • Downstream code assumes data already validated

6. Fix at Source

// api-handler.ts (SOURCE FIX)
async createUser(req: Request): Promise<Response> {
  if (!req.body.email) {
    return Response.badRequest("Email is required");
  }
  const userRequest = {
    name: req.body.name,
    email: req.body.email, // No default, already validated
  };
  return this.userService.create(userRequest);
}

7. Add Defense in Depth

After fixing source, add validation at each layer as backup:

// Layer 1: API - Reject invalid input (PRIMARY FIX)
if (!req.body.email) return Response.badRequest("Email required");

// Layer 2: Service - Validate assumptions
assert(request.email, "email must be present");

// Layer 3: Utility - Defensive check
if (!email) throw new Error("invariant violated: email empty");

Primary fix at source. Defense is backup, not replacement. </the_process>

<debugging_approaches>

Option 1: Guide User Through Debugger

IMPORTANT: Claude cannot run interactive debuggers. Guide user through debugger commands.

"Let's use lldb to trace backward through the call stack.

Please run these commands:
  lldb target/debug/myapp
  (lldb) breakpoint set --file validator.rs --line 42
  (lldb) run

When breakpoint hits:
  (lldb) frame variable email     # Check value here
  (lldb) bt                       # See full call stack
  (lldb) up                       # Move to caller
  (lldb) frame variable request   # Check values in caller
  (lldb) up                       # Move up again
  (lldb) frame variable           # Where empty string created?

Please share:
  1. Value of 'email' at validator.rs:42
  2. Value of 'request.email' in user_service.rs
  3. Value of 'req.body.email' in api_handler.rs
  4. Where does empty string first appear?"

Option 2: Add Instrumentation (Claude CAN Do This)

When debugger not available or issue intermittent:

// Add at error location
fn validate_email(email: &str) -> Result<()> {
    eprintln!("DEBUG validate_email called:");
    eprintln!("  email: {:?}", email);
    eprintln!("  backtrace: {}", std::backtrace::Backtrace::capture());

    if email.is_empty() {
        return Err(Error::InvalidEmail);
    }
    // ...
}

Critical: Use eprintln!() or console.error() in tests (not logger - may be suppressed).

Run and analyze:

cargo test 2>&1 | grep "DEBUG validate_email" -A 10

Look for:

  • Test file names in backtraces
  • Line numbers triggering the call
  • Patterns (same test? same parameter?) </debugging_approaches>

<finding_polluting_tests>

Finding Which Test Pollutes

When something appears during tests but you don't know which:

Binary search approach:

# Run half the tests
npm test tests/first-half/*.test.ts
# Pollution appears? Yes → in first half, No → second half

# Subdivide
npm test tests/first-quarter/*.test.ts

# Continue until specific file
npm test tests/auth/login.test.ts  ← Found it!

Or test isolation:

# Run tests one at a time
for test in tests/**/*.test.ts; do
  echo "Testing: $test"
  npm test "$test"
  if [ -d .git ]; then
    echo "FOUND POLLUTER: $test"
    break
  fi
done

</finding_polluting_tests>

Developer fixes symptom, not source # Error appears in git utility: fn git_init(directory: &str) { Command::new("git") .arg("init") .current_dir(directory) .run() }

Error: "Invalid argument: empty directory"

Developer adds validation at symptom:

fn git_init(directory: &str) { if directory.is_empty() { panic!("Directory cannot be empty"); // Band-aid } Command::new("git").arg("init").current_dir(directory).run() }

<why_it_fails>

  • Fixes symptom, not source (where empty string created)
  • Same bug will appear elsewhere directory is used
  • Doesn't explain WHY directory was empty
  • Future code might make same mistake
  • Band-aid hides the real problem </why_it_fails>
**Trace backward:**
  1. git_init called with directory=""
  2. WorkspaceManager.init(projectDir="")
  3. Session.create(projectDir="")
  4. Test: Project.create(context.tempDir)
  5. SOURCE: context.tempDir="" (accessed before beforeEach!)

Fix at source:

function setupTest() {
  let _tempDir: string | undefined;

  return {
    beforeEach() {
      _tempDir = makeTempDir();
    },
    get tempDir(): string {
      if (!_tempDir) {
        throw new Error("tempDir accessed before beforeEach!");
      }
      return _tempDir;
    }
  };
}

What you gain:

  • Fixes actual bug (test timing issue)
  • Prevents same mistake elsewhere
  • Clear error at source, not deep in stack
  • No empty strings propagating through system
Developer stops tracing too early # Error in API handler async createUser(req: Request): Promise { const userRequest = { name: req.body.name, email: req.body.email || "", // Suspicious! }; return this.userService.create(userRequest); }

Developer sees empty string default and "fixes" it:

email: req.body.email || "noreply@example.com"

Ships to production

Bug: Users created without email input get noreply@example.com

Database has fake emails, can't distinguish missing from real

<why_it_fails>

  • Stopped at first suspicious code
  • Didn't question WHY empty string was default
  • "Fixed" by replacing with different wrong default
  • Root cause: shouldn't accept missing email at all
  • Validation should happen at API boundary </why_it_fails>
**Keep tracing to understand intent:**
  1. Why was empty string default?
  2. Should email be optional or required?
  3. What does API spec say?
  4. What does database schema say?

Findings:

  • Email column is NOT NULL in database
  • API docs say email is required
  • Empty string was workaround, not design

Fix at source (validate at boundary):

async createUser(req: Request): Promise<Response> {
  // Validate at API boundary
  if (!req.body.email) {
    return Response.badRequest("Email is required");
  }

  const userRequest = {
    name: req.body.name,
    email: req.body.email, // No default needed
  };
  return this.userService.create(userRequest);
}

What you gain:

  • Validates at correct layer (API boundary)
  • Clear error message to client
  • No invalid data propagates downstream
  • Database constraints enforced
  • Matches API specification
Complex multi-layer trace to find original trigger # Problem: .git directory appearing in source code directory during tests

Symptom location:

Error: Cannot initialize git repo (repo already exists) Location: src/workspace/git.rs:45

Developer adds check:

if Path::new(".git").exists() { return Err("Git already initialized"); }

Doesn't help - still appears in wrong place!

<why_it_fails>

  • Detects symptom, doesn't prevent it
  • .git still created in wrong directory
  • Doesn't explain HOW it gets there
  • Pollution still happens, just detected </why_it_fails>
**Trace through multiple layers:**
1. git init runs with cwd=""
   ↓ Why is cwd empty?

2. WorkspaceManager.init(projectDir="")
   ↓ Why is projectDir empty?

3. Session.create(projectDir="")
   ↓ Why was empty string passed?

4. Test: Project.create(context.tempDir)
   ↓ Why is context.tempDir empty?

5. ROOT CAUSE:
   const context = setupTest(); // tempDir="" initially
   Project.create(context.tempDir); // Accessed at top level!

   beforeEach(() => {
     context.tempDir = makeTempDir(); // Assigned here
   });

   TEST ACCESSED TEMPDIR BEFORE BEFOREEACH RAN!

Fix at source (make early access impossible):

function setupTest() {
  let _tempDir: string | undefined;

  return {
    beforeEach() {
      _tempDir = makeTempDir();
    },
    get tempDir(): string {
      if (!_tempDir) {
        throw new Error("tempDir accessed before beforeEach!");
      }
      return _tempDir;
    }
  };
}

Then add defense at each layer:

// Layer 1: Test framework (PRIMARY FIX)
// Getter throws if accessed early

// Layer 2: Project validation
fn create(directory: &str) -> Result<Self> {
    if directory.is_empty() {
        return Err("Directory cannot be empty");
    }
    // ...
}

// Layer 3: Workspace validation
fn init(path: &Path) -> Result<()> {
    if !path.exists() {
        return Err("Path must exist");
    }
    // ...
}

// Layer 4: Environment guard
fn git_init(dir: &Path) -> Result<()> {
    if env::var("NODE_ENV") != Ok("test".to_string()) {
        if !dir.starts_with("/tmp") {
            panic!("Refusing to git init outside test dir");
        }
    }
    // ...
}

What you gain:

  • Primary fix prevents early access (source)
  • Each layer validates assumptions (defense)
  • Clear error at source, not deep in stack
  • Environment guard prevents production pollution
  • Multi-layer defense catches future mistakes

<critical_rules>

Rules That Have No Exceptions

  1. Never fix just where error appears → Trace backward to find source
  2. Don't stop at first suspicious code → Keep tracing to original trigger
  3. Fix at source first → Defense is backup, not primary fix
  4. Use debugger OR instrumentation → Don't guess at call chain
  5. Add defense at each layer → After fixing source, validate assumptions throughout

Common Excuses

All of these mean: STOP. Trace backward to find source.

  • "Error is obvious here, I'll add validation" (That's a symptom fix)
  • "Stack trace shows the problem" (Shows symptom location, not source)
  • "This code should handle empty values" (Why is value empty? Find source.)
  • "Too deep to trace, I'll add defensive check" (Defense without source fix = band-aid)
  • "Multiple places could cause this" (Trace to find which one actually does) </critical_rules>

<verification_checklist> Before claiming root cause fixed:

  • Traced backward through entire call chain
  • Found where invalid data was created (not just passed)
  • Identified WHY invalid data was created (pattern/assumption)
  • Fixed at source (where bad data originates)
  • Added defense at each layer (validate assumptions)
  • Verified fix with test (reproduces original bug, passes with fix)
  • Confirmed no other code paths have same pattern

Can't check all boxes? Keep tracing backward. </verification_checklist>

**This skill is called by:** - hyperpowers:debugging-with-tools (Phase 2: Trace Backward Through Call Stack) - When errors occur deep in execution - When unclear where invalid data originated

This skill requires:

  • Stack traces or debugger access
  • Ability to add instrumentation (logging)
  • Understanding of call chain

This skill calls:

  • hyperpowers:test-driven-development (write regression test after finding source)
  • hyperpowers:verification-before-completion (verify fix works)
**Detailed guides:** - [Debugger commands by language](resources/debugger-reference.md) - [Instrumentation patterns](resources/instrumentation-patterns.md) - [Defense-in-depth examples](resources/defense-patterns.md)

When stuck:

  • Can't find source → Add instrumentation at each layer, run test
  • Stack trace unclear → Use debugger to inspect variables at each frame
  • Multiple suspects → Add instrumentation to all, find which actually executes
  • Intermittent issue → Add instrumentation and wait for reproduction