gh-withzombies-hyperpowers/skills/root-cause-tracing/SKILL.md at master

Files

Zhongwei Li ed3e4c84c3 Initial commit

2025-11-30 09:06:38 +08:00

15 KiB

Raw Permalink Blame History

name, description

name	description
root-cause-tracing	Use when errors occur deep in execution - traces bugs backward through call stack to find original trigger, not just symptom

<skill_overview> Bugs manifest deep in the call stack; trace backward until you find the original trigger, then fix at source, not where error appears. </skill_overview>

<rigidity_level> MEDIUM FREEDOM - Follow the backward tracing process strictly, but adapt instrumentation and debugging techniques to your language and tools. </rigidity_level>

<quick_reference>

Step	Action	Question
1	Read error completely	What failed and where?
2	Find immediate cause	What code directly threw this?
3	Trace backward one level	What called this code?
4	Keep tracing up stack	What called that?
5	Find where bad data originated	Where was invalid value created?
6	Fix at source	Address root cause
7	Add defense at each layer	Validate assumptions as backup

Core rule: Never fix just where error appears. Fix where problem originates. </quick_reference>

<when_to_use>

Error happens deep in execution (not at entry point)
Stack trace shows long call chain
Unclear where invalid data originated
Need to find which test/code triggers problem
Error message points to utility/library code

Example symptoms:

"Database rejects empty string" ← Where did empty string come from?
"File not found: ''" ← Why is path empty?
"Invalid argument to function" ← Who passed invalid argument?
"Null pointer dereference" ← What should have been initialized? </when_to_use>

<the_process>

1. Observe the Symptom

Read the complete error:

Error: Invalid email format: ""
  at validateEmail (validator.ts:42)
  at UserService.create (user-service.ts:18)
  at ApiHandler.createUser (api-handler.ts:67)
  at HttpServer.handleRequest (server.ts:123)
  at TestCase.test_create_user (user.test.ts:10)

Symptom: Email validation fails on empty string Location: Deep in validator utility

DON'T fix here yet. This might be symptom, not source.

2. Find Immediate Cause

What code directly causes this?

// validator.ts:42
function validateEmail(email: string): boolean {
  if (!email) throw new Error(`Invalid email format: "${email}"`);
  return EMAIL_REGEX.test(email);
}

Question: Why is email empty? Keep tracing.

3. Trace Backward: What Called This?

Use stack trace:

// user-service.ts:18
create(request: UserRequest): User {
  validateEmail(request.email); // Called with request.email = ""
  // ...
}

Question: Why is request.email empty? Keep tracing.

4. Keep Tracing Up the Stack

// api-handler.ts:67
async createUser(req: Request): Promise<Response> {
  const userRequest = {
    name: req.body.name,
    email: req.body.email || "", // ← FOUND IT!
  };
  return this.userService.create(userRequest);
}

Root cause found: API handler provides default empty string when email missing.

5. Identify the Pattern

Why empty string as default?

Misguided "safety": Thought empty string better than undefined
Should reject invalid request at API boundary
Downstream code assumes data already validated

6. Fix at Source

// api-handler.ts (SOURCE FIX)
async createUser(req: Request): Promise<Response> {
  if (!req.body.email) {
    return Response.badRequest("Email is required");
  }
  const userRequest = {
    name: req.body.name,
    email: req.body.email, // No default, already validated
  };
  return this.userService.create(userRequest);
}

7. Add Defense in Depth

After fixing source, add validation at each layer as backup:

// Layer 1: API - Reject invalid input (PRIMARY FIX)
if (!req.body.email) return Response.badRequest("Email required");

// Layer 2: Service - Validate assumptions
assert(request.email, "email must be present");

// Layer 3: Utility - Defensive check
if (!email) throw new Error("invariant violated: email empty");

Primary fix at source. Defense is backup, not replacement. </the_process>

<debugging_approaches>

Option 1: Guide User Through Debugger

IMPORTANT: Claude cannot run interactive debuggers. Guide user through debugger commands.

"Let's use lldb to trace backward through the call stack.

Please run these commands:
  lldb target/debug/myapp
  (lldb) breakpoint set --file validator.rs --line 42
  (lldb) run

When breakpoint hits:
  (lldb) frame variable email     # Check value here
  (lldb) bt                       # See full call stack
  (lldb) up                       # Move to caller
  (lldb) frame variable request   # Check values in caller
  (lldb) up                       # Move up again
  (lldb) frame variable           # Where empty string created?

Please share:
  1. Value of 'email' at validator.rs:42
  2. Value of 'request.email' in user_service.rs
  3. Value of 'req.body.email' in api_handler.rs
  4. Where does empty string first appear?"

Option 2: Add Instrumentation (Claude CAN Do This)

When debugger not available or issue intermittent:

// Add at error location
fn validate_email(email: &str) -> Result<()> {
    eprintln!("DEBUG validate_email called:");
    eprintln!("  email: {:?}", email);
    eprintln!("  backtrace: {}", std::backtrace::Backtrace::capture());

    if email.is_empty() {
        return Err(Error::InvalidEmail);
    }
    // ...
}

Critical: Use eprintln!() or console.error() in tests (not logger - may be suppressed).

Run and analyze:

cargo test 2>&1 | grep "DEBUG validate_email" -A 10

Look for:

Test file names in backtraces
Line numbers triggering the call
Patterns (same test? same parameter?) </debugging_approaches>

<finding_polluting_tests>

Finding Which Test Pollutes

When something appears during tests but you don't know which:

Binary search approach:

# Run half the tests
npm test tests/first-half/*.test.ts
# Pollution appears? Yes → in first half, No → second half

# Subdivide
npm test tests/first-quarter/*.test.ts

# Continue until specific file
npm test tests/auth/login.test.ts  ← Found it!

Or test isolation:

# Run tests one at a time
for test in tests/**/*.test.ts; do
  echo "Testing: $test"
  npm test "$test"
  if [ -d .git ]; then
    echo "FOUND POLLUTER: $test"
    break
  fi
done

</finding_polluting_tests>

Developer fixes symptom, not source


# Error appears in git utility:
fn git_init(directory: &str) {
    Command::new("git")
        .arg("init")
        .current_dir(directory)
        .run()
}
Error: "Invalid argument: empty directory"
Developer adds validation at symptom:

fn git_init(directory: &str) { if directory.is_empty() { panic!("Directory cannot be empty"); // Band-aid } Command::new("git").arg("init").current_dir(directory).run() }

<why_it_fails>

Fixes symptom, not source (where empty string created)
Same bug will appear elsewhere directory is used
Doesn't explain WHY directory was empty
Future code might make same mistake
Band-aid hides the real problem </why_it_fails>

**Trace backward:**

git_init called with directory=""
WorkspaceManager.init(projectDir="")
Session.create(projectDir="")
Test: Project.create(context.tempDir)
SOURCE: context.tempDir="" (accessed before beforeEach!)

Fix at source:

function setupTest() {
  let _tempDir: string | undefined;

  return {
    beforeEach() {
      _tempDir = makeTempDir();
    },
    get tempDir(): string {
      if (!_tempDir) {
        throw new Error("tempDir accessed before beforeEach!");
      }
      return _tempDir;
    }
  };
}

What you gain:

Fixes actual bug (test timing issue)
Prevents same mistake elsewhere
Clear error at source, not deep in stack
No empty strings propagating through system

Developer stops tracing too early


# Error in API handler
async createUser(req: Request): Promise {
  const userRequest = {
    name: req.body.name,
    email: req.body.email || "", // Suspicious!
  };
  return this.userService.create(userRequest);
}
Developer sees empty string default and "fixes" it:
email: req.body.email || "noreply@example.com"
Ships to production
Bug: Users created without email input get noreply@example.com
Database has fake emails, can't distinguish missing from real

<why_it_fails>

Stopped at first suspicious code
Didn't question WHY empty string was default
"Fixed" by replacing with different wrong default
Root cause: shouldn't accept missing email at all
Validation should happen at API boundary </why_it_fails>

**Keep tracing to understand intent:**

Why was empty string default?
Should email be optional or required?
What does API spec say?
What does database schema say?

Findings:

Email column is NOT NULL in database
API docs say email is required
Empty string was workaround, not design

Fix at source (validate at boundary):

async createUser(req: Request): Promise<Response> {
  // Validate at API boundary
  if (!req.body.email) {
    return Response.badRequest("Email is required");
  }

  const userRequest = {
    name: req.body.name,
    email: req.body.email, // No default needed
  };
  return this.userService.create(userRequest);
}

What you gain:

Validates at correct layer (API boundary)
Clear error message to client
No invalid data propagates downstream
Database constraints enforced
Matches API specification

Complex multi-layer trace to find original trigger


# Problem: .git directory appearing in source code directory during tests
Symptom location:
Error: Cannot initialize git repo (repo already exists)
Location: src/workspace/git.rs:45
Developer adds check:
if Path::new(".git").exists() {
return Err("Git already initialized");
}
Doesn't help - still appears in wrong place!

<why_it_fails>

Detects symptom, doesn't prevent it
.git still created in wrong directory
Doesn't explain HOW it gets there
Pollution still happens, just detected </why_it_fails>

**Trace through multiple layers:**

1. git init runs with cwd=""
   ↓ Why is cwd empty?

2. WorkspaceManager.init(projectDir="")
   ↓ Why is projectDir empty?

3. Session.create(projectDir="")
   ↓ Why was empty string passed?

4. Test: Project.create(context.tempDir)
   ↓ Why is context.tempDir empty?

5. ROOT CAUSE:
   const context = setupTest(); // tempDir="" initially
   Project.create(context.tempDir); // Accessed at top level!

   beforeEach(() => {
     context.tempDir = makeTempDir(); // Assigned here
   });

   TEST ACCESSED TEMPDIR BEFORE BEFOREEACH RAN!

Fix at source (make early access impossible):

function setupTest() {
  let _tempDir: string | undefined;

  return {
    beforeEach() {
      _tempDir = makeTempDir();
    },
    get tempDir(): string {
      if (!_tempDir) {
        throw new Error("tempDir accessed before beforeEach!");
      }
      return _tempDir;
    }
  };
}

Then add defense at each layer:

// Layer 1: Test framework (PRIMARY FIX)
// Getter throws if accessed early

// Layer 2: Project validation
fn create(directory: &str) -> Result<Self> {
    if directory.is_empty() {
        return Err("Directory cannot be empty");
    }
    // ...
}

// Layer 3: Workspace validation
fn init(path: &Path) -> Result<()> {
    if !path.exists() {
        return Err("Path must exist");
    }
    // ...
}

// Layer 4: Environment guard
fn git_init(dir: &Path) -> Result<()> {
    if env::var("NODE_ENV") != Ok("test".to_string()) {
        if !dir.starts_with("/tmp") {
            panic!("Refusing to git init outside test dir");
        }
    }
    // ...
}

What you gain:

Primary fix prevents early access (source)
Each layer validates assumptions (defense)
Clear error at source, not deep in stack
Environment guard prevents production pollution
Multi-layer defense catches future mistakes

<critical_rules>

Rules That Have No Exceptions

Never fix just where error appears → Trace backward to find source
Don't stop at first suspicious code → Keep tracing to original trigger
Fix at source first → Defense is backup, not primary fix
Use debugger OR instrumentation → Don't guess at call chain
Add defense at each layer → After fixing source, validate assumptions throughout

Common Excuses

All of these mean: STOP. Trace backward to find source.

"Error is obvious here, I'll add validation" (That's a symptom fix)
"Stack trace shows the problem" (Shows symptom location, not source)
"This code should handle empty values" (Why is value empty? Find source.)
"Too deep to trace, I'll add defensive check" (Defense without source fix = band-aid)
"Multiple places could cause this" (Trace to find which one actually does) </critical_rules>

<verification_checklist> Before claiming root cause fixed:

Traced backward through entire call chain
Found where invalid data was created (not just passed)
Identified WHY invalid data was created (pattern/assumption)
Fixed at source (where bad data originates)
Added defense at each layer (validate assumptions)
Verified fix with test (reproduces original bug, passes with fix)
Confirmed no other code paths have same pattern

Can't check all boxes? Keep tracing backward. </verification_checklist>

**This skill is called by:** - hyperpowers:debugging-with-tools (Phase 2: Trace Backward Through Call Stack) - When errors occur deep in execution - When unclear where invalid data originated

This skill requires:

Stack traces or debugger access
Ability to add instrumentation (logging)
Understanding of call chain

This skill calls:

hyperpowers:test-driven-development (write regression test after finding source)
hyperpowers:verification-before-completion (verify fix works)

**Detailed guides:** - [Debugger commands by language](resources/debugger-reference.md) - [Instrumentation patterns](resources/instrumentation-patterns.md) - [Defense-in-depth examples](resources/defense-patterns.md)

When stuck:

Can't find source → Add instrumentation at each layer, run test
Stack trace unclear → Use debugger to inspect variables at each frame
Multiple suspects → Add instrumentation to all, find which actually executes
Intermittent issue → Add instrumentation and wait for reproduction

15 KiB Raw Permalink Blame History

1. Observe the Symptom

2. Find Immediate Cause

3. Trace Backward: What Called This?

4. Keep Tracing Up the Stack

5. Identify the Pattern

6. Fix at Source

7. Add Defense in Depth

Option 1: Guide User Through Debugger

Option 2: Add Instrumentation (Claude CAN Do This)

Finding Which Test Pollutes

Error: "Invalid argument: empty directory"

Developer adds validation at symptom:

Developer sees empty string default and "fixes" it:

Ships to production

Bug: Users created without email input get noreply@example.com

Database has fake emails, can't distinguish missing from real

Symptom location:

Developer adds check:

Doesn't help - still appears in wrong place!

Rules That Have No Exceptions

Common Excuses

15 KiB

Raw Permalink Blame History