Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 09:06:38 +08:00
commit ed3e4c84c3
76 changed files with 20449 additions and 0 deletions

View File

@@ -0,0 +1,566 @@
---
name: root-cause-tracing
description: Use when errors occur deep in execution - traces bugs backward through call stack to find original trigger, not just symptom
---
<skill_overview>
Bugs manifest deep in the call stack; trace backward until you find the original trigger, then fix at source, not where error appears.
</skill_overview>
<rigidity_level>
MEDIUM FREEDOM - Follow the backward tracing process strictly, but adapt instrumentation and debugging techniques to your language and tools.
</rigidity_level>
<quick_reference>
| Step | Action | Question |
|------|--------|----------|
| 1 | Read error completely | What failed and where? |
| 2 | Find immediate cause | What code directly threw this? |
| 3 | Trace backward one level | What called this code? |
| 4 | Keep tracing up stack | What called that? |
| 5 | Find where bad data originated | Where was invalid value created? |
| 6 | Fix at source | Address root cause |
| 7 | Add defense at each layer | Validate assumptions as backup |
**Core rule:** Never fix just where error appears. Fix where problem originates.
</quick_reference>
<when_to_use>
- Error happens deep in execution (not at entry point)
- Stack trace shows long call chain
- Unclear where invalid data originated
- Need to find which test/code triggers problem
- Error message points to utility/library code
**Example symptoms:**
- "Database rejects empty string" ← Where did empty string come from?
- "File not found: ''" ← Why is path empty?
- "Invalid argument to function" ← Who passed invalid argument?
- "Null pointer dereference" ← What should have been initialized?
</when_to_use>
<the_process>
## 1. Observe the Symptom
Read the complete error:
```
Error: Invalid email format: ""
at validateEmail (validator.ts:42)
at UserService.create (user-service.ts:18)
at ApiHandler.createUser (api-handler.ts:67)
at HttpServer.handleRequest (server.ts:123)
at TestCase.test_create_user (user.test.ts:10)
```
**Symptom:** Email validation fails on empty string
**Location:** Deep in validator utility
**DON'T fix here yet.** This might be symptom, not source.
---
## 2. Find Immediate Cause
What code directly causes this?
```typescript
// validator.ts:42
function validateEmail(email: string): boolean {
if (!email) throw new Error(`Invalid email format: "${email}"`);
return EMAIL_REGEX.test(email);
}
```
**Question:** Why is email empty? Keep tracing.
---
## 3. Trace Backward: What Called This?
Use stack trace:
```typescript
// user-service.ts:18
create(request: UserRequest): User {
validateEmail(request.email); // Called with request.email = ""
// ...
}
```
**Question:** Why is `request.email` empty? Keep tracing.
---
## 4. Keep Tracing Up the Stack
```typescript
// api-handler.ts:67
async createUser(req: Request): Promise<Response> {
const userRequest = {
name: req.body.name,
email: req.body.email || "", // ← FOUND IT!
};
return this.userService.create(userRequest);
}
```
**Root cause found:** API handler provides default empty string when email missing.
---
## 5. Identify the Pattern
**Why empty string as default?**
- Misguided "safety": Thought empty string better than undefined
- Should reject invalid request at API boundary
- Downstream code assumes data already validated
---
## 6. Fix at Source
```typescript
// api-handler.ts (SOURCE FIX)
async createUser(req: Request): Promise<Response> {
if (!req.body.email) {
return Response.badRequest("Email is required");
}
const userRequest = {
name: req.body.name,
email: req.body.email, // No default, already validated
};
return this.userService.create(userRequest);
}
```
---
## 7. Add Defense in Depth
After fixing source, add validation at each layer as backup:
```typescript
// Layer 1: API - Reject invalid input (PRIMARY FIX)
if (!req.body.email) return Response.badRequest("Email required");
// Layer 2: Service - Validate assumptions
assert(request.email, "email must be present");
// Layer 3: Utility - Defensive check
if (!email) throw new Error("invariant violated: email empty");
```
**Primary fix at source. Defense is backup, not replacement.**
</the_process>
<debugging_approaches>
## Option 1: Guide User Through Debugger
**IMPORTANT:** Claude cannot run interactive debuggers. Guide user through debugger commands.
```
"Let's use lldb to trace backward through the call stack.
Please run these commands:
lldb target/debug/myapp
(lldb) breakpoint set --file validator.rs --line 42
(lldb) run
When breakpoint hits:
(lldb) frame variable email # Check value here
(lldb) bt # See full call stack
(lldb) up # Move to caller
(lldb) frame variable request # Check values in caller
(lldb) up # Move up again
(lldb) frame variable # Where empty string created?
Please share:
1. Value of 'email' at validator.rs:42
2. Value of 'request.email' in user_service.rs
3. Value of 'req.body.email' in api_handler.rs
4. Where does empty string first appear?"
```
---
## Option 2: Add Instrumentation (Claude CAN Do This)
When debugger not available or issue intermittent:
```rust
// Add at error location
fn validate_email(email: &str) -> Result<()> {
eprintln!("DEBUG validate_email called:");
eprintln!(" email: {:?}", email);
eprintln!(" backtrace: {}", std::backtrace::Backtrace::capture());
if email.is_empty() {
return Err(Error::InvalidEmail);
}
// ...
}
```
**Critical:** Use `eprintln!()` or `console.error()` in tests (not logger - may be suppressed).
**Run and analyze:**
```bash
cargo test 2>&1 | grep "DEBUG validate_email" -A 10
```
Look for:
- Test file names in backtraces
- Line numbers triggering the call
- Patterns (same test? same parameter?)
</debugging_approaches>
<finding_polluting_tests>
## Finding Which Test Pollutes
When something appears during tests but you don't know which:
**Binary search approach:**
```bash
# Run half the tests
npm test tests/first-half/*.test.ts
# Pollution appears? Yes → in first half, No → second half
# Subdivide
npm test tests/first-quarter/*.test.ts
# Continue until specific file
npm test tests/auth/login.test.ts ← Found it!
```
**Or test isolation:**
```bash
# Run tests one at a time
for test in tests/**/*.test.ts; do
echo "Testing: $test"
npm test "$test"
if [ -d .git ]; then
echo "FOUND POLLUTER: $test"
break
fi
done
```
</finding_polluting_tests>
<examples>
<example>
<scenario>Developer fixes symptom, not source</scenario>
<code>
# Error appears in git utility:
fn git_init(directory: &str) {
Command::new("git")
.arg("init")
.current_dir(directory)
.run()
}
# Error: "Invalid argument: empty directory"
# Developer adds validation at symptom:
fn git_init(directory: &str) {
if directory.is_empty() {
panic!("Directory cannot be empty"); // Band-aid
}
Command::new("git").arg("init").current_dir(directory).run()
}
</code>
<why_it_fails>
- Fixes symptom, not source (where empty string created)
- Same bug will appear elsewhere directory is used
- Doesn't explain WHY directory was empty
- Future code might make same mistake
- Band-aid hides the real problem
</why_it_fails>
<correction>
**Trace backward:**
1. git_init called with directory=""
2. WorkspaceManager.init(projectDir="")
3. Session.create(projectDir="")
4. Test: Project.create(context.tempDir)
5. **SOURCE:** context.tempDir="" (accessed before beforeEach!)
**Fix at source:**
```typescript
function setupTest() {
let _tempDir: string | undefined;
return {
beforeEach() {
_tempDir = makeTempDir();
},
get tempDir(): string {
if (!_tempDir) {
throw new Error("tempDir accessed before beforeEach!");
}
return _tempDir;
}
};
}
```
**What you gain:**
- Fixes actual bug (test timing issue)
- Prevents same mistake elsewhere
- Clear error at source, not deep in stack
- No empty strings propagating through system
</correction>
</example>
<example>
<scenario>Developer stops tracing too early</scenario>
<code>
# Error in API handler
async createUser(req: Request): Promise<Response> {
const userRequest = {
name: req.body.name,
email: req.body.email || "", // Suspicious!
};
return this.userService.create(userRequest);
}
# Developer sees empty string default and "fixes" it:
email: req.body.email || "noreply@example.com"
# Ships to production
# Bug: Users created without email input get noreply@example.com
# Database has fake emails, can't distinguish missing from real
</code>
<why_it_fails>
- Stopped at first suspicious code
- Didn't question WHY empty string was default
- "Fixed" by replacing with different wrong default
- Root cause: shouldn't accept missing email at all
- Validation should happen at API boundary
</why_it_fails>
<correction>
**Keep tracing to understand intent:**
1. Why was empty string default?
2. Should email be optional or required?
3. What does API spec say?
4. What does database schema say?
**Findings:**
- Email column is NOT NULL in database
- API docs say email is required
- Empty string was workaround, not design
**Fix at source (validate at boundary):**
```typescript
async createUser(req: Request): Promise<Response> {
// Validate at API boundary
if (!req.body.email) {
return Response.badRequest("Email is required");
}
const userRequest = {
name: req.body.name,
email: req.body.email, // No default needed
};
return this.userService.create(userRequest);
}
```
**What you gain:**
- Validates at correct layer (API boundary)
- Clear error message to client
- No invalid data propagates downstream
- Database constraints enforced
- Matches API specification
</correction>
</example>
<example>
<scenario>Complex multi-layer trace to find original trigger</scenario>
<code>
# Problem: .git directory appearing in source code directory during tests
# Symptom location:
Error: Cannot initialize git repo (repo already exists)
Location: src/workspace/git.rs:45
# Developer adds check:
if Path::new(".git").exists() {
return Err("Git already initialized");
}
# Doesn't help - still appears in wrong place!
</code>
<why_it_fails>
- Detects symptom, doesn't prevent it
- .git still created in wrong directory
- Doesn't explain HOW it gets there
- Pollution still happens, just detected
</why_it_fails>
<correction>
**Trace through multiple layers:**
```
1. git init runs with cwd=""
↓ Why is cwd empty?
2. WorkspaceManager.init(projectDir="")
↓ Why is projectDir empty?
3. Session.create(projectDir="")
↓ Why was empty string passed?
4. Test: Project.create(context.tempDir)
↓ Why is context.tempDir empty?
5. ROOT CAUSE:
const context = setupTest(); // tempDir="" initially
Project.create(context.tempDir); // Accessed at top level!
beforeEach(() => {
context.tempDir = makeTempDir(); // Assigned here
});
TEST ACCESSED TEMPDIR BEFORE BEFOREEACH RAN!
```
**Fix at source (make early access impossible):**
```typescript
function setupTest() {
let _tempDir: string | undefined;
return {
beforeEach() {
_tempDir = makeTempDir();
},
get tempDir(): string {
if (!_tempDir) {
throw new Error("tempDir accessed before beforeEach!");
}
return _tempDir;
}
};
}
```
**Then add defense at each layer:**
```rust
// Layer 1: Test framework (PRIMARY FIX)
// Getter throws if accessed early
// Layer 2: Project validation
fn create(directory: &str) -> Result<Self> {
if directory.is_empty() {
return Err("Directory cannot be empty");
}
// ...
}
// Layer 3: Workspace validation
fn init(path: &Path) -> Result<()> {
if !path.exists() {
return Err("Path must exist");
}
// ...
}
// Layer 4: Environment guard
fn git_init(dir: &Path) -> Result<()> {
if env::var("NODE_ENV") != Ok("test".to_string()) {
if !dir.starts_with("/tmp") {
panic!("Refusing to git init outside test dir");
}
}
// ...
}
```
**What you gain:**
- Primary fix prevents early access (source)
- Each layer validates assumptions (defense)
- Clear error at source, not deep in stack
- Environment guard prevents production pollution
- Multi-layer defense catches future mistakes
</correction>
</example>
</examples>
<critical_rules>
## Rules That Have No Exceptions
1. **Never fix just where error appears** → Trace backward to find source
2. **Don't stop at first suspicious code** → Keep tracing to original trigger
3. **Fix at source first** → Defense is backup, not primary fix
4. **Use debugger OR instrumentation** → Don't guess at call chain
5. **Add defense at each layer** → After fixing source, validate assumptions throughout
## Common Excuses
All of these mean: **STOP. Trace backward to find source.**
- "Error is obvious here, I'll add validation" (That's a symptom fix)
- "Stack trace shows the problem" (Shows symptom location, not source)
- "This code should handle empty values" (Why is value empty? Find source.)
- "Too deep to trace, I'll add defensive check" (Defense without source fix = band-aid)
- "Multiple places could cause this" (Trace to find which one actually does)
</critical_rules>
<verification_checklist>
Before claiming root cause fixed:
- [ ] Traced backward through entire call chain
- [ ] Found where invalid data was created (not just passed)
- [ ] Identified WHY invalid data was created (pattern/assumption)
- [ ] Fixed at source (where bad data originates)
- [ ] Added defense at each layer (validate assumptions)
- [ ] Verified fix with test (reproduces original bug, passes with fix)
- [ ] Confirmed no other code paths have same pattern
**Can't check all boxes?** Keep tracing backward.
</verification_checklist>
<integration>
**This skill is called by:**
- hyperpowers:debugging-with-tools (Phase 2: Trace Backward Through Call Stack)
- When errors occur deep in execution
- When unclear where invalid data originated
**This skill requires:**
- Stack traces or debugger access
- Ability to add instrumentation (logging)
- Understanding of call chain
**This skill calls:**
- hyperpowers:test-driven-development (write regression test after finding source)
- hyperpowers:verification-before-completion (verify fix works)
</integration>
<resources>
**Detailed guides:**
- [Debugger commands by language](resources/debugger-reference.md)
- [Instrumentation patterns](resources/instrumentation-patterns.md)
- [Defense-in-depth examples](resources/defense-patterns.md)
**When stuck:**
- Can't find source → Add instrumentation at each layer, run test
- Stack trace unclear → Use debugger to inspect variables at each frame
- Multiple suspects → Add instrumentation to all, find which actually executes
- Intermittent issue → Add instrumentation and wait for reproduction
</resources>