Initial commit

2025-11-30 09:06:38 +08:00
commit ed3e4c84c3
76 changed files with 20449 additions and 0 deletions
--- a/skills/root-cause-tracing/SKILL.md
+++ b/skills/root-cause-tracing/SKILL.md
@@ -0,0 +1,566 @@
+---
+name: root-cause-tracing
+description: Use when errors occur deep in execution - traces bugs backward through call stack to find original trigger, not just symptom
+---
+
+<skill_overview>
+Bugs manifest deep in the call stack; trace backward until you find the original trigger, then fix at source, not where error appears.
+</skill_overview>
+
+<rigidity_level>
+MEDIUM FREEDOM - Follow the backward tracing process strictly, but adapt instrumentation and debugging techniques to your language and tools.
+</rigidity_level>
+
+<quick_reference>
+| Step | Action | Question |
+|------|--------|----------|
+| 1 | Read error completely | What failed and where? |
+| 2 | Find immediate cause | What code directly threw this? |
+| 3 | Trace backward one level | What called this code? |
+| 4 | Keep tracing up stack | What called that? |
+| 5 | Find where bad data originated | Where was invalid value created? |
+| 6 | Fix at source | Address root cause |
+| 7 | Add defense at each layer | Validate assumptions as backup |
+
+**Core rule:** Never fix just where error appears. Fix where problem originates.
+</quick_reference>
+
+<when_to_use>
+- Error happens deep in execution (not at entry point)
+- Stack trace shows long call chain
+- Unclear where invalid data originated
+- Need to find which test/code triggers problem
+- Error message points to utility/library code
+
+**Example symptoms:**
+- "Database rejects empty string" ← Where did empty string come from?
+- "File not found: ''" ← Why is path empty?
+- "Invalid argument to function" ← Who passed invalid argument?
+- "Null pointer dereference" ← What should have been initialized?
+</when_to_use>
+
+<the_process>
+## 1. Observe the Symptom
+
+Read the complete error:
+
+```
+Error: Invalid email format: ""
+  at validateEmail (validator.ts:42)
+  at UserService.create (user-service.ts:18)
+  at ApiHandler.createUser (api-handler.ts:67)
+  at HttpServer.handleRequest (server.ts:123)
+  at TestCase.test_create_user (user.test.ts:10)
+```
+
+**Symptom:** Email validation fails on empty string
+**Location:** Deep in validator utility
+
+**DON'T fix here yet.** This might be symptom, not source.
+
+---
+
+## 2. Find Immediate Cause
+
+What code directly causes this?
+
+```typescript
+// validator.ts:42
+function validateEmail(email: string): boolean {
+  if (!email) throw new Error(`Invalid email format: "${email}"`);
+  return EMAIL_REGEX.test(email);
+}
+```
+
+**Question:** Why is email empty? Keep tracing.
+
+---
+
+## 3. Trace Backward: What Called This?
+
+Use stack trace:
+
+```typescript
+// user-service.ts:18
+create(request: UserRequest): User {
+  validateEmail(request.email); // Called with request.email = ""
+  // ...
+}
+```
+
+**Question:** Why is `request.email` empty? Keep tracing.
+
+---
+
+## 4. Keep Tracing Up the Stack
+
+```typescript
+// api-handler.ts:67
+async createUser(req: Request): Promise<Response> {
+  const userRequest = {
+    name: req.body.name,
+    email: req.body.email || "", // ← FOUND IT!
+  };
+  return this.userService.create(userRequest);
+}
+```
+
+**Root cause found:** API handler provides default empty string when email missing.
+
+---
+
+## 5. Identify the Pattern
+
+**Why empty string as default?**
+- Misguided "safety": Thought empty string better than undefined
+- Should reject invalid request at API boundary
+- Downstream code assumes data already validated
+
+---
+
+## 6. Fix at Source
+
+```typescript
+// api-handler.ts (SOURCE FIX)
+async createUser(req: Request): Promise<Response> {
+  if (!req.body.email) {
+    return Response.badRequest("Email is required");
+  }
+  const userRequest = {
+    name: req.body.name,
+    email: req.body.email, // No default, already validated
+  };
+  return this.userService.create(userRequest);
+}
+```
+
+---
+
+## 7. Add Defense in Depth
+
+After fixing source, add validation at each layer as backup:
+
+```typescript
+// Layer 1: API - Reject invalid input (PRIMARY FIX)
+if (!req.body.email) return Response.badRequest("Email required");
+
+// Layer 2: Service - Validate assumptions
+assert(request.email, "email must be present");
+
+// Layer 3: Utility - Defensive check
+if (!email) throw new Error("invariant violated: email empty");
+```
+
+**Primary fix at source. Defense is backup, not replacement.**
+</the_process>
+
+<debugging_approaches>
+## Option 1: Guide User Through Debugger
+
+**IMPORTANT:** Claude cannot run interactive debuggers. Guide user through debugger commands.
+
+```
+"Let's use lldb to trace backward through the call stack.
+
+Please run these commands:
+  lldb target/debug/myapp
+  (lldb) breakpoint set --file validator.rs --line 42
+  (lldb) run
+
+When breakpoint hits:
+  (lldb) frame variable email     # Check value here
+  (lldb) bt                       # See full call stack
+  (lldb) up                       # Move to caller
+  (lldb) frame variable request   # Check values in caller
+  (lldb) up                       # Move up again
+  (lldb) frame variable           # Where empty string created?
+
+Please share:
+  1. Value of 'email' at validator.rs:42
+  2. Value of 'request.email' in user_service.rs
+  3. Value of 'req.body.email' in api_handler.rs
+  4. Where does empty string first appear?"
+```
+
+---
+
+## Option 2: Add Instrumentation (Claude CAN Do This)
+
+When debugger not available or issue intermittent:
+
+```rust
+// Add at error location
+fn validate_email(email: &str) -> Result<()> {
+    eprintln!("DEBUG validate_email called:");
+    eprintln!("  email: {:?}", email);
+    eprintln!("  backtrace: {}", std::backtrace::Backtrace::capture());
+
+    if email.is_empty() {
+        return Err(Error::InvalidEmail);
+    }
+    // ...
+}
+```
+
+**Critical:** Use `eprintln!()` or `console.error()` in tests (not logger - may be suppressed).
+
+**Run and analyze:**
+
+```bash
+cargo test 2>&1 | grep "DEBUG validate_email" -A 10
+```
+
+Look for:
+- Test file names in backtraces
+- Line numbers triggering the call
+- Patterns (same test? same parameter?)
+</debugging_approaches>
+
+<finding_polluting_tests>
+## Finding Which Test Pollutes
+
+When something appears during tests but you don't know which:
+
+**Binary search approach:**
+
+```bash
+# Run half the tests
+npm test tests/first-half/*.test.ts
+# Pollution appears? Yes → in first half, No → second half
+
+# Subdivide
+npm test tests/first-quarter/*.test.ts
+
+# Continue until specific file
+npm test tests/auth/login.test.ts  ← Found it!
+```
+
+**Or test isolation:**
+
+```bash
+# Run tests one at a time
+for test in tests/**/*.test.ts; do
+  echo "Testing: $test"
+  npm test "$test"
+  if [ -d .git ]; then
+    echo "FOUND POLLUTER: $test"
+    break
+  fi
+done
+```
+</finding_polluting_tests>
+
+<examples>
+<example>
+<scenario>Developer fixes symptom, not source</scenario>
+
+<code>
+# Error appears in git utility:
+fn git_init(directory: &str) {
+    Command::new("git")
+        .arg("init")
+        .current_dir(directory)
+        .run()
+}
+
+# Error: "Invalid argument: empty directory"
+
+# Developer adds validation at symptom:
+fn git_init(directory: &str) {
+    if directory.is_empty() {
+        panic!("Directory cannot be empty"); // Band-aid
+    }
+    Command::new("git").arg("init").current_dir(directory).run()
+}
+</code>
+
+<why_it_fails>
+- Fixes symptom, not source (where empty string created)
+- Same bug will appear elsewhere directory is used
+- Doesn't explain WHY directory was empty
+- Future code might make same mistake
+- Band-aid hides the real problem
+</why_it_fails>
+
+<correction>
+**Trace backward:**
+
+1. git_init called with directory=""
+2. WorkspaceManager.init(projectDir="")
+3. Session.create(projectDir="")
+4. Test: Project.create(context.tempDir)
+5. **SOURCE:** context.tempDir="" (accessed before beforeEach!)
+
+**Fix at source:**
+
+```typescript
+function setupTest() {
+  let _tempDir: string | undefined;
+
+  return {
+    beforeEach() {
+      _tempDir = makeTempDir();
+    },
+    get tempDir(): string {
+      if (!_tempDir) {
+        throw new Error("tempDir accessed before beforeEach!");
+      }
+      return _tempDir;
+    }
+  };
+}
+```
+
+**What you gain:**
+- Fixes actual bug (test timing issue)
+- Prevents same mistake elsewhere
+- Clear error at source, not deep in stack
+- No empty strings propagating through system
+</correction>
+</example>
+
+<example>
+<scenario>Developer stops tracing too early</scenario>
+
+<code>
+# Error in API handler
+async createUser(req: Request): Promise<Response> {
+  const userRequest = {
+    name: req.body.name,
+    email: req.body.email || "", // Suspicious!
+  };
+  return this.userService.create(userRequest);
+}
+
+# Developer sees empty string default and "fixes" it:
+email: req.body.email || "noreply@example.com"
+
+# Ships to production
+# Bug: Users created without email input get noreply@example.com
+# Database has fake emails, can't distinguish missing from real
+</code>
+
+<why_it_fails>
+- Stopped at first suspicious code
+- Didn't question WHY empty string was default
+- "Fixed" by replacing with different wrong default
+- Root cause: shouldn't accept missing email at all
+- Validation should happen at API boundary
+</why_it_fails>
+
+<correction>
+**Keep tracing to understand intent:**
+
+1. Why was empty string default?
+2. Should email be optional or required?
+3. What does API spec say?
+4. What does database schema say?
+
+**Findings:**
+- Email column is NOT NULL in database
+- API docs say email is required
+- Empty string was workaround, not design
+
+**Fix at source (validate at boundary):**
+
+```typescript
+async createUser(req: Request): Promise<Response> {
+  // Validate at API boundary
+  if (!req.body.email) {
+    return Response.badRequest("Email is required");
+  }
+
+  const userRequest = {
+    name: req.body.name,
+    email: req.body.email, // No default needed
+  };
+  return this.userService.create(userRequest);
+}
+```
+
+**What you gain:**
+- Validates at correct layer (API boundary)
+- Clear error message to client
+- No invalid data propagates downstream
+- Database constraints enforced
+- Matches API specification
+</correction>
+</example>
+
+<example>
+<scenario>Complex multi-layer trace to find original trigger</scenario>
+
+<code>
+# Problem: .git directory appearing in source code directory during tests
+
+# Symptom location:
+Error: Cannot initialize git repo (repo already exists)
+Location: src/workspace/git.rs:45
+
+# Developer adds check:
+if Path::new(".git").exists() {
+    return Err("Git already initialized");
+}
+
+# Doesn't help - still appears in wrong place!
+</code>
+
+<why_it_fails>
+- Detects symptom, doesn't prevent it
+- .git still created in wrong directory
+- Doesn't explain HOW it gets there
+- Pollution still happens, just detected
+</why_it_fails>
+
+<correction>
+**Trace through multiple layers:**
+
+```
+1. git init runs with cwd=""
+   ↓ Why is cwd empty?
+
+2. WorkspaceManager.init(projectDir="")
+   ↓ Why is projectDir empty?
+
+3. Session.create(projectDir="")
+   ↓ Why was empty string passed?
+
+4. Test: Project.create(context.tempDir)
+   ↓ Why is context.tempDir empty?
+
+5. ROOT CAUSE:
+   const context = setupTest(); // tempDir="" initially
+   Project.create(context.tempDir); // Accessed at top level!
+
+   beforeEach(() => {
+     context.tempDir = makeTempDir(); // Assigned here
+   });
+
+   TEST ACCESSED TEMPDIR BEFORE BEFOREEACH RAN!
+```
+
+**Fix at source (make early access impossible):**
+
+```typescript
+function setupTest() {
+  let _tempDir: string | undefined;
+
+  return {
+    beforeEach() {
+      _tempDir = makeTempDir();
+    },
+    get tempDir(): string {
+      if (!_tempDir) {
+        throw new Error("tempDir accessed before beforeEach!");
+      }
+      return _tempDir;
+    }
+  };
+}
+```
+
+**Then add defense at each layer:**
+
+```rust
+// Layer 1: Test framework (PRIMARY FIX)
+// Getter throws if accessed early
+
+// Layer 2: Project validation
+fn create(directory: &str) -> Result<Self> {
+    if directory.is_empty() {
+        return Err("Directory cannot be empty");
+    }
+    // ...
+}
+
+// Layer 3: Workspace validation
+fn init(path: &Path) -> Result<()> {
+    if !path.exists() {
+        return Err("Path must exist");
+    }
+    // ...
+}
+
+// Layer 4: Environment guard
+fn git_init(dir: &Path) -> Result<()> {
+    if env::var("NODE_ENV") != Ok("test".to_string()) {
+        if !dir.starts_with("/tmp") {
+            panic!("Refusing to git init outside test dir");
+        }
+    }
+    // ...
+}
+```
+
+**What you gain:**
+- Primary fix prevents early access (source)
+- Each layer validates assumptions (defense)
+- Clear error at source, not deep in stack
+- Environment guard prevents production pollution
+- Multi-layer defense catches future mistakes
+</correction>
+</example>
+</examples>
+
+<critical_rules>
+## Rules That Have No Exceptions
+
+1. **Never fix just where error appears** → Trace backward to find source
+2. **Don't stop at first suspicious code** → Keep tracing to original trigger
+3. **Fix at source first** → Defense is backup, not primary fix
+4. **Use debugger OR instrumentation** → Don't guess at call chain
+5. **Add defense at each layer** → After fixing source, validate assumptions throughout
+
+## Common Excuses
+
+All of these mean: **STOP. Trace backward to find source.**
+
+- "Error is obvious here, I'll add validation" (That's a symptom fix)
+- "Stack trace shows the problem" (Shows symptom location, not source)
+- "This code should handle empty values" (Why is value empty? Find source.)
+- "Too deep to trace, I'll add defensive check" (Defense without source fix = band-aid)
+- "Multiple places could cause this" (Trace to find which one actually does)
+</critical_rules>
+
+<verification_checklist>
+Before claiming root cause fixed:
+
+- [ ] Traced backward through entire call chain
+- [ ] Found where invalid data was created (not just passed)
+- [ ] Identified WHY invalid data was created (pattern/assumption)
+- [ ] Fixed at source (where bad data originates)
+- [ ] Added defense at each layer (validate assumptions)
+- [ ] Verified fix with test (reproduces original bug, passes with fix)
+- [ ] Confirmed no other code paths have same pattern
+
+**Can't check all boxes?** Keep tracing backward.
+</verification_checklist>
+
+<integration>
+**This skill is called by:**
+- hyperpowers:debugging-with-tools (Phase 2: Trace Backward Through Call Stack)
+- When errors occur deep in execution
+- When unclear where invalid data originated
+
+**This skill requires:**
+- Stack traces or debugger access
+- Ability to add instrumentation (logging)
+- Understanding of call chain
+
+**This skill calls:**
+- hyperpowers:test-driven-development (write regression test after finding source)
+- hyperpowers:verification-before-completion (verify fix works)
+</integration>
+
+<resources>
+**Detailed guides:**
+- [Debugger commands by language](resources/debugger-reference.md)
+- [Instrumentation patterns](resources/instrumentation-patterns.md)
+- [Defense-in-depth examples](resources/defense-patterns.md)
+
+**When stuck:**
+- Can't find source → Add instrumentation at each layer, run test
+- Stack trace unclear → Use debugger to inspect variables at each frame
+- Multiple suspects → Add instrumentation to all, find which actually executes
+- Intermittent issue → Add instrumentation and wait for reproduction
+</resources>