Initial commit

2025-11-29 18:28:37 +08:00
commit ccc65b3f07
180 changed files with 53970 additions and 0 deletions
--- a/skills/debug-like-expert/references/verification-patterns.md
+++ b/skills/debug-like-expert/references/verification-patterns.md
@@ -0,0 +1,425 @@
+
+<overview>
+The most common debugging mistake: declaring victory too early. A fix isn't complete until it's verified. This document defines what "verified" means and provides systematic approaches to proving your fix works.
+</overview>
+
+
+<definition>
+A fix is verified when:
+
+1. **The original issue no longer occurs**
+   - The exact reproduction steps now produce correct behavior
+   - Not "it seems better" - it definitively works
+
+2. **You understand why the fix works**
+   - You can explain the mechanism
+   - Not "I changed X and it worked" but "X was causing Y, and changing it prevents Y"
+
+3. **Related functionality still works**
+   - You haven't broken adjacent features
+   - Regression testing passes
+
+4. **The fix works across environments**
+   - Not just on your machine
+   - In production-like conditions
+
+5. **The fix is stable**
+   - Works consistently, not intermittently
+   - Not just "worked once" but "works reliably"
+
+**Anything less than this is not verified.**
+</definition>
+
+<examples>
+❌ **Not verified**:
+- "I ran it once and it didn't crash"
+- "It seems to work now"
+- "The error message is gone" (but is the behavior correct?)
+- "Works on my machine"
+
+✅ **Verified**:
+- "I ran the original reproduction steps 20 times - zero failures"
+- "The data now saves correctly and I can retrieve it"
+- "All existing tests pass, plus I added a test for this scenario"
+- "Verified in dev, staging, and production environments"
+</examples>
+
+
+<pattern name="reproduction_verification">
+**The golden rule**: If you can't reproduce the bug, you can't verify it's fixed.
+
+**Process**:
+
+1. **Before fixing**: Document exact steps to reproduce
+   ```markdown
+   Reproduction steps:
+   1. Login as admin user
+   2. Navigate to /settings
+   3. Click "Export Data" button
+   4. Observe: Error "Cannot read property 'data' of undefined"
+   ```
+
+2. **After fixing**: Execute the same steps exactly
+   ```markdown
+   Verification:
+   1. Login as admin user ✓
+   2. Navigate to /settings ✓
+   3. Click "Export Data" button ✓
+   4. Observe: CSV downloads successfully ✓
+   ```
+
+3. **Test edge cases** related to the bug
+   ```markdown
+   Additional tests:
+   - Export with empty data set ✓
+   - Export with 1000+ records ✓
+   - Export while another request is pending ✓
+   ```
+
+**If you can't reproduce the original bug**:
+- You don't know if your fix worked
+- Maybe it's still broken
+- Maybe your "fix" did nothing
+- Maybe you fixed a different bug
+
+**Solution**: Revert your fix. If the bug comes back, you've verified your fix addressed it.
+</pattern>
+
+
+<pattern name="regression_testing">
+**The problem**: You fix one thing, break another.
+
+**Why it happens**:
+- Your fix changed shared code
+- Your fix had unintended side effects
+- Your fix broke an assumption other code relied on
+
+**Protection strategy**:
+
+**1. Identify adjacent functionality**
+- What else uses the code you changed?
+- What features depend on this behavior?
+- What workflows include this step?
+
+**2. Test each adjacent area**
+- Manually test the happy path
+- Check error handling
+- Verify data integrity
+
+**3. Run existing tests**
+- Unit tests for the module
+- Integration tests for the feature
+- End-to-end tests for the workflow
+
+<example>
+**Fix**: Changed how user sessions are stored (from memory to database)
+
+**Adjacent functionality to verify**:
+- Login still works ✓
+- Logout still works ✓
+- Session timeout still works ✓
+- Concurrent logins are handled correctly ✓
+- Session data persists across server restarts ✓ (new capability)
+- Password reset flow still works ✓
+- OAuth login still works ✓
+
+If you only tested "login works", you missed 6 other things that could break.
+</example>
+</pattern>
+
+
+<pattern name="test_first_debugging">
+**Strategy**: Write a failing test that reproduces the bug, then fix until the test passes.
+
+**Benefits**:
+- Proves you can reproduce the bug
+- Provides automatic verification
+- Prevents regression in the future
+- Forces you to understand the bug precisely
+
+**Process**:
+
+1. **Write a test that reproduces the bug**
+   ```javascript
+   test('should handle undefined user data gracefully', () => {
+     const result = processUserData(undefined);
+     expect(result).toBe(null); // Currently throws error
+   });
+   ```
+
+2. **Verify the test fails** (confirms it reproduces the bug)
+   ```
+   ✗ should handle undefined user data gracefully
+     TypeError: Cannot read property 'name' of undefined
+   ```
+
+3. **Fix the code**
+   ```javascript
+   function processUserData(user) {
+     if (!user) return null; // Add defensive check
+     return user.name;
+   }
+   ```
+
+4. **Verify the test passes**
+   ```
+   ✓ should handle undefined user data gracefully
+   ```
+
+5. **Test is now regression protection**
+   - If someone breaks this again, the test will catch it
+
+**When to use**:
+- Clear, reproducible bugs
+- Code that has test infrastructure
+- Bugs that could recur
+
+**When not to use**:
+- Exploratory debugging (you don't understand the bug yet)
+- Infrastructure issues (can't easily test)
+- One-off data issues
+</pattern>
+
+
+<pattern name="environment_verification">
+**The trap**: "Works on my machine"
+
+**Reality**: Production is different.
+
+**Differences to consider**:
+
+**Environment variables**:
+- `NODE_ENV=development` vs `NODE_ENV=production`
+- Different API keys
+- Different database connections
+- Different feature flags
+
+**Dependencies**:
+- Different package versions (if not locked)
+- Different system libraries
+- Different Node/Python/etc versions
+
+**Data**:
+- Volume (100 records locally, 1M in production)
+- Quality (clean test data vs messy real data)
+- Edge cases (nulls, special characters, extreme values)
+
+**Network**:
+- Latency (local: 5ms, production: 200ms)
+- Reliability (local: perfect, production: occasional failures)
+- Firewalls, proxies, load balancers
+
+**Verification checklist**:
+```markdown
+- [ ] Works locally (dev environment)
+- [ ] Works in Docker container (mimics production)
+- [ ] Works in staging (production-like)
+- [ ] Works in production (the real test)
+```
+
+<example>
+**Bug**: Batch processing fails in production but works locally
+
+**Investigation**:
+- Local: 100 test records, completes in 2 seconds
+- Production: 50,000 records, times out at 30 seconds
+
+**The difference**: Volume. Local testing didn't catch it.
+
+**Fix verification**:
+- Test locally with 50,000 records
+- Verify performance in staging
+- Monitor first production run
+- Confirm all environments work
+</example>
+</pattern>
+
+
+<pattern name="stability_testing">
+**The problem**: It worked once, but will it work reliably?
+
+**Intermittent bugs are the worst**:
+- Hard to reproduce
+- Hard to verify fixes
+- Easy to declare fixed when they're not
+
+**Verification strategies**:
+
+**1. Repeated execution**
+```bash
+for i in {1..100}; do
+  npm test -- specific-test.js || echo "Failed on run $i"
+done
+```
+
+If it fails even once, it's not fixed.
+
+**2. Stress testing**
+```javascript
+// Run many instances in parallel
+const promises = Array(50).fill().map(() =>
+  processData(testInput)
+);
+
+const results = await Promise.all(promises);
+// All results should be correct
+```
+
+**3. Soak testing**
+- Run for extended period (hours, days)
+- Monitor for memory leaks, performance degradation
+- Ensure stability over time
+
+**4. Timing variations**
+```javascript
+// For race conditions, add random delays
+async function testWithRandomTiming() {
+  await randomDelay(0, 100);
+  triggerAction1();
+  await randomDelay(0, 100);
+  triggerAction2();
+  await randomDelay(0, 100);
+  verifyResult();
+}
+
+// Run this 1000 times
+```
+
+<example>
+**Bug**: Race condition in file upload
+
+**Weak verification**:
+- Upload one file
+- "It worked!"
+- Ship it
+
+**Strong verification**:
+- Upload 100 files sequentially: all succeed ✓
+- Upload 20 files in parallel: all succeed ✓
+- Upload while navigating away: handles correctly ✓
+- Upload, cancel, upload again: works ✓
+- Run all tests 50 times: zero failures ✓
+
+Now it's verified.
+</example>
+</pattern>
+
+
+<checklist>
+Copy this checklist when verifying a fix:
+
+```markdown
+
+### Original Issue
+- [ ] Can reproduce the original bug before the fix
+- [ ] Have documented exact reproduction steps
+
+### Fix Validation
+- [ ] Original reproduction steps now work correctly
+- [ ] Can explain WHY the fix works
+- [ ] Fix is minimal and targeted
+
+### Regression Testing
+- [ ] Adjacent feature 1: [name] works
+- [ ] Adjacent feature 2: [name] works
+- [ ] Adjacent feature 3: [name] works
+- [ ] Existing tests pass
+- [ ] Added test to prevent regression
+
+### Environment Testing
+- [ ] Works in development
+- [ ] Works in staging/QA
+- [ ] Works in production
+- [ ] Tested with production-like data volume
+
+### Stability Testing
+- [ ] Tested multiple times (n=__): zero failures
+- [ ] Tested edge cases: [list them]
+- [ ] Tested under load/stress: stable
+
+### Documentation
+- [ ] Code comments explain the fix
+- [ ] Commit message explains the root cause
+- [ ] If needed, updated user-facing docs
+
+### Sign-off
+- [ ] I understand why this bug occurred
+- [ ] I understand why this fix works
+- [ ] I've verified it works in all relevant environments
+- [ ] I've tested for regressions
+- [ ] I'm confident this won't recur
+```
+
+**Do not merge/deploy until all checkboxes are checked.**
+</checklist>
+
+
+<distrust>
+Your verification might be wrong if:
+
+**1. You can't reproduce the original bug anymore**
+- Maybe you forgot how
+- Maybe the environment changed
+- Maybe you're testing the wrong thing
+- **Action**: Document reproduction steps FIRST, before fixing
+
+**2. The fix is large or complex**
+- Changed 10 files, modified 200 lines
+- Too many moving parts
+- **Action**: Simplify the fix, then verify each piece
+
+**3. You're not sure why it works**
+- "I changed X and the bug went away"
+- But you can't explain the mechanism
+- **Action**: Investigate until you understand, then verify
+
+**4. It only works sometimes**
+- "Usually works now"
+- "Seems more stable"
+- **Action**: Not verified. Find and fix the remaining issue
+
+**5. You can't test in production-like conditions**
+- Only tested locally
+- Different data, different scale
+- **Action**: Set up staging environment or use production data in dev
+
+**Red flag phrases**:
+- "It seems to work"
+- "I think it's fixed"
+- "Looks good to me"
+- "Can't reproduce anymore" (but you never could reliably)
+
+**Trust-building phrases**:
+- "I've verified 50 times - zero failures"
+- "All tests pass including new regression test"
+- "Deployed to staging, tested for 3 days, no issues"
+- "Root cause was X, fix addresses X directly, verified by Y"
+</distrust>
+
+
+<mindset>
+**Assume your fix is wrong until proven otherwise.**
+
+This isn't pessimism - it's professionalism.
+
+**Questions to ask yourself**:
+- "How could this fix fail?"
+- "What haven't I tested?"
+- "What am I assuming?"
+- "Would this survive production?"
+
+**The cost of insufficient verification**:
+- Bug returns in production
+- User frustration
+- Lost trust
+- Emergency debugging sessions
+- Rollbacks
+
+**The benefit of thorough verification**:
+- Confidence in deployment
+- Prevention of regressions
+- Trust from team
+- Learning from the investigation
+
+**Verification is not optional. It's the most important part of debugging.**
+</mindset>