426 lines
11 KiB
Markdown
426 lines
11 KiB
Markdown
|
|
<overview>
|
|
The most common debugging mistake: declaring victory too early. A fix isn't complete until it's verified. This document defines what "verified" means and provides systematic approaches to proving your fix works.
|
|
</overview>
|
|
|
|
|
|
<definition>
|
|
A fix is verified when:
|
|
|
|
1. **The original issue no longer occurs**
|
|
- The exact reproduction steps now produce correct behavior
|
|
- Not "it seems better" - it definitively works
|
|
|
|
2. **You understand why the fix works**
|
|
- You can explain the mechanism
|
|
- Not "I changed X and it worked" but "X was causing Y, and changing it prevents Y"
|
|
|
|
3. **Related functionality still works**
|
|
- You haven't broken adjacent features
|
|
- Regression testing passes
|
|
|
|
4. **The fix works across environments**
|
|
- Not just on your machine
|
|
- In production-like conditions
|
|
|
|
5. **The fix is stable**
|
|
- Works consistently, not intermittently
|
|
- Not just "worked once" but "works reliably"
|
|
|
|
**Anything less than this is not verified.**
|
|
</definition>
|
|
|
|
<examples>
|
|
❌ **Not verified**:
|
|
- "I ran it once and it didn't crash"
|
|
- "It seems to work now"
|
|
- "The error message is gone" (but is the behavior correct?)
|
|
- "Works on my machine"
|
|
|
|
✅ **Verified**:
|
|
- "I ran the original reproduction steps 20 times - zero failures"
|
|
- "The data now saves correctly and I can retrieve it"
|
|
- "All existing tests pass, plus I added a test for this scenario"
|
|
- "Verified in dev, staging, and production environments"
|
|
</examples>
|
|
|
|
|
|
<pattern name="reproduction_verification">
|
|
**The golden rule**: If you can't reproduce the bug, you can't verify it's fixed.
|
|
|
|
**Process**:
|
|
|
|
1. **Before fixing**: Document exact steps to reproduce
|
|
```markdown
|
|
Reproduction steps:
|
|
1. Login as admin user
|
|
2. Navigate to /settings
|
|
3. Click "Export Data" button
|
|
4. Observe: Error "Cannot read property 'data' of undefined"
|
|
```
|
|
|
|
2. **After fixing**: Execute the same steps exactly
|
|
```markdown
|
|
Verification:
|
|
1. Login as admin user ✓
|
|
2. Navigate to /settings ✓
|
|
3. Click "Export Data" button ✓
|
|
4. Observe: CSV downloads successfully ✓
|
|
```
|
|
|
|
3. **Test edge cases** related to the bug
|
|
```markdown
|
|
Additional tests:
|
|
- Export with empty data set ✓
|
|
- Export with 1000+ records ✓
|
|
- Export while another request is pending ✓
|
|
```
|
|
|
|
**If you can't reproduce the original bug**:
|
|
- You don't know if your fix worked
|
|
- Maybe it's still broken
|
|
- Maybe your "fix" did nothing
|
|
- Maybe you fixed a different bug
|
|
|
|
**Solution**: Revert your fix. If the bug comes back, you've verified your fix addressed it.
|
|
</pattern>
|
|
|
|
|
|
<pattern name="regression_testing">
|
|
**The problem**: You fix one thing, break another.
|
|
|
|
**Why it happens**:
|
|
- Your fix changed shared code
|
|
- Your fix had unintended side effects
|
|
- Your fix broke an assumption other code relied on
|
|
|
|
**Protection strategy**:
|
|
|
|
**1. Identify adjacent functionality**
|
|
- What else uses the code you changed?
|
|
- What features depend on this behavior?
|
|
- What workflows include this step?
|
|
|
|
**2. Test each adjacent area**
|
|
- Manually test the happy path
|
|
- Check error handling
|
|
- Verify data integrity
|
|
|
|
**3. Run existing tests**
|
|
- Unit tests for the module
|
|
- Integration tests for the feature
|
|
- End-to-end tests for the workflow
|
|
|
|
<example>
|
|
**Fix**: Changed how user sessions are stored (from memory to database)
|
|
|
|
**Adjacent functionality to verify**:
|
|
- Login still works ✓
|
|
- Logout still works ✓
|
|
- Session timeout still works ✓
|
|
- Concurrent logins are handled correctly ✓
|
|
- Session data persists across server restarts ✓ (new capability)
|
|
- Password reset flow still works ✓
|
|
- OAuth login still works ✓
|
|
|
|
If you only tested "login works", you missed 6 other things that could break.
|
|
</example>
|
|
</pattern>
|
|
|
|
|
|
<pattern name="test_first_debugging">
|
|
**Strategy**: Write a failing test that reproduces the bug, then fix until the test passes.
|
|
|
|
**Benefits**:
|
|
- Proves you can reproduce the bug
|
|
- Provides automatic verification
|
|
- Prevents regression in the future
|
|
- Forces you to understand the bug precisely
|
|
|
|
**Process**:
|
|
|
|
1. **Write a test that reproduces the bug**
|
|
```javascript
|
|
test('should handle undefined user data gracefully', () => {
|
|
const result = processUserData(undefined);
|
|
expect(result).toBe(null); // Currently throws error
|
|
});
|
|
```
|
|
|
|
2. **Verify the test fails** (confirms it reproduces the bug)
|
|
```
|
|
✗ should handle undefined user data gracefully
|
|
TypeError: Cannot read property 'name' of undefined
|
|
```
|
|
|
|
3. **Fix the code**
|
|
```javascript
|
|
function processUserData(user) {
|
|
if (!user) return null; // Add defensive check
|
|
return user.name;
|
|
}
|
|
```
|
|
|
|
4. **Verify the test passes**
|
|
```
|
|
✓ should handle undefined user data gracefully
|
|
```
|
|
|
|
5. **Test is now regression protection**
|
|
- If someone breaks this again, the test will catch it
|
|
|
|
**When to use**:
|
|
- Clear, reproducible bugs
|
|
- Code that has test infrastructure
|
|
- Bugs that could recur
|
|
|
|
**When not to use**:
|
|
- Exploratory debugging (you don't understand the bug yet)
|
|
- Infrastructure issues (can't easily test)
|
|
- One-off data issues
|
|
</pattern>
|
|
|
|
|
|
<pattern name="environment_verification">
|
|
**The trap**: "Works on my machine"
|
|
|
|
**Reality**: Production is different.
|
|
|
|
**Differences to consider**:
|
|
|
|
**Environment variables**:
|
|
- `NODE_ENV=development` vs `NODE_ENV=production`
|
|
- Different API keys
|
|
- Different database connections
|
|
- Different feature flags
|
|
|
|
**Dependencies**:
|
|
- Different package versions (if not locked)
|
|
- Different system libraries
|
|
- Different Node/Python/etc versions
|
|
|
|
**Data**:
|
|
- Volume (100 records locally, 1M in production)
|
|
- Quality (clean test data vs messy real data)
|
|
- Edge cases (nulls, special characters, extreme values)
|
|
|
|
**Network**:
|
|
- Latency (local: 5ms, production: 200ms)
|
|
- Reliability (local: perfect, production: occasional failures)
|
|
- Firewalls, proxies, load balancers
|
|
|
|
**Verification checklist**:
|
|
```markdown
|
|
- [ ] Works locally (dev environment)
|
|
- [ ] Works in Docker container (mimics production)
|
|
- [ ] Works in staging (production-like)
|
|
- [ ] Works in production (the real test)
|
|
```
|
|
|
|
<example>
|
|
**Bug**: Batch processing fails in production but works locally
|
|
|
|
**Investigation**:
|
|
- Local: 100 test records, completes in 2 seconds
|
|
- Production: 50,000 records, times out at 30 seconds
|
|
|
|
**The difference**: Volume. Local testing didn't catch it.
|
|
|
|
**Fix verification**:
|
|
- Test locally with 50,000 records
|
|
- Verify performance in staging
|
|
- Monitor first production run
|
|
- Confirm all environments work
|
|
</example>
|
|
</pattern>
|
|
|
|
|
|
<pattern name="stability_testing">
|
|
**The problem**: It worked once, but will it work reliably?
|
|
|
|
**Intermittent bugs are the worst**:
|
|
- Hard to reproduce
|
|
- Hard to verify fixes
|
|
- Easy to declare fixed when they're not
|
|
|
|
**Verification strategies**:
|
|
|
|
**1. Repeated execution**
|
|
```bash
|
|
for i in {1..100}; do
|
|
npm test -- specific-test.js || echo "Failed on run $i"
|
|
done
|
|
```
|
|
|
|
If it fails even once, it's not fixed.
|
|
|
|
**2. Stress testing**
|
|
```javascript
|
|
// Run many instances in parallel
|
|
const promises = Array(50).fill().map(() =>
|
|
processData(testInput)
|
|
);
|
|
|
|
const results = await Promise.all(promises);
|
|
// All results should be correct
|
|
```
|
|
|
|
**3. Soak testing**
|
|
- Run for extended period (hours, days)
|
|
- Monitor for memory leaks, performance degradation
|
|
- Ensure stability over time
|
|
|
|
**4. Timing variations**
|
|
```javascript
|
|
// For race conditions, add random delays
|
|
async function testWithRandomTiming() {
|
|
await randomDelay(0, 100);
|
|
triggerAction1();
|
|
await randomDelay(0, 100);
|
|
triggerAction2();
|
|
await randomDelay(0, 100);
|
|
verifyResult();
|
|
}
|
|
|
|
// Run this 1000 times
|
|
```
|
|
|
|
<example>
|
|
**Bug**: Race condition in file upload
|
|
|
|
**Weak verification**:
|
|
- Upload one file
|
|
- "It worked!"
|
|
- Ship it
|
|
|
|
**Strong verification**:
|
|
- Upload 100 files sequentially: all succeed ✓
|
|
- Upload 20 files in parallel: all succeed ✓
|
|
- Upload while navigating away: handles correctly ✓
|
|
- Upload, cancel, upload again: works ✓
|
|
- Run all tests 50 times: zero failures ✓
|
|
|
|
Now it's verified.
|
|
</example>
|
|
</pattern>
|
|
|
|
|
|
<checklist>
|
|
Copy this checklist when verifying a fix:
|
|
|
|
```markdown
|
|
|
|
### Original Issue
|
|
- [ ] Can reproduce the original bug before the fix
|
|
- [ ] Have documented exact reproduction steps
|
|
|
|
### Fix Validation
|
|
- [ ] Original reproduction steps now work correctly
|
|
- [ ] Can explain WHY the fix works
|
|
- [ ] Fix is minimal and targeted
|
|
|
|
### Regression Testing
|
|
- [ ] Adjacent feature 1: [name] works
|
|
- [ ] Adjacent feature 2: [name] works
|
|
- [ ] Adjacent feature 3: [name] works
|
|
- [ ] Existing tests pass
|
|
- [ ] Added test to prevent regression
|
|
|
|
### Environment Testing
|
|
- [ ] Works in development
|
|
- [ ] Works in staging/QA
|
|
- [ ] Works in production
|
|
- [ ] Tested with production-like data volume
|
|
|
|
### Stability Testing
|
|
- [ ] Tested multiple times (n=__): zero failures
|
|
- [ ] Tested edge cases: [list them]
|
|
- [ ] Tested under load/stress: stable
|
|
|
|
### Documentation
|
|
- [ ] Code comments explain the fix
|
|
- [ ] Commit message explains the root cause
|
|
- [ ] If needed, updated user-facing docs
|
|
|
|
### Sign-off
|
|
- [ ] I understand why this bug occurred
|
|
- [ ] I understand why this fix works
|
|
- [ ] I've verified it works in all relevant environments
|
|
- [ ] I've tested for regressions
|
|
- [ ] I'm confident this won't recur
|
|
```
|
|
|
|
**Do not merge/deploy until all checkboxes are checked.**
|
|
</checklist>
|
|
|
|
|
|
<distrust>
|
|
Your verification might be wrong if:
|
|
|
|
**1. You can't reproduce the original bug anymore**
|
|
- Maybe you forgot how
|
|
- Maybe the environment changed
|
|
- Maybe you're testing the wrong thing
|
|
- **Action**: Document reproduction steps FIRST, before fixing
|
|
|
|
**2. The fix is large or complex**
|
|
- Changed 10 files, modified 200 lines
|
|
- Too many moving parts
|
|
- **Action**: Simplify the fix, then verify each piece
|
|
|
|
**3. You're not sure why it works**
|
|
- "I changed X and the bug went away"
|
|
- But you can't explain the mechanism
|
|
- **Action**: Investigate until you understand, then verify
|
|
|
|
**4. It only works sometimes**
|
|
- "Usually works now"
|
|
- "Seems more stable"
|
|
- **Action**: Not verified. Find and fix the remaining issue
|
|
|
|
**5. You can't test in production-like conditions**
|
|
- Only tested locally
|
|
- Different data, different scale
|
|
- **Action**: Set up staging environment or use production data in dev
|
|
|
|
**Red flag phrases**:
|
|
- "It seems to work"
|
|
- "I think it's fixed"
|
|
- "Looks good to me"
|
|
- "Can't reproduce anymore" (but you never could reliably)
|
|
|
|
**Trust-building phrases**:
|
|
- "I've verified 50 times - zero failures"
|
|
- "All tests pass including new regression test"
|
|
- "Deployed to staging, tested for 3 days, no issues"
|
|
- "Root cause was X, fix addresses X directly, verified by Y"
|
|
</distrust>
|
|
|
|
|
|
<mindset>
|
|
**Assume your fix is wrong until proven otherwise.**
|
|
|
|
This isn't pessimism - it's professionalism.
|
|
|
|
**Questions to ask yourself**:
|
|
- "How could this fix fail?"
|
|
- "What haven't I tested?"
|
|
- "What am I assuming?"
|
|
- "Would this survive production?"
|
|
|
|
**The cost of insufficient verification**:
|
|
- Bug returns in production
|
|
- User frustration
|
|
- Lost trust
|
|
- Emergency debugging sessions
|
|
- Rollbacks
|
|
|
|
**The benefit of thorough verification**:
|
|
- Confidence in deployment
|
|
- Prevention of regressions
|
|
- Trust from team
|
|
- Learning from the investigation
|
|
|
|
**Verification is not optional. It's the most important part of debugging.**
|
|
</mindset>
|