gh-jeanluciano-quaestor-src…/skills/debugging-issues/SKILL.md

---
name: Debugging Issues
description: Systematically debug issues with reproduction steps, error analysis, hypothesis testing, and root cause fixes. Use when investigating bugs, analyzing production incidents, or troubleshooting unexpected behavior.
---

# Debugging Issues

## Purpose
Provides systematic approaches to debugging, troubleshooting techniques, and error analysis strategies.

## When to Use
- Investigating bugs or unexpected behavior
- Analyzing error messages and stack traces
- Troubleshooting system issues
- Performance debugging
- Root cause analysis
- Production incident response

## Systematic Debugging Process

### 1. Reproduce the Issue
**Goal**: Create a consistent way to trigger the bug

**Steps:**
- [ ] Document exact steps to reproduce
- [ ] Identify required preconditions
- [ ] Note the environment (OS, browser, versions)
- [ ] Create minimal reproduction case
- [ ] Verify it reproduces consistently

**Example:**
```yaml
reproduction_steps:
  - action: "Login as admin user"
  - action: "Navigate to /dashboard"
  - action: "Click 'Export Data' button"
  - expected: "CSV file downloads"
  - actual: "Error 500 appears"
  - frequency: "Occurs every time"
```

### 2. Isolate the Problem
**Goal**: Narrow down where the issue occurs

**Techniques:**
```yaml
isolation_methods:
  Divide and Conquer:
    description: "Split system in half, test which half has issue"
    example: "Comment out half the code, see if error persists"

  Binary Search:
    description: "Use git bisect or similar to find breaking commit"
    command: "git bisect start && git bisect bad && git bisect good v1.0"

  Component Isolation:
    description: "Test each component individually"
    example: "Test database, API, frontend separately"

  Environment Comparison:
    description: "Compare working vs broken environments"
    checklist:
      - Different OS?
      - Different versions?
      - Different configurations?
      - Different data?
```

### 3. Analyze Logs and Errors
**Goal**: Gather evidence about what's going wrong

**Log Analysis:**
```yaml
log_analysis:
  error_messages:
    - Read the full error message
    - Note the error type/code
    - Identify the failing component

  stack_traces:
    - Start from the bottom (root cause)
    - Identify the first non-library code
    - Check function arguments at that point

  correlation:
    - Check logs before the error
    - Look for patterns
    - Correlate with user actions
    - Check timestamps
```

**Common Error Patterns:**
```python
# NullPointerException / AttributeError
# Usually: Accessing property of None/null object
# Fix: Add null checks or ensure object is initialized

# IndexError / ArrayIndexOutOfBoundsException
# Usually: Accessing array index that doesn't exist
# Fix: Check array length before accessing

# KeyError / Property not found
# Usually: Accessing dict/object key that doesn't exist
# Fix: Use .get() with default or check if key exists

# TypeError / Type mismatch
# Usually: Wrong type passed to function
# Fix: Validate types, add type hints

# ConnectionError / Timeout
# Usually: Network issues or service down
# Fix: Add retry logic, check service health
```

### 4. Form Hypothesis
**Goal**: Develop theory about what's causing the issue

**Hypothesis Framework:**
```yaml
hypothesis_template:
  observation: "What did you observe?"
  theory: "What do you think is causing it?"
  prediction: "If theory is correct, what else would be true?"
  test: "How can you test this?"

example:
  observation: "API returns 500 error on POST /users"
  theory: "Input validation is rejecting valid email format"
  prediction: "If true, different email format should work"
  test: "Try with various email formats"
```

### 5. Test the Hypothesis
**Goal**: Verify or disprove your theory

**Testing Approaches:**
```yaml
testing_methods:
  Add Logging:
    description: "Add detailed logs around suspected area"
    example: |
      logger.debug(f"Input data: {data}")
      logger.debug(f"Validation result: {is_valid}")

  Add Breakpoints:
    description: "Pause execution to inspect state"
    tools:
      - "pdb for Python"
      - "debugger for JavaScript"
      - "gdb for C/C++"

  Change One Thing:
    description: "Modify one variable at a time"
    example: "Change input value, run again, observe result"

  Write Failing Test:
    description: "Create test that reproduces the bug"
    benefit: "Ensures fix works and prevents regression"
```

### 6. Implement Fix
**Goal**: Resolve the root cause

**Fix Strategies:**
```yaml
fix_approaches:
  Quick Fix:
    when: "Production is down"
    approach: "Minimal change to restore service"
    followup: "Proper fix later"

  Root Cause Fix:
    when: "Have time to do it right"
    approach: "Fix underlying cause"
    benefit: "Prevents similar bugs"

  Workaround:
    when: "Fix is complex, need temporary solution"
    approach: "Add special handling"
    document: "Explain why workaround exists"
```

### 7. Verify the Fix
**Goal**: Ensure the issue is resolved

**Verification Checklist:**
- [ ] Original bug is fixed
- [ ] No new bugs introduced
- [ ] All tests pass
- [ ] Edge cases handled
- [ ] Code reviewed
- [ ] Deployed to test environment
- [ ] Tested in production-like environment

## Debugging Techniques

### Print Debugging
```python
# Simple but effective
def calculate_total(items):
    print(f"DEBUG: items = {items}")
    total = sum(item.price for item in items)
    print(f"DEBUG: total = {total}")
    return total
```

### Interactive Debugging
```python
# Python pdb
import pdb; pdb.set_trace()

# Common commands:
# n (next) - Execute next line
# s (step) - Step into function
# c (continue) - Continue execution
# p variable - Print variable
# l (list) - Show code context
# q (quit) - Exit debugger
```

### Rubber Duck Debugging
```yaml
rubber_duck_method:
  step_1: "Get a rubber duck (or patient colleague)"
  step_2: "Explain your code line by line"
  step_3: "Explain what you expect to happen"
  step_4: "Explain what actually happens"
  step_5: "Often you'll realize the issue while explaining"
```

### Binary Search Debugging
```bash
# Find which commit introduced a bug
git bisect start
git bisect bad  # Current commit is bad
git bisect good v1.0  # v1.0 was working

# Git will checkout commits for you to test
# After each test, mark as good or bad:
git bisect good  # if works
git bisect bad   # if broken

# Git will find the problematic commit
```

### Adding Instrumentation
```python
# Add metrics to understand behavior
import time
from functools import wraps

def timing_decorator(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        start = time.time()
        result = func(*args, **kwargs)
        duration = time.time() - start
        print(f"{func.__name__} took {duration:.2f}s")
        return result
    return wrapper

@timing_decorator
def slow_function():
    # Your code here
    pass
```

## Common Debugging Scenarios

### Performance Issues
```yaml
performance_debugging:
  profile_the_code:
    python: "python -m cProfile script.py"
    node: "node --prof script.js"

  identify_bottlenecks:
    - Look for functions called many times
    - Check for slow database queries
    - Identify memory allocations

  optimize:
    - Cache repeated calculations
    - Use more efficient algorithms
    - Add database indexes
    - Implement pagination
```

### Memory Leaks
```yaml
memory_leak_debugging:
  detect:
    - Monitor memory usage over time
    - Look for steadily increasing memory
    - Check for unclosed resources

  common_causes:
    - Unclosed file handles
    - Unclosed database connections
    - Event listeners not removed
    - Circular references
    - Large objects not garbage collected

  fix:
    - Use context managers (with statement)
    - Explicitly close connections
    - Remove event listeners
    - Break circular references
```

### Race Conditions
```yaml
race_condition_debugging:
  symptoms:
    - Intermittent failures
    - Harder to reproduce
    - Timing-dependent

  detection:
    - Add logging with timestamps
    - Use thread/process IDs in logs
    - Add artificial delays to expose timing issues

  solutions:
    - Add proper locking (mutex, semaphore)
    - Use atomic operations
    - Redesign to avoid shared state
    - Use message queues
```

### Database Issues
```yaml
database_debugging:
  slow_queries:
    identify: "EXPLAIN ANALYZE query"
    solutions:
      - Add indexes
      - Optimize joins
      - Reduce data fetched
      - Use connection pooling

  deadlocks:
    detect: "Check database logs for deadlock errors"
    prevent:
      - Acquire locks in consistent order
      - Keep transactions short
      - Use appropriate isolation levels

  connection_issues:
    symptoms: "Connection refused, timeout errors"
    check:
      - Database is running
      - Connection string correct
      - Firewall/network allows connection
      - Connection pool not exhausted
```

## Error Analysis Patterns

### Stack Trace Reading
```python
# Example stack trace
Traceback (most recent call last):
  File "app.py", line 45, in main
    process_user(user_data)
  File "services.py", line 23, in process_user
    validate_email(user_data['email'])
  File "validators.py", line 12, in validate_email
    if '@' not in email:
TypeError: argument of type 'NoneType' is not iterable

# Analysis:
# 1. Error: TypeError at line 12 in validators.py
# 2. Cause: 'email' variable is None
# 3. Origin: Likely user_data['email'] is None from services.py line 23
# 4. Fix: Add None check before validation
```

### Error Messages Interpretation
```yaml
error_interpretation:
  "Connection refused":
    likely_causes:
      - Service not running
      - Wrong port
      - Firewall blocking

  "Permission denied":
    likely_causes:
      - Insufficient file permissions
      - User lacks required role
      - Protected resource

  "Resource not found":
    likely_causes:
      - Typo in path/URL
      - Resource deleted
      - Wrong environment

  "Timeout":
    likely_causes:
      - Service too slow
      - Network issues
      - Infinite loop
      - Deadlock
```

## Debugging Checklist

### Before Starting
- [ ] Can you reproduce the issue?
- [ ] Do you have access to logs?
- [ ] Do you have a test environment?
- [ ] Is there a recent change that might have caused it?

### During Debugging
- [ ] Have you isolated the problem area?
- [ ] Have you checked the logs?
- [ ] Have you formed a hypothesis?
- [ ] Have you tested your hypothesis?
- [ ] Are you changing one thing at a time?

### Before Closing
- [ ] Is the original issue fixed?
- [ ] Have you written a test for this bug?
- [ ] Have you checked for similar bugs?
- [ ] Have you documented the root cause?
- [ ] Have you shared knowledge with the team?

## Production Debugging

### Safe Debugging in Production
```yaml
production_debugging:
  do:
    - Add detailed logging
    - Monitor metrics
    - Use feature flags to isolate issues
    - Take snapshots/backups before changes
    - Have rollback plan ready

  dont:
    - Don't use debugger breakpoints (freezes service)
    - Don't make changes without review
    - Don't restart services unnecessarily
    - Don't expose sensitive data in logs
```

### Incident Response
```yaml
incident_response:
  immediate:
    - Assess severity
    - Notify stakeholders
    - Start incident log
    - Begin mitigation

  mitigation:
    - Restore service (rollback if needed)
    - Implement workaround
    - Monitor closely

  resolution:
    - Identify root cause
    - Implement proper fix
    - Test thoroughly
    - Deploy fix

  followup:
    - Write postmortem
    - Update runbooks
    - Add monitoring/alerts
    - Share learnings
```

## Tools and Resources

### Debugging Tools
```yaml
tools_by_language:
  python:
    - "pdb - Interactive debugger"
    - "ipdb - Enhanced pdb"
    - "memory_profiler - Memory profiling"
    - "cProfile - Performance profiling"

  javascript:
    - "Chrome DevTools"
    - "Node.js debugger"
    - "VS Code debugger"

  general:
    - "Git bisect - Find breaking commit"
    - "curl - Test APIs"
    - "tcpdump - Network debugging"
    - "strace/dtrace - System call tracing"
```

---
*Use this skill when debugging issues or conducting root cause analysis*