Files
gh-dhofheinz-open-plugins-p…/commands/debug/README.md
2025-11-29 18:20:21 +08:00

597 lines
19 KiB
Markdown

# Debug Skill - Comprehensive Debugging Toolkit
A professional-grade debugging skill for diagnosing, reproducing, fixing, analyzing, and optimizing complex issues across the entire application stack.
## Overview
The debug skill provides systematic debugging operations that work seamlessly with the **10x-fullstack-engineer** agent to deliver cross-stack debugging expertise, production-grade strategies, and prevention-focused solutions.
## Available Operations
### 1. **diagnose** - Comprehensive Diagnosis and Root Cause Analysis
Performs systematic diagnosis across all layers of the application stack to identify root causes of complex issues.
**Usage:**
```bash
/10x-fullstack-engineer:debug diagnose issue:"Users getting 500 errors on file upload" environment:"production" logs:"logs/app.log"
```
**Parameters:**
- `issue:"description"` (required) - Problem description
- `environment:"prod|staging|dev"` (optional) - Target environment
- `logs:"path"` (optional) - Log file location
- `reproduction:"steps"` (optional) - Steps to reproduce
- `impact:"severity"` (optional) - Issue severity
**What it does:**
- Collects diagnostic data from frontend, backend, database, and infrastructure
- Analyzes symptoms and patterns across all stack layers
- Forms and tests hypotheses systematically
- Identifies root cause with supporting evidence
- Provides actionable recommendations
**Output:**
- Executive summary of issue and root cause
- Detailed diagnostic data from each layer
- Hypothesis analysis with evidence
- Root cause explanation
- Recommended immediate actions and permanent fix
- Prevention measures (monitoring, testing, documentation)
---
### 2. **reproduce** - Create Reliable Reproduction Strategies
Develops reliable strategies to reproduce issues consistently, creating test cases and reproduction documentation.
**Usage:**
```bash
/10x-fullstack-engineer:debug reproduce issue:"Payment webhook fails intermittently" environment:"staging" data:"sample-webhook-payload.json"
```
**Parameters:**
- `issue:"description"` (required) - Issue to reproduce
- `environment:"prod|staging|dev"` (optional) - Environment context
- `data:"path"` (optional) - Test data location
- `steps:"description"` (optional) - Known reproduction steps
- `reliability:"percentage"` (optional) - Current reproduction rate
**What it does:**
- Gathers environment, data, and user context
- Creates local reproduction strategy
- Develops automated test cases (unit, integration, E2E)
- Tests scenario variations and edge cases
- Verifies reproduction reliability
- Documents comprehensive reproduction guide
**Output:**
- Reproduction reliability metrics
- Prerequisites and setup instructions
- Detailed reproduction steps (manual and automated)
- Automated test case code
- Scenario variations tested
- Troubleshooting guide for reproduction issues
---
### 3. **fix** - Implement Targeted Fixes with Verification
Implements targeted fixes with comprehensive verification, safeguards, and prevention measures.
**Usage:**
```bash
/10x-fullstack-engineer:debug fix issue:"Race condition in order processing" root_cause:"Missing transaction lock" verification:"run-integration-tests"
```
**Parameters:**
- `issue:"description"` (required) - Issue being fixed
- `root_cause:"cause"` (required) - Identified root cause
- `verification:"strategy"` (optional) - Verification approach
- `scope:"areas"` (optional) - Affected code areas
- `rollback:"plan"` (optional) - Rollback strategy
**What it does:**
- Designs appropriate fix pattern for the issue type
- Implements fix with safety measures
- Adds safeguards (validation, rate limiting, circuit breakers)
- Performs multi-level verification (unit, integration, load, production)
- Adds prevention measures (tests, monitoring, alerts)
- Documents fix and deployment plan
**Fix patterns supported:**
- Missing error handling
- Race conditions
- Memory leaks
- Missing validation
- N+1 query problems
- Configuration issues
- Infrastructure limits
**Output:**
- Detailed fix implementation with before/after code
- Safeguards added (validation, error handling, monitoring)
- Verification results at all levels
- Prevention measures (tests, alerts, documentation)
- Deployment plan with rollback strategy
- Files modified and commits made
---
### 4. **analyze-logs** - Deep Log Analysis with Pattern Detection
Performs deep log analysis with pattern detection, timeline correlation, and anomaly identification.
**Usage:**
```bash
/10x-fullstack-engineer:debug analyze-logs path:"logs/application.log" pattern:"ERROR.*timeout" timeframe:"last-24h"
```
**Parameters:**
- `path:"log-file-path"` (required) - Log file to analyze
- `pattern:"regex"` (optional) - Filter pattern
- `timeframe:"range"` (optional) - Time range to analyze
- `level:"error|warn|info"` (optional) - Log level filter
- `context:"lines"` (optional) - Context lines around matches
**What it does:**
- Discovers and filters relevant logs across all sources
- Detects error patterns and clusters similar errors
- Performs timeline analysis and event correlation
- Traces individual requests across services
- Identifies statistical anomalies and spikes
- Analyzes performance, user impact, and security issues
**Utility script:**
```bash
./commands/debug/.scripts/analyze-logs.sh \
--file logs/application.log \
--level ERROR \
--since "1 hour ago" \
--context 5
```
**Output:**
- Summary of findings with key statistics
- Top errors with frequency and patterns
- Timeline of critical events
- Request tracing through distributed system
- Anomaly detection (spikes, new errors)
- Performance analysis from logs
- User impact assessment
- Root cause analysis based on log patterns
- Recommendations for fixes and monitoring
---
### 5. **performance** - Performance Debugging and Optimization
Debugs performance issues through profiling, bottleneck identification, and targeted optimization.
**Usage:**
```bash
/10x-fullstack-engineer:debug performance component:"api-endpoint:/orders" metric:"response-time" threshold:"200ms"
```
**Parameters:**
- `component:"name"` (required) - Component to profile
- `metric:"type"` (optional) - Metric to measure (response-time, throughput, cpu, memory)
- `threshold:"value"` (optional) - Target performance threshold
- `duration:"period"` (optional) - Profiling duration
- `load:"users"` (optional) - Concurrent users for load testing
**What it does:**
- Establishes performance baseline
- Profiles application, database, and network
- Identifies bottlenecks (CPU, I/O, memory, network)
- Implements targeted optimizations (queries, caching, algorithms, async)
- Performs load testing to verify improvements
- Sets up performance monitoring
**Profiling utility script:**
```bash
./commands/debug/.scripts/profile.sh \
--app node_app \
--duration 60 \
--endpoint http://localhost:3000/api/slow
```
**Optimization strategies:**
- Query optimization (indexes, query rewriting)
- Caching (application-level, Redis)
- Code optimization (algorithms, lazy loading, pagination)
- Async optimization (parallel execution, batching)
**Output:**
- Performance baseline and after-optimization metrics
- Bottlenecks identified with evidence
- Optimizations implemented with code changes
- Load testing results
- Performance improvement percentages
- Monitoring setup (metrics, dashboards, alerts)
- Recommendations for additional optimizations
---
### 6. **memory** - Memory Leak Detection and Optimization
Detects memory leaks, analyzes memory usage patterns, and optimizes memory consumption.
**Usage:**
```bash
/10x-fullstack-engineer:debug memory component:"background-worker" symptom:"growing-heap" duration:"6h"
```
**Parameters:**
- `component:"name"` (required) - Component to analyze
- `symptom:"type"` (optional) - Memory symptom (growing-heap, high-usage, oom)
- `duration:"period"` (optional) - Observation period
- `threshold:"max-mb"` (optional) - Memory threshold in MB
- `profile:"type"` (optional) - Profile type (heap, allocation)
**What it does:**
- Identifies memory symptoms (leaks, high usage, OOM)
- Captures memory profiles (heap snapshots, allocation tracking)
- Analyzes common leak patterns
- Implements memory optimizations
- Performs leak verification under load
- Tunes garbage collection
**Memory check utility script:**
```bash
./commands/debug/.scripts/memory-check.sh \
--app node_app \
--duration 300 \
--interval 10 \
--threshold 1024
```
**Common leak patterns detected:**
- Event listeners not removed
- Timers not cleared
- Closures holding references
- Unbounded caches
- Global variable accumulation
- Detached DOM nodes
- Infinite promise chains
**Optimization techniques:**
- Stream large data instead of loading into memory
- Use efficient data structures (Map vs Array)
- Paginate database queries
- Implement LRU caches with size limits
- Use weak references where appropriate
- Object pooling for frequently created objects
**Output:**
- Memory symptoms and baseline metrics
- Heap snapshot analysis
- Memory leaks identified with evidence
- Fixes implemented with before/after code
- Memory after fixes with improvement percentages
- Memory stability test results
- Garbage collection metrics
- Monitoring setup and alerts
- Recommendations for memory limits and future monitoring
---
## Utility Scripts
The debug skill includes three utility scripts in `.scripts/` directory:
### analyze-logs.sh
**Purpose:** Analyze log files for patterns, errors, and anomalies
**Features:**
- Pattern matching with regex
- Log level filtering
- Time-based filtering
- Context lines around matches
- Error statistics and top errors
- Time distribution analysis
- JSON output support
### profile.sh
**Purpose:** Profile application performance (CPU, memory, I/O)
**Features:**
- CPU profiling with statistics
- Memory profiling with growth detection
- I/O profiling
- Concurrent load testing
- Automated recommendations
- Comprehensive reports
### memory-check.sh
**Purpose:** Monitor memory usage and detect leaks
**Features:**
- Real-time memory monitoring
- Memory growth detection
- Leak detection with trend analysis
- ASCII memory usage charts
- Threshold alerts
- Detailed memory reports
---
## Common Debugging Workflows
### Workflow 1: Production Error Investigation
```bash
# Step 1: Diagnose the issue
/10x-fullstack-engineer:debug diagnose issue:"500 errors on checkout" environment:"production" logs:"logs/app.log"
# Step 2: Analyze logs for patterns
/10x-fullstack-engineer:debug analyze-logs path:"logs/app.log" pattern:"checkout.*ERROR" timeframe:"last-1h"
# Step 3: Reproduce locally
/10x-fullstack-engineer:debug reproduce issue:"Checkout fails with 500" environment:"staging" data:"test-checkout.json"
# Step 4: Implement fix
/10x-fullstack-engineer:debug fix issue:"Database timeout on checkout" root_cause:"Missing connection pool configuration"
```
### Workflow 2: Performance Degradation
```bash
# Step 1: Profile performance
/10x-fullstack-engineer:debug performance component:"api-endpoint:/checkout" metric:"response-time" threshold:"500ms"
# Step 2: Analyze slow queries
/10x-fullstack-engineer:debug analyze-logs path:"logs/postgresql.log" pattern:"duration:.*[0-9]{4,}"
# Step 3: Implement optimization
/10x-fullstack-engineer:debug fix issue:"Slow checkout API" root_cause:"N+1 query on order items"
```
### Workflow 3: Memory Leak Investigation
```bash
# Step 1: Diagnose memory symptoms
/10x-fullstack-engineer:debug diagnose issue:"Memory grows over time" environment:"production"
# Step 2: Profile memory usage
/10x-fullstack-engineer:debug memory component:"background-processor" symptom:"growing-heap" duration:"1h"
# Step 3: Implement fix
/10x-fullstack-engineer:debug fix issue:"Memory leak in event handlers" root_cause:"Event listeners not removed"
```
### Workflow 4: Intermittent Failure
```bash
# Step 1: Reproduce reliably
/10x-fullstack-engineer:debug reproduce issue:"Random payment failures" environment:"staging"
# Step 2: Diagnose with reproduction
/10x-fullstack-engineer:debug diagnose issue:"Payment webhook fails intermittently" reproduction:"steps-from-reproduce"
# Step 3: Analyze timing
/10x-fullstack-engineer:debug analyze-logs path:"logs/webhooks.log" pattern:"payment.*fail" context:10
# Step 4: Fix race condition
/10x-fullstack-engineer:debug fix issue:"Race condition in webhook handler" root_cause:"Concurrent webhook processing"
```
---
## Integration with 10x-fullstack-engineer Agent
All debugging operations are designed to work with the **10x-fullstack-engineer** agent, which provides:
- **Cross-stack debugging expertise** - Systematic analysis across frontend, backend, database, and infrastructure
- **Systematic root cause analysis** - Hypothesis formation, testing, and evidence-based conclusions
- **Production-grade debugging strategies** - Safe, reliable approaches suitable for production environments
- **Performance and security awareness** - Considers performance impact and security implications
- **Prevention-focused mindset** - Not just fixing issues, but preventing future occurrences
The agent brings deep expertise in:
- Full-stack architecture patterns
- Performance optimization techniques
- Memory management and leak detection
- Database query optimization
- Distributed systems debugging
- Production safety and deployment strategies
---
## Debugging Best Practices
### 1. Start with Diagnosis
Always begin with `/debug diagnose` to understand the full scope of the issue before attempting fixes.
### 2. Reproduce Reliably
Use `/debug reproduce` to create reproducible test cases. A bug that can't be reliably reproduced is hard to fix and verify.
### 3. Analyze Logs Systematically
Use `/debug analyze-logs` to find patterns and correlations. Look for:
- Error frequency and distribution
- Timeline correlation with deployments
- Anomalies and spikes
- Request tracing across services
### 4. Profile Before Optimizing
Use `/debug performance` and `/debug memory` to identify actual bottlenecks. Don't optimize based on assumptions.
### 5. Fix with Verification
Use `/debug fix` which includes:
- Proper error handling
- Comprehensive testing
- Monitoring and alerts
- Documentation
### 6. Add Prevention Measures
Every fix should include:
- Regression tests
- Monitoring metrics
- Alerts on thresholds
- Documentation updates
---
## Output Documentation
Each operation generates comprehensive reports in markdown format:
- **Executive summaries** for stakeholders
- **Detailed technical analysis** for engineers
- **Code snippets** with before/after comparisons
- **Evidence and metrics** supporting conclusions
- **Actionable recommendations** with priorities
- **Next steps** with clear instructions
Reports include:
- Issue description and symptoms
- Analysis methodology and findings
- Root cause explanation with evidence
- Fixes implemented with code
- Verification results
- Prevention measures added
- Files modified and commits
- Monitoring and alerting setup
---
## Error Handling
All operations include robust error handling:
- **Insufficient information** - Lists what's needed and how to gather it
- **Cannot reproduce** - Suggests alternative debugging approaches
- **Fix verification fails** - Provides re-diagnosis steps
- **Optimization degrades performance** - Includes rollback procedures
- **Environment differences** - Helps bridge local vs production gaps
---
## Common Debugging Scenarios
### Database Performance Issues
1. Use `/debug performance` to establish baseline
2. Use `/debug analyze-logs` on database slow query logs
3. Identify missing indexes or inefficient queries
4. Use `/debug fix` to implement optimization
5. Verify with load testing
### Memory Leaks
1. Use `/debug diagnose` to identify symptoms
2. Use `/debug memory` to capture heap profiles
3. Identify leak patterns (event listeners, timers, caches)
4. Use `/debug fix` to implement cleanup
5. Verify with sustained load testing
### Intermittent Errors
1. Use `/debug analyze-logs` to find error patterns
2. Use `/debug reproduce` to create reliable reproduction
3. Use `/debug diagnose` with reproduction steps
4. Identify timing or concurrency issues
5. Use `/debug fix` to implement proper synchronization
### Production Incidents
1. Use `/debug diagnose` for rapid root cause analysis
2. Use `/debug analyze-logs` for recent time period
3. Implement immediate mitigation (rollback, circuit breaker)
4. Use `/debug reproduce` to prevent recurrence
5. Use `/debug fix` for permanent solution
### Performance Degradation
1. Use `/debug performance` to compare against baseline
2. Identify bottlenecks (CPU, I/O, memory, network)
3. Use `/debug analyze-logs` for slow operations
4. Implement targeted optimizations
5. Verify improvements with load testing
---
## Tips and Tricks
### Effective Log Analysis
- Use pattern matching to find related errors
- Look for request IDs to trace across services
- Check timestamps for correlation with deployments
- Compare error rates before and after changes
- Use context lines to understand error conditions
### Performance Profiling
- Profile production-like workloads
- Use realistic data sizes
- Test under sustained load, not just peak
- Profile both CPU and memory together
- Use flame graphs for visual analysis
### Memory Debugging
- Force GC between measurements for accuracy
- Take multiple heap snapshots over time
- Look for objects that never get collected
- Check for consistent growth, not just spikes
- Verify fixes with extended monitoring
### Reproduction Strategies
- Minimize reproduction to essential steps
- Control timing with explicit delays
- Use specific test data that triggers issue
- Document environment differences
- Aim for >80% reproduction reliability
---
## File Locations
```
plugins/10x-fullstack-engineer/commands/debug/
├── skill.md # Router/orchestrator
├── diagnose.md # Diagnosis operation
├── reproduce.md # Reproduction operation
├── fix.md # Fix implementation operation
├── analyze-logs.md # Log analysis operation
├── performance.md # Performance debugging operation
├── memory.md # Memory debugging operation
├── .scripts/
│ ├── analyze-logs.sh # Log analysis utility
│ ├── profile.sh # Performance profiling utility
│ └── memory-check.sh # Memory monitoring utility
└── README.md # This file
```
---
## Requirements
- **Node.js operations**: Node.js runtime with `--inspect` or `--prof` flags for profiling
- **Log analysis**: Standard Unix tools (awk, grep, sed), optional jq for JSON logs
- **Performance profiling**: Apache Bench (ab), k6, or Artillery for load testing
- **Memory profiling**: Chrome DevTools, clinic.js, or memwatch for Node.js
- **Database profiling**: Access to database query logs and EXPLAIN ANALYZE capability
---
## Support and Troubleshooting
If operations fail:
1. Check that required parameters are provided
2. Verify file paths and permissions
3. Ensure utility scripts are executable (`chmod +x .scripts/*.sh`)
4. Check that prerequisite tools are installed
5. Review error messages for specific issues
For complex debugging scenarios:
- Start with `/debug diagnose` for systematic analysis
- Use multiple operations in sequence for comprehensive investigation
- Leverage the 10x-fullstack-engineer agent's expertise
- Document findings and share with team
---
## Version
Debug Skill v1.0.0
---
## License
Part of the 10x-fullstack-engineer plugin for Claude Code.