Initial commit
This commit is contained in:
759
commands/debug/diagnose.md
Normal file
759
commands/debug/diagnose.md
Normal file
@@ -0,0 +1,759 @@
|
||||
# Diagnose Operation - Comprehensive Diagnosis and Root Cause Analysis
|
||||
|
||||
You are executing the **diagnose** operation to perform comprehensive diagnosis and root cause analysis for complex issues spanning multiple layers of the application stack.
|
||||
|
||||
## Parameters
|
||||
|
||||
**Received**: `$ARGUMENTS` (after removing 'diagnose' operation name)
|
||||
|
||||
Expected format: `issue:"problem description" [environment:"prod|staging|dev"] [logs:"log-location"] [reproduction:"steps"] [impact:"severity"]`
|
||||
|
||||
## Workflow
|
||||
|
||||
### 1. Issue Understanding
|
||||
|
||||
Gather and analyze comprehensive information about the issue:
|
||||
|
||||
**Information to Collect**:
|
||||
- **Symptom**: What is the observable problem? What exactly is failing?
|
||||
- **Impact**: Who is affected? How many users? Business impact?
|
||||
- **Frequency**: Consistent, intermittent, or rare? Percentage of occurrences?
|
||||
- **Environment**: Production, staging, or development? Specific regions/zones?
|
||||
- **Timeline**: When did it start? Any correlation with deployments?
|
||||
- **Recent Changes**: Deployments, config changes, infrastructure changes?
|
||||
- **Error Messages**: Complete error messages, stack traces, error codes
|
||||
|
||||
**Questions to Answer**:
|
||||
```markdown
|
||||
- What is the user experiencing?
|
||||
- What should be happening instead?
|
||||
- How widespread is the issue?
|
||||
- Is it getting worse over time?
|
||||
- Are there any patterns (time of day, user types, specific actions)?
|
||||
```
|
||||
|
||||
### 2. Data Collection Across All Layers
|
||||
|
||||
Systematically collect diagnostic data from each layer of the stack:
|
||||
|
||||
#### Frontend Diagnostics
|
||||
|
||||
**Browser Console Analysis**:
|
||||
```javascript
|
||||
// Check for JavaScript errors
|
||||
console.error logs
|
||||
console.warn logs
|
||||
|
||||
// Inspect unhandled promise rejections
|
||||
window.addEventListener('unhandledrejection', event => {
|
||||
console.error('Unhandled promise rejection:', event.reason);
|
||||
});
|
||||
|
||||
// Check for resource loading failures
|
||||
performance.getEntriesByType('resource').filter(r => r.transferSize === 0)
|
||||
```
|
||||
|
||||
**Network Request Analysis**:
|
||||
```javascript
|
||||
// Analyze failed requests
|
||||
// Open DevTools > Network tab
|
||||
// Filter: Status code 4xx, 5xx
|
||||
// Check: Request headers, payload, response body, timing
|
||||
|
||||
// Performance timing
|
||||
const perfEntries = performance.getEntriesByType('navigation')[0];
|
||||
console.log('DNS lookup:', perfEntries.domainLookupEnd - perfEntries.domainLookupStart);
|
||||
console.log('TCP connection:', perfEntries.connectEnd - perfEntries.connectStart);
|
||||
console.log('Request time:', perfEntries.responseStart - perfEntries.requestStart);
|
||||
console.log('Response time:', perfEntries.responseEnd - perfEntries.responseStart);
|
||||
```
|
||||
|
||||
**State Inspection**:
|
||||
```javascript
|
||||
// React DevTools: Component state at error time
|
||||
// Redux DevTools: Action history, state snapshots
|
||||
// Vue DevTools: Vuex state, component hierarchy
|
||||
|
||||
// Add error boundary to capture React errors
|
||||
class ErrorBoundary extends React.Component {
|
||||
componentDidCatch(error, errorInfo) {
|
||||
console.error('Component error:', {
|
||||
error: error.toString(),
|
||||
componentStack: errorInfo.componentStack,
|
||||
currentState: this.props.reduxState
|
||||
});
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### Backend Diagnostics
|
||||
|
||||
**Application Logs**:
|
||||
```bash
|
||||
# Real-time application logs
|
||||
tail -f logs/application.log
|
||||
|
||||
# Error logs with context
|
||||
grep -i "error\|exception\|fatal" logs/*.log -A 10 -B 5
|
||||
|
||||
# Filter by request ID to trace single request
|
||||
grep "request-id-12345" logs/*.log
|
||||
|
||||
# Find patterns in errors
|
||||
awk '/ERROR/ {print $0}' logs/application.log | sort | uniq -c | sort -rn
|
||||
|
||||
# Time-based analysis
|
||||
grep "2024-10-14 14:" logs/application.log | grep ERROR
|
||||
```
|
||||
|
||||
**System Logs**:
|
||||
```bash
|
||||
# Service logs (systemd)
|
||||
journalctl -u application-service.service -f
|
||||
journalctl -u application-service.service --since "1 hour ago"
|
||||
|
||||
# Syslog
|
||||
tail -f /var/log/syslog | grep application
|
||||
|
||||
# Kernel logs (for system-level issues)
|
||||
dmesg -T | tail -50
|
||||
```
|
||||
|
||||
**Application Metrics**:
|
||||
```bash
|
||||
# Request rate and response times
|
||||
# Check APM tools: New Relic, Datadog, Elastic APM
|
||||
|
||||
# HTTP response codes over time
|
||||
awk '{print $9}' /var/log/nginx/access.log | sort | uniq -c
|
||||
|
||||
# Slow requests
|
||||
awk '$10 > 1000 {print $0}' /var/log/nginx/access.log
|
||||
|
||||
# Error rate calculation
|
||||
errors=$(grep -c "ERROR" logs/application.log)
|
||||
total=$(wc -l < logs/application.log)
|
||||
echo "Error rate: $(echo "scale=4; $errors / $total * 100" | bc)%"
|
||||
```
|
||||
|
||||
#### Database Diagnostics
|
||||
|
||||
**Active Queries and Locks**:
|
||||
```sql
|
||||
-- PostgreSQL: Active queries
|
||||
SELECT
|
||||
pid,
|
||||
now() - query_start AS duration,
|
||||
state,
|
||||
query
|
||||
FROM pg_stat_activity
|
||||
WHERE state != 'idle'
|
||||
ORDER BY duration DESC;
|
||||
|
||||
-- Long-running queries
|
||||
SELECT
|
||||
pid,
|
||||
now() - query_start AS duration,
|
||||
query
|
||||
FROM pg_stat_activity
|
||||
WHERE state = 'active'
|
||||
AND now() - query_start > interval '1 minute';
|
||||
|
||||
-- Blocking queries
|
||||
SELECT
|
||||
blocked_locks.pid AS blocked_pid,
|
||||
blocked_activity.usename AS blocked_user,
|
||||
blocking_locks.pid AS blocking_pid,
|
||||
blocking_activity.usename AS blocking_user,
|
||||
blocked_activity.query AS blocked_statement,
|
||||
blocking_activity.query AS blocking_statement
|
||||
FROM pg_catalog.pg_locks blocked_locks
|
||||
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
|
||||
JOIN pg_catalog.pg_locks blocking_locks
|
||||
ON blocking_locks.locktype = blocked_locks.locktype
|
||||
AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
|
||||
AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
|
||||
AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
|
||||
AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
|
||||
AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
|
||||
AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
|
||||
AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
|
||||
AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
|
||||
AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
|
||||
AND blocking_locks.pid != blocked_locks.pid
|
||||
JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
|
||||
WHERE NOT blocked_locks.granted;
|
||||
|
||||
-- Deadlock information (from logs)
|
||||
-- Look for "deadlock detected" in PostgreSQL logs
|
||||
```
|
||||
|
||||
**Database Performance**:
|
||||
```sql
|
||||
-- Table statistics
|
||||
SELECT
|
||||
schemaname,
|
||||
tablename,
|
||||
n_live_tup AS live_rows,
|
||||
n_dead_tup AS dead_rows,
|
||||
last_vacuum,
|
||||
last_autovacuum
|
||||
FROM pg_stat_user_tables
|
||||
ORDER BY n_dead_tup DESC;
|
||||
|
||||
-- Index usage
|
||||
SELECT
|
||||
schemaname,
|
||||
tablename,
|
||||
indexname,
|
||||
idx_scan,
|
||||
idx_tup_read,
|
||||
idx_tup_fetch
|
||||
FROM pg_stat_user_indexes
|
||||
ORDER BY idx_scan ASC;
|
||||
|
||||
-- Connection count
|
||||
SELECT
|
||||
count(*) AS connections,
|
||||
state,
|
||||
usename
|
||||
FROM pg_stat_activity
|
||||
GROUP BY state, usename;
|
||||
|
||||
-- Cache hit ratio
|
||||
SELECT
|
||||
sum(heap_blks_read) AS heap_read,
|
||||
sum(heap_blks_hit) AS heap_hit,
|
||||
sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read)) AS cache_hit_ratio
|
||||
FROM pg_statio_user_tables;
|
||||
```
|
||||
|
||||
**Slow Query Log Analysis**:
|
||||
```bash
|
||||
# PostgreSQL: Enable log_min_duration_statement
|
||||
# Check postgresql.conf: log_min_duration_statement = 1000 (1 second)
|
||||
|
||||
# Analyze slow queries
|
||||
grep "duration:" /var/log/postgresql/postgresql.log | awk '{print $3, $6}' | sort -rn | head -20
|
||||
```
|
||||
|
||||
#### Infrastructure Diagnostics
|
||||
|
||||
**Resource Usage**:
|
||||
```bash
|
||||
# CPU usage
|
||||
top -bn1 | head -20
|
||||
mpstat 1 5 # CPU stats every 1 second, 5 times
|
||||
|
||||
# Memory usage
|
||||
free -h
|
||||
vmstat 1 5
|
||||
|
||||
# Disk I/O
|
||||
iostat -x 1 5
|
||||
iotop -o # Only show processes doing I/O
|
||||
|
||||
# Disk space
|
||||
df -h
|
||||
du -sh /* | sort -rh | head -10
|
||||
|
||||
# Network connections
|
||||
netstat -an | grep ESTABLISHED | wc -l
|
||||
ss -s # Socket statistics
|
||||
|
||||
# Open files
|
||||
lsof | wc -l
|
||||
lsof -u application-user | wc -l
|
||||
```
|
||||
|
||||
**Container Diagnostics (Docker/Kubernetes)**:
|
||||
```bash
|
||||
# Docker container logs
|
||||
docker logs container-name --tail 100 -f
|
||||
docker stats container-name
|
||||
|
||||
# Docker container inspection
|
||||
docker inspect container-name
|
||||
docker exec container-name ps aux
|
||||
docker exec container-name df -h
|
||||
|
||||
# Kubernetes pod logs
|
||||
kubectl logs pod-name -f
|
||||
kubectl logs pod-name --previous # Previous container logs
|
||||
|
||||
# Kubernetes pod resource usage
|
||||
kubectl top pods
|
||||
kubectl describe pod pod-name
|
||||
|
||||
# Kubernetes events
|
||||
kubectl get events --sort-by='.lastTimestamp'
|
||||
```
|
||||
|
||||
**Cloud Provider Metrics**:
|
||||
```bash
|
||||
# AWS CloudWatch
|
||||
aws cloudwatch get-metric-statistics \
|
||||
--namespace AWS/EC2 \
|
||||
--metric-name CPUUtilization \
|
||||
--dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
|
||||
--start-time 2024-10-14T00:00:00Z \
|
||||
--end-time 2024-10-14T23:59:59Z \
|
||||
--period 3600 \
|
||||
--statistics Average
|
||||
|
||||
# Check application logs
|
||||
aws logs tail /aws/application/logs --follow
|
||||
|
||||
# GCP Stackdriver
|
||||
gcloud logging read "resource.type=gce_instance AND severity>=ERROR" --limit 50
|
||||
|
||||
# Azure Monitor
|
||||
az monitor metrics list --resource <resource-id> --metric "Percentage CPU"
|
||||
```
|
||||
|
||||
### 3. Hypothesis Formation
|
||||
|
||||
Based on collected data, form testable hypotheses about the root cause:
|
||||
|
||||
**Common Issue Patterns to Consider**:
|
||||
|
||||
#### Race Conditions
|
||||
**Symptoms**:
|
||||
- Intermittent failures
|
||||
- Works sometimes, fails other times
|
||||
- Timing-dependent behavior
|
||||
- "Cannot read property of undefined" on objects that should exist
|
||||
|
||||
**What to Check**:
|
||||
```javascript
|
||||
// Look for async operations without proper waiting
|
||||
async function problematic() {
|
||||
let data;
|
||||
fetchData().then(result => data = result); // ❌ Race condition
|
||||
return processData(data); // May execute before data is set
|
||||
}
|
||||
|
||||
// Proper async/await
|
||||
async function correct() {
|
||||
const data = await fetchData(); // ✅ Wait for data
|
||||
return processData(data);
|
||||
}
|
||||
|
||||
// Multiple parallel operations
|
||||
Promise.all([op1(), op2(), op3()]) // Check for interdependencies
|
||||
```
|
||||
|
||||
#### Memory Leaks
|
||||
**Symptoms**:
|
||||
- Degrading performance over time
|
||||
- Increasing memory usage
|
||||
- Eventually crashes with OOM errors
|
||||
- Slow garbage collection
|
||||
|
||||
**What to Check**:
|
||||
```javascript
|
||||
// Event listeners not removed
|
||||
componentDidMount() {
|
||||
window.addEventListener('resize', this.handleResize);
|
||||
// ❌ Missing removeEventListener in componentWillUnmount
|
||||
}
|
||||
|
||||
// Closures holding references
|
||||
function createLeak() {
|
||||
const largeData = new Array(1000000);
|
||||
return () => console.log(largeData[0]); // Holds entire array
|
||||
}
|
||||
|
||||
// Timers not cleared
|
||||
setInterval(() => fetchData(), 1000); // ❌ Never cleared
|
||||
|
||||
// Cache without eviction
|
||||
const cache = {};
|
||||
cache[key] = value; // ❌ Grows indefinitely
|
||||
```
|
||||
|
||||
#### Database Issues
|
||||
**Symptoms**:
|
||||
- Slow queries
|
||||
- Timeouts
|
||||
- Deadlocks
|
||||
- Connection pool exhausted
|
||||
|
||||
**What to Check**:
|
||||
```sql
|
||||
-- Missing indexes
|
||||
EXPLAIN ANALYZE SELECT * FROM users WHERE email = 'user@example.com';
|
||||
-- Look for "Seq Scan" on large tables
|
||||
|
||||
-- N+1 queries
|
||||
-- Check if ORM is making one query per item in a loop
|
||||
|
||||
-- Long transactions
|
||||
-- Find transactions open for extended periods
|
||||
|
||||
-- Lock contention
|
||||
-- Check for blocking queries and deadlocks
|
||||
```
|
||||
|
||||
#### Network Issues
|
||||
**Symptoms**:
|
||||
- Timeouts
|
||||
- Intermittent connectivity
|
||||
- DNS resolution failures
|
||||
- SSL/TLS handshake errors
|
||||
|
||||
**What to Check**:
|
||||
```bash
|
||||
# DNS resolution
|
||||
dig api.example.com
|
||||
nslookup api.example.com
|
||||
|
||||
# Network latency
|
||||
ping api.example.com
|
||||
traceroute api.example.com
|
||||
|
||||
# TCP connection
|
||||
telnet api.example.com 443
|
||||
nc -zv api.example.com 443
|
||||
|
||||
# SSL/TLS verification
|
||||
openssl s_client -connect api.example.com:443 -servername api.example.com
|
||||
```
|
||||
|
||||
#### Authentication/Authorization
|
||||
**Symptoms**:
|
||||
- 401 Unauthorized errors
|
||||
- 403 Forbidden errors
|
||||
- Intermittent authentication failures
|
||||
- Session expired errors
|
||||
|
||||
**What to Check**:
|
||||
```javascript
|
||||
// Token expiration
|
||||
const token = jwt.decode(authToken);
|
||||
console.log('Token expires:', new Date(token.exp * 1000));
|
||||
|
||||
// Session state
|
||||
console.log('Session:', sessionStorage, localStorage);
|
||||
|
||||
// Cookie issues
|
||||
console.log('Cookies:', document.cookie);
|
||||
|
||||
// CORS issues (browser console)
|
||||
// Look for: "CORS policy: No 'Access-Control-Allow-Origin' header"
|
||||
```
|
||||
|
||||
#### Configuration Issues
|
||||
**Symptoms**:
|
||||
- Works locally, fails in environment
|
||||
- "Environment variable not set" errors
|
||||
- Connection refused errors
|
||||
- Permission denied errors
|
||||
|
||||
**What to Check**:
|
||||
```bash
|
||||
# Environment variables
|
||||
printenv | grep APPLICATION
|
||||
env | sort
|
||||
|
||||
# Configuration files
|
||||
cat config/production.json
|
||||
diff config/development.json config/production.json
|
||||
|
||||
# File permissions
|
||||
ls -la config/
|
||||
ls -la /var/application/
|
||||
|
||||
# Network configuration
|
||||
cat /etc/hosts
|
||||
cat /etc/resolv.conf
|
||||
```
|
||||
|
||||
### 4. Hypothesis Testing
|
||||
|
||||
Systematically test each hypothesis:
|
||||
|
||||
**Testing Strategy**:
|
||||
|
||||
1. **Isolation**: Test each component in isolation
|
||||
2. **Instrumentation**: Add detailed logging around suspected areas
|
||||
3. **Reproduction**: Create minimal reproduction case
|
||||
4. **Elimination**: Rule out hypotheses systematically
|
||||
|
||||
**Add Diagnostic Instrumentation**:
|
||||
```javascript
|
||||
// Detailed logging with context
|
||||
console.log('[DIAG] Before operation:', {
|
||||
timestamp: new Date().toISOString(),
|
||||
user: currentUser,
|
||||
state: JSON.stringify(currentState),
|
||||
params: params
|
||||
});
|
||||
|
||||
try {
|
||||
const result = await operation(params);
|
||||
console.log('[DIAG] Operation success:', {
|
||||
timestamp: new Date().toISOString(),
|
||||
result: result,
|
||||
duration: Date.now() - startTime
|
||||
});
|
||||
} catch (error) {
|
||||
console.error('[DIAG] Operation failed:', {
|
||||
timestamp: new Date().toISOString(),
|
||||
error: error.message,
|
||||
stack: error.stack,
|
||||
context: { user, state, params }
|
||||
});
|
||||
throw error;
|
||||
}
|
||||
|
||||
// Performance timing
|
||||
console.time('operation');
|
||||
await operation();
|
||||
console.timeEnd('operation');
|
||||
|
||||
// Memory usage tracking
|
||||
if (global.gc) {
|
||||
global.gc();
|
||||
const usage = process.memoryUsage();
|
||||
console.log('[MEMORY]', {
|
||||
heapUsed: Math.round(usage.heapUsed / 1024 / 1024) + 'MB',
|
||||
heapTotal: Math.round(usage.heapTotal / 1024 / 1024) + 'MB',
|
||||
external: Math.round(usage.external / 1024 / 1024) + 'MB'
|
||||
});
|
||||
}
|
||||
```
|
||||
|
||||
**Binary Search Debugging**:
|
||||
```javascript
|
||||
// Comment out half the code
|
||||
// Determine which half has the bug
|
||||
// Repeat until isolated
|
||||
|
||||
// Example: Large function with error
|
||||
function complexOperation() {
|
||||
// Part 1: Data fetching
|
||||
const data = fetchData();
|
||||
|
||||
// Part 2: Data processing
|
||||
const processed = processData(data);
|
||||
|
||||
// Part 3: Data validation
|
||||
const validated = validateData(processed);
|
||||
|
||||
// Part 4: Data saving
|
||||
return saveData(validated);
|
||||
}
|
||||
|
||||
// Test each part independently
|
||||
const data = fetchData();
|
||||
console.log('[TEST] Data fetched:', data); // ✅ Works
|
||||
|
||||
const processed = processData(testData);
|
||||
console.log('[TEST] Data processed:', processed); // ❌ Fails here
|
||||
// Now investigate processData() specifically
|
||||
```
|
||||
|
||||
### 5. Root Cause Identification
|
||||
|
||||
Once hypotheses are tested and narrowed down:
|
||||
|
||||
**Confirm Root Cause**:
|
||||
1. Can you consistently reproduce the issue?
|
||||
2. Does fixing this cause resolve the symptom?
|
||||
3. Are there other instances of the same issue?
|
||||
4. Does the fix have any side effects?
|
||||
|
||||
**Document Evidence**:
|
||||
- Specific code/config that causes the issue
|
||||
- Exact conditions required for issue to manifest
|
||||
- Why this causes the observed symptom
|
||||
- Related code that might have same issue
|
||||
|
||||
### 6. Impact Assessment
|
||||
|
||||
Evaluate the full impact:
|
||||
|
||||
**User Impact**:
|
||||
- Number of users affected
|
||||
- Severity of impact (blocking, degraded, minor)
|
||||
- User actions affected
|
||||
- Business metrics impacted
|
||||
|
||||
**System Impact**:
|
||||
- Performance degradation
|
||||
- Resource consumption
|
||||
- Downstream service effects
|
||||
- Data integrity concerns
|
||||
|
||||
**Risk Assessment**:
|
||||
- Can it cause data loss?
|
||||
- Can it cause security issues?
|
||||
- Can it cause cascading failures?
|
||||
- Is it getting worse over time?
|
||||
|
||||
## Output Format
|
||||
|
||||
```markdown
|
||||
# Diagnosis Report: [Issue Summary]
|
||||
|
||||
## Executive Summary
|
||||
[One-paragraph summary of issue, root cause, and recommended action]
|
||||
|
||||
## Issue Description
|
||||
|
||||
### Symptoms
|
||||
- [Observable symptom 1]
|
||||
- [Observable symptom 2]
|
||||
- [Observable symptom 3]
|
||||
|
||||
### Impact
|
||||
- **Affected Users**: [number/percentage of users]
|
||||
- **Severity**: [critical|high|medium|low]
|
||||
- **Frequency**: [always|often|sometimes|rarely - with percentage]
|
||||
- **Business Impact**: [revenue loss, user experience, etc.]
|
||||
|
||||
### Environment
|
||||
- **Environment**: [production|staging|development]
|
||||
- **Version**: [application version]
|
||||
- **Infrastructure**: [relevant infrastructure details]
|
||||
- **Region**: [if applicable]
|
||||
|
||||
### Timeline
|
||||
- **First Observed**: [date/time]
|
||||
- **Recent Changes**: [deployments, config changes]
|
||||
- **Pattern**: [time-based, load-based, user-based]
|
||||
|
||||
## Diagnostic Data Collected
|
||||
|
||||
### Frontend Analysis
|
||||
[Console errors, network requests, performance data, state inspection results]
|
||||
|
||||
### Backend Analysis
|
||||
[Application logs, error traces, system metrics, request patterns]
|
||||
|
||||
### Database Analysis
|
||||
[Query logs, lock information, performance metrics, connection pool status]
|
||||
|
||||
### Infrastructure Analysis
|
||||
[Resource usage, container logs, cloud metrics, network diagnostics]
|
||||
|
||||
## Hypothesis Analysis
|
||||
|
||||
### Hypotheses Considered
|
||||
1. **[Hypothesis 1]**: [Description]
|
||||
- **Evidence For**: [supporting evidence]
|
||||
- **Evidence Against**: [contradicting evidence]
|
||||
- **Conclusion**: [Ruled out|Confirmed|Needs more investigation]
|
||||
|
||||
2. **[Hypothesis 2]**: [Description]
|
||||
- **Evidence For**: [supporting evidence]
|
||||
- **Evidence Against**: [contradicting evidence]
|
||||
- **Conclusion**: [Ruled out|Confirmed|Needs more investigation]
|
||||
|
||||
3. **[Hypothesis 3]**: [Description]
|
||||
- **Evidence For**: [supporting evidence]
|
||||
- **Evidence Against**: [contradicting evidence]
|
||||
- **Conclusion**: [Ruled out|Confirmed|Needs more investigation]
|
||||
|
||||
## Root Cause
|
||||
|
||||
### Root Cause Identified
|
||||
[Detailed explanation of the root cause with specific code/config references]
|
||||
|
||||
### Why It Causes the Symptom
|
||||
[Technical explanation of how the root cause leads to the observed behavior]
|
||||
|
||||
### Why It Wasn't Caught Earlier
|
||||
[Explanation of why tests/monitoring didn't catch this]
|
||||
|
||||
### Related Issues
|
||||
[Any similar issues that might exist or could be fixed with similar approach]
|
||||
|
||||
## Evidence
|
||||
|
||||
### Code/Configuration
|
||||
```[language]
|
||||
[Specific code or configuration causing the issue]
|
||||
```
|
||||
|
||||
### Reproduction
|
||||
[Exact steps to reproduce the issue consistently]
|
||||
|
||||
### Verification
|
||||
[Steps taken to confirm this is the root cause]
|
||||
|
||||
## Recommended Actions
|
||||
|
||||
### Immediate Actions
|
||||
1. [Immediate action 1 - e.g., rollback, circuit breaker]
|
||||
2. [Immediate action 2]
|
||||
|
||||
### Permanent Fix
|
||||
[Description of the permanent fix needed]
|
||||
|
||||
### Prevention
|
||||
- **Monitoring**: [What monitoring to add]
|
||||
- **Testing**: [What tests to add]
|
||||
- **Code Review**: [What to look for in code reviews]
|
||||
- **Documentation**: [What to document]
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. **Fix Implementation**: [Use /debug fix operation]
|
||||
2. **Verification**: [Testing strategy]
|
||||
3. **Deployment**: [Rollout plan]
|
||||
4. **Monitoring**: [What to watch]
|
||||
|
||||
## Appendices
|
||||
|
||||
### A. Detailed Logs
|
||||
[Relevant log excerpts with context]
|
||||
|
||||
### B. Metrics and Graphs
|
||||
[Performance metrics, error rates, resource usage]
|
||||
|
||||
### C. Related Tickets
|
||||
[Links to related issues or tickets]
|
||||
```
|
||||
|
||||
## Error Handling
|
||||
|
||||
**Insufficient Information**:
|
||||
If diagnosis cannot be completed due to missing information:
|
||||
1. List specific information needed
|
||||
2. Explain why each piece is important
|
||||
3. Provide instructions for gathering data
|
||||
4. Suggest interim monitoring
|
||||
|
||||
**Cannot Reproduce**:
|
||||
If issue cannot be reproduced:
|
||||
1. Document reproduction attempts
|
||||
2. Request more detailed reproduction steps
|
||||
3. Suggest environment comparison
|
||||
4. Propose production debugging approach
|
||||
|
||||
**Multiple Root Causes**:
|
||||
If multiple root causes are identified:
|
||||
1. Prioritize by impact
|
||||
2. Explain interdependencies
|
||||
3. Provide fix sequence
|
||||
4. Suggest monitoring between fixes
|
||||
|
||||
## Integration with Other Operations
|
||||
|
||||
After diagnosis is complete:
|
||||
- **For fixes**: Use `/debug fix` with identified root cause
|
||||
- **For reproduction**: Use `/debug reproduce` to create reliable test case
|
||||
- **For log analysis**: Use `/debug analyze-logs` for deeper log investigation
|
||||
- **For performance**: Use `/debug performance` if performance-related
|
||||
- **For memory**: Use `/debug memory` if memory-related
|
||||
|
||||
## Agent Utilization
|
||||
|
||||
This operation leverages the **10x-fullstack-engineer** agent for:
|
||||
- Systematic cross-layer analysis
|
||||
- Pattern recognition across stack
|
||||
- Hypothesis formation and testing
|
||||
- Production debugging expertise
|
||||
- Prevention-focused thinking
|
||||
Reference in New Issue
Block a user