760 lines
20 KiB
Markdown
760 lines
20 KiB
Markdown
# Diagnose Operation - Comprehensive Diagnosis and Root Cause Analysis
|
|
|
|
You are executing the **diagnose** operation to perform comprehensive diagnosis and root cause analysis for complex issues spanning multiple layers of the application stack.
|
|
|
|
## Parameters
|
|
|
|
**Received**: `$ARGUMENTS` (after removing 'diagnose' operation name)
|
|
|
|
Expected format: `issue:"problem description" [environment:"prod|staging|dev"] [logs:"log-location"] [reproduction:"steps"] [impact:"severity"]`
|
|
|
|
## Workflow
|
|
|
|
### 1. Issue Understanding
|
|
|
|
Gather and analyze comprehensive information about the issue:
|
|
|
|
**Information to Collect**:
|
|
- **Symptom**: What is the observable problem? What exactly is failing?
|
|
- **Impact**: Who is affected? How many users? Business impact?
|
|
- **Frequency**: Consistent, intermittent, or rare? Percentage of occurrences?
|
|
- **Environment**: Production, staging, or development? Specific regions/zones?
|
|
- **Timeline**: When did it start? Any correlation with deployments?
|
|
- **Recent Changes**: Deployments, config changes, infrastructure changes?
|
|
- **Error Messages**: Complete error messages, stack traces, error codes
|
|
|
|
**Questions to Answer**:
|
|
```markdown
|
|
- What is the user experiencing?
|
|
- What should be happening instead?
|
|
- How widespread is the issue?
|
|
- Is it getting worse over time?
|
|
- Are there any patterns (time of day, user types, specific actions)?
|
|
```
|
|
|
|
### 2. Data Collection Across All Layers
|
|
|
|
Systematically collect diagnostic data from each layer of the stack:
|
|
|
|
#### Frontend Diagnostics
|
|
|
|
**Browser Console Analysis**:
|
|
```javascript
|
|
// Check for JavaScript errors
|
|
console.error logs
|
|
console.warn logs
|
|
|
|
// Inspect unhandled promise rejections
|
|
window.addEventListener('unhandledrejection', event => {
|
|
console.error('Unhandled promise rejection:', event.reason);
|
|
});
|
|
|
|
// Check for resource loading failures
|
|
performance.getEntriesByType('resource').filter(r => r.transferSize === 0)
|
|
```
|
|
|
|
**Network Request Analysis**:
|
|
```javascript
|
|
// Analyze failed requests
|
|
// Open DevTools > Network tab
|
|
// Filter: Status code 4xx, 5xx
|
|
// Check: Request headers, payload, response body, timing
|
|
|
|
// Performance timing
|
|
const perfEntries = performance.getEntriesByType('navigation')[0];
|
|
console.log('DNS lookup:', perfEntries.domainLookupEnd - perfEntries.domainLookupStart);
|
|
console.log('TCP connection:', perfEntries.connectEnd - perfEntries.connectStart);
|
|
console.log('Request time:', perfEntries.responseStart - perfEntries.requestStart);
|
|
console.log('Response time:', perfEntries.responseEnd - perfEntries.responseStart);
|
|
```
|
|
|
|
**State Inspection**:
|
|
```javascript
|
|
// React DevTools: Component state at error time
|
|
// Redux DevTools: Action history, state snapshots
|
|
// Vue DevTools: Vuex state, component hierarchy
|
|
|
|
// Add error boundary to capture React errors
|
|
class ErrorBoundary extends React.Component {
|
|
componentDidCatch(error, errorInfo) {
|
|
console.error('Component error:', {
|
|
error: error.toString(),
|
|
componentStack: errorInfo.componentStack,
|
|
currentState: this.props.reduxState
|
|
});
|
|
}
|
|
}
|
|
```
|
|
|
|
#### Backend Diagnostics
|
|
|
|
**Application Logs**:
|
|
```bash
|
|
# Real-time application logs
|
|
tail -f logs/application.log
|
|
|
|
# Error logs with context
|
|
grep -i "error\|exception\|fatal" logs/*.log -A 10 -B 5
|
|
|
|
# Filter by request ID to trace single request
|
|
grep "request-id-12345" logs/*.log
|
|
|
|
# Find patterns in errors
|
|
awk '/ERROR/ {print $0}' logs/application.log | sort | uniq -c | sort -rn
|
|
|
|
# Time-based analysis
|
|
grep "2024-10-14 14:" logs/application.log | grep ERROR
|
|
```
|
|
|
|
**System Logs**:
|
|
```bash
|
|
# Service logs (systemd)
|
|
journalctl -u application-service.service -f
|
|
journalctl -u application-service.service --since "1 hour ago"
|
|
|
|
# Syslog
|
|
tail -f /var/log/syslog | grep application
|
|
|
|
# Kernel logs (for system-level issues)
|
|
dmesg -T | tail -50
|
|
```
|
|
|
|
**Application Metrics**:
|
|
```bash
|
|
# Request rate and response times
|
|
# Check APM tools: New Relic, Datadog, Elastic APM
|
|
|
|
# HTTP response codes over time
|
|
awk '{print $9}' /var/log/nginx/access.log | sort | uniq -c
|
|
|
|
# Slow requests
|
|
awk '$10 > 1000 {print $0}' /var/log/nginx/access.log
|
|
|
|
# Error rate calculation
|
|
errors=$(grep -c "ERROR" logs/application.log)
|
|
total=$(wc -l < logs/application.log)
|
|
echo "Error rate: $(echo "scale=4; $errors / $total * 100" | bc)%"
|
|
```
|
|
|
|
#### Database Diagnostics
|
|
|
|
**Active Queries and Locks**:
|
|
```sql
|
|
-- PostgreSQL: Active queries
|
|
SELECT
|
|
pid,
|
|
now() - query_start AS duration,
|
|
state,
|
|
query
|
|
FROM pg_stat_activity
|
|
WHERE state != 'idle'
|
|
ORDER BY duration DESC;
|
|
|
|
-- Long-running queries
|
|
SELECT
|
|
pid,
|
|
now() - query_start AS duration,
|
|
query
|
|
FROM pg_stat_activity
|
|
WHERE state = 'active'
|
|
AND now() - query_start > interval '1 minute';
|
|
|
|
-- Blocking queries
|
|
SELECT
|
|
blocked_locks.pid AS blocked_pid,
|
|
blocked_activity.usename AS blocked_user,
|
|
blocking_locks.pid AS blocking_pid,
|
|
blocking_activity.usename AS blocking_user,
|
|
blocked_activity.query AS blocked_statement,
|
|
blocking_activity.query AS blocking_statement
|
|
FROM pg_catalog.pg_locks blocked_locks
|
|
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
|
|
JOIN pg_catalog.pg_locks blocking_locks
|
|
ON blocking_locks.locktype = blocked_locks.locktype
|
|
AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
|
|
AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
|
|
AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
|
|
AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
|
|
AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
|
|
AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
|
|
AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
|
|
AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
|
|
AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
|
|
AND blocking_locks.pid != blocked_locks.pid
|
|
JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
|
|
WHERE NOT blocked_locks.granted;
|
|
|
|
-- Deadlock information (from logs)
|
|
-- Look for "deadlock detected" in PostgreSQL logs
|
|
```
|
|
|
|
**Database Performance**:
|
|
```sql
|
|
-- Table statistics
|
|
SELECT
|
|
schemaname,
|
|
tablename,
|
|
n_live_tup AS live_rows,
|
|
n_dead_tup AS dead_rows,
|
|
last_vacuum,
|
|
last_autovacuum
|
|
FROM pg_stat_user_tables
|
|
ORDER BY n_dead_tup DESC;
|
|
|
|
-- Index usage
|
|
SELECT
|
|
schemaname,
|
|
tablename,
|
|
indexname,
|
|
idx_scan,
|
|
idx_tup_read,
|
|
idx_tup_fetch
|
|
FROM pg_stat_user_indexes
|
|
ORDER BY idx_scan ASC;
|
|
|
|
-- Connection count
|
|
SELECT
|
|
count(*) AS connections,
|
|
state,
|
|
usename
|
|
FROM pg_stat_activity
|
|
GROUP BY state, usename;
|
|
|
|
-- Cache hit ratio
|
|
SELECT
|
|
sum(heap_blks_read) AS heap_read,
|
|
sum(heap_blks_hit) AS heap_hit,
|
|
sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read)) AS cache_hit_ratio
|
|
FROM pg_statio_user_tables;
|
|
```
|
|
|
|
**Slow Query Log Analysis**:
|
|
```bash
|
|
# PostgreSQL: Enable log_min_duration_statement
|
|
# Check postgresql.conf: log_min_duration_statement = 1000 (1 second)
|
|
|
|
# Analyze slow queries
|
|
grep "duration:" /var/log/postgresql/postgresql.log | awk '{print $3, $6}' | sort -rn | head -20
|
|
```
|
|
|
|
#### Infrastructure Diagnostics
|
|
|
|
**Resource Usage**:
|
|
```bash
|
|
# CPU usage
|
|
top -bn1 | head -20
|
|
mpstat 1 5 # CPU stats every 1 second, 5 times
|
|
|
|
# Memory usage
|
|
free -h
|
|
vmstat 1 5
|
|
|
|
# Disk I/O
|
|
iostat -x 1 5
|
|
iotop -o # Only show processes doing I/O
|
|
|
|
# Disk space
|
|
df -h
|
|
du -sh /* | sort -rh | head -10
|
|
|
|
# Network connections
|
|
netstat -an | grep ESTABLISHED | wc -l
|
|
ss -s # Socket statistics
|
|
|
|
# Open files
|
|
lsof | wc -l
|
|
lsof -u application-user | wc -l
|
|
```
|
|
|
|
**Container Diagnostics (Docker/Kubernetes)**:
|
|
```bash
|
|
# Docker container logs
|
|
docker logs container-name --tail 100 -f
|
|
docker stats container-name
|
|
|
|
# Docker container inspection
|
|
docker inspect container-name
|
|
docker exec container-name ps aux
|
|
docker exec container-name df -h
|
|
|
|
# Kubernetes pod logs
|
|
kubectl logs pod-name -f
|
|
kubectl logs pod-name --previous # Previous container logs
|
|
|
|
# Kubernetes pod resource usage
|
|
kubectl top pods
|
|
kubectl describe pod pod-name
|
|
|
|
# Kubernetes events
|
|
kubectl get events --sort-by='.lastTimestamp'
|
|
```
|
|
|
|
**Cloud Provider Metrics**:
|
|
```bash
|
|
# AWS CloudWatch
|
|
aws cloudwatch get-metric-statistics \
|
|
--namespace AWS/EC2 \
|
|
--metric-name CPUUtilization \
|
|
--dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
|
|
--start-time 2024-10-14T00:00:00Z \
|
|
--end-time 2024-10-14T23:59:59Z \
|
|
--period 3600 \
|
|
--statistics Average
|
|
|
|
# Check application logs
|
|
aws logs tail /aws/application/logs --follow
|
|
|
|
# GCP Stackdriver
|
|
gcloud logging read "resource.type=gce_instance AND severity>=ERROR" --limit 50
|
|
|
|
# Azure Monitor
|
|
az monitor metrics list --resource <resource-id> --metric "Percentage CPU"
|
|
```
|
|
|
|
### 3. Hypothesis Formation
|
|
|
|
Based on collected data, form testable hypotheses about the root cause:
|
|
|
|
**Common Issue Patterns to Consider**:
|
|
|
|
#### Race Conditions
|
|
**Symptoms**:
|
|
- Intermittent failures
|
|
- Works sometimes, fails other times
|
|
- Timing-dependent behavior
|
|
- "Cannot read property of undefined" on objects that should exist
|
|
|
|
**What to Check**:
|
|
```javascript
|
|
// Look for async operations without proper waiting
|
|
async function problematic() {
|
|
let data;
|
|
fetchData().then(result => data = result); // ❌ Race condition
|
|
return processData(data); // May execute before data is set
|
|
}
|
|
|
|
// Proper async/await
|
|
async function correct() {
|
|
const data = await fetchData(); // ✅ Wait for data
|
|
return processData(data);
|
|
}
|
|
|
|
// Multiple parallel operations
|
|
Promise.all([op1(), op2(), op3()]) // Check for interdependencies
|
|
```
|
|
|
|
#### Memory Leaks
|
|
**Symptoms**:
|
|
- Degrading performance over time
|
|
- Increasing memory usage
|
|
- Eventually crashes with OOM errors
|
|
- Slow garbage collection
|
|
|
|
**What to Check**:
|
|
```javascript
|
|
// Event listeners not removed
|
|
componentDidMount() {
|
|
window.addEventListener('resize', this.handleResize);
|
|
// ❌ Missing removeEventListener in componentWillUnmount
|
|
}
|
|
|
|
// Closures holding references
|
|
function createLeak() {
|
|
const largeData = new Array(1000000);
|
|
return () => console.log(largeData[0]); // Holds entire array
|
|
}
|
|
|
|
// Timers not cleared
|
|
setInterval(() => fetchData(), 1000); // ❌ Never cleared
|
|
|
|
// Cache without eviction
|
|
const cache = {};
|
|
cache[key] = value; // ❌ Grows indefinitely
|
|
```
|
|
|
|
#### Database Issues
|
|
**Symptoms**:
|
|
- Slow queries
|
|
- Timeouts
|
|
- Deadlocks
|
|
- Connection pool exhausted
|
|
|
|
**What to Check**:
|
|
```sql
|
|
-- Missing indexes
|
|
EXPLAIN ANALYZE SELECT * FROM users WHERE email = 'user@example.com';
|
|
-- Look for "Seq Scan" on large tables
|
|
|
|
-- N+1 queries
|
|
-- Check if ORM is making one query per item in a loop
|
|
|
|
-- Long transactions
|
|
-- Find transactions open for extended periods
|
|
|
|
-- Lock contention
|
|
-- Check for blocking queries and deadlocks
|
|
```
|
|
|
|
#### Network Issues
|
|
**Symptoms**:
|
|
- Timeouts
|
|
- Intermittent connectivity
|
|
- DNS resolution failures
|
|
- SSL/TLS handshake errors
|
|
|
|
**What to Check**:
|
|
```bash
|
|
# DNS resolution
|
|
dig api.example.com
|
|
nslookup api.example.com
|
|
|
|
# Network latency
|
|
ping api.example.com
|
|
traceroute api.example.com
|
|
|
|
# TCP connection
|
|
telnet api.example.com 443
|
|
nc -zv api.example.com 443
|
|
|
|
# SSL/TLS verification
|
|
openssl s_client -connect api.example.com:443 -servername api.example.com
|
|
```
|
|
|
|
#### Authentication/Authorization
|
|
**Symptoms**:
|
|
- 401 Unauthorized errors
|
|
- 403 Forbidden errors
|
|
- Intermittent authentication failures
|
|
- Session expired errors
|
|
|
|
**What to Check**:
|
|
```javascript
|
|
// Token expiration
|
|
const token = jwt.decode(authToken);
|
|
console.log('Token expires:', new Date(token.exp * 1000));
|
|
|
|
// Session state
|
|
console.log('Session:', sessionStorage, localStorage);
|
|
|
|
// Cookie issues
|
|
console.log('Cookies:', document.cookie);
|
|
|
|
// CORS issues (browser console)
|
|
// Look for: "CORS policy: No 'Access-Control-Allow-Origin' header"
|
|
```
|
|
|
|
#### Configuration Issues
|
|
**Symptoms**:
|
|
- Works locally, fails in environment
|
|
- "Environment variable not set" errors
|
|
- Connection refused errors
|
|
- Permission denied errors
|
|
|
|
**What to Check**:
|
|
```bash
|
|
# Environment variables
|
|
printenv | grep APPLICATION
|
|
env | sort
|
|
|
|
# Configuration files
|
|
cat config/production.json
|
|
diff config/development.json config/production.json
|
|
|
|
# File permissions
|
|
ls -la config/
|
|
ls -la /var/application/
|
|
|
|
# Network configuration
|
|
cat /etc/hosts
|
|
cat /etc/resolv.conf
|
|
```
|
|
|
|
### 4. Hypothesis Testing
|
|
|
|
Systematically test each hypothesis:
|
|
|
|
**Testing Strategy**:
|
|
|
|
1. **Isolation**: Test each component in isolation
|
|
2. **Instrumentation**: Add detailed logging around suspected areas
|
|
3. **Reproduction**: Create minimal reproduction case
|
|
4. **Elimination**: Rule out hypotheses systematically
|
|
|
|
**Add Diagnostic Instrumentation**:
|
|
```javascript
|
|
// Detailed logging with context
|
|
console.log('[DIAG] Before operation:', {
|
|
timestamp: new Date().toISOString(),
|
|
user: currentUser,
|
|
state: JSON.stringify(currentState),
|
|
params: params
|
|
});
|
|
|
|
try {
|
|
const result = await operation(params);
|
|
console.log('[DIAG] Operation success:', {
|
|
timestamp: new Date().toISOString(),
|
|
result: result,
|
|
duration: Date.now() - startTime
|
|
});
|
|
} catch (error) {
|
|
console.error('[DIAG] Operation failed:', {
|
|
timestamp: new Date().toISOString(),
|
|
error: error.message,
|
|
stack: error.stack,
|
|
context: { user, state, params }
|
|
});
|
|
throw error;
|
|
}
|
|
|
|
// Performance timing
|
|
console.time('operation');
|
|
await operation();
|
|
console.timeEnd('operation');
|
|
|
|
// Memory usage tracking
|
|
if (global.gc) {
|
|
global.gc();
|
|
const usage = process.memoryUsage();
|
|
console.log('[MEMORY]', {
|
|
heapUsed: Math.round(usage.heapUsed / 1024 / 1024) + 'MB',
|
|
heapTotal: Math.round(usage.heapTotal / 1024 / 1024) + 'MB',
|
|
external: Math.round(usage.external / 1024 / 1024) + 'MB'
|
|
});
|
|
}
|
|
```
|
|
|
|
**Binary Search Debugging**:
|
|
```javascript
|
|
// Comment out half the code
|
|
// Determine which half has the bug
|
|
// Repeat until isolated
|
|
|
|
// Example: Large function with error
|
|
function complexOperation() {
|
|
// Part 1: Data fetching
|
|
const data = fetchData();
|
|
|
|
// Part 2: Data processing
|
|
const processed = processData(data);
|
|
|
|
// Part 3: Data validation
|
|
const validated = validateData(processed);
|
|
|
|
// Part 4: Data saving
|
|
return saveData(validated);
|
|
}
|
|
|
|
// Test each part independently
|
|
const data = fetchData();
|
|
console.log('[TEST] Data fetched:', data); // ✅ Works
|
|
|
|
const processed = processData(testData);
|
|
console.log('[TEST] Data processed:', processed); // ❌ Fails here
|
|
// Now investigate processData() specifically
|
|
```
|
|
|
|
### 5. Root Cause Identification
|
|
|
|
Once hypotheses are tested and narrowed down:
|
|
|
|
**Confirm Root Cause**:
|
|
1. Can you consistently reproduce the issue?
|
|
2. Does fixing this cause resolve the symptom?
|
|
3. Are there other instances of the same issue?
|
|
4. Does the fix have any side effects?
|
|
|
|
**Document Evidence**:
|
|
- Specific code/config that causes the issue
|
|
- Exact conditions required for issue to manifest
|
|
- Why this causes the observed symptom
|
|
- Related code that might have same issue
|
|
|
|
### 6. Impact Assessment
|
|
|
|
Evaluate the full impact:
|
|
|
|
**User Impact**:
|
|
- Number of users affected
|
|
- Severity of impact (blocking, degraded, minor)
|
|
- User actions affected
|
|
- Business metrics impacted
|
|
|
|
**System Impact**:
|
|
- Performance degradation
|
|
- Resource consumption
|
|
- Downstream service effects
|
|
- Data integrity concerns
|
|
|
|
**Risk Assessment**:
|
|
- Can it cause data loss?
|
|
- Can it cause security issues?
|
|
- Can it cause cascading failures?
|
|
- Is it getting worse over time?
|
|
|
|
## Output Format
|
|
|
|
```markdown
|
|
# Diagnosis Report: [Issue Summary]
|
|
|
|
## Executive Summary
|
|
[One-paragraph summary of issue, root cause, and recommended action]
|
|
|
|
## Issue Description
|
|
|
|
### Symptoms
|
|
- [Observable symptom 1]
|
|
- [Observable symptom 2]
|
|
- [Observable symptom 3]
|
|
|
|
### Impact
|
|
- **Affected Users**: [number/percentage of users]
|
|
- **Severity**: [critical|high|medium|low]
|
|
- **Frequency**: [always|often|sometimes|rarely - with percentage]
|
|
- **Business Impact**: [revenue loss, user experience, etc.]
|
|
|
|
### Environment
|
|
- **Environment**: [production|staging|development]
|
|
- **Version**: [application version]
|
|
- **Infrastructure**: [relevant infrastructure details]
|
|
- **Region**: [if applicable]
|
|
|
|
### Timeline
|
|
- **First Observed**: [date/time]
|
|
- **Recent Changes**: [deployments, config changes]
|
|
- **Pattern**: [time-based, load-based, user-based]
|
|
|
|
## Diagnostic Data Collected
|
|
|
|
### Frontend Analysis
|
|
[Console errors, network requests, performance data, state inspection results]
|
|
|
|
### Backend Analysis
|
|
[Application logs, error traces, system metrics, request patterns]
|
|
|
|
### Database Analysis
|
|
[Query logs, lock information, performance metrics, connection pool status]
|
|
|
|
### Infrastructure Analysis
|
|
[Resource usage, container logs, cloud metrics, network diagnostics]
|
|
|
|
## Hypothesis Analysis
|
|
|
|
### Hypotheses Considered
|
|
1. **[Hypothesis 1]**: [Description]
|
|
- **Evidence For**: [supporting evidence]
|
|
- **Evidence Against**: [contradicting evidence]
|
|
- **Conclusion**: [Ruled out|Confirmed|Needs more investigation]
|
|
|
|
2. **[Hypothesis 2]**: [Description]
|
|
- **Evidence For**: [supporting evidence]
|
|
- **Evidence Against**: [contradicting evidence]
|
|
- **Conclusion**: [Ruled out|Confirmed|Needs more investigation]
|
|
|
|
3. **[Hypothesis 3]**: [Description]
|
|
- **Evidence For**: [supporting evidence]
|
|
- **Evidence Against**: [contradicting evidence]
|
|
- **Conclusion**: [Ruled out|Confirmed|Needs more investigation]
|
|
|
|
## Root Cause
|
|
|
|
### Root Cause Identified
|
|
[Detailed explanation of the root cause with specific code/config references]
|
|
|
|
### Why It Causes the Symptom
|
|
[Technical explanation of how the root cause leads to the observed behavior]
|
|
|
|
### Why It Wasn't Caught Earlier
|
|
[Explanation of why tests/monitoring didn't catch this]
|
|
|
|
### Related Issues
|
|
[Any similar issues that might exist or could be fixed with similar approach]
|
|
|
|
## Evidence
|
|
|
|
### Code/Configuration
|
|
```[language]
|
|
[Specific code or configuration causing the issue]
|
|
```
|
|
|
|
### Reproduction
|
|
[Exact steps to reproduce the issue consistently]
|
|
|
|
### Verification
|
|
[Steps taken to confirm this is the root cause]
|
|
|
|
## Recommended Actions
|
|
|
|
### Immediate Actions
|
|
1. [Immediate action 1 - e.g., rollback, circuit breaker]
|
|
2. [Immediate action 2]
|
|
|
|
### Permanent Fix
|
|
[Description of the permanent fix needed]
|
|
|
|
### Prevention
|
|
- **Monitoring**: [What monitoring to add]
|
|
- **Testing**: [What tests to add]
|
|
- **Code Review**: [What to look for in code reviews]
|
|
- **Documentation**: [What to document]
|
|
|
|
## Next Steps
|
|
|
|
1. **Fix Implementation**: [Use /debug fix operation]
|
|
2. **Verification**: [Testing strategy]
|
|
3. **Deployment**: [Rollout plan]
|
|
4. **Monitoring**: [What to watch]
|
|
|
|
## Appendices
|
|
|
|
### A. Detailed Logs
|
|
[Relevant log excerpts with context]
|
|
|
|
### B. Metrics and Graphs
|
|
[Performance metrics, error rates, resource usage]
|
|
|
|
### C. Related Tickets
|
|
[Links to related issues or tickets]
|
|
```
|
|
|
|
## Error Handling
|
|
|
|
**Insufficient Information**:
|
|
If diagnosis cannot be completed due to missing information:
|
|
1. List specific information needed
|
|
2. Explain why each piece is important
|
|
3. Provide instructions for gathering data
|
|
4. Suggest interim monitoring
|
|
|
|
**Cannot Reproduce**:
|
|
If issue cannot be reproduced:
|
|
1. Document reproduction attempts
|
|
2. Request more detailed reproduction steps
|
|
3. Suggest environment comparison
|
|
4. Propose production debugging approach
|
|
|
|
**Multiple Root Causes**:
|
|
If multiple root causes are identified:
|
|
1. Prioritize by impact
|
|
2. Explain interdependencies
|
|
3. Provide fix sequence
|
|
4. Suggest monitoring between fixes
|
|
|
|
## Integration with Other Operations
|
|
|
|
After diagnosis is complete:
|
|
- **For fixes**: Use `/debug fix` with identified root cause
|
|
- **For reproduction**: Use `/debug reproduce` to create reliable test case
|
|
- **For log analysis**: Use `/debug analyze-logs` for deeper log investigation
|
|
- **For performance**: Use `/debug performance` if performance-related
|
|
- **For memory**: Use `/debug memory` if memory-related
|
|
|
|
## Agent Utilization
|
|
|
|
This operation leverages the **10x-fullstack-engineer** agent for:
|
|
- Systematic cross-layer analysis
|
|
- Pattern recognition across stack
|
|
- Hypothesis formation and testing
|
|
- Production debugging expertise
|
|
- Prevention-focused thinking
|