Files
2025-11-29 18:20:21 +08:00

760 lines
20 KiB
Markdown

# Diagnose Operation - Comprehensive Diagnosis and Root Cause Analysis
You are executing the **diagnose** operation to perform comprehensive diagnosis and root cause analysis for complex issues spanning multiple layers of the application stack.
## Parameters
**Received**: `$ARGUMENTS` (after removing 'diagnose' operation name)
Expected format: `issue:"problem description" [environment:"prod|staging|dev"] [logs:"log-location"] [reproduction:"steps"] [impact:"severity"]`
## Workflow
### 1. Issue Understanding
Gather and analyze comprehensive information about the issue:
**Information to Collect**:
- **Symptom**: What is the observable problem? What exactly is failing?
- **Impact**: Who is affected? How many users? Business impact?
- **Frequency**: Consistent, intermittent, or rare? Percentage of occurrences?
- **Environment**: Production, staging, or development? Specific regions/zones?
- **Timeline**: When did it start? Any correlation with deployments?
- **Recent Changes**: Deployments, config changes, infrastructure changes?
- **Error Messages**: Complete error messages, stack traces, error codes
**Questions to Answer**:
```markdown
- What is the user experiencing?
- What should be happening instead?
- How widespread is the issue?
- Is it getting worse over time?
- Are there any patterns (time of day, user types, specific actions)?
```
### 2. Data Collection Across All Layers
Systematically collect diagnostic data from each layer of the stack:
#### Frontend Diagnostics
**Browser Console Analysis**:
```javascript
// Check for JavaScript errors
console.error logs
console.warn logs
// Inspect unhandled promise rejections
window.addEventListener('unhandledrejection', event => {
console.error('Unhandled promise rejection:', event.reason);
});
// Check for resource loading failures
performance.getEntriesByType('resource').filter(r => r.transferSize === 0)
```
**Network Request Analysis**:
```javascript
// Analyze failed requests
// Open DevTools > Network tab
// Filter: Status code 4xx, 5xx
// Check: Request headers, payload, response body, timing
// Performance timing
const perfEntries = performance.getEntriesByType('navigation')[0];
console.log('DNS lookup:', perfEntries.domainLookupEnd - perfEntries.domainLookupStart);
console.log('TCP connection:', perfEntries.connectEnd - perfEntries.connectStart);
console.log('Request time:', perfEntries.responseStart - perfEntries.requestStart);
console.log('Response time:', perfEntries.responseEnd - perfEntries.responseStart);
```
**State Inspection**:
```javascript
// React DevTools: Component state at error time
// Redux DevTools: Action history, state snapshots
// Vue DevTools: Vuex state, component hierarchy
// Add error boundary to capture React errors
class ErrorBoundary extends React.Component {
componentDidCatch(error, errorInfo) {
console.error('Component error:', {
error: error.toString(),
componentStack: errorInfo.componentStack,
currentState: this.props.reduxState
});
}
}
```
#### Backend Diagnostics
**Application Logs**:
```bash
# Real-time application logs
tail -f logs/application.log
# Error logs with context
grep -i "error\|exception\|fatal" logs/*.log -A 10 -B 5
# Filter by request ID to trace single request
grep "request-id-12345" logs/*.log
# Find patterns in errors
awk '/ERROR/ {print $0}' logs/application.log | sort | uniq -c | sort -rn
# Time-based analysis
grep "2024-10-14 14:" logs/application.log | grep ERROR
```
**System Logs**:
```bash
# Service logs (systemd)
journalctl -u application-service.service -f
journalctl -u application-service.service --since "1 hour ago"
# Syslog
tail -f /var/log/syslog | grep application
# Kernel logs (for system-level issues)
dmesg -T | tail -50
```
**Application Metrics**:
```bash
# Request rate and response times
# Check APM tools: New Relic, Datadog, Elastic APM
# HTTP response codes over time
awk '{print $9}' /var/log/nginx/access.log | sort | uniq -c
# Slow requests
awk '$10 > 1000 {print $0}' /var/log/nginx/access.log
# Error rate calculation
errors=$(grep -c "ERROR" logs/application.log)
total=$(wc -l < logs/application.log)
echo "Error rate: $(echo "scale=4; $errors / $total * 100" | bc)%"
```
#### Database Diagnostics
**Active Queries and Locks**:
```sql
-- PostgreSQL: Active queries
SELECT
pid,
now() - query_start AS duration,
state,
query
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY duration DESC;
-- Long-running queries
SELECT
pid,
now() - query_start AS duration,
query
FROM pg_stat_activity
WHERE state = 'active'
AND now() - query_start > interval '1 minute';
-- Blocking queries
SELECT
blocked_locks.pid AS blocked_pid,
blocked_activity.usename AS blocked_user,
blocking_locks.pid AS blocking_pid,
blocking_activity.usename AS blocking_user,
blocked_activity.query AS blocked_statement,
blocking_activity.query AS blocking_statement
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks
ON blocking_locks.locktype = blocked_locks.locktype
AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
AND blocking_locks.pid != blocked_locks.pid
JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;
-- Deadlock information (from logs)
-- Look for "deadlock detected" in PostgreSQL logs
```
**Database Performance**:
```sql
-- Table statistics
SELECT
schemaname,
tablename,
n_live_tup AS live_rows,
n_dead_tup AS dead_rows,
last_vacuum,
last_autovacuum
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC;
-- Index usage
SELECT
schemaname,
tablename,
indexname,
idx_scan,
idx_tup_read,
idx_tup_fetch
FROM pg_stat_user_indexes
ORDER BY idx_scan ASC;
-- Connection count
SELECT
count(*) AS connections,
state,
usename
FROM pg_stat_activity
GROUP BY state, usename;
-- Cache hit ratio
SELECT
sum(heap_blks_read) AS heap_read,
sum(heap_blks_hit) AS heap_hit,
sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read)) AS cache_hit_ratio
FROM pg_statio_user_tables;
```
**Slow Query Log Analysis**:
```bash
# PostgreSQL: Enable log_min_duration_statement
# Check postgresql.conf: log_min_duration_statement = 1000 (1 second)
# Analyze slow queries
grep "duration:" /var/log/postgresql/postgresql.log | awk '{print $3, $6}' | sort -rn | head -20
```
#### Infrastructure Diagnostics
**Resource Usage**:
```bash
# CPU usage
top -bn1 | head -20
mpstat 1 5 # CPU stats every 1 second, 5 times
# Memory usage
free -h
vmstat 1 5
# Disk I/O
iostat -x 1 5
iotop -o # Only show processes doing I/O
# Disk space
df -h
du -sh /* | sort -rh | head -10
# Network connections
netstat -an | grep ESTABLISHED | wc -l
ss -s # Socket statistics
# Open files
lsof | wc -l
lsof -u application-user | wc -l
```
**Container Diagnostics (Docker/Kubernetes)**:
```bash
# Docker container logs
docker logs container-name --tail 100 -f
docker stats container-name
# Docker container inspection
docker inspect container-name
docker exec container-name ps aux
docker exec container-name df -h
# Kubernetes pod logs
kubectl logs pod-name -f
kubectl logs pod-name --previous # Previous container logs
# Kubernetes pod resource usage
kubectl top pods
kubectl describe pod pod-name
# Kubernetes events
kubectl get events --sort-by='.lastTimestamp'
```
**Cloud Provider Metrics**:
```bash
# AWS CloudWatch
aws cloudwatch get-metric-statistics \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
--start-time 2024-10-14T00:00:00Z \
--end-time 2024-10-14T23:59:59Z \
--period 3600 \
--statistics Average
# Check application logs
aws logs tail /aws/application/logs --follow
# GCP Stackdriver
gcloud logging read "resource.type=gce_instance AND severity>=ERROR" --limit 50
# Azure Monitor
az monitor metrics list --resource <resource-id> --metric "Percentage CPU"
```
### 3. Hypothesis Formation
Based on collected data, form testable hypotheses about the root cause:
**Common Issue Patterns to Consider**:
#### Race Conditions
**Symptoms**:
- Intermittent failures
- Works sometimes, fails other times
- Timing-dependent behavior
- "Cannot read property of undefined" on objects that should exist
**What to Check**:
```javascript
// Look for async operations without proper waiting
async function problematic() {
let data;
fetchData().then(result => data = result); // ❌ Race condition
return processData(data); // May execute before data is set
}
// Proper async/await
async function correct() {
const data = await fetchData(); // ✅ Wait for data
return processData(data);
}
// Multiple parallel operations
Promise.all([op1(), op2(), op3()]) // Check for interdependencies
```
#### Memory Leaks
**Symptoms**:
- Degrading performance over time
- Increasing memory usage
- Eventually crashes with OOM errors
- Slow garbage collection
**What to Check**:
```javascript
// Event listeners not removed
componentDidMount() {
window.addEventListener('resize', this.handleResize);
// ❌ Missing removeEventListener in componentWillUnmount
}
// Closures holding references
function createLeak() {
const largeData = new Array(1000000);
return () => console.log(largeData[0]); // Holds entire array
}
// Timers not cleared
setInterval(() => fetchData(), 1000); // ❌ Never cleared
// Cache without eviction
const cache = {};
cache[key] = value; // ❌ Grows indefinitely
```
#### Database Issues
**Symptoms**:
- Slow queries
- Timeouts
- Deadlocks
- Connection pool exhausted
**What to Check**:
```sql
-- Missing indexes
EXPLAIN ANALYZE SELECT * FROM users WHERE email = 'user@example.com';
-- Look for "Seq Scan" on large tables
-- N+1 queries
-- Check if ORM is making one query per item in a loop
-- Long transactions
-- Find transactions open for extended periods
-- Lock contention
-- Check for blocking queries and deadlocks
```
#### Network Issues
**Symptoms**:
- Timeouts
- Intermittent connectivity
- DNS resolution failures
- SSL/TLS handshake errors
**What to Check**:
```bash
# DNS resolution
dig api.example.com
nslookup api.example.com
# Network latency
ping api.example.com
traceroute api.example.com
# TCP connection
telnet api.example.com 443
nc -zv api.example.com 443
# SSL/TLS verification
openssl s_client -connect api.example.com:443 -servername api.example.com
```
#### Authentication/Authorization
**Symptoms**:
- 401 Unauthorized errors
- 403 Forbidden errors
- Intermittent authentication failures
- Session expired errors
**What to Check**:
```javascript
// Token expiration
const token = jwt.decode(authToken);
console.log('Token expires:', new Date(token.exp * 1000));
// Session state
console.log('Session:', sessionStorage, localStorage);
// Cookie issues
console.log('Cookies:', document.cookie);
// CORS issues (browser console)
// Look for: "CORS policy: No 'Access-Control-Allow-Origin' header"
```
#### Configuration Issues
**Symptoms**:
- Works locally, fails in environment
- "Environment variable not set" errors
- Connection refused errors
- Permission denied errors
**What to Check**:
```bash
# Environment variables
printenv | grep APPLICATION
env | sort
# Configuration files
cat config/production.json
diff config/development.json config/production.json
# File permissions
ls -la config/
ls -la /var/application/
# Network configuration
cat /etc/hosts
cat /etc/resolv.conf
```
### 4. Hypothesis Testing
Systematically test each hypothesis:
**Testing Strategy**:
1. **Isolation**: Test each component in isolation
2. **Instrumentation**: Add detailed logging around suspected areas
3. **Reproduction**: Create minimal reproduction case
4. **Elimination**: Rule out hypotheses systematically
**Add Diagnostic Instrumentation**:
```javascript
// Detailed logging with context
console.log('[DIAG] Before operation:', {
timestamp: new Date().toISOString(),
user: currentUser,
state: JSON.stringify(currentState),
params: params
});
try {
const result = await operation(params);
console.log('[DIAG] Operation success:', {
timestamp: new Date().toISOString(),
result: result,
duration: Date.now() - startTime
});
} catch (error) {
console.error('[DIAG] Operation failed:', {
timestamp: new Date().toISOString(),
error: error.message,
stack: error.stack,
context: { user, state, params }
});
throw error;
}
// Performance timing
console.time('operation');
await operation();
console.timeEnd('operation');
// Memory usage tracking
if (global.gc) {
global.gc();
const usage = process.memoryUsage();
console.log('[MEMORY]', {
heapUsed: Math.round(usage.heapUsed / 1024 / 1024) + 'MB',
heapTotal: Math.round(usage.heapTotal / 1024 / 1024) + 'MB',
external: Math.round(usage.external / 1024 / 1024) + 'MB'
});
}
```
**Binary Search Debugging**:
```javascript
// Comment out half the code
// Determine which half has the bug
// Repeat until isolated
// Example: Large function with error
function complexOperation() {
// Part 1: Data fetching
const data = fetchData();
// Part 2: Data processing
const processed = processData(data);
// Part 3: Data validation
const validated = validateData(processed);
// Part 4: Data saving
return saveData(validated);
}
// Test each part independently
const data = fetchData();
console.log('[TEST] Data fetched:', data); // ✅ Works
const processed = processData(testData);
console.log('[TEST] Data processed:', processed); // ❌ Fails here
// Now investigate processData() specifically
```
### 5. Root Cause Identification
Once hypotheses are tested and narrowed down:
**Confirm Root Cause**:
1. Can you consistently reproduce the issue?
2. Does fixing this cause resolve the symptom?
3. Are there other instances of the same issue?
4. Does the fix have any side effects?
**Document Evidence**:
- Specific code/config that causes the issue
- Exact conditions required for issue to manifest
- Why this causes the observed symptom
- Related code that might have same issue
### 6. Impact Assessment
Evaluate the full impact:
**User Impact**:
- Number of users affected
- Severity of impact (blocking, degraded, minor)
- User actions affected
- Business metrics impacted
**System Impact**:
- Performance degradation
- Resource consumption
- Downstream service effects
- Data integrity concerns
**Risk Assessment**:
- Can it cause data loss?
- Can it cause security issues?
- Can it cause cascading failures?
- Is it getting worse over time?
## Output Format
```markdown
# Diagnosis Report: [Issue Summary]
## Executive Summary
[One-paragraph summary of issue, root cause, and recommended action]
## Issue Description
### Symptoms
- [Observable symptom 1]
- [Observable symptom 2]
- [Observable symptom 3]
### Impact
- **Affected Users**: [number/percentage of users]
- **Severity**: [critical|high|medium|low]
- **Frequency**: [always|often|sometimes|rarely - with percentage]
- **Business Impact**: [revenue loss, user experience, etc.]
### Environment
- **Environment**: [production|staging|development]
- **Version**: [application version]
- **Infrastructure**: [relevant infrastructure details]
- **Region**: [if applicable]
### Timeline
- **First Observed**: [date/time]
- **Recent Changes**: [deployments, config changes]
- **Pattern**: [time-based, load-based, user-based]
## Diagnostic Data Collected
### Frontend Analysis
[Console errors, network requests, performance data, state inspection results]
### Backend Analysis
[Application logs, error traces, system metrics, request patterns]
### Database Analysis
[Query logs, lock information, performance metrics, connection pool status]
### Infrastructure Analysis
[Resource usage, container logs, cloud metrics, network diagnostics]
## Hypothesis Analysis
### Hypotheses Considered
1. **[Hypothesis 1]**: [Description]
- **Evidence For**: [supporting evidence]
- **Evidence Against**: [contradicting evidence]
- **Conclusion**: [Ruled out|Confirmed|Needs more investigation]
2. **[Hypothesis 2]**: [Description]
- **Evidence For**: [supporting evidence]
- **Evidence Against**: [contradicting evidence]
- **Conclusion**: [Ruled out|Confirmed|Needs more investigation]
3. **[Hypothesis 3]**: [Description]
- **Evidence For**: [supporting evidence]
- **Evidence Against**: [contradicting evidence]
- **Conclusion**: [Ruled out|Confirmed|Needs more investigation]
## Root Cause
### Root Cause Identified
[Detailed explanation of the root cause with specific code/config references]
### Why It Causes the Symptom
[Technical explanation of how the root cause leads to the observed behavior]
### Why It Wasn't Caught Earlier
[Explanation of why tests/monitoring didn't catch this]
### Related Issues
[Any similar issues that might exist or could be fixed with similar approach]
## Evidence
### Code/Configuration
```[language]
[Specific code or configuration causing the issue]
```
### Reproduction
[Exact steps to reproduce the issue consistently]
### Verification
[Steps taken to confirm this is the root cause]
## Recommended Actions
### Immediate Actions
1. [Immediate action 1 - e.g., rollback, circuit breaker]
2. [Immediate action 2]
### Permanent Fix
[Description of the permanent fix needed]
### Prevention
- **Monitoring**: [What monitoring to add]
- **Testing**: [What tests to add]
- **Code Review**: [What to look for in code reviews]
- **Documentation**: [What to document]
## Next Steps
1. **Fix Implementation**: [Use /debug fix operation]
2. **Verification**: [Testing strategy]
3. **Deployment**: [Rollout plan]
4. **Monitoring**: [What to watch]
## Appendices
### A. Detailed Logs
[Relevant log excerpts with context]
### B. Metrics and Graphs
[Performance metrics, error rates, resource usage]
### C. Related Tickets
[Links to related issues or tickets]
```
## Error Handling
**Insufficient Information**:
If diagnosis cannot be completed due to missing information:
1. List specific information needed
2. Explain why each piece is important
3. Provide instructions for gathering data
4. Suggest interim monitoring
**Cannot Reproduce**:
If issue cannot be reproduced:
1. Document reproduction attempts
2. Request more detailed reproduction steps
3. Suggest environment comparison
4. Propose production debugging approach
**Multiple Root Causes**:
If multiple root causes are identified:
1. Prioritize by impact
2. Explain interdependencies
3. Provide fix sequence
4. Suggest monitoring between fixes
## Integration with Other Operations
After diagnosis is complete:
- **For fixes**: Use `/debug fix` with identified root cause
- **For reproduction**: Use `/debug reproduce` to create reliable test case
- **For log analysis**: Use `/debug analyze-logs` for deeper log investigation
- **For performance**: Use `/debug performance` if performance-related
- **For memory**: Use `/debug memory` if memory-related
## Agent Utilization
This operation leverages the **10x-fullstack-engineer** agent for:
- Systematic cross-layer analysis
- Pattern recognition across stack
- Hypothesis formation and testing
- Production debugging expertise
- Prevention-focused thinking