Initial commit

2025-11-29 18:20:21 +08:00
commit bbbaf7acad
63 changed files with 38552 additions and 0 deletions
--- a/commands/debug/analyze-logs.md
+++ b/commands/debug/analyze-logs.md
@@ -0,0 +1,842 @@
+# Analyze Logs Operation - Deep Log Analysis
+
+You are executing the **analyze-logs** operation to perform deep log analysis with pattern detection, timeline correlation, and anomaly identification.
+
+## Parameters
+
+**Received**: `$ARGUMENTS` (after removing 'analyze-logs' operation name)
+
+Expected format: `path:"log-file-path" [pattern:"regex-pattern"] [timeframe:"time-range"] [level:"error|warn|info"] [context:"lines-before-after"]`
+
+## Workflow
+
+### 1. Discover and Locate Logs
+
+Identify all relevant log sources:
+
+**Application Logs**:
+```bash
+# Common log locations
+ls -lh /var/log/application/
+ls -lh logs/
+ls -lh ~/.pm2/logs/
+
+# Find log files
+find /var/log -name "*.log" -type f
+find . -name "*.log" -mtime -1  # Modified in last 24 hours
+
+# Check log rotation
+ls -lh /var/log/application/*.log*
+zcat /var/log/application/app.log.*.gz  # Read rotated logs
+```
+
+**System Logs**:
+```bash
+# Systemd service logs
+journalctl -u application.service --since "1 hour ago"
+journalctl -u application.service --since "2024-10-14 14:00:00"
+
+# Syslog
+tail -f /var/log/syslog
+tail -f /var/log/messages
+
+# Kernel logs
+dmesg -T
+```
+
+**Container Logs**:
+```bash
+# Docker
+docker logs container-name --since 1h
+docker logs container-name --timestamps
+docker logs --tail 1000 container-name > container-logs.txt
+
+# Kubernetes
+kubectl logs pod-name -c container-name
+kubectl logs pod-name --previous  # Previous container
+kubectl logs -l app=myapp --all-containers=true
+```
+
+**Web Server Logs**:
+```bash
+# Nginx
+tail -f /var/log/nginx/access.log
+tail -f /var/log/nginx/error.log
+
+# Apache
+tail -f /var/log/apache2/access.log
+tail -f /var/log/apache2/error.log
+```
+
+**Database Logs**:
+```bash
+# PostgreSQL
+tail -f /var/log/postgresql/postgresql-*.log
+
+# MySQL
+tail -f /var/log/mysql/error.log
+tail -f /var/log/mysql/slow-query.log
+
+# MongoDB
+tail -f /var/log/mongodb/mongod.log
+```
+
+### 2. Filter and Extract Relevant Logs
+
+Use the `.scripts/analyze-logs.sh` utility to extract relevant log entries:
+
+**Basic Extraction**:
+```bash
+# Extract errors from last hour
+./commands/debug/.scripts/analyze-logs.sh \
+  --file logs/application.log \
+  --level ERROR \
+  --since "1 hour ago"
+
+# Extract with pattern matching
+./commands/debug/.scripts/analyze-logs.sh \
+  --file logs/application.log \
+  --pattern "timeout|connection.*refused" \
+  --context 5
+
+# Extract specific timeframe
+./commands/debug/.scripts/analyze-logs.sh \
+  --file logs/application.log \
+  --start "2024-10-14 14:00:00" \
+  --end "2024-10-14 15:00:00"
+```
+
+**Manual Filtering**:
+```bash
+# Find errors with context
+grep -i "error" logs/application.log -A 5 -B 5
+
+# Find specific error patterns
+grep -E "(timeout|refused|failed)" logs/application.log
+
+# Find errors in timeframe
+awk '/2024-10-14 14:/ && /ERROR/ {print}' logs/application.log
+
+# Count errors by type
+grep "ERROR" logs/application.log | awk '{print $5}' | sort | uniq -c | sort -rn
+
+# Extract JSON logs with jq
+cat logs/application.log | jq 'select(.level == "error")'
+cat logs/application.log | jq 'select(.message | contains("timeout"))'
+```
+
+### 3. Pattern Detection
+
+Identify patterns in log data:
+
+#### Error Patterns
+
+**Frequency Analysis**:
+```bash
+# Error frequency over time
+grep "ERROR" logs/application.log | \
+  awk '{print $1, $2}' | \
+  cut -d: -f1 | \
+  uniq -c
+
+# Most common errors
+grep "ERROR" logs/application.log | \
+  awk -F'ERROR' '{print $2}' | \
+  sort | uniq -c | sort -rn | head -20
+
+# Error rate calculation
+total_lines=$(wc -l < logs/application.log)
+error_lines=$(grep -c "ERROR" logs/application.log)
+echo "Error rate: $(echo "scale=4; $error_lines / $total_lines * 100" | bc)%"
+```
+
+**Error Clustering**:
+```python
+# Group similar errors
+import re
+from collections import Counter
+
+def normalize_error(error_msg):
+    # Remove numbers, IDs, timestamps
+    error_msg = re.sub(r'\d+', 'N', error_msg)
+    error_msg = re.sub(r'[a-f0-9-]{36}', 'UUID', error_msg)
+    error_msg = re.sub(r'\d{4}-\d{2}-\d{2}', 'DATE', error_msg)
+    return error_msg
+
+errors = []
+with open('logs/application.log') as f:
+    for line in f:
+        if 'ERROR' in line:
+            normalized = normalize_error(line)
+            errors.append(normalized)
+
+# Count error types
+error_counts = Counter(errors)
+for error, count in error_counts.most_common(10):
+    print(f"{count}: {error}")
+```
+
+#### Request Patterns
+
+**Request Analysis**:
+```bash
+# Requests per minute
+awk '{print $1}' /var/log/nginx/access.log | \
+  cut -d: -f1-2 | \
+  uniq -c
+
+# Most requested endpoints
+awk '{print $7}' /var/log/nginx/access.log | \
+  sort | uniq -c | sort -rn | head -20
+
+# Response code distribution
+awk '{print $9}' /var/log/nginx/access.log | \
+  sort | uniq -c | sort -rn
+
+# Slow requests (>1 second)
+awk '$10 > 1.0 {print $0}' /var/log/nginx/access.log
+
+# Top user agents
+awk -F'"' '{print $6}' /var/log/nginx/access.log | \
+  sort | uniq -c | sort -rn | head -10
+```
+
+#### Performance Patterns
+
+**Response Time Analysis**:
+```bash
+# Average response time
+awk '{sum+=$10; count++} END {print "Average:", sum/count}' \
+  /var/log/nginx/access.log
+
+# Response time percentiles
+awk '{print $10}' /var/log/nginx/access.log | \
+  sort -n | \
+  awk '{
+    times[NR] = $1
+  }
+  END {
+    print "P50:", times[int(NR*0.5)]
+    print "P95:", times[int(NR*0.95)]
+    print "P99:", times[int(NR*0.99)]
+  }'
+
+# Response time over time
+awk '{print $4, $10}' /var/log/nginx/access.log | \
+  awk -F'[:]' '{print $1":"$2, $NF}' | \
+  awk '{sum[$1]+=$2; count[$1]++} END {
+    for (time in sum) print time, sum[time]/count[time]
+  }' | sort
+```
+
+### 4. Timeline Analysis
+
+Create timeline of events:
+
+**Timeline Construction**:
+```bash
+# Merge multiple log sources by timestamp
+sort -m -k1,2 \
+  logs/application.log \
+  logs/database.log \
+  logs/nginx.log \
+  > merged-timeline.log
+
+# Extract timeline around specific event
+event_time="2024-10-14 14:30:15"
+grep "$event_time" logs/application.log -B 100 -A 100
+
+# Timeline with multiple sources
+for log in logs/*.log; do
+  echo "=== $(basename $log) ==="
+  grep "$event_time" "$log" -B 10 -A 10
+  echo ""
+done
+```
+
+**Event Correlation**:
+```python
+# Correlate events across log sources
+import re
+from datetime import datetime, timedelta
+
+def parse_log_line(line):
+    # Extract timestamp and message
+    match = re.match(r'(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})', line)
+    if match:
+        timestamp = datetime.strptime(match.group(1), '%Y-%m-%d %H:%M:%S')
+        return timestamp, line
+    return None, None
+
+# Load events from multiple logs
+events = []
+for log_file in ['app.log', 'db.log', 'nginx.log']:
+    with open(f'logs/{log_file}') as f:
+        for line in f:
+            timestamp, message = parse_log_line(line)
+            if timestamp:
+                events.append((timestamp, log_file, message))
+
+# Sort by timestamp
+events.sort(key=lambda x: x[0])
+
+# Find events within time window
+def find_related_events(target_time, window_seconds=10):
+    window = timedelta(seconds=window_seconds)
+    start_time = target_time - window
+    end_time = target_time + window
+
+    related = [
+        event for event in events
+        if start_time <= event[0] <= end_time
+    ]
+
+    return related
+
+# Analyze error event
+error_time = datetime(2024, 10, 14, 14, 30, 15)
+related = find_related_events(error_time)
+
+for timestamp, source, message in related:
+    print(f"[{source}] {timestamp}: {message.strip()}")
+```
+
+### 5. Request Tracing
+
+Trace individual requests across services:
+
+**Request ID Tracing**:
+```bash
+# Extract request ID from error
+error_line=$(grep "ERROR" logs/application.log | head -1)
+request_id=$(echo "$error_line" | grep -oP 'request_id=\K[a-f0-9-]+')
+
+echo "Tracing request: $request_id"
+
+# Find all log entries for this request
+grep "$request_id" logs/application.log
+
+# Across multiple services
+for log in logs/*.log; do
+  echo "=== $(basename $log) ==="
+  grep "$request_id" "$log"
+done
+
+# With timestamps for timeline
+grep "$request_id" logs/*.log | sort -k1,2
+```
+
+**Distributed Tracing Correlation**:
+```bash
+# Extract trace ID from logs
+trace_id=$(grep "ERROR" logs/application.log | \
+  head -1 | \
+  grep -oP 'trace_id=\K[a-f0-9]+')
+
+# Query distributed tracing system
+# Jaeger
+curl "http://jaeger:16686/api/traces/$trace_id"
+
+# Zipkin
+curl "http://zipkin:9411/api/v2/trace/$trace_id"
+```
+
+### 6. Anomaly Detection
+
+Identify unusual patterns:
+
+**Statistical Anomalies**:
+```python
+import statistics
+from collections import defaultdict
+
+# Analyze error rates per hour
+hourly_errors = defaultdict(int)
+
+with open('logs/application.log') as f:
+    for line in f:
+        if 'ERROR' in line:
+            # Extract hour
+            hour = line[:13]  # YYYY-MM-DD HH
+            hourly_errors[hour] += 1
+
+# Calculate statistics
+error_counts = list(hourly_errors.values())
+mean = statistics.mean(error_counts)
+stdev = statistics.stdev(error_counts)
+
+# Find anomalies (>2 standard deviations)
+print("Anomalous hours (>2 std dev from mean):")
+for hour, count in sorted(hourly_errors.items()):
+    z_score = (count - mean) / stdev
+    if abs(z_score) > 2:
+        print(f"{hour}: {count} errors (z-score: {z_score:.2f})")
+```
+
+**New Error Types**:
+```bash
+# Compare today's errors with baseline
+grep "ERROR" logs/application.log.1 | \
+  awk -F'ERROR' '{print $2}' | \
+  sort -u > baseline_errors.txt
+
+grep "ERROR" logs/application.log | \
+  awk -F'ERROR' '{print $2}' | \
+  sort -u > current_errors.txt
+
+# Find new error types
+comm -13 baseline_errors.txt current_errors.txt > new_errors.txt
+
+echo "New error types detected:"
+cat new_errors.txt
+```
+
+**Spike Detection**:
+```python
+# Detect sudden spikes in error rate
+from collections import deque
+
+def detect_spikes(values, window_size=10, threshold=3):
+    """Detect values that are >threshold times the rolling average"""
+    window = deque(maxlen=window_size)
+    spikes = []
+
+    for i, value in enumerate(values):
+        if len(window) == window_size:
+            avg = sum(window) / len(window)
+            if value > avg * threshold:
+                spikes.append((i, value, avg))
+
+        window.append(value)
+
+    return spikes
+
+# Analyze minute-by-minute error counts
+minute_errors = {}  # {minute: error_count}
+
+with open('logs/application.log') as f:
+    for line in f:
+        if 'ERROR' in line:
+            minute = line[:16]  # YYYY-MM-DD HH:MM
+            minute_errors[minute] = minute_errors.get(minute, 0) + 1
+
+# Detect spikes
+error_counts = [minute_errors.get(m, 0) for m in sorted(minute_errors.keys())]
+spikes = detect_spikes(error_counts, window_size=10, threshold=3)
+
+print("Error spikes detected:")
+for idx, value, avg in spikes:
+    print(f"Minute {idx}: {value} errors (avg was {avg:.1f})")
+```
+
+### 7. Performance Analysis
+
+Analyze performance from logs:
+
+**Slow Query Analysis**:
+```bash
+# PostgreSQL slow query log
+cat /var/log/postgresql/postgresql.log | \
+  grep "duration:" | \
+  awk '{print $13, $0}' | \
+  sort -rn | \
+  head -20
+
+# Extract slow queries
+awk '/duration:/ && $13 > 1000 {print $0}' \
+  /var/log/postgresql/postgresql.log
+```
+
+**Endpoint Performance**:
+```bash
+# Average response time per endpoint
+awk '{endpoint[$7] += $10; count[$7]++}
+END {
+  for (e in endpoint) {
+    printf "%s: %.2fms\n", e, endpoint[e]/count[e]
+  }
+}' /var/log/nginx/access.log | sort -t: -k2 -rn
+
+# Slowest endpoints
+awk '{print $10, $7}' /var/log/nginx/access.log | \
+  sort -rn | \
+  head -20
+```
+
+### 8. User Impact Analysis
+
+Assess user-facing impact:
+
+**Affected Users**:
+```bash
+# Extract unique users experiencing errors
+grep "ERROR" logs/application.log | \
+  grep -oP 'user_id=\K[a-zA-Z0-9]+' | \
+  sort -u | \
+  wc -l
+
+# Error rate by user
+grep "ERROR" logs/application.log | \
+  grep -oP 'user_id=\K[a-zA-Z0-9]+' | \
+  sort | uniq -c | sort -rn | head -20
+
+# Users with most errors
+grep "user_id=" logs/application.log | \
+  awk '{
+    total[$0]++
+    if (/ERROR/) errors[$0]++
+  }
+  END {
+    for (user in total) {
+      print user, errors[user]/total[user]*100"%"
+    }
+  }' | sort -t% -k2 -rn
+```
+
+**Failed Requests**:
+```bash
+# 5xx errors
+grep " 5[0-9][0-9] " /var/log/nginx/access.log
+
+# Failed endpoints
+awk '$9 >= 500 {print $7}' /var/log/nginx/access.log | \
+  sort | uniq -c | sort -rn
+
+# Failed request details
+awk '$9 >= 500 {print $4, $7, $9, $10}' \
+  /var/log/nginx/access.log
+```
+
+### 9. Resource Usage from Logs
+
+Extract resource usage patterns:
+
+**Memory Usage**:
+```bash
+# Extract memory logs
+grep -i "memory\|heap\|oom" logs/application.log
+
+# Parse memory usage
+grep "heap_used" logs/application.log | \
+  awk '{print $1, $2, $NF}' | \
+  sed 's/MB$//'
+```
+
+**Connection Pool**:
+```bash
+# Database connection logs
+grep "connection" logs/application.log | \
+  grep -oP 'pool_size=\K\d+|active=\K\d+|idle=\K\d+'
+
+# Connection exhaustion
+grep "connection.*timeout\|pool.*exhausted" logs/application.log -A 5
+```
+
+### 10. Security Analysis
+
+Look for security-related issues:
+
+**Authentication Failures**:
+```bash
+# Failed login attempts
+grep -i "authentication.*failed\|login.*failed" logs/application.log
+
+# By IP address
+grep "authentication.*failed" logs/application.log | \
+  grep -oP 'ip=\K[\d.]+' | \
+  sort | uniq -c | sort -rn
+
+# Brute force detection
+grep "authentication.*failed" logs/application.log | \
+  grep -oP 'ip=\K[\d.]+' | \
+  uniq -c | \
+  awk '$1 > 10 {print $2, $1 " attempts"}'
+```
+
+**Suspicious Patterns**:
+```bash
+# SQL injection attempts
+grep -iE "union.*select|drop.*table|; --" /var/log/nginx/access.log
+
+# Path traversal attempts
+grep -E "\.\./|\.\.%2F" /var/log/nginx/access.log
+
+# XSS attempts
+grep -iE "<script|javascript:|onerror=" /var/log/nginx/access.log
+
+# Command injection attempts
+grep -E ";\s*(cat|ls|wget|curl)" /var/log/nginx/access.log
+```
+
+## Output Format
+
+```markdown
+# Log Analysis Report: [Issue/Time Period]
+
+## Summary
+[High-level summary of findings]
+
+## Analysis Period
+- **Start**: [start timestamp]
+- **End**: [end timestamp]
+- **Duration**: [duration]
+- **Log Sources**: [list of logs analyzed]
+- **Total Lines**: [number of log lines]
+
+## Key Findings
+
+### Error Analysis
+- **Total Errors**: [count]
+- **Error Rate**: [percentage]%
+- **Error Types**: [number of unique error types]
+- **Most Common Error**: [error type] ([count] occurrences)
+
+### Top Errors
+
+1. **[Error Type 1]** - [count] occurrences
+   ```
+   [sample log line]
+   ```
+   - First seen: [timestamp]
+   - Last seen: [timestamp]
+   - Peak: [timestamp with highest frequency]
+
+2. **[Error Type 2]** - [count] occurrences
+   ```
+   [sample log line]
+   ```
+   - [similar details]
+
+### Patterns Detected
+
+#### Pattern 1: [Pattern Name]
+- **Description**: [what pattern is]
+- **Frequency**: [how often it occurs]
+- **Impact**: [user/system impact]
+- **Example**:
+  ```
+  [log excerpt showing pattern]
+  ```
+
+#### Pattern 2: [Pattern Name]
+[similar structure]
+
+## Timeline Analysis
+
+### Critical Events Timeline
+
+\`\`\`
+14:25:30 [APP] Normal operation, avg response time 50ms
+14:28:45 [APP] Response time increasing to 150ms
+14:29:10 [DB] Connection pool usage at 90%
+14:29:30 [APP] First timeout errors appear
+14:29:45 [DB] Connection pool exhausted
+14:30:00 [APP] Error rate spikes to 25%
+14:30:15 [APP] Circuit breaker opens
+14:30:30 [OPS] Auto-scaling triggers
+14:32:00 [APP] New instances online
+14:33:00 [APP] Error rate decreases to 5%
+14:35:00 [APP] Full recovery, normal operation
+\`\`\`
+
+### Event Correlation
+
+**Root Event**: Database connection pool exhaustion at 14:29:45
+
+**Contributing Factors**:
+- High traffic spike (+300% at 14:28:00)
+- Long-running queries (>5s queries detected)
+- Insufficient connection pool size (max: 20)
+
+**Cascading Effects**:
+- API timeouts (starting 14:29:30)
+- Cache misses due to timeouts
+- Increased load from retries
+- Circuit breaker activation
+
+## Request Tracing
+
+### Example Failed Request
+
+**Request ID**: req_abc123def456
+
+**Timeline**:
+\`\`\`
+14:30:15.123 [NGINX] Request received: POST /api/orders
+14:30:15.125 [APP] Request processing started
+14:30:15.130 [APP] Database query started: SELECT orders...
+14:30:20.131 [DB] Query timeout after 5s
+14:30:20.135 [APP] Error: Database timeout
+14:30:20.137 [APP] Response: 500 Internal Server Error
+14:30:20.140 [NGINX] Response sent (5017ms)
+\`\`\`
+
+**User Impact**: Order creation failed for user_123
+
+## Anomalies Detected
+
+### Anomaly 1: Error Rate Spike
+- **Time**: 14:30:00 - 14:35:00
+- **Severity**: High
+- **Details**: Error rate jumped from 0.1% to 25%
+- **Affected Users**: ~500 users
+- **Root Cause**: Database connection pool exhaustion
+
+### Anomaly 2: New Error Type
+- **Error**: "ConnectionPoolExhausted"
+- **First Seen**: 14:29:45
+- **Frequency**: 1,234 occurrences in 5 minutes
+- **Status**: Previously unseen in baseline
+
+## Performance Analysis
+
+### Response Time Statistics
+- **Average**: 150ms (baseline: 50ms)
+- **P50**: 80ms
+- **P95**: 500ms
+- **P99**: 2000ms
+- **Max**: 5000ms
+
+### Slowest Endpoints
+1. `/api/orders` - avg 450ms (1,200 requests)
+2. `/api/users/profile` - avg 380ms (800 requests)
+3. `/api/reports` - avg 320ms (200 requests)
+
+### Database Performance
+- **Slow Queries**: 45 queries >1s
+- **Slowest Query**: 5.2s (SELECT with missing index)
+- **Average Query Time**: 85ms (baseline: 25ms)
+
+## User Impact
+
+### Affected Users
+- **Total Affected**: ~500 users
+- **Error Rate by User Type**:
+  - Premium users: 5% error rate
+  - Free users: 30% error rate
+- **Most Affected User**: user_789 (25 errors)
+
+### Failed Operations
+- **Order Creation**: 234 failures
+- **Payment Processing**: 89 failures
+- **Profile Updates**: 45 failures
+
+## Resource Analysis
+
+### Connection Pool
+- **Max Size**: 20 connections
+- **Peak Usage**: 20/20 (100%)
+- **Average Wait Time**: 2.5s
+- **Recommendation**: Increase to 50 connections
+
+### Memory Usage
+- **Average**: 450MB
+- **Peak**: 890MB
+- **Trend**: Stable (no leak detected)
+
+## Security Findings
+
+### Authentication
+- **Failed Logins**: 12
+- **Suspicious IPs**: 2 IPs with >5 failed attempts
+- **Brute Force Attempts**: None detected
+
+### Attack Patterns
+- **SQL Injection Attempts**: 0
+- **XSS Attempts**: 0
+- **Path Traversal**: 0
+
+## Root Cause Analysis
+
+Based on log analysis:
+
+**Primary Cause**: Database connection pool too small for traffic volume
+
+**Contributing Factors**:
+1. Traffic spike (+300%)
+2. Slow queries consuming connections
+3. No connection timeout configured
+
+**Evidence**:
+- Connection pool exhausted at 14:29:45
+- Immediate correlation with error spike
+- Recovery after auto-scaling added capacity
+
+## Recommendations
+
+### Immediate Actions
+1. Increase database connection pool to 50
+2. Add connection timeout (30s)
+3. Optimize slow queries identified
+
+### Monitoring Improvements
+1. Alert on connection pool usage >80%
+2. Track query duration P95
+3. Monitor error rate per endpoint
+
+### Code Changes
+1. Add query timeouts to all database calls
+2. Implement connection retry logic
+3. Add circuit breaker for database calls
+
+## Next Steps
+
+1. **Fix**: Use `/debug fix` to implement connection pool increase
+2. **Performance**: Use `/debug performance` to optimize slow queries
+3. **Monitoring**: Add alerts for connection pool usage
+
+## Appendices
+
+### A. Full Error Log Excerpt
+\`\`\`
+[Relevant log excerpts]
+\`\`\`
+
+### B. Query Performance Data
+\`\`\`sql
+[Slow query details]
+\`\`\`
+
+### C. Traffic Pattern Graph
+\`\`\`
+[ASCII graph or description of traffic pattern]
+\`\`\`
+```
+
+## Error Handling
+
+**Logs Not Found**:
+If specified log files don't exist:
+1. List available log files
+2. Suggest alternative log locations
+3. Provide commands to locate logs
+
+**Logs Too Large**:
+If logs are too large to analyze:
+1. Focus on most recent data
+2. Use sampling techniques
+3. Analyze specific time windows
+4. Suggest log aggregation tools
+
+**Insufficient Context**:
+If logs lack necessary information:
+1. Document what information is missing
+2. Suggest additional logging
+3. Recommend structured logging format
+4. Propose log enrichment strategies
+
+## Integration with Other Operations
+
+- **Before**: Use `/debug diagnose` to identify time period to analyze
+- **After**: Use `/debug fix` to address issues found in logs
+- **Related**: Use `/debug performance` for performance issues
+- **Related**: Use `/debug reproduce` to recreate issues found in logs
+
+## Agent Utilization
+
+This operation leverages the **10x-fullstack-engineer** agent for:
+- Pattern recognition across large log volumes
+- Correlating events across multiple log sources
+- Statistical analysis and anomaly detection
+- Root cause inference from log patterns
+- Actionable recommendations based on findings