zhongwei/gh-dhofheinz-open-plugins-plugins-10x-fullstack-engineer

Files

Zhongwei Li bbbaf7acad Initial commit

2025-11-29 18:20:21 +08:00

20 KiB

Raw Blame History

Analyze Logs Operation - Deep Log Analysis

You are executing the analyze-logs operation to perform deep log analysis with pattern detection, timeline correlation, and anomaly identification.

Parameters

Received: $ARGUMENTS (after removing 'analyze-logs' operation name)

Expected format: path:"log-file-path" [pattern:"regex-pattern"] [timeframe:"time-range"] [level:"error|warn|info"] [context:"lines-before-after"]

Workflow

1. Discover and Locate Logs

Identify all relevant log sources:

Application Logs:

# Common log locations
ls -lh /var/log/application/
ls -lh logs/
ls -lh ~/.pm2/logs/

# Find log files
find /var/log -name "*.log" -type f
find . -name "*.log" -mtime -1  # Modified in last 24 hours

# Check log rotation
ls -lh /var/log/application/*.log*
zcat /var/log/application/app.log.*.gz  # Read rotated logs

System Logs:

# Systemd service logs
journalctl -u application.service --since "1 hour ago"
journalctl -u application.service --since "2024-10-14 14:00:00"

# Syslog
tail -f /var/log/syslog
tail -f /var/log/messages

# Kernel logs
dmesg -T

Container Logs:

# Docker
docker logs container-name --since 1h
docker logs container-name --timestamps
docker logs --tail 1000 container-name > container-logs.txt

# Kubernetes
kubectl logs pod-name -c container-name
kubectl logs pod-name --previous  # Previous container
kubectl logs -l app=myapp --all-containers=true

Web Server Logs:

# Nginx
tail -f /var/log/nginx/access.log
tail -f /var/log/nginx/error.log

# Apache
tail -f /var/log/apache2/access.log
tail -f /var/log/apache2/error.log

Database Logs:

# PostgreSQL
tail -f /var/log/postgresql/postgresql-*.log

# MySQL
tail -f /var/log/mysql/error.log
tail -f /var/log/mysql/slow-query.log

# MongoDB
tail -f /var/log/mongodb/mongod.log

2. Filter and Extract Relevant Logs

Use the .scripts/analyze-logs.sh utility to extract relevant log entries:

Basic Extraction:

# Extract errors from last hour
./commands/debug/.scripts/analyze-logs.sh \
  --file logs/application.log \
  --level ERROR \
  --since "1 hour ago"

# Extract with pattern matching
./commands/debug/.scripts/analyze-logs.sh \
  --file logs/application.log \
  --pattern "timeout|connection.*refused" \
  --context 5

# Extract specific timeframe
./commands/debug/.scripts/analyze-logs.sh \
  --file logs/application.log \
  --start "2024-10-14 14:00:00" \
  --end "2024-10-14 15:00:00"

Manual Filtering:

# Find errors with context
grep -i "error" logs/application.log -A 5 -B 5

# Find specific error patterns
grep -E "(timeout|refused|failed)" logs/application.log

# Find errors in timeframe
awk '/2024-10-14 14:/ && /ERROR/ {print}' logs/application.log

# Count errors by type
grep "ERROR" logs/application.log | awk '{print $5}' | sort | uniq -c | sort -rn

# Extract JSON logs with jq
cat logs/application.log | jq 'select(.level == "error")'
cat logs/application.log | jq 'select(.message | contains("timeout"))'

3. Pattern Detection

Identify patterns in log data:

Error Patterns

Frequency Analysis:

# Error frequency over time
grep "ERROR" logs/application.log | \
  awk '{print $1, $2}' | \
  cut -d: -f1 | \
  uniq -c

# Most common errors
grep "ERROR" logs/application.log | \
  awk -F'ERROR' '{print $2}' | \
  sort | uniq -c | sort -rn | head -20

# Error rate calculation
total_lines=$(wc -l < logs/application.log)
error_lines=$(grep -c "ERROR" logs/application.log)
echo "Error rate: $(echo "scale=4; $error_lines / $total_lines * 100" | bc)%"

Error Clustering:

# Group similar errors
import re
from collections import Counter

def normalize_error(error_msg):
    # Remove numbers, IDs, timestamps
    error_msg = re.sub(r'\d+', 'N', error_msg)
    error_msg = re.sub(r'[a-f0-9-]{36}', 'UUID', error_msg)
    error_msg = re.sub(r'\d{4}-\d{2}-\d{2}', 'DATE', error_msg)
    return error_msg

errors = []
with open('logs/application.log') as f:
    for line in f:
        if 'ERROR' in line:
            normalized = normalize_error(line)
            errors.append(normalized)

# Count error types
error_counts = Counter(errors)
for error, count in error_counts.most_common(10):
    print(f"{count}: {error}")

Request Patterns

Request Analysis:

# Requests per minute
awk '{print $1}' /var/log/nginx/access.log | \
  cut -d: -f1-2 | \
  uniq -c

# Most requested endpoints
awk '{print $7}' /var/log/nginx/access.log | \
  sort | uniq -c | sort -rn | head -20

# Response code distribution
awk '{print $9}' /var/log/nginx/access.log | \
  sort | uniq -c | sort -rn

# Slow requests (>1 second)
awk '$10 > 1.0 {print $0}' /var/log/nginx/access.log

# Top user agents
awk -F'"' '{print $6}' /var/log/nginx/access.log | \
  sort | uniq -c | sort -rn | head -10

Performance Patterns

Response Time Analysis:

# Average response time
awk '{sum+=$10; count++} END {print "Average:", sum/count}' \
  /var/log/nginx/access.log

# Response time percentiles
awk '{print $10}' /var/log/nginx/access.log | \
  sort -n | \
  awk '{
    times[NR] = $1
  }
  END {
    print "P50:", times[int(NR*0.5)]
    print "P95:", times[int(NR*0.95)]
    print "P99:", times[int(NR*0.99)]
  }'

# Response time over time
awk '{print $4, $10}' /var/log/nginx/access.log | \
  awk -F'[:]' '{print $1":"$2, $NF}' | \
  awk '{sum[$1]+=$2; count[$1]++} END {
    for (time in sum) print time, sum[time]/count[time]
  }' | sort

4. Timeline Analysis

Create timeline of events:

Timeline Construction:

# Merge multiple log sources by timestamp
sort -m -k1,2 \
  logs/application.log \
  logs/database.log \
  logs/nginx.log \
  > merged-timeline.log

# Extract timeline around specific event
event_time="2024-10-14 14:30:15"
grep "$event_time" logs/application.log -B 100 -A 100

# Timeline with multiple sources
for log in logs/*.log; do
  echo "=== $(basename $log) ==="
  grep "$event_time" "$log" -B 10 -A 10
  echo ""
done

Event Correlation:

# Correlate events across log sources
import re
from datetime import datetime, timedelta

def parse_log_line(line):
    # Extract timestamp and message
    match = re.match(r'(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})', line)
    if match:
        timestamp = datetime.strptime(match.group(1), '%Y-%m-%d %H:%M:%S')
        return timestamp, line
    return None, None

# Load events from multiple logs
events = []
for log_file in ['app.log', 'db.log', 'nginx.log']:
    with open(f'logs/{log_file}') as f:
        for line in f:
            timestamp, message = parse_log_line(line)
            if timestamp:
                events.append((timestamp, log_file, message))

# Sort by timestamp
events.sort(key=lambda x: x[0])

# Find events within time window
def find_related_events(target_time, window_seconds=10):
    window = timedelta(seconds=window_seconds)
    start_time = target_time - window
    end_time = target_time + window

    related = [
        event for event in events
        if start_time <= event[0] <= end_time
    ]

    return related

# Analyze error event
error_time = datetime(2024, 10, 14, 14, 30, 15)
related = find_related_events(error_time)

for timestamp, source, message in related:
    print(f"[{source}] {timestamp}: {message.strip()}")

5. Request Tracing

Trace individual requests across services:

Request ID Tracing:

# Extract request ID from error
error_line=$(grep "ERROR" logs/application.log | head -1)
request_id=$(echo "$error_line" | grep -oP 'request_id=\K[a-f0-9-]+')

echo "Tracing request: $request_id"

# Find all log entries for this request
grep "$request_id" logs/application.log

# Across multiple services
for log in logs/*.log; do
  echo "=== $(basename $log) ==="
  grep "$request_id" "$log"
done

# With timestamps for timeline
grep "$request_id" logs/*.log | sort -k1,2

Distributed Tracing Correlation:

# Extract trace ID from logs
trace_id=$(grep "ERROR" logs/application.log | \
  head -1 | \
  grep -oP 'trace_id=\K[a-f0-9]+')

# Query distributed tracing system
# Jaeger
curl "http://jaeger:16686/api/traces/$trace_id"

# Zipkin
curl "http://zipkin:9411/api/v2/trace/$trace_id"

6. Anomaly Detection

Identify unusual patterns:

Statistical Anomalies:

import statistics
from collections import defaultdict

# Analyze error rates per hour
hourly_errors = defaultdict(int)

with open('logs/application.log') as f:
    for line in f:
        if 'ERROR' in line:
            # Extract hour
            hour = line[:13]  # YYYY-MM-DD HH
            hourly_errors[hour] += 1

# Calculate statistics
error_counts = list(hourly_errors.values())
mean = statistics.mean(error_counts)
stdev = statistics.stdev(error_counts)

# Find anomalies (>2 standard deviations)
print("Anomalous hours (>2 std dev from mean):")
for hour, count in sorted(hourly_errors.items()):
    z_score = (count - mean) / stdev
    if abs(z_score) > 2:
        print(f"{hour}: {count} errors (z-score: {z_score:.2f})")

New Error Types:

# Compare today's errors with baseline
grep "ERROR" logs/application.log.1 | \
  awk -F'ERROR' '{print $2}' | \
  sort -u > baseline_errors.txt

grep "ERROR" logs/application.log | \
  awk -F'ERROR' '{print $2}' | \
  sort -u > current_errors.txt

# Find new error types
comm -13 baseline_errors.txt current_errors.txt > new_errors.txt

echo "New error types detected:"
cat new_errors.txt

Spike Detection:

# Detect sudden spikes in error rate
from collections import deque

def detect_spikes(values, window_size=10, threshold=3):
    """Detect values that are >threshold times the rolling average"""
    window = deque(maxlen=window_size)
    spikes = []

    for i, value in enumerate(values):
        if len(window) == window_size:
            avg = sum(window) / len(window)
            if value > avg * threshold:
                spikes.append((i, value, avg))

        window.append(value)

    return spikes

# Analyze minute-by-minute error counts
minute_errors = {}  # {minute: error_count}

with open('logs/application.log') as f:
    for line in f:
        if 'ERROR' in line:
            minute = line[:16]  # YYYY-MM-DD HH:MM
            minute_errors[minute] = minute_errors.get(minute, 0) + 1

# Detect spikes
error_counts = [minute_errors.get(m, 0) for m in sorted(minute_errors.keys())]
spikes = detect_spikes(error_counts, window_size=10, threshold=3)

print("Error spikes detected:")
for idx, value, avg in spikes:
    print(f"Minute {idx}: {value} errors (avg was {avg:.1f})")

7. Performance Analysis

Analyze performance from logs:

Slow Query Analysis:

# PostgreSQL slow query log
cat /var/log/postgresql/postgresql.log | \
  grep "duration:" | \
  awk '{print $13, $0}' | \
  sort -rn | \
  head -20

# Extract slow queries
awk '/duration:/ && $13 > 1000 {print $0}' \
  /var/log/postgresql/postgresql.log

Endpoint Performance:

# Average response time per endpoint
awk '{endpoint[$7] += $10; count[$7]++}
END {
  for (e in endpoint) {
    printf "%s: %.2fms\n", e, endpoint[e]/count[e]
  }
}' /var/log/nginx/access.log | sort -t: -k2 -rn

# Slowest endpoints
awk '{print $10, $7}' /var/log/nginx/access.log | \
  sort -rn | \
  head -20

8. User Impact Analysis

Assess user-facing impact:

Affected Users:

# Extract unique users experiencing errors
grep "ERROR" logs/application.log | \
  grep -oP 'user_id=\K[a-zA-Z0-9]+' | \
  sort -u | \
  wc -l

# Error rate by user
grep "ERROR" logs/application.log | \
  grep -oP 'user_id=\K[a-zA-Z0-9]+' | \
  sort | uniq -c | sort -rn | head -20

# Users with most errors
grep "user_id=" logs/application.log | \
  awk '{
    total[$0]++
    if (/ERROR/) errors[$0]++
  }
  END {
    for (user in total) {
      print user, errors[user]/total[user]*100"%"
    }
  }' | sort -t% -k2 -rn

Failed Requests:

# 5xx errors
grep " 5[0-9][0-9] " /var/log/nginx/access.log

# Failed endpoints
awk '$9 >= 500 {print $7}' /var/log/nginx/access.log | \
  sort | uniq -c | sort -rn

# Failed request details
awk '$9 >= 500 {print $4, $7, $9, $10}' \
  /var/log/nginx/access.log

9. Resource Usage from Logs

Extract resource usage patterns:

Memory Usage:

# Extract memory logs
grep -i "memory\|heap\|oom" logs/application.log

# Parse memory usage
grep "heap_used" logs/application.log | \
  awk '{print $1, $2, $NF}' | \
  sed 's/MB$//'

Connection Pool:

# Database connection logs
grep "connection" logs/application.log | \
  grep -oP 'pool_size=\K\d+|active=\K\d+|idle=\K\d+'

# Connection exhaustion
grep "connection.*timeout\|pool.*exhausted" logs/application.log -A 5

10. Security Analysis

Look for security-related issues:

Authentication Failures:

# Failed login attempts
grep -i "authentication.*failed\|login.*failed" logs/application.log

# By IP address
grep "authentication.*failed" logs/application.log | \
  grep -oP 'ip=\K[\d.]+' | \
  sort | uniq -c | sort -rn

# Brute force detection
grep "authentication.*failed" logs/application.log | \
  grep -oP 'ip=\K[\d.]+' | \
  uniq -c | \
  awk '$1 > 10 {print $2, $1 " attempts"}'

Suspicious Patterns:

# SQL injection attempts
grep -iE "union.*select|drop.*table|; --" /var/log/nginx/access.log

# Path traversal attempts
grep -E "\.\./|\.\.%2F" /var/log/nginx/access.log

# XSS attempts
grep -iE "<script|javascript:|onerror=" /var/log/nginx/access.log

# Command injection attempts
grep -E ";\s*(cat|ls|wget|curl)" /var/log/nginx/access.log

Output Format

# Log Analysis Report: [Issue/Time Period]

## Summary
[High-level summary of findings]

## Analysis Period
- **Start**: [start timestamp]
- **End**: [end timestamp]
- **Duration**: [duration]
- **Log Sources**: [list of logs analyzed]
- **Total Lines**: [number of log lines]

## Key Findings

### Error Analysis
- **Total Errors**: [count]
- **Error Rate**: [percentage]%
- **Error Types**: [number of unique error types]
- **Most Common Error**: [error type] ([count] occurrences)

### Top Errors

1. **[Error Type 1]** - [count] occurrences

[sample log line]

- First seen: [timestamp]
- Last seen: [timestamp]
- Peak: [timestamp with highest frequency]

2. **[Error Type 2]** - [count] occurrences