Initial commit

2025-11-30 08:45:51 +08:00
commit 411f0d3063
5 changed files with 1125 additions and 0 deletions
--- a/commands/analyze-performance.md
+++ b/commands/analyze-performance.md
@@ -0,0 +1,602 @@
+---
+description: Analyze etcd performance metrics, latency, and identify bottlenecks
+argument-hint: "[--duration <minutes>]"
+---
+
+## Name
+etcd:analyze-performance
+
+## Synopsis
+```
+/etcd:analyze-performance [--duration <minutes>]
+```
+
+## Description
+
+The `analyze-performance` command analyzes etcd performance metrics to identify latency issues, slow operations, and potential bottlenecks. It examines disk performance, commit latency, network latency, and provides recommendations for optimization.
+
+Etcd performance is critical for cluster responsiveness. Slow etcd operations can cause:
+- API server timeouts
+- Slow pod creation and updates
+- Controller delays
+- Overall cluster sluggishness
+
+This command is useful for:
+- Diagnosing slow cluster operations
+- Identifying disk I/O bottlenecks
+- Detecting network latency issues
+- Capacity planning
+- Performance tuning
+
+## Prerequisites
+
+Before using this command, ensure you have:
+
+1. **OpenShift CLI (oc)**
+   - Install from: https://mirror.openshift.com/pub/openshift-v4/clients/ocp/
+   - Verify with: `oc version`
+
+2. **Active cluster connection**
+   - Must be connected to an OpenShift cluster
+   - Verify with: `oc whoami`
+
+3. **Cluster admin permissions**
+   - Required to access etcd pods and metrics
+   - Verify with: `oc auth can-i get pods -n openshift-etcd`
+
+4. **Running etcd pods**
+   - At least one etcd pod must be running
+   - Check with: `oc get pods -n openshift-etcd -l app=etcd`
+
+## Arguments
+
+- **--duration** (optional): Duration in minutes to analyze logs (default: 5)
+  - Analyzes recent logs for the specified duration
+  - Longer durations provide more comprehensive analysis
+  - Example: `--duration 15` for 15-minute window
+
+## Implementation
+
+The command performs the following analysis:
+
+### 1. Verify Prerequisites
+
+```bash
+if ! command -v oc &> /dev/null; then
+    echo "Error: oc CLI not found"
+    exit 1
+fi
+
+if ! oc whoami &> /dev/null; then
+    echo "Error: Not connected to cluster"
+    exit 1
+fi
+
+# Parse duration argument (default: 5 minutes)
+DURATION=5
+if [[ "$1" == "--duration" ]] && [[ -n "$2" ]]; then
+    DURATION=$2
+fi
+
+echo "Analyzing etcd performance (last $DURATION minutes)..."
+```
+
+### 2. Get Running Etcd Pod
+
+```bash
+ETCD_POD=$(oc get pods -n openshift-etcd -l app=etcd --field-selector=status.phase=Running -o jsonpath='{.items[0].metadata.name}')
+
+if [ -z "$ETCD_POD" ]; then
+    echo "Error: No running etcd pod found"
+    exit 1
+fi
+
+echo "Using etcd pod: $ETCD_POD"
+echo ""
+```
+
+### 3. Analyze Database Performance
+
+Get database statistics using etcdctl:
+
+```bash
+echo "==============================================="
+echo "DATABASE PERFORMANCE ANALYSIS"
+echo "==============================================="
+echo ""
+echo "Fetching database statistics..."
+
+# Get database sizes from endpoint status
+DB_STATUS=$(oc exec -n openshift-etcd "$ETCD_POD" -c etcdctl -- etcdctl endpoint status --cluster -w json 2>/dev/null)
+
+echo "Database Statistics:"
+echo "$DB_STATUS" | jq -r '.[] |
+    "Endpoint: \(.Endpoint)
+  Version: \(.Status.version)
+  DB Size: \(.Status.dbSize) bytes (\((.Status.dbSize / 1024 / 1024) | floor)MB)
+  DB In Use: \(.Status.dbSizeInUse) bytes (\((.Status.dbSizeInUse / 1024 / 1024) | floor)MB)
+  Keys: \(.Status.header.revision)
+  Raft Index: \(.Status.raftIndex)
+  Raft Term: \(.Status.raftTerm)
+  Leader: \(if .Status.leader == .Status.header.member_id then "YES" else "NO" end)
+"'
+
+echo ""
+echo "Fragmentation Analysis:"
+echo "$DB_STATUS" | jq -r '.[] |
+    if .Status.dbSize > 0 then
+        ((.Status.dbSize - .Status.dbSizeInUse) * 100 / .Status.dbSize) as $frag |
+        "Endpoint: \(.Endpoint)
+  Fragmentation: \($frag | floor)%" +
+        if $frag > 50 then
+            " - WARNING: High fragmentation detected, consider defragmentation"
+        elif $frag > 30 then
+            " - NOTICE: Moderate fragmentation"
+        else
+            " - OK"
+        end
+    else
+        "Endpoint: \(.Endpoint)
+  Fragmentation: N/A"
+    end'
+```
+
+### 4. Check Cluster Health
+
+Verify etcd cluster health:
+
+```bash
+echo ""
+echo "==============================================="
+echo "CLUSTER HEALTH"
+echo "==============================================="
+echo ""
+oc exec -n openshift-etcd "$ETCD_POD" -c etcdctl -- etcdctl endpoint health --cluster 2>/dev/null || echo "Health check failed"
+```
+
+### 5. Analyze Logs for Performance Issues
+
+Parse etcd logs for performance warnings:
+
+```bash
+echo ""
+echo "==============================================="
+echo "LOG ANALYSIS (Last $DURATION minutes)"
+echo "==============================================="
+echo ""
+echo "Searching for performance-related warnings..."
+
+# Get recent logs
+LOGS=$(oc logs -n openshift-etcd "$ETCD_POD" -c etcd --since="${DURATION}m" 2>/dev/null)
+
+# Count slow operations
+SLOW_OPS=$(echo "$LOGS" | grep -i "slow" | wc -l)
+echo "Slow operations logged: $SLOW_OPS"
+
+if [ "$SLOW_OPS" -gt 0 ]; then
+    echo ""
+    echo "Recent slow operations (last 10):"
+    echo "$LOGS" | grep -i "slow" | tail -10
+fi
+
+echo ""
+
+# Check for disk warnings
+DISK_WARNINGS=$(echo "$LOGS" | grep -iE "disk|fdatasync|fsync" | grep -iE "slow|took|latency" | wc -l)
+echo "Disk-related warnings: $DISK_WARNINGS"
+
+if [ "$DISK_WARNINGS" -gt 0 ]; then
+    echo ""
+    echo "Disk performance warnings:"
+    echo "$LOGS" | grep -iE "disk|fdatasync|fsync" | grep -iE "slow|took|latency" | tail -5
+fi
+
+echo ""
+
+# Check for apply warnings
+APPLY_WARNINGS=$(echo "$LOGS" | grep -iE "apply.*took|slow.*apply" | wc -l)
+echo "Apply operation warnings: $APPLY_WARNINGS"
+
+if [ "$APPLY_WARNINGS" -gt 0 ]; then
+    echo ""
+    echo "Apply warnings:"
+    echo "$LOGS" | grep -iE "apply.*took|slow.*apply" | tail -5
+fi
+
+echo ""
+
+# Check for compaction info
+echo "Recent compaction operations:"
+echo "$LOGS" | grep "finished scheduled compaction" | tail -3
+if [ $(echo "$LOGS" | grep "finished scheduled compaction" | wc -l) -eq 0 ]; then
+    echo "  No compaction operations in this time window"
+fi
+
+echo ""
+
+# Check for snapshot operations
+echo "Snapshot operations:"
+SNAPSHOTS=$(echo "$LOGS" | grep -i "snapshot" | wc -l)
+echo "Snapshot events: $SNAPSHOTS"
+if [ "$SNAPSHOTS" -gt 0 ]; then
+    echo "$LOGS" | grep -i "snapshot" | tail -3
+fi
+```
+
+### 6. Analyze Leader Stability
+
+Check for leader changes and stability issues:
+
+```bash
+echo ""
+echo "==============================================="
+echo "LEADER STABILITY ANALYSIS"
+echo "==============================================="
+echo ""
+
+LEADER_CHANGES=$(echo "$LOGS" | grep -i "leader.*changed\|became leader\|lost leader" | wc -l)
+echo "Leader change events: $LEADER_CHANGES"
+
+if [ "$LEADER_CHANGES" -gt 0 ]; then
+    echo ""
+    echo "Leader change events:"
+    echo "$LOGS" | grep -i "leader.*changed\|became leader\|lost leader"
+fi
+
+# Check for proposal/commit issues
+echo ""
+echo "Proposal and commit operations:"
+PROPOSAL_LOGS=$(echo "$LOGS" | grep -iE "proposal|commit" | grep -iE "slow|took|failed" | wc -l)
+echo "Slow proposal/commit operations: $PROPOSAL_LOGS"
+
+if [ "$PROPOSAL_LOGS" -gt 0 ]; then
+    echo ""
+    echo "Sample slow operations:"
+    echo "$LOGS" | grep -iE "proposal|commit" | grep -iE "slow|took|failed" | tail -5
+fi
+```
+
+### 7. Analyze Network Performance
+
+Check for network-related issues:
+
+```bash
+echo ""
+echo "==============================================="
+echo "NETWORK ANALYSIS"
+echo "==============================================="
+echo ""
+
+NETWORK_ISSUES=$(echo "$LOGS" | grep -iE "network|connection|timeout|peer" | grep -iE "error|fail|slow" | wc -l)
+echo "Network-related issues: $NETWORK_ISSUES"
+
+if [ "$NETWORK_ISSUES" -gt 0 ]; then
+    echo ""
+    echo "Network issues:"
+    echo "$LOGS" | grep -iE "network|connection|timeout|peer" | grep -iE "error|fail|slow" | tail -5
+fi
+```
+
+### 8. Generate Performance Summary
+
+Create summary with recommendations:
+
+```bash
+echo ""
+echo "==============================================="
+echo "PERFORMANCE SUMMARY & RECOMMENDATIONS"
+echo "==============================================="
+echo ""
+
+ISSUES=0
+WARNINGS=0
+
+# Check fragmentation from DB status
+MAX_FRAG=$(echo "$DB_STATUS" | jq -r '[.[] | if .Status.dbSize > 0 then ((.Status.dbSize - .Status.dbSizeInUse) * 100 / .Status.dbSize) else 0 end] | max')
+
+if (( $(echo "$MAX_FRAG > 50" | bc -l 2>/dev/null || echo 0) )); then
+    echo "ISSUE: High database fragmentation (${MAX_FRAG}%)"
+    echo "  Recommendation: Run defragmentation on all etcd members"
+    echo "  Command: oc exec -n openshift-etcd <pod> -c etcdctl -- etcdctl defrag"
+    echo ""
+    ISSUES=$((ISSUES + 1))
+elif (( $(echo "$MAX_FRAG > 30" | bc -l 2>/dev/null || echo 0) )); then
+    echo "WARNING: Moderate database fragmentation (${MAX_FRAG}%)"
+    echo "  Recommendation: Monitor and consider defragmentation if performance degrades"
+    echo ""
+    WARNINGS=$((WARNINGS + 1))
+fi
+
+if [ "$LEADER_CHANGES" -gt 5 ]; then
+    echo "WARNING: Frequent leader changes ($LEADER_CHANGES in last ${DURATION}m)"
+    echo "  Recommendation: Check network stability between etcd nodes"
+    echo "  - Verify network latency between control plane nodes"
+    echo "  - Check for packet loss or network congestion"
+    echo ""
+    WARNINGS=$((WARNINGS + 1))
+fi
+
+if [ "$SLOW_OPS" -gt 10 ]; then
+    echo "WARNING: High number of slow operations ($SLOW_OPS in last ${DURATION}m)"
+    echo "  Recommendation: Investigate disk I/O and workload patterns"
+    echo "  - Check disk performance with 'fio' benchmarks"
+    echo "  - Review etcd workload and consider optimization"
+    echo ""
+    WARNINGS=$((WARNINGS + 1))
+fi
+
+if [ "$DISK_WARNINGS" -gt 5 ]; then
+    echo "WARNING: Multiple disk performance warnings ($DISK_WARNINGS in last ${DURATION}m)"
+    echo "  Recommendation: Investigate disk I/O performance"
+    echo "  - Ensure etcd is using SSD/NVMe storage"
+    echo "  - Check for disk saturation or competing I/O"
+    echo "  - Verify disk benchmarks meet etcd requirements (> 50 sequential IOPS)"
+    echo ""
+    WARNINGS=$((WARNINGS + 1))
+fi
+
+# Get average DB size
+AVG_DB_SIZE=$(echo "$DB_STATUS" | jq -r '[.[] | .Status.dbSize] | add / length')
+AVG_DB_SIZE_MB=$(echo "scale=0; $AVG_DB_SIZE / 1024 / 1024" | bc)
+
+if [ "$AVG_DB_SIZE_MB" -gt 8000 ]; then
+    echo "WARNING: Large database size (${AVG_DB_SIZE_MB}MB)"
+    echo "  Recommendation: Review data retention and compaction policies"
+    echo "  - Check event retention policies"
+    echo "  - Consider more frequent compaction"
+    echo ""
+    WARNINGS=$((WARNINGS + 1))
+fi
+
+echo "Performance Metrics Summary:"
+echo "  - Database size: ${AVG_DB_SIZE_MB}MB (recommended: < 8GB)"
+echo "  - Fragmentation: ${MAX_FRAG}% (recommended: < 30%)"
+echo "  - Slow operations (${DURATION}m): $SLOW_OPS (recommended: < 10)"
+echo "  - Leader changes (${DURATION}m): $LEADER_CHANGES (recommended: < 5)"
+echo ""
+
+if [ "$ISSUES" -eq 0 ] && [ "$WARNINGS" -eq 0 ]; then
+    echo "Status: ✓ HEALTHY - Performance within acceptable ranges"
+    exit 0
+elif [ "$ISSUES" -gt 0 ]; then
+    echo "Status: ✗ CRITICAL - Found $ISSUES performance issues requiring attention"
+    exit 1
+else
+    echo "Status: ⚠ WARNING - Found $WARNINGS performance warnings"
+    exit 0
+fi
+```
+
+## Return Value
+
+- **Exit 0**: Performance is acceptable (may have warnings)
+- **Exit 1**: Critical performance issues detected
+
+**Output Format**:
+- Structured sections for different performance aspects
+- Metrics with percentile values (P50, P99)
+- Warnings for values exceeding thresholds
+- Recommendations for remediation
+
+## Examples
+
+### Example 1: Basic performance analysis
+```
+/etcd:analyze-performance
+```
+
+Output:
+```
+===============================================
+ETCD PERFORMANCE ANALYSIS
+===============================================
+Analyzing etcd performance (last 5 minutes)...
+Using etcd pod: etcd-dis016-p6vvv-master-0.us-central1-a.c.openshift-qe.internal
+
+===============================================
+DATABASE PERFORMANCE ANALYSIS
+===============================================
+
+Fetching database statistics...
+Database Statistics:
+Endpoint: https://10.0.0.5:2379
+  Version: 3.5.24
+  DB Size: 94941184 bytes (90MB)
+  DB In Use: 51789824 bytes (49MB)
+  Keys: 50240
+  Raft Index: 57097
+  Raft Term: 8
+  Leader: YES
+
+Endpoint: https://10.0.0.3:2379
+  Version: 3.5.24
+  DB Size: 95363072 bytes (90MB)
+  DB In Use: 51789824 bytes (49MB)
+  Keys: 50240
+  Raft Index: 57097
+  Raft Term: 8
+  Leader: NO
+
+Endpoint: https://10.0.0.6:2379
+  Version: 3.5.24
+  DB Size: 94613504 bytes (90MB)
+  DB In Use: 51834880 bytes (49MB)
+  Keys: 50240
+  Raft Index: 57097
+  Raft Term: 8
+  Leader: NO
+
+Fragmentation Analysis:
+Endpoint: https://10.0.0.5:2379
+  Fragmentation: 45% - NOTICE: Moderate fragmentation
+Endpoint: https://10.0.0.3:2379
+  Fragmentation: 45% - NOTICE: Moderate fragmentation
+Endpoint: https://10.0.0.6:2379
+  Fragmentation: 45% - NOTICE: Moderate fragmentation
+
+===============================================
+CLUSTER HEALTH
+===============================================
+
+https://10.0.0.5:2379 is healthy: successfully committed proposal: took = 9.848973ms
+https://10.0.0.3:2379 is healthy: successfully committed proposal: took = 14.309216ms
+https://10.0.0.6:2379 is healthy: successfully committed proposal: took = 14.829731ms
+
+===============================================
+LOG ANALYSIS (Last 5 minutes)
+===============================================
+
+Searching for performance-related warnings...
+Slow operations logged: 0
+Disk-related warnings: 0
+Apply operation warnings: 0
+
+Recent compaction operations:
+{"level":"info","ts":"2025-11-19T06:15:10.136401Z","caller":"mvcc/kvstore_compaction.go:72","msg":"finished scheduled compaction","compact-revision":48026,"took":"175.577699ms","hash":1330697744}
+
+===============================================
+LEADER STABILITY ANALYSIS
+===============================================
+
+Leader change events: 0
+
+===============================================
+NETWORK ANALYSIS
+===============================================
+
+Network-related issues: 0
+
+===============================================
+PERFORMANCE SUMMARY & RECOMMENDATIONS
+===============================================
+
+WARNING: Moderate database fragmentation (45%)
+  Recommendation: Monitor and consider defragmentation if performance degrades
+
+Performance Metrics Summary:
+  - Database size: 90MB (recommended: < 8GB)
+  - Fragmentation: 45% (recommended: < 30%)
+  - Slow operations (5m): 0 (recommended: < 10)
+  - Leader changes (5m): 0 (recommended: < 5)
+
+Status: ⚠ WARNING - Found 1 performance warnings
+```
+
+### Example 2: Extended analysis window
+```
+/etcd:analyze-performance --duration 30
+```
+
+## Common Performance Issues
+
+### High Database Fragmentation
+
+**Symptoms**: Database size significantly larger than in-use size (>30% fragmentation)
+
+**Investigation**:
+```bash
+# Check current fragmentation
+oc exec -n openshift-etcd <pod> -c etcdctl -- etcdctl endpoint status --cluster -w json | jq
+```
+
+**Remediation**:
+```bash
+# Defragment each etcd member (run one at a time)
+oc exec -n openshift-etcd <pod> -c etcdctl -- etcdctl defrag --cluster
+```
+
+**Recommendations**:
+- Schedule regular defragmentation during maintenance windows
+- Monitor fragmentation trends over time
+- Consider defragmentation when >30% fragmented
+
+### Slow Disk I/O
+
+**Symptoms**:
+- Disk-related warnings in logs (fsync, fdatasync)
+- Slow apply operations
+- High compaction times (>500ms)
+
+**Investigation**:
+```bash
+# Check disk performance on etcd nodes
+oc debug node/<node-name> -- chroot /host fio --name=test --rw=write --bs=4k --size=1G --direct=1
+```
+
+**Recommendations**:
+- Use SSD or NVMe storage for etcd
+- Ensure dedicated disks for etcd (not shared with OS)
+- Check for disk saturation or competing I/O
+- Verify disk benchmarks meet etcd requirements (> 50 sequential IOPS)
+
+### Frequent Leader Changes
+
+**Symptoms**: Multiple leader change events in logs
+
+**Investigation**:
+```bash
+# Test network latency between control plane nodes
+oc debug node/<node1> -- ping <node2-ip>
+
+# Check for network packet loss
+oc debug node/<node1> -- ping -c 100 <node2-ip>
+```
+
+**Recommendations**:
+- Ensure etcd nodes are in same datacenter/availability zone
+- Check for network congestion or packet loss
+- Verify MTU settings across cluster network
+- Review network firewall rules and QoS settings
+
+### Large Database Size
+
+**Symptoms**:
+- Database size >8GB
+- Slow operations
+- High memory usage
+
+**Investigation**:
+```bash
+# Check database size across cluster
+oc exec -n openshift-etcd <pod> -c etcdctl -- etcdctl endpoint status --cluster -w table
+```
+
+**Remediation**:
+```bash
+# Check event retention settings
+oc get kubeapiserver cluster -o yaml | grep -A5 eventTTL
+
+# Review compaction settings
+oc logs -n openshift-etcd <pod> -c etcd | grep compaction
+```
+
+**Recommendations**:
+- Review event retention policies
+- Consider more frequent compaction
+- Check for key churn and unnecessary data
+- Monitor database growth trends
+
+## Security Considerations
+
+- Metrics may expose cluster operational details
+- Requires cluster-admin permissions
+- Log analysis may contain sensitive data
+- Performance data should be treated as confidential
+
+## See Also
+
+- Etcd performance guide: https://etcd.io/docs/latest/tuning/
+- OpenShift etcd docs: https://docs.openshift.com/container-platform/latest/scalability_and_performance/recommended-performance-scale-practices/
+- Related commands: `/etcd:health-check`
+
+## Notes
+
+- This command uses `etcdctl` and log analysis rather than direct metrics endpoint access
+- Performance thresholds are based on etcd upstream recommendations
+- Disk benchmarks should show > 50 sequential IOPS for etcd
+- Network latency < 50ms recommended between members
+- Analysis is point-in-time; trends require repeated checks over time
+- Compatible with etcd 3.5+ (OpenShift 4.x)
+- Log analysis window can be adjusted with `--duration` parameter
+- For production clusters, consider running during low-traffic periods
+- Health check latency is measured by actual proposal commits to the cluster