Initial commit

2025-11-30 08:45:51 +08:00
commit 411f0d3063
5 changed files with 1125 additions and 0 deletions
--- a/commands/health-check.md
+++ b/commands/health-check.md
@@ -0,0 +1,460 @@
+---
+description: Check etcd cluster health, member status, and identify issues
+argument-hint: "[--verbose]"
+---
+
+## Name
+etcd:health-check
+
+## Synopsis
+```
+/etcd:health-check [--verbose]
+```
+
+## Description
+
+The `health-check` command performs a comprehensive health check of the etcd cluster in an OpenShift environment. It examines etcd member status, cluster health, leadership, connectivity, and identifies potential issues that could affect cluster stability.
+
+Etcd is the critical key-value store that holds all cluster state for Kubernetes/OpenShift. Issues related to etcd can cause cluster-wide failures, so monitoring its health is essential.
+
+This command is useful for:
+- Diagnosing cluster control plane issues
+- Verifying etcd cluster stability
+- Identifying split-brain scenarios
+- Checking member synchronization
+- Detecting disk space issues
+- Monitoring etcd performance
+
+## Prerequisites
+
+Before using this command, ensure you have:
+
+1. **OpenShift CLI (oc)**
+   - Install from: https://mirror.openshift.com/pub/openshift-v4/clients/ocp/
+   - Verify with: `oc version`
+
+2. **Active cluster connection**
+   - Must be connected to an OpenShift cluster
+   - Verify with: `oc whoami`
+
+3. **Cluster admin permissions**
+   - Required to access etcd pods and execute commands
+   - Verify with: `oc auth can-i get pods -n openshift-etcd`
+
+4. **Healthy etcd namespace**
+   - The openshift-etcd namespace must exist
+   - At least one etcd pod must be running
+
+## Arguments
+
+- **--verbose** (optional): Enable detailed output
+  - Shows etcd member details
+  - Displays performance metrics
+  - Includes log snippets for errors
+  - Provides additional diagnostic information
+
+## Implementation
+
+The command performs the following checks:
+
+### 1. Verify Prerequisites
+
+Check if oc CLI is available and cluster is accessible:
+
+```bash
+if ! command -v oc &> /dev/null; then
+    echo "Error: oc CLI not found. Please install OpenShift CLI."
+    exit 1
+fi
+
+if ! oc whoami &> /dev/null; then
+    echo "Error: Not connected to an OpenShift cluster."
+    exit 1
+fi
+```
+
+### 2. Check Etcd Namespace and Pods
+
+Verify the etcd namespace exists and get pod status:
+
+```bash
+echo "Checking etcd namespace and pods..."
+
+if ! oc get namespace openshift-etcd &> /dev/null; then
+    echo "CRITICAL: openshift-etcd namespace not found"
+    exit 1
+fi
+
+# Get etcd pod status
+ETCD_PODS=$(oc get pods -n openshift-etcd -l app=etcd -o json)
+TOTAL_PODS=$(echo "$ETCD_PODS" | jq '.items | length')
+RUNNING_PODS=$(echo "$ETCD_PODS" | jq '[.items[] | select(.status.phase == "Running")] | length')
+
+echo "Etcd pods: $RUNNING_PODS/$TOTAL_PODS running"
+
+if [ "$RUNNING_PODS" -eq 0 ]; then
+    echo "CRITICAL: No etcd pods are running"
+    exit 1
+fi
+
+# List all etcd pods with status
+echo ""
+echo "Etcd Pod Status:"
+oc get pods -n openshift-etcd -l app=etcd -o custom-columns=NAME:.metadata.name,STATUS:.status.phase,READY:.status.containerStatuses[0].ready,RESTARTS:.status.containerStatuses[0].restartCount,NODE:.spec.nodeName
+```
+
+### 3. Check Etcd Cluster Health
+
+Use etcdctl to check cluster health from each running etcd pod:
+
+```bash
+echo ""
+echo "Checking etcd cluster health..."
+
+# Get the first running etcd pod
+ETCD_POD=$(oc get pods -n openshift-etcd -l app=etcd --field-selector=status.phase=Running -o jsonpath='{.items[0].metadata.name}')
+
+if [ -z "$ETCD_POD" ]; then
+    echo "CRITICAL: No running etcd pod found"
+    exit 1
+fi
+
+# Check cluster health
+HEALTH_OUTPUT=$(oc exec -n openshift-etcd "$ETCD_POD" -c etcdctl -- etcdctl endpoint health --cluster -w table 2>&1)
+
+if echo "$HEALTH_OUTPUT" | grep -q "is healthy"; then
+    echo "Cluster Health Status:"
+    echo "$HEALTH_OUTPUT"
+else
+    echo "CRITICAL: Etcd cluster health check failed"
+    echo "$HEALTH_OUTPUT"
+    exit 1
+fi
+```
+
+### 4. Check Etcd Member List
+
+List all etcd members and verify quorum:
+
+```bash
+echo ""
+echo "Checking etcd member list..."
+
+MEMBER_LIST=$(oc exec -n openshift-etcd "$ETCD_POD" -c etcdctl -- etcdctl member list -w table 2>&1)
+
+echo "Etcd Members:"
+echo "$MEMBER_LIST"
+
+# Count members
+MEMBER_COUNT=$(oc exec -n openshift-etcd "$ETCD_POD" -c etcdctl -- etcdctl member list -w json 2>/dev/null | jq '.members | length')
+
+echo ""
+echo "Total members: $MEMBER_COUNT"
+
+if [ "$MEMBER_COUNT" -lt 3 ]; then
+    echo "WARNING: Etcd cluster has less than 3 members (quorum at risk)"
+fi
+
+# Check for unstarted members
+UNSTARTED=$(echo "$MEMBER_LIST" | grep "unstarted" | wc -l)
+if [ "$UNSTARTED" -gt 0 ]; then
+    echo "WARNING: $UNSTARTED member(s) in unstarted state"
+fi
+```
+
+### 5. Check Etcd Leadership
+
+Verify there is a healthy leader:
+
+```bash
+echo ""
+echo "Checking etcd leadership..."
+
+ENDPOINT_STATUS=$(oc exec -n openshift-etcd "$ETCD_POD" -c etcdctl -- etcdctl endpoint status --cluster -w table 2>&1)
+
+echo "Endpoint Status:"
+echo "$ENDPOINT_STATUS"
+
+# Check if there's a leader
+if echo "$ENDPOINT_STATUS" | grep -q "true"; then
+    LEADER_ENDPOINT=$(echo "$ENDPOINT_STATUS" | grep "true" | awk '{print $2}')
+    echo ""
+    echo "Leader: $LEADER_ENDPOINT"
+else
+    echo "CRITICAL: No etcd leader elected"
+    exit 1
+fi
+```
+
+### 6. Check Etcd Database Size
+
+Check database size and fragmentation:
+
+```bash
+echo ""
+echo "Checking etcd database size..."
+
+# Get database size from endpoint status
+DB_SIZE=$(oc exec -n openshift-etcd "$ETCD_POD" -c etcdctl -- etcdctl endpoint status --cluster -w json 2>/dev/null)
+
+echo "$DB_SIZE" | jq -r '.[] | "Endpoint: \(.Endpoint) | DB Size: \(.Status.dbSize) bytes | DB Size in Use: \(.Status.dbSizeInUse) bytes"'
+
+# Calculate fragmentation percentage
+echo "$DB_SIZE" | jq -r '.[] |
+    if .Status.dbSize > 0 then
+        "Fragmentation: \(((.Status.dbSize - .Status.dbSizeInUse) * 100 / .Status.dbSize) | floor)%"
+    else
+        "Fragmentation: N/A"
+    end'
+
+# Warn if database is too large
+MAX_DB_SIZE=$((8 * 1024 * 1024 * 1024))  # 8GB threshold
+CURRENT_SIZE=$(echo "$DB_SIZE" | jq -r '.[0].Status.dbSize')
+
+if [ "$CURRENT_SIZE" -gt "$MAX_DB_SIZE" ]; then
+    echo "WARNING: Etcd database size ($CURRENT_SIZE bytes) exceeds recommended maximum (8GB)"
+    echo "Consider defragmentation or checking for excessive key growth"
+fi
+```
+
+### 7. Check Disk Space on Etcd Nodes
+
+Verify disk space on nodes running etcd:
+
+```bash
+echo ""
+echo "Checking disk space on etcd nodes..."
+
+for pod in $(oc get pods -n openshift-etcd -l app=etcd --field-selector=status.phase=Running -o jsonpath='{.items[*].metadata.name}'); do
+    echo "Pod: $pod"
+    oc exec -n openshift-etcd "$pod" -c etcd -- df -h /var/lib/etcd | tail -1
+
+    # Get disk usage percentage
+    DISK_USAGE=$(oc exec -n openshift-etcd "$pod" -c etcd -- df -h /var/lib/etcd | tail -1 | awk '{print $5}' | sed 's/%//')
+
+    if [ "$DISK_USAGE" -gt 80 ]; then
+        echo "WARNING: Disk usage on $pod is ${DISK_USAGE}% (threshold: 80%)"
+    fi
+    echo ""
+done
+```
+
+### 8. Check for Recent Etcd Errors
+
+Check recent logs for errors or warnings:
+
+```bash
+echo ""
+echo "Checking recent etcd logs for errors..."
+
+RECENT_ERRORS=$(oc logs -n openshift-etcd "$ETCD_POD" -c etcd --tail=100 | grep -i "error\|warn\|fatal" | tail -10)
+
+if [ -n "$RECENT_ERRORS" ]; then
+    echo "Recent errors/warnings found:"
+    echo "$RECENT_ERRORS"
+else
+    echo "No recent errors in etcd logs"
+fi
+```
+
+### 9. Check Etcd Performance Metrics (if --verbose)
+
+If verbose mode is enabled, check performance metrics:
+
+```bash
+if [ "$VERBOSE" = "true" ]; then
+    echo ""
+    echo "Checking etcd performance metrics..."
+
+    # Get metrics from etcd pod
+    METRICS=$(oc exec -n openshift-etcd "$ETCD_POD" -c etcd -- curl -s http://localhost:2379/metrics 2>/dev/null)
+
+    # Parse key metrics
+    echo "Backend Commit Duration (p99):"
+    echo "$METRICS" | grep "etcd_disk_backend_commit_duration_seconds" | grep "quantile=\"0.99\"" | head -1
+
+    echo ""
+    echo "WAL Fsync Duration (p99):"
+    echo "$METRICS" | grep "etcd_disk_wal_fsync_duration_seconds" | grep "quantile=\"0.99\"" | head -1
+
+    echo ""
+    echo "Leader Changes:"
+    echo "$METRICS" | grep "etcd_server_leader_changes_seen_total" | head -1
+fi
+```
+
+### 10. Generate Summary Report
+
+Create a summary of findings:
+
+```bash
+echo ""
+echo "==============================================="
+echo "Etcd Health Check Summary"
+echo "==============================================="
+echo "Check Time: $(date)"
+echo "Cluster: $(oc whoami --show-server)"
+echo ""
+echo "Results:"
+echo "  Etcd Pods Running: $RUNNING_PODS/$TOTAL_PODS"
+echo "  Cluster Members: $MEMBER_COUNT"
+echo "  Leader Elected: Yes"
+echo "  Cluster Health: Healthy"
+echo ""
+
+if [ "$WARNINGS" -gt 0 ]; then
+    echo "Status: WARNING - Found $WARNINGS warnings requiring attention"
+    exit 0
+else
+    echo "Status: HEALTHY - All checks passed"
+    exit 0
+fi
+```
+
+## Return Value
+
+The command returns different exit codes:
+
+- **Exit 0**: Etcd cluster is healthy (may have warnings)
+- **Exit 1**: Critical issues detected (no running pods, no leader, health check failed)
+
+**Output Format**:
+- Human-readable report with section headers
+- Critical issues marked with "CRITICAL:"
+- Warnings marked with "WARNING:"
+- Success indicators for healthy checks
+
+## Examples
+
+### Example 1: Basic health check
+```
+/etcd:health-check
+```
+
+Output:
+```
+Checking etcd namespace and pods...
+Etcd pods: 3/3 running
+
+Etcd Pod Status:
+NAME                                     STATUS    READY  RESTARTS  NODE
+etcd-ip-10-0-21-125.us-east-2...        Running   true   0         ip-10-0-21-125
+etcd-ip-10-0-43-249.us-east-2...        Running   true   0         ip-10-0-43-249
+etcd-ip-10-0-68-109.us-east-2...        Running   true   0         ip-10-0-68-109
+
+Checking etcd cluster health...
+Cluster Health Status:
+------------------------------------------+--------+
+|                ENDPOINT                  | HEALTH |
+------------------------------------------+--------+
+| https://10.0.21.125:2379                | true   |
+| https://10.0.43.249:2379                | true   |
+| https://10.0.68.109:2379                | true   |
+------------------------------------------+--------+
+
+Checking etcd member list...
+Etcd Members:
+------------------+---------+------------------------+
+|        ID        | STATUS  |          NAME          |
+------------------+---------+------------------------+
+| 3a2b1c4d5e6f7890 | started | ip-10-0-21-125         |
+| 4b3c2d5e6f708901 | started | ip-10-0-43-249         |
+| 5c4d3e6f70890123 | started | ip-10-0-68-109         |
+------------------+---------+------------------------+
+
+Total members: 3
+
+Checking etcd leadership...
+Leader: https://10.0.21.125:2379
+
+===============================================
+Etcd Health Check Summary
+===============================================
+Status: HEALTHY - All checks passed
+```
+
+### Example 2: Verbose health check with metrics
+```
+/etcd:health-check --verbose
+```
+
+## Common Issues and Remediation
+
+### No Etcd Leader
+
+**Symptoms**: Cluster shows no leader elected
+
+**Investigation**:
+```bash
+oc logs -n openshift-etcd <etcd-pod> -c etcd | grep -i "leader"
+oc get events -n openshift-etcd
+```
+
+**Remediation**:
+- Check network connectivity between etcd members
+- Verify etcd pods are running on different nodes
+- Check for clock skew between nodes
+
+### High Database Size
+
+**Symptoms**: Database size exceeds 8GB
+
+**Investigation**:
+```bash
+oc exec -n openshift-etcd <etcd-pod> -c etcdctl -- etcdctl endpoint status -w table
+```
+
+**Remediation**:
+- Run defragmentation: `/etcd:defrag` (if command exists)
+- Check for excessive key creation (e.g., many events)
+- Review retention policies
+
+### Disk Space Issues
+
+**Symptoms**: Disk usage > 80% on etcd data directory
+
+**Investigation**:
+```bash
+oc exec -n openshift-etcd <etcd-pod> -c etcd -- df -h /var/lib/etcd
+```
+
+**Remediation**:
+- Clean up old snapshots
+- Defragment database
+- Increase disk size if needed
+
+### Member Not Started
+
+**Symptoms**: Member shows "unstarted" status
+
+**Investigation**:
+```bash
+oc logs -n openshift-etcd <etcd-pod> -c etcd
+oc describe pod -n openshift-etcd <etcd-pod>
+```
+
+**Remediation**:
+- Check pod logs for errors
+- Verify certificates are valid
+- Check network policies and firewall rules
+
+## Security Considerations
+
+- Requires cluster-admin or equivalent permissions
+- Access to etcd data allows viewing all cluster secrets
+- Etcd metrics may contain sensitive information
+- Always use secure connections when accessing etcd
+
+## See Also
+
+- Etcd documentation: https://etcd.io/docs/
+- OpenShift etcd docs: https://docs.openshift.com/container-platform/latest/backup_and_restore/control_plane_backup_and_restore/
+- Related commands: `/etcd:analyze-performance`
+
+## Notes
+
+- This command is read-only and does not modify etcd
+- Checks are performed from within etcd pods using etcdctl
+- Some checks require etcd to be running
+- Performance may vary on large clusters with many keys
+- Database size recommendations are based on upstream etcd guidance