gh-openshift-eng-ai-helpers-plugins-etcd/commands/health-check.md at master

zhongwei/gh-openshift-eng-ai-helpers-plugins-etcd

Files

Zhongwei Li 411f0d3063 Initial commit

2025-11-30 08:45:51 +08:00

13 KiB

Raw Permalink Blame History

description, argument-hint

description	argument-hint
Check etcd cluster health, member status, and identify issues	[--verbose]

Name

etcd:health-check

Synopsis

/etcd:health-check [--verbose]

Description

The health-check command performs a comprehensive health check of the etcd cluster in an OpenShift environment. It examines etcd member status, cluster health, leadership, connectivity, and identifies potential issues that could affect cluster stability.

Etcd is the critical key-value store that holds all cluster state for Kubernetes/OpenShift. Issues related to etcd can cause cluster-wide failures, so monitoring its health is essential.

This command is useful for:

Diagnosing cluster control plane issues
Verifying etcd cluster stability
Identifying split-brain scenarios
Checking member synchronization
Detecting disk space issues
Monitoring etcd performance

Prerequisites

Before using this command, ensure you have:

OpenShift CLI (oc)
- Install from: https://mirror.openshift.com/pub/openshift-v4/clients/ocp/
- Verify with: oc version
Active cluster connection
- Must be connected to an OpenShift cluster
- Verify with: oc whoami
Cluster admin permissions
- Required to access etcd pods and execute commands
- Verify with: oc auth can-i get pods -n openshift-etcd
Healthy etcd namespace
- The openshift-etcd namespace must exist
- At least one etcd pod must be running

Arguments

--verbose (optional): Enable detailed output
- Shows etcd member details
- Displays performance metrics
- Includes log snippets for errors
- Provides additional diagnostic information

Implementation

The command performs the following checks:

1. Verify Prerequisites

Check if oc CLI is available and cluster is accessible:

if ! command -v oc &> /dev/null; then
    echo "Error: oc CLI not found. Please install OpenShift CLI."
    exit 1
fi

if ! oc whoami &> /dev/null; then
    echo "Error: Not connected to an OpenShift cluster."
    exit 1
fi

2. Check Etcd Namespace and Pods

Verify the etcd namespace exists and get pod status:

echo "Checking etcd namespace and pods..."

if ! oc get namespace openshift-etcd &> /dev/null; then
    echo "CRITICAL: openshift-etcd namespace not found"
    exit 1
fi

# Get etcd pod status
ETCD_PODS=$(oc get pods -n openshift-etcd -l app=etcd -o json)
TOTAL_PODS=$(echo "$ETCD_PODS" | jq '.items | length')
RUNNING_PODS=$(echo "$ETCD_PODS" | jq '[.items[] | select(.status.phase == "Running")] | length')

echo "Etcd pods: $RUNNING_PODS/$TOTAL_PODS running"

if [ "$RUNNING_PODS" -eq 0 ]; then
    echo "CRITICAL: No etcd pods are running"
    exit 1
fi

# List all etcd pods with status
echo ""
echo "Etcd Pod Status:"
oc get pods -n openshift-etcd -l app=etcd -o custom-columns=NAME:.metadata.name,STATUS:.status.phase,READY:.status.containerStatuses[0].ready,RESTARTS:.status.containerStatuses[0].restartCount,NODE:.spec.nodeName

3. Check Etcd Cluster Health

Use etcdctl to check cluster health from each running etcd pod:

echo ""
echo "Checking etcd cluster health..."

# Get the first running etcd pod
ETCD_POD=$(oc get pods -n openshift-etcd -l app=etcd --field-selector=status.phase=Running -o jsonpath='{.items[0].metadata.name}')

if [ -z "$ETCD_POD" ]; then
    echo "CRITICAL: No running etcd pod found"
    exit 1
fi

# Check cluster health
HEALTH_OUTPUT=$(oc exec -n openshift-etcd "$ETCD_POD" -c etcdctl -- etcdctl endpoint health --cluster -w table 2>&1)

if echo "$HEALTH_OUTPUT" | grep -q "is healthy"; then
    echo "Cluster Health Status:"
    echo "$HEALTH_OUTPUT"
else
    echo "CRITICAL: Etcd cluster health check failed"
    echo "$HEALTH_OUTPUT"
    exit 1
fi

4. Check Etcd Member List

List all etcd members and verify quorum:

echo ""
echo "Checking etcd member list..."

MEMBER_LIST=$(oc exec -n openshift-etcd "$ETCD_POD" -c etcdctl -- etcdctl member list -w table 2>&1)

echo "Etcd Members:"
echo "$MEMBER_LIST"

# Count members
MEMBER_COUNT=$(oc exec -n openshift-etcd "$ETCD_POD" -c etcdctl -- etcdctl member list -w json 2>/dev/null | jq '.members | length')

echo ""
echo "Total members: $MEMBER_COUNT"

if [ "$MEMBER_COUNT" -lt 3 ]; then
    echo "WARNING: Etcd cluster has less than 3 members (quorum at risk)"
fi

# Check for unstarted members
UNSTARTED=$(echo "$MEMBER_LIST" | grep "unstarted" | wc -l)
if [ "$UNSTARTED" -gt 0 ]; then
    echo "WARNING: $UNSTARTED member(s) in unstarted state"
fi

5. Check Etcd Leadership

Verify there is a healthy leader:

echo ""
echo "Checking etcd leadership..."

ENDPOINT_STATUS=$(oc exec -n openshift-etcd "$ETCD_POD" -c etcdctl -- etcdctl endpoint status --cluster -w table 2>&1)

echo "Endpoint Status:"
echo "$ENDPOINT_STATUS"

# Check if there's a leader
if echo "$ENDPOINT_STATUS" | grep -q "true"; then
    LEADER_ENDPOINT=$(echo "$ENDPOINT_STATUS" | grep "true" | awk '{print $2}')
    echo ""
    echo "Leader: $LEADER_ENDPOINT"
else
    echo "CRITICAL: No etcd leader elected"
    exit 1
fi

6. Check Etcd Database Size

Check database size and fragmentation:

echo ""
echo "Checking etcd database size..."

# Get database size from endpoint status
DB_SIZE=$(oc exec -n openshift-etcd "$ETCD_POD" -c etcdctl -- etcdctl endpoint status --cluster -w json 2>/dev/null)

echo "$DB_SIZE" | jq -r '.[] | "Endpoint: \(.Endpoint) | DB Size: \(.Status.dbSize) bytes | DB Size in Use: \(.Status.dbSizeInUse) bytes"'

# Calculate fragmentation percentage
echo "$DB_SIZE" | jq -r '.[] |
    if .Status.dbSize > 0 then
        "Fragmentation: \(((.Status.dbSize - .Status.dbSizeInUse) * 100 / .Status.dbSize) | floor)%"
    else
        "Fragmentation: N/A"
    end'

# Warn if database is too large
MAX_DB_SIZE=$((8 * 1024 * 1024 * 1024))  # 8GB threshold
CURRENT_SIZE=$(echo "$DB_SIZE" | jq -r '.[0].Status.dbSize')

if [ "$CURRENT_SIZE" -gt "$MAX_DB_SIZE" ]; then
    echo "WARNING: Etcd database size ($CURRENT_SIZE bytes) exceeds recommended maximum (8GB)"
    echo "Consider defragmentation or checking for excessive key growth"
fi

7. Check Disk Space on Etcd Nodes

Verify disk space on nodes running etcd:

echo ""
echo "Checking disk space on etcd nodes..."

for pod in $(oc get pods -n openshift-etcd -l app=etcd --field-selector=status.phase=Running -o jsonpath='{.items[*].metadata.name}'); do
    echo "Pod: $pod"
    oc exec -n openshift-etcd "$pod" -c etcd -- df -h /var/lib/etcd | tail -1

    # Get disk usage percentage
    DISK_USAGE=$(oc exec -n openshift-etcd "$pod" -c etcd -- df -h /var/lib/etcd | tail -1 | awk '{print $5}' | sed 's/%//')

    if [ "$DISK_USAGE" -gt 80 ]; then
        echo "WARNING: Disk usage on $pod is ${DISK_USAGE}% (threshold: 80%)"
    fi
    echo ""
done

8. Check for Recent Etcd Errors

Check recent logs for errors or warnings:

echo ""
echo "Checking recent etcd logs for errors..."

RECENT_ERRORS=$(oc logs -n openshift-etcd "$ETCD_POD" -c etcd --tail=100 | grep -i "error\|warn\|fatal" | tail -10)

if [ -n "$RECENT_ERRORS" ]; then
    echo "Recent errors/warnings found:"
    echo "$RECENT_ERRORS"
else
    echo "No recent errors in etcd logs"
fi

9. Check Etcd Performance Metrics (if --verbose)

If verbose mode is enabled, check performance metrics:

if [ "$VERBOSE" = "true" ]; then
    echo ""
    echo "Checking etcd performance metrics..."

    # Get metrics from etcd pod
    METRICS=$(oc exec -n openshift-etcd "$ETCD_POD" -c etcd -- curl -s http://localhost:2379/metrics 2>/dev/null)

    # Parse key metrics
    echo "Backend Commit Duration (p99):"
    echo "$METRICS" | grep "etcd_disk_backend_commit_duration_seconds" | grep "quantile=\"0.99\"" | head -1

    echo ""
    echo "WAL Fsync Duration (p99):"
    echo "$METRICS" | grep "etcd_disk_wal_fsync_duration_seconds" | grep "quantile=\"0.99\"" | head -1

    echo ""
    echo "Leader Changes:"
    echo "$METRICS" | grep "etcd_server_leader_changes_seen_total" | head -1
fi

10. Generate Summary Report

Create a summary of findings:

echo ""
echo "==============================================="
echo "Etcd Health Check Summary"
echo "==============================================="
echo "Check Time: $(date)"
echo "Cluster: $(oc whoami --show-server)"
echo ""
echo "Results:"
echo "  Etcd Pods Running: $RUNNING_PODS/$TOTAL_PODS"
echo "  Cluster Members: $MEMBER_COUNT"
echo "  Leader Elected: Yes"
echo "  Cluster Health: Healthy"
echo ""

if [ "$WARNINGS" -gt 0 ]; then
    echo "Status: WARNING - Found $WARNINGS warnings requiring attention"
    exit 0
else
    echo "Status: HEALTHY - All checks passed"
    exit 0
fi

Return Value

The command returns different exit codes:

Exit 0: Etcd cluster is healthy (may have warnings)
Exit 1: Critical issues detected (no running pods, no leader, health check failed)

Output Format:

Human-readable report with section headers
Critical issues marked with "CRITICAL:"
Warnings marked with "WARNING:"
Success indicators for healthy checks

Examples

Example 1: Basic health check

/etcd:health-check

Output:

Checking etcd namespace and pods...
Etcd pods: 3/3 running

Etcd Pod Status:
NAME                                     STATUS    READY  RESTARTS  NODE
etcd-ip-10-0-21-125.us-east-2...        Running   true   0         ip-10-0-21-125
etcd-ip-10-0-43-249.us-east-2...        Running   true   0         ip-10-0-43-249
etcd-ip-10-0-68-109.us-east-2...        Running   true   0         ip-10-0-68-109

Checking etcd cluster health...
Cluster Health Status:
+------------------------------------------+--------+
|                ENDPOINT                  | HEALTH |
+------------------------------------------+--------+
| https://10.0.21.125:2379                | true   |
| https://10.0.43.249:2379                | true   |
| https://10.0.68.109:2379                | true   |
+------------------------------------------+--------+

Checking etcd member list...
Etcd Members:
+------------------+---------+------------------------+
|        ID        | STATUS  |          NAME          |
+------------------+---------+------------------------+
| 3a2b1c4d5e6f7890 | started | ip-10-0-21-125         |
| 4b3c2d5e6f708901 | started | ip-10-0-43-249         |
| 5c4d3e6f70890123 | started | ip-10-0-68-109         |
+------------------+---------+------------------------+

Total members: 3

Checking etcd leadership...
Leader: https://10.0.21.125:2379

===============================================
Etcd Health Check Summary
===============================================
Status: HEALTHY - All checks passed

Example 2: Verbose health check with metrics

/etcd:health-check --verbose

Common Issues and Remediation

No Etcd Leader

Symptoms: Cluster shows no leader elected

Investigation:

oc logs -n openshift-etcd <etcd-pod> -c etcd | grep -i "leader"
oc get events -n openshift-etcd

Remediation:

Check network connectivity between etcd members
Verify etcd pods are running on different nodes
Check for clock skew between nodes

High Database Size

Symptoms: Database size exceeds 8GB

Investigation:

oc exec -n openshift-etcd <etcd-pod> -c etcdctl -- etcdctl endpoint status -w table

Remediation:

Run defragmentation: /etcd:defrag (if command exists)
Check for excessive key creation (e.g., many events)
Review retention policies

Disk Space Issues

Symptoms: Disk usage > 80% on etcd data directory

Investigation:

oc exec -n openshift-etcd <etcd-pod> -c etcd -- df -h /var/lib/etcd

Remediation:

Clean up old snapshots
Defragment database
Increase disk size if needed

Member Not Started

Symptoms: Member shows "unstarted" status

Investigation:

oc logs -n openshift-etcd <etcd-pod> -c etcd
oc describe pod -n openshift-etcd <etcd-pod>

Remediation:

Check pod logs for errors
Verify certificates are valid
Check network policies and firewall rules

Security Considerations

Requires cluster-admin or equivalent permissions
Access to etcd data allows viewing all cluster secrets
Etcd metrics may contain sensitive information
Always use secure connections when accessing etcd

Notes

This command is read-only and does not modify etcd
Checks are performed from within etcd pods using etcdctl
Some checks require etcd to be running
Performance may vary on large clusters with many keys
Database size recommendations are based on upstream etcd guidance

13 KiB Raw Permalink Blame History

Name

Synopsis

Description

Prerequisites

Arguments

Implementation

1. Verify Prerequisites

2. Check Etcd Namespace and Pods

3. Check Etcd Cluster Health

4. Check Etcd Member List

5. Check Etcd Leadership

6. Check Etcd Database Size

7. Check Disk Space on Etcd Nodes

8. Check for Recent Etcd Errors

9. Check Etcd Performance Metrics (if --verbose)

10. Generate Summary Report

Return Value

Examples

Example 1: Basic health check

Example 2: Verbose health check with metrics

Common Issues and Remediation

No Etcd Leader

High Database Size

Disk Space Issues

Member Not Started

Security Considerations

See Also

Notes

13 KiB

Raw Permalink Blame History