Initial commit
This commit is contained in:
460
commands/health-check.md
Normal file
460
commands/health-check.md
Normal file
@@ -0,0 +1,460 @@
|
||||
---
|
||||
description: Check etcd cluster health, member status, and identify issues
|
||||
argument-hint: "[--verbose]"
|
||||
---
|
||||
|
||||
## Name
|
||||
etcd:health-check
|
||||
|
||||
## Synopsis
|
||||
```
|
||||
/etcd:health-check [--verbose]
|
||||
```
|
||||
|
||||
## Description
|
||||
|
||||
The `health-check` command performs a comprehensive health check of the etcd cluster in an OpenShift environment. It examines etcd member status, cluster health, leadership, connectivity, and identifies potential issues that could affect cluster stability.
|
||||
|
||||
Etcd is the critical key-value store that holds all cluster state for Kubernetes/OpenShift. Issues related to etcd can cause cluster-wide failures, so monitoring its health is essential.
|
||||
|
||||
This command is useful for:
|
||||
- Diagnosing cluster control plane issues
|
||||
- Verifying etcd cluster stability
|
||||
- Identifying split-brain scenarios
|
||||
- Checking member synchronization
|
||||
- Detecting disk space issues
|
||||
- Monitoring etcd performance
|
||||
|
||||
## Prerequisites
|
||||
|
||||
Before using this command, ensure you have:
|
||||
|
||||
1. **OpenShift CLI (oc)**
|
||||
- Install from: https://mirror.openshift.com/pub/openshift-v4/clients/ocp/
|
||||
- Verify with: `oc version`
|
||||
|
||||
2. **Active cluster connection**
|
||||
- Must be connected to an OpenShift cluster
|
||||
- Verify with: `oc whoami`
|
||||
|
||||
3. **Cluster admin permissions**
|
||||
- Required to access etcd pods and execute commands
|
||||
- Verify with: `oc auth can-i get pods -n openshift-etcd`
|
||||
|
||||
4. **Healthy etcd namespace**
|
||||
- The openshift-etcd namespace must exist
|
||||
- At least one etcd pod must be running
|
||||
|
||||
## Arguments
|
||||
|
||||
- **--verbose** (optional): Enable detailed output
|
||||
- Shows etcd member details
|
||||
- Displays performance metrics
|
||||
- Includes log snippets for errors
|
||||
- Provides additional diagnostic information
|
||||
|
||||
## Implementation
|
||||
|
||||
The command performs the following checks:
|
||||
|
||||
### 1. Verify Prerequisites
|
||||
|
||||
Check if oc CLI is available and cluster is accessible:
|
||||
|
||||
```bash
|
||||
if ! command -v oc &> /dev/null; then
|
||||
echo "Error: oc CLI not found. Please install OpenShift CLI."
|
||||
exit 1
|
||||
fi
|
||||
|
||||
if ! oc whoami &> /dev/null; then
|
||||
echo "Error: Not connected to an OpenShift cluster."
|
||||
exit 1
|
||||
fi
|
||||
```
|
||||
|
||||
### 2. Check Etcd Namespace and Pods
|
||||
|
||||
Verify the etcd namespace exists and get pod status:
|
||||
|
||||
```bash
|
||||
echo "Checking etcd namespace and pods..."
|
||||
|
||||
if ! oc get namespace openshift-etcd &> /dev/null; then
|
||||
echo "CRITICAL: openshift-etcd namespace not found"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Get etcd pod status
|
||||
ETCD_PODS=$(oc get pods -n openshift-etcd -l app=etcd -o json)
|
||||
TOTAL_PODS=$(echo "$ETCD_PODS" | jq '.items | length')
|
||||
RUNNING_PODS=$(echo "$ETCD_PODS" | jq '[.items[] | select(.status.phase == "Running")] | length')
|
||||
|
||||
echo "Etcd pods: $RUNNING_PODS/$TOTAL_PODS running"
|
||||
|
||||
if [ "$RUNNING_PODS" -eq 0 ]; then
|
||||
echo "CRITICAL: No etcd pods are running"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# List all etcd pods with status
|
||||
echo ""
|
||||
echo "Etcd Pod Status:"
|
||||
oc get pods -n openshift-etcd -l app=etcd -o custom-columns=NAME:.metadata.name,STATUS:.status.phase,READY:.status.containerStatuses[0].ready,RESTARTS:.status.containerStatuses[0].restartCount,NODE:.spec.nodeName
|
||||
```
|
||||
|
||||
### 3. Check Etcd Cluster Health
|
||||
|
||||
Use etcdctl to check cluster health from each running etcd pod:
|
||||
|
||||
```bash
|
||||
echo ""
|
||||
echo "Checking etcd cluster health..."
|
||||
|
||||
# Get the first running etcd pod
|
||||
ETCD_POD=$(oc get pods -n openshift-etcd -l app=etcd --field-selector=status.phase=Running -o jsonpath='{.items[0].metadata.name}')
|
||||
|
||||
if [ -z "$ETCD_POD" ]; then
|
||||
echo "CRITICAL: No running etcd pod found"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Check cluster health
|
||||
HEALTH_OUTPUT=$(oc exec -n openshift-etcd "$ETCD_POD" -c etcdctl -- etcdctl endpoint health --cluster -w table 2>&1)
|
||||
|
||||
if echo "$HEALTH_OUTPUT" | grep -q "is healthy"; then
|
||||
echo "Cluster Health Status:"
|
||||
echo "$HEALTH_OUTPUT"
|
||||
else
|
||||
echo "CRITICAL: Etcd cluster health check failed"
|
||||
echo "$HEALTH_OUTPUT"
|
||||
exit 1
|
||||
fi
|
||||
```
|
||||
|
||||
### 4. Check Etcd Member List
|
||||
|
||||
List all etcd members and verify quorum:
|
||||
|
||||
```bash
|
||||
echo ""
|
||||
echo "Checking etcd member list..."
|
||||
|
||||
MEMBER_LIST=$(oc exec -n openshift-etcd "$ETCD_POD" -c etcdctl -- etcdctl member list -w table 2>&1)
|
||||
|
||||
echo "Etcd Members:"
|
||||
echo "$MEMBER_LIST"
|
||||
|
||||
# Count members
|
||||
MEMBER_COUNT=$(oc exec -n openshift-etcd "$ETCD_POD" -c etcdctl -- etcdctl member list -w json 2>/dev/null | jq '.members | length')
|
||||
|
||||
echo ""
|
||||
echo "Total members: $MEMBER_COUNT"
|
||||
|
||||
if [ "$MEMBER_COUNT" -lt 3 ]; then
|
||||
echo "WARNING: Etcd cluster has less than 3 members (quorum at risk)"
|
||||
fi
|
||||
|
||||
# Check for unstarted members
|
||||
UNSTARTED=$(echo "$MEMBER_LIST" | grep "unstarted" | wc -l)
|
||||
if [ "$UNSTARTED" -gt 0 ]; then
|
||||
echo "WARNING: $UNSTARTED member(s) in unstarted state"
|
||||
fi
|
||||
```
|
||||
|
||||
### 5. Check Etcd Leadership
|
||||
|
||||
Verify there is a healthy leader:
|
||||
|
||||
```bash
|
||||
echo ""
|
||||
echo "Checking etcd leadership..."
|
||||
|
||||
ENDPOINT_STATUS=$(oc exec -n openshift-etcd "$ETCD_POD" -c etcdctl -- etcdctl endpoint status --cluster -w table 2>&1)
|
||||
|
||||
echo "Endpoint Status:"
|
||||
echo "$ENDPOINT_STATUS"
|
||||
|
||||
# Check if there's a leader
|
||||
if echo "$ENDPOINT_STATUS" | grep -q "true"; then
|
||||
LEADER_ENDPOINT=$(echo "$ENDPOINT_STATUS" | grep "true" | awk '{print $2}')
|
||||
echo ""
|
||||
echo "Leader: $LEADER_ENDPOINT"
|
||||
else
|
||||
echo "CRITICAL: No etcd leader elected"
|
||||
exit 1
|
||||
fi
|
||||
```
|
||||
|
||||
### 6. Check Etcd Database Size
|
||||
|
||||
Check database size and fragmentation:
|
||||
|
||||
```bash
|
||||
echo ""
|
||||
echo "Checking etcd database size..."
|
||||
|
||||
# Get database size from endpoint status
|
||||
DB_SIZE=$(oc exec -n openshift-etcd "$ETCD_POD" -c etcdctl -- etcdctl endpoint status --cluster -w json 2>/dev/null)
|
||||
|
||||
echo "$DB_SIZE" | jq -r '.[] | "Endpoint: \(.Endpoint) | DB Size: \(.Status.dbSize) bytes | DB Size in Use: \(.Status.dbSizeInUse) bytes"'
|
||||
|
||||
# Calculate fragmentation percentage
|
||||
echo "$DB_SIZE" | jq -r '.[] |
|
||||
if .Status.dbSize > 0 then
|
||||
"Fragmentation: \(((.Status.dbSize - .Status.dbSizeInUse) * 100 / .Status.dbSize) | floor)%"
|
||||
else
|
||||
"Fragmentation: N/A"
|
||||
end'
|
||||
|
||||
# Warn if database is too large
|
||||
MAX_DB_SIZE=$((8 * 1024 * 1024 * 1024)) # 8GB threshold
|
||||
CURRENT_SIZE=$(echo "$DB_SIZE" | jq -r '.[0].Status.dbSize')
|
||||
|
||||
if [ "$CURRENT_SIZE" -gt "$MAX_DB_SIZE" ]; then
|
||||
echo "WARNING: Etcd database size ($CURRENT_SIZE bytes) exceeds recommended maximum (8GB)"
|
||||
echo "Consider defragmentation or checking for excessive key growth"
|
||||
fi
|
||||
```
|
||||
|
||||
### 7. Check Disk Space on Etcd Nodes
|
||||
|
||||
Verify disk space on nodes running etcd:
|
||||
|
||||
```bash
|
||||
echo ""
|
||||
echo "Checking disk space on etcd nodes..."
|
||||
|
||||
for pod in $(oc get pods -n openshift-etcd -l app=etcd --field-selector=status.phase=Running -o jsonpath='{.items[*].metadata.name}'); do
|
||||
echo "Pod: $pod"
|
||||
oc exec -n openshift-etcd "$pod" -c etcd -- df -h /var/lib/etcd | tail -1
|
||||
|
||||
# Get disk usage percentage
|
||||
DISK_USAGE=$(oc exec -n openshift-etcd "$pod" -c etcd -- df -h /var/lib/etcd | tail -1 | awk '{print $5}' | sed 's/%//')
|
||||
|
||||
if [ "$DISK_USAGE" -gt 80 ]; then
|
||||
echo "WARNING: Disk usage on $pod is ${DISK_USAGE}% (threshold: 80%)"
|
||||
fi
|
||||
echo ""
|
||||
done
|
||||
```
|
||||
|
||||
### 8. Check for Recent Etcd Errors
|
||||
|
||||
Check recent logs for errors or warnings:
|
||||
|
||||
```bash
|
||||
echo ""
|
||||
echo "Checking recent etcd logs for errors..."
|
||||
|
||||
RECENT_ERRORS=$(oc logs -n openshift-etcd "$ETCD_POD" -c etcd --tail=100 | grep -i "error\|warn\|fatal" | tail -10)
|
||||
|
||||
if [ -n "$RECENT_ERRORS" ]; then
|
||||
echo "Recent errors/warnings found:"
|
||||
echo "$RECENT_ERRORS"
|
||||
else
|
||||
echo "No recent errors in etcd logs"
|
||||
fi
|
||||
```
|
||||
|
||||
### 9. Check Etcd Performance Metrics (if --verbose)
|
||||
|
||||
If verbose mode is enabled, check performance metrics:
|
||||
|
||||
```bash
|
||||
if [ "$VERBOSE" = "true" ]; then
|
||||
echo ""
|
||||
echo "Checking etcd performance metrics..."
|
||||
|
||||
# Get metrics from etcd pod
|
||||
METRICS=$(oc exec -n openshift-etcd "$ETCD_POD" -c etcd -- curl -s http://localhost:2379/metrics 2>/dev/null)
|
||||
|
||||
# Parse key metrics
|
||||
echo "Backend Commit Duration (p99):"
|
||||
echo "$METRICS" | grep "etcd_disk_backend_commit_duration_seconds" | grep "quantile=\"0.99\"" | head -1
|
||||
|
||||
echo ""
|
||||
echo "WAL Fsync Duration (p99):"
|
||||
echo "$METRICS" | grep "etcd_disk_wal_fsync_duration_seconds" | grep "quantile=\"0.99\"" | head -1
|
||||
|
||||
echo ""
|
||||
echo "Leader Changes:"
|
||||
echo "$METRICS" | grep "etcd_server_leader_changes_seen_total" | head -1
|
||||
fi
|
||||
```
|
||||
|
||||
### 10. Generate Summary Report
|
||||
|
||||
Create a summary of findings:
|
||||
|
||||
```bash
|
||||
echo ""
|
||||
echo "==============================================="
|
||||
echo "Etcd Health Check Summary"
|
||||
echo "==============================================="
|
||||
echo "Check Time: $(date)"
|
||||
echo "Cluster: $(oc whoami --show-server)"
|
||||
echo ""
|
||||
echo "Results:"
|
||||
echo " Etcd Pods Running: $RUNNING_PODS/$TOTAL_PODS"
|
||||
echo " Cluster Members: $MEMBER_COUNT"
|
||||
echo " Leader Elected: Yes"
|
||||
echo " Cluster Health: Healthy"
|
||||
echo ""
|
||||
|
||||
if [ "$WARNINGS" -gt 0 ]; then
|
||||
echo "Status: WARNING - Found $WARNINGS warnings requiring attention"
|
||||
exit 0
|
||||
else
|
||||
echo "Status: HEALTHY - All checks passed"
|
||||
exit 0
|
||||
fi
|
||||
```
|
||||
|
||||
## Return Value
|
||||
|
||||
The command returns different exit codes:
|
||||
|
||||
- **Exit 0**: Etcd cluster is healthy (may have warnings)
|
||||
- **Exit 1**: Critical issues detected (no running pods, no leader, health check failed)
|
||||
|
||||
**Output Format**:
|
||||
- Human-readable report with section headers
|
||||
- Critical issues marked with "CRITICAL:"
|
||||
- Warnings marked with "WARNING:"
|
||||
- Success indicators for healthy checks
|
||||
|
||||
## Examples
|
||||
|
||||
### Example 1: Basic health check
|
||||
```
|
||||
/etcd:health-check
|
||||
```
|
||||
|
||||
Output:
|
||||
```
|
||||
Checking etcd namespace and pods...
|
||||
Etcd pods: 3/3 running
|
||||
|
||||
Etcd Pod Status:
|
||||
NAME STATUS READY RESTARTS NODE
|
||||
etcd-ip-10-0-21-125.us-east-2... Running true 0 ip-10-0-21-125
|
||||
etcd-ip-10-0-43-249.us-east-2... Running true 0 ip-10-0-43-249
|
||||
etcd-ip-10-0-68-109.us-east-2... Running true 0 ip-10-0-68-109
|
||||
|
||||
Checking etcd cluster health...
|
||||
Cluster Health Status:
|
||||
+------------------------------------------+--------+
|
||||
| ENDPOINT | HEALTH |
|
||||
+------------------------------------------+--------+
|
||||
| https://10.0.21.125:2379 | true |
|
||||
| https://10.0.43.249:2379 | true |
|
||||
| https://10.0.68.109:2379 | true |
|
||||
+------------------------------------------+--------+
|
||||
|
||||
Checking etcd member list...
|
||||
Etcd Members:
|
||||
+------------------+---------+------------------------+
|
||||
| ID | STATUS | NAME |
|
||||
+------------------+---------+------------------------+
|
||||
| 3a2b1c4d5e6f7890 | started | ip-10-0-21-125 |
|
||||
| 4b3c2d5e6f708901 | started | ip-10-0-43-249 |
|
||||
| 5c4d3e6f70890123 | started | ip-10-0-68-109 |
|
||||
+------------------+---------+------------------------+
|
||||
|
||||
Total members: 3
|
||||
|
||||
Checking etcd leadership...
|
||||
Leader: https://10.0.21.125:2379
|
||||
|
||||
===============================================
|
||||
Etcd Health Check Summary
|
||||
===============================================
|
||||
Status: HEALTHY - All checks passed
|
||||
```
|
||||
|
||||
### Example 2: Verbose health check with metrics
|
||||
```
|
||||
/etcd:health-check --verbose
|
||||
```
|
||||
|
||||
## Common Issues and Remediation
|
||||
|
||||
### No Etcd Leader
|
||||
|
||||
**Symptoms**: Cluster shows no leader elected
|
||||
|
||||
**Investigation**:
|
||||
```bash
|
||||
oc logs -n openshift-etcd <etcd-pod> -c etcd | grep -i "leader"
|
||||
oc get events -n openshift-etcd
|
||||
```
|
||||
|
||||
**Remediation**:
|
||||
- Check network connectivity between etcd members
|
||||
- Verify etcd pods are running on different nodes
|
||||
- Check for clock skew between nodes
|
||||
|
||||
### High Database Size
|
||||
|
||||
**Symptoms**: Database size exceeds 8GB
|
||||
|
||||
**Investigation**:
|
||||
```bash
|
||||
oc exec -n openshift-etcd <etcd-pod> -c etcdctl -- etcdctl endpoint status -w table
|
||||
```
|
||||
|
||||
**Remediation**:
|
||||
- Run defragmentation: `/etcd:defrag` (if command exists)
|
||||
- Check for excessive key creation (e.g., many events)
|
||||
- Review retention policies
|
||||
|
||||
### Disk Space Issues
|
||||
|
||||
**Symptoms**: Disk usage > 80% on etcd data directory
|
||||
|
||||
**Investigation**:
|
||||
```bash
|
||||
oc exec -n openshift-etcd <etcd-pod> -c etcd -- df -h /var/lib/etcd
|
||||
```
|
||||
|
||||
**Remediation**:
|
||||
- Clean up old snapshots
|
||||
- Defragment database
|
||||
- Increase disk size if needed
|
||||
|
||||
### Member Not Started
|
||||
|
||||
**Symptoms**: Member shows "unstarted" status
|
||||
|
||||
**Investigation**:
|
||||
```bash
|
||||
oc logs -n openshift-etcd <etcd-pod> -c etcd
|
||||
oc describe pod -n openshift-etcd <etcd-pod>
|
||||
```
|
||||
|
||||
**Remediation**:
|
||||
- Check pod logs for errors
|
||||
- Verify certificates are valid
|
||||
- Check network policies and firewall rules
|
||||
|
||||
## Security Considerations
|
||||
|
||||
- Requires cluster-admin or equivalent permissions
|
||||
- Access to etcd data allows viewing all cluster secrets
|
||||
- Etcd metrics may contain sensitive information
|
||||
- Always use secure connections when accessing etcd
|
||||
|
||||
## See Also
|
||||
|
||||
- Etcd documentation: https://etcd.io/docs/
|
||||
- OpenShift etcd docs: https://docs.openshift.com/container-platform/latest/backup_and_restore/control_plane_backup_and_restore/
|
||||
- Related commands: `/etcd:analyze-performance`
|
||||
|
||||
## Notes
|
||||
|
||||
- This command is read-only and does not modify etcd
|
||||
- Checks are performed from within etcd pods using etcdctl
|
||||
- Some checks require etcd to be running
|
||||
- Performance may vary on large clusters with many keys
|
||||
- Database size recommendations are based on upstream etcd guidance
|
||||
Reference in New Issue
Block a user