Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:45:51 +08:00
commit 411f0d3063
5 changed files with 1125 additions and 0 deletions

View File

@@ -0,0 +1,602 @@
---
description: Analyze etcd performance metrics, latency, and identify bottlenecks
argument-hint: "[--duration <minutes>]"
---
## Name
etcd:analyze-performance
## Synopsis
```
/etcd:analyze-performance [--duration <minutes>]
```
## Description
The `analyze-performance` command analyzes etcd performance metrics to identify latency issues, slow operations, and potential bottlenecks. It examines disk performance, commit latency, network latency, and provides recommendations for optimization.
Etcd performance is critical for cluster responsiveness. Slow etcd operations can cause:
- API server timeouts
- Slow pod creation and updates
- Controller delays
- Overall cluster sluggishness
This command is useful for:
- Diagnosing slow cluster operations
- Identifying disk I/O bottlenecks
- Detecting network latency issues
- Capacity planning
- Performance tuning
## Prerequisites
Before using this command, ensure you have:
1. **OpenShift CLI (oc)**
- Install from: https://mirror.openshift.com/pub/openshift-v4/clients/ocp/
- Verify with: `oc version`
2. **Active cluster connection**
- Must be connected to an OpenShift cluster
- Verify with: `oc whoami`
3. **Cluster admin permissions**
- Required to access etcd pods and metrics
- Verify with: `oc auth can-i get pods -n openshift-etcd`
4. **Running etcd pods**
- At least one etcd pod must be running
- Check with: `oc get pods -n openshift-etcd -l app=etcd`
## Arguments
- **--duration** (optional): Duration in minutes to analyze logs (default: 5)
- Analyzes recent logs for the specified duration
- Longer durations provide more comprehensive analysis
- Example: `--duration 15` for 15-minute window
## Implementation
The command performs the following analysis:
### 1. Verify Prerequisites
```bash
if ! command -v oc &> /dev/null; then
echo "Error: oc CLI not found"
exit 1
fi
if ! oc whoami &> /dev/null; then
echo "Error: Not connected to cluster"
exit 1
fi
# Parse duration argument (default: 5 minutes)
DURATION=5
if [[ "$1" == "--duration" ]] && [[ -n "$2" ]]; then
DURATION=$2
fi
echo "Analyzing etcd performance (last $DURATION minutes)..."
```
### 2. Get Running Etcd Pod
```bash
ETCD_POD=$(oc get pods -n openshift-etcd -l app=etcd --field-selector=status.phase=Running -o jsonpath='{.items[0].metadata.name}')
if [ -z "$ETCD_POD" ]; then
echo "Error: No running etcd pod found"
exit 1
fi
echo "Using etcd pod: $ETCD_POD"
echo ""
```
### 3. Analyze Database Performance
Get database statistics using etcdctl:
```bash
echo "==============================================="
echo "DATABASE PERFORMANCE ANALYSIS"
echo "==============================================="
echo ""
echo "Fetching database statistics..."
# Get database sizes from endpoint status
DB_STATUS=$(oc exec -n openshift-etcd "$ETCD_POD" -c etcdctl -- etcdctl endpoint status --cluster -w json 2>/dev/null)
echo "Database Statistics:"
echo "$DB_STATUS" | jq -r '.[] |
"Endpoint: \(.Endpoint)
Version: \(.Status.version)
DB Size: \(.Status.dbSize) bytes (\((.Status.dbSize / 1024 / 1024) | floor)MB)
DB In Use: \(.Status.dbSizeInUse) bytes (\((.Status.dbSizeInUse / 1024 / 1024) | floor)MB)
Keys: \(.Status.header.revision)
Raft Index: \(.Status.raftIndex)
Raft Term: \(.Status.raftTerm)
Leader: \(if .Status.leader == .Status.header.member_id then "YES" else "NO" end)
"'
echo ""
echo "Fragmentation Analysis:"
echo "$DB_STATUS" | jq -r '.[] |
if .Status.dbSize > 0 then
((.Status.dbSize - .Status.dbSizeInUse) * 100 / .Status.dbSize) as $frag |
"Endpoint: \(.Endpoint)
Fragmentation: \($frag | floor)%" +
if $frag > 50 then
" - WARNING: High fragmentation detected, consider defragmentation"
elif $frag > 30 then
" - NOTICE: Moderate fragmentation"
else
" - OK"
end
else
"Endpoint: \(.Endpoint)
Fragmentation: N/A"
end'
```
### 4. Check Cluster Health
Verify etcd cluster health:
```bash
echo ""
echo "==============================================="
echo "CLUSTER HEALTH"
echo "==============================================="
echo ""
oc exec -n openshift-etcd "$ETCD_POD" -c etcdctl -- etcdctl endpoint health --cluster 2>/dev/null || echo "Health check failed"
```
### 5. Analyze Logs for Performance Issues
Parse etcd logs for performance warnings:
```bash
echo ""
echo "==============================================="
echo "LOG ANALYSIS (Last $DURATION minutes)"
echo "==============================================="
echo ""
echo "Searching for performance-related warnings..."
# Get recent logs
LOGS=$(oc logs -n openshift-etcd "$ETCD_POD" -c etcd --since="${DURATION}m" 2>/dev/null)
# Count slow operations
SLOW_OPS=$(echo "$LOGS" | grep -i "slow" | wc -l)
echo "Slow operations logged: $SLOW_OPS"
if [ "$SLOW_OPS" -gt 0 ]; then
echo ""
echo "Recent slow operations (last 10):"
echo "$LOGS" | grep -i "slow" | tail -10
fi
echo ""
# Check for disk warnings
DISK_WARNINGS=$(echo "$LOGS" | grep -iE "disk|fdatasync|fsync" | grep -iE "slow|took|latency" | wc -l)
echo "Disk-related warnings: $DISK_WARNINGS"
if [ "$DISK_WARNINGS" -gt 0 ]; then
echo ""
echo "Disk performance warnings:"
echo "$LOGS" | grep -iE "disk|fdatasync|fsync" | grep -iE "slow|took|latency" | tail -5
fi
echo ""
# Check for apply warnings
APPLY_WARNINGS=$(echo "$LOGS" | grep -iE "apply.*took|slow.*apply" | wc -l)
echo "Apply operation warnings: $APPLY_WARNINGS"
if [ "$APPLY_WARNINGS" -gt 0 ]; then
echo ""
echo "Apply warnings:"
echo "$LOGS" | grep -iE "apply.*took|slow.*apply" | tail -5
fi
echo ""
# Check for compaction info
echo "Recent compaction operations:"
echo "$LOGS" | grep "finished scheduled compaction" | tail -3
if [ $(echo "$LOGS" | grep "finished scheduled compaction" | wc -l) -eq 0 ]; then
echo " No compaction operations in this time window"
fi
echo ""
# Check for snapshot operations
echo "Snapshot operations:"
SNAPSHOTS=$(echo "$LOGS" | grep -i "snapshot" | wc -l)
echo "Snapshot events: $SNAPSHOTS"
if [ "$SNAPSHOTS" -gt 0 ]; then
echo "$LOGS" | grep -i "snapshot" | tail -3
fi
```
### 6. Analyze Leader Stability
Check for leader changes and stability issues:
```bash
echo ""
echo "==============================================="
echo "LEADER STABILITY ANALYSIS"
echo "==============================================="
echo ""
LEADER_CHANGES=$(echo "$LOGS" | grep -i "leader.*changed\|became leader\|lost leader" | wc -l)
echo "Leader change events: $LEADER_CHANGES"
if [ "$LEADER_CHANGES" -gt 0 ]; then
echo ""
echo "Leader change events:"
echo "$LOGS" | grep -i "leader.*changed\|became leader\|lost leader"
fi
# Check for proposal/commit issues
echo ""
echo "Proposal and commit operations:"
PROPOSAL_LOGS=$(echo "$LOGS" | grep -iE "proposal|commit" | grep -iE "slow|took|failed" | wc -l)
echo "Slow proposal/commit operations: $PROPOSAL_LOGS"
if [ "$PROPOSAL_LOGS" -gt 0 ]; then
echo ""
echo "Sample slow operations:"
echo "$LOGS" | grep -iE "proposal|commit" | grep -iE "slow|took|failed" | tail -5
fi
```
### 7. Analyze Network Performance
Check for network-related issues:
```bash
echo ""
echo "==============================================="
echo "NETWORK ANALYSIS"
echo "==============================================="
echo ""
NETWORK_ISSUES=$(echo "$LOGS" | grep -iE "network|connection|timeout|peer" | grep -iE "error|fail|slow" | wc -l)
echo "Network-related issues: $NETWORK_ISSUES"
if [ "$NETWORK_ISSUES" -gt 0 ]; then
echo ""
echo "Network issues:"
echo "$LOGS" | grep -iE "network|connection|timeout|peer" | grep -iE "error|fail|slow" | tail -5
fi
```
### 8. Generate Performance Summary
Create summary with recommendations:
```bash
echo ""
echo "==============================================="
echo "PERFORMANCE SUMMARY & RECOMMENDATIONS"
echo "==============================================="
echo ""
ISSUES=0
WARNINGS=0
# Check fragmentation from DB status
MAX_FRAG=$(echo "$DB_STATUS" | jq -r '[.[] | if .Status.dbSize > 0 then ((.Status.dbSize - .Status.dbSizeInUse) * 100 / .Status.dbSize) else 0 end] | max')
if (( $(echo "$MAX_FRAG > 50" | bc -l 2>/dev/null || echo 0) )); then
echo "ISSUE: High database fragmentation (${MAX_FRAG}%)"
echo " Recommendation: Run defragmentation on all etcd members"
echo " Command: oc exec -n openshift-etcd <pod> -c etcdctl -- etcdctl defrag"
echo ""
ISSUES=$((ISSUES + 1))
elif (( $(echo "$MAX_FRAG > 30" | bc -l 2>/dev/null || echo 0) )); then
echo "WARNING: Moderate database fragmentation (${MAX_FRAG}%)"
echo " Recommendation: Monitor and consider defragmentation if performance degrades"
echo ""
WARNINGS=$((WARNINGS + 1))
fi
if [ "$LEADER_CHANGES" -gt 5 ]; then
echo "WARNING: Frequent leader changes ($LEADER_CHANGES in last ${DURATION}m)"
echo " Recommendation: Check network stability between etcd nodes"
echo " - Verify network latency between control plane nodes"
echo " - Check for packet loss or network congestion"
echo ""
WARNINGS=$((WARNINGS + 1))
fi
if [ "$SLOW_OPS" -gt 10 ]; then
echo "WARNING: High number of slow operations ($SLOW_OPS in last ${DURATION}m)"
echo " Recommendation: Investigate disk I/O and workload patterns"
echo " - Check disk performance with 'fio' benchmarks"
echo " - Review etcd workload and consider optimization"
echo ""
WARNINGS=$((WARNINGS + 1))
fi
if [ "$DISK_WARNINGS" -gt 5 ]; then
echo "WARNING: Multiple disk performance warnings ($DISK_WARNINGS in last ${DURATION}m)"
echo " Recommendation: Investigate disk I/O performance"
echo " - Ensure etcd is using SSD/NVMe storage"
echo " - Check for disk saturation or competing I/O"
echo " - Verify disk benchmarks meet etcd requirements (> 50 sequential IOPS)"
echo ""
WARNINGS=$((WARNINGS + 1))
fi
# Get average DB size
AVG_DB_SIZE=$(echo "$DB_STATUS" | jq -r '[.[] | .Status.dbSize] | add / length')
AVG_DB_SIZE_MB=$(echo "scale=0; $AVG_DB_SIZE / 1024 / 1024" | bc)
if [ "$AVG_DB_SIZE_MB" -gt 8000 ]; then
echo "WARNING: Large database size (${AVG_DB_SIZE_MB}MB)"
echo " Recommendation: Review data retention and compaction policies"
echo " - Check event retention policies"
echo " - Consider more frequent compaction"
echo ""
WARNINGS=$((WARNINGS + 1))
fi
echo "Performance Metrics Summary:"
echo " - Database size: ${AVG_DB_SIZE_MB}MB (recommended: < 8GB)"
echo " - Fragmentation: ${MAX_FRAG}% (recommended: < 30%)"
echo " - Slow operations (${DURATION}m): $SLOW_OPS (recommended: < 10)"
echo " - Leader changes (${DURATION}m): $LEADER_CHANGES (recommended: < 5)"
echo ""
if [ "$ISSUES" -eq 0 ] && [ "$WARNINGS" -eq 0 ]; then
echo "Status: ✓ HEALTHY - Performance within acceptable ranges"
exit 0
elif [ "$ISSUES" -gt 0 ]; then
echo "Status: ✗ CRITICAL - Found $ISSUES performance issues requiring attention"
exit 1
else
echo "Status: ⚠ WARNING - Found $WARNINGS performance warnings"
exit 0
fi
```
## Return Value
- **Exit 0**: Performance is acceptable (may have warnings)
- **Exit 1**: Critical performance issues detected
**Output Format**:
- Structured sections for different performance aspects
- Metrics with percentile values (P50, P99)
- Warnings for values exceeding thresholds
- Recommendations for remediation
## Examples
### Example 1: Basic performance analysis
```
/etcd:analyze-performance
```
Output:
```
===============================================
ETCD PERFORMANCE ANALYSIS
===============================================
Analyzing etcd performance (last 5 minutes)...
Using etcd pod: etcd-dis016-p6vvv-master-0.us-central1-a.c.openshift-qe.internal
===============================================
DATABASE PERFORMANCE ANALYSIS
===============================================
Fetching database statistics...
Database Statistics:
Endpoint: https://10.0.0.5:2379
Version: 3.5.24
DB Size: 94941184 bytes (90MB)
DB In Use: 51789824 bytes (49MB)
Keys: 50240
Raft Index: 57097
Raft Term: 8
Leader: YES
Endpoint: https://10.0.0.3:2379
Version: 3.5.24
DB Size: 95363072 bytes (90MB)
DB In Use: 51789824 bytes (49MB)
Keys: 50240
Raft Index: 57097
Raft Term: 8
Leader: NO
Endpoint: https://10.0.0.6:2379
Version: 3.5.24
DB Size: 94613504 bytes (90MB)
DB In Use: 51834880 bytes (49MB)
Keys: 50240
Raft Index: 57097
Raft Term: 8
Leader: NO
Fragmentation Analysis:
Endpoint: https://10.0.0.5:2379
Fragmentation: 45% - NOTICE: Moderate fragmentation
Endpoint: https://10.0.0.3:2379
Fragmentation: 45% - NOTICE: Moderate fragmentation
Endpoint: https://10.0.0.6:2379
Fragmentation: 45% - NOTICE: Moderate fragmentation
===============================================
CLUSTER HEALTH
===============================================
https://10.0.0.5:2379 is healthy: successfully committed proposal: took = 9.848973ms
https://10.0.0.3:2379 is healthy: successfully committed proposal: took = 14.309216ms
https://10.0.0.6:2379 is healthy: successfully committed proposal: took = 14.829731ms
===============================================
LOG ANALYSIS (Last 5 minutes)
===============================================
Searching for performance-related warnings...
Slow operations logged: 0
Disk-related warnings: 0
Apply operation warnings: 0
Recent compaction operations:
{"level":"info","ts":"2025-11-19T06:15:10.136401Z","caller":"mvcc/kvstore_compaction.go:72","msg":"finished scheduled compaction","compact-revision":48026,"took":"175.577699ms","hash":1330697744}
===============================================
LEADER STABILITY ANALYSIS
===============================================
Leader change events: 0
===============================================
NETWORK ANALYSIS
===============================================
Network-related issues: 0
===============================================
PERFORMANCE SUMMARY & RECOMMENDATIONS
===============================================
WARNING: Moderate database fragmentation (45%)
Recommendation: Monitor and consider defragmentation if performance degrades
Performance Metrics Summary:
- Database size: 90MB (recommended: < 8GB)
- Fragmentation: 45% (recommended: < 30%)
- Slow operations (5m): 0 (recommended: < 10)
- Leader changes (5m): 0 (recommended: < 5)
Status: ⚠ WARNING - Found 1 performance warnings
```
### Example 2: Extended analysis window
```
/etcd:analyze-performance --duration 30
```
## Common Performance Issues
### High Database Fragmentation
**Symptoms**: Database size significantly larger than in-use size (>30% fragmentation)
**Investigation**:
```bash
# Check current fragmentation
oc exec -n openshift-etcd <pod> -c etcdctl -- etcdctl endpoint status --cluster -w json | jq
```
**Remediation**:
```bash
# Defragment each etcd member (run one at a time)
oc exec -n openshift-etcd <pod> -c etcdctl -- etcdctl defrag --cluster
```
**Recommendations**:
- Schedule regular defragmentation during maintenance windows
- Monitor fragmentation trends over time
- Consider defragmentation when >30% fragmented
### Slow Disk I/O
**Symptoms**:
- Disk-related warnings in logs (fsync, fdatasync)
- Slow apply operations
- High compaction times (>500ms)
**Investigation**:
```bash
# Check disk performance on etcd nodes
oc debug node/<node-name> -- chroot /host fio --name=test --rw=write --bs=4k --size=1G --direct=1
```
**Recommendations**:
- Use SSD or NVMe storage for etcd
- Ensure dedicated disks for etcd (not shared with OS)
- Check for disk saturation or competing I/O
- Verify disk benchmarks meet etcd requirements (> 50 sequential IOPS)
### Frequent Leader Changes
**Symptoms**: Multiple leader change events in logs
**Investigation**:
```bash
# Test network latency between control plane nodes
oc debug node/<node1> -- ping <node2-ip>
# Check for network packet loss
oc debug node/<node1> -- ping -c 100 <node2-ip>
```
**Recommendations**:
- Ensure etcd nodes are in same datacenter/availability zone
- Check for network congestion or packet loss
- Verify MTU settings across cluster network
- Review network firewall rules and QoS settings
### Large Database Size
**Symptoms**:
- Database size >8GB
- Slow operations
- High memory usage
**Investigation**:
```bash
# Check database size across cluster
oc exec -n openshift-etcd <pod> -c etcdctl -- etcdctl endpoint status --cluster -w table
```
**Remediation**:
```bash
# Check event retention settings
oc get kubeapiserver cluster -o yaml | grep -A5 eventTTL
# Review compaction settings
oc logs -n openshift-etcd <pod> -c etcd | grep compaction
```
**Recommendations**:
- Review event retention policies
- Consider more frequent compaction
- Check for key churn and unnecessary data
- Monitor database growth trends
## Security Considerations
- Metrics may expose cluster operational details
- Requires cluster-admin permissions
- Log analysis may contain sensitive data
- Performance data should be treated as confidential
## See Also
- Etcd performance guide: https://etcd.io/docs/latest/tuning/
- OpenShift etcd docs: https://docs.openshift.com/container-platform/latest/scalability_and_performance/recommended-performance-scale-practices/
- Related commands: `/etcd:health-check`
## Notes
- This command uses `etcdctl` and log analysis rather than direct metrics endpoint access
- Performance thresholds are based on etcd upstream recommendations
- Disk benchmarks should show > 50 sequential IOPS for etcd
- Network latency < 50ms recommended between members
- Analysis is point-in-time; trends require repeated checks over time
- Compatible with etcd 3.5+ (OpenShift 4.x)
- Log analysis window can be adjusted with `--duration` parameter
- For production clusters, consider running during low-traffic periods
- Health check latency is measured by actual proposal commits to the cluster

460
commands/health-check.md Normal file
View File

@@ -0,0 +1,460 @@
---
description: Check etcd cluster health, member status, and identify issues
argument-hint: "[--verbose]"
---
## Name
etcd:health-check
## Synopsis
```
/etcd:health-check [--verbose]
```
## Description
The `health-check` command performs a comprehensive health check of the etcd cluster in an OpenShift environment. It examines etcd member status, cluster health, leadership, connectivity, and identifies potential issues that could affect cluster stability.
Etcd is the critical key-value store that holds all cluster state for Kubernetes/OpenShift. Issues related to etcd can cause cluster-wide failures, so monitoring its health is essential.
This command is useful for:
- Diagnosing cluster control plane issues
- Verifying etcd cluster stability
- Identifying split-brain scenarios
- Checking member synchronization
- Detecting disk space issues
- Monitoring etcd performance
## Prerequisites
Before using this command, ensure you have:
1. **OpenShift CLI (oc)**
- Install from: https://mirror.openshift.com/pub/openshift-v4/clients/ocp/
- Verify with: `oc version`
2. **Active cluster connection**
- Must be connected to an OpenShift cluster
- Verify with: `oc whoami`
3. **Cluster admin permissions**
- Required to access etcd pods and execute commands
- Verify with: `oc auth can-i get pods -n openshift-etcd`
4. **Healthy etcd namespace**
- The openshift-etcd namespace must exist
- At least one etcd pod must be running
## Arguments
- **--verbose** (optional): Enable detailed output
- Shows etcd member details
- Displays performance metrics
- Includes log snippets for errors
- Provides additional diagnostic information
## Implementation
The command performs the following checks:
### 1. Verify Prerequisites
Check if oc CLI is available and cluster is accessible:
```bash
if ! command -v oc &> /dev/null; then
echo "Error: oc CLI not found. Please install OpenShift CLI."
exit 1
fi
if ! oc whoami &> /dev/null; then
echo "Error: Not connected to an OpenShift cluster."
exit 1
fi
```
### 2. Check Etcd Namespace and Pods
Verify the etcd namespace exists and get pod status:
```bash
echo "Checking etcd namespace and pods..."
if ! oc get namespace openshift-etcd &> /dev/null; then
echo "CRITICAL: openshift-etcd namespace not found"
exit 1
fi
# Get etcd pod status
ETCD_PODS=$(oc get pods -n openshift-etcd -l app=etcd -o json)
TOTAL_PODS=$(echo "$ETCD_PODS" | jq '.items | length')
RUNNING_PODS=$(echo "$ETCD_PODS" | jq '[.items[] | select(.status.phase == "Running")] | length')
echo "Etcd pods: $RUNNING_PODS/$TOTAL_PODS running"
if [ "$RUNNING_PODS" -eq 0 ]; then
echo "CRITICAL: No etcd pods are running"
exit 1
fi
# List all etcd pods with status
echo ""
echo "Etcd Pod Status:"
oc get pods -n openshift-etcd -l app=etcd -o custom-columns=NAME:.metadata.name,STATUS:.status.phase,READY:.status.containerStatuses[0].ready,RESTARTS:.status.containerStatuses[0].restartCount,NODE:.spec.nodeName
```
### 3. Check Etcd Cluster Health
Use etcdctl to check cluster health from each running etcd pod:
```bash
echo ""
echo "Checking etcd cluster health..."
# Get the first running etcd pod
ETCD_POD=$(oc get pods -n openshift-etcd -l app=etcd --field-selector=status.phase=Running -o jsonpath='{.items[0].metadata.name}')
if [ -z "$ETCD_POD" ]; then
echo "CRITICAL: No running etcd pod found"
exit 1
fi
# Check cluster health
HEALTH_OUTPUT=$(oc exec -n openshift-etcd "$ETCD_POD" -c etcdctl -- etcdctl endpoint health --cluster -w table 2>&1)
if echo "$HEALTH_OUTPUT" | grep -q "is healthy"; then
echo "Cluster Health Status:"
echo "$HEALTH_OUTPUT"
else
echo "CRITICAL: Etcd cluster health check failed"
echo "$HEALTH_OUTPUT"
exit 1
fi
```
### 4. Check Etcd Member List
List all etcd members and verify quorum:
```bash
echo ""
echo "Checking etcd member list..."
MEMBER_LIST=$(oc exec -n openshift-etcd "$ETCD_POD" -c etcdctl -- etcdctl member list -w table 2>&1)
echo "Etcd Members:"
echo "$MEMBER_LIST"
# Count members
MEMBER_COUNT=$(oc exec -n openshift-etcd "$ETCD_POD" -c etcdctl -- etcdctl member list -w json 2>/dev/null | jq '.members | length')
echo ""
echo "Total members: $MEMBER_COUNT"
if [ "$MEMBER_COUNT" -lt 3 ]; then
echo "WARNING: Etcd cluster has less than 3 members (quorum at risk)"
fi
# Check for unstarted members
UNSTARTED=$(echo "$MEMBER_LIST" | grep "unstarted" | wc -l)
if [ "$UNSTARTED" -gt 0 ]; then
echo "WARNING: $UNSTARTED member(s) in unstarted state"
fi
```
### 5. Check Etcd Leadership
Verify there is a healthy leader:
```bash
echo ""
echo "Checking etcd leadership..."
ENDPOINT_STATUS=$(oc exec -n openshift-etcd "$ETCD_POD" -c etcdctl -- etcdctl endpoint status --cluster -w table 2>&1)
echo "Endpoint Status:"
echo "$ENDPOINT_STATUS"
# Check if there's a leader
if echo "$ENDPOINT_STATUS" | grep -q "true"; then
LEADER_ENDPOINT=$(echo "$ENDPOINT_STATUS" | grep "true" | awk '{print $2}')
echo ""
echo "Leader: $LEADER_ENDPOINT"
else
echo "CRITICAL: No etcd leader elected"
exit 1
fi
```
### 6. Check Etcd Database Size
Check database size and fragmentation:
```bash
echo ""
echo "Checking etcd database size..."
# Get database size from endpoint status
DB_SIZE=$(oc exec -n openshift-etcd "$ETCD_POD" -c etcdctl -- etcdctl endpoint status --cluster -w json 2>/dev/null)
echo "$DB_SIZE" | jq -r '.[] | "Endpoint: \(.Endpoint) | DB Size: \(.Status.dbSize) bytes | DB Size in Use: \(.Status.dbSizeInUse) bytes"'
# Calculate fragmentation percentage
echo "$DB_SIZE" | jq -r '.[] |
if .Status.dbSize > 0 then
"Fragmentation: \(((.Status.dbSize - .Status.dbSizeInUse) * 100 / .Status.dbSize) | floor)%"
else
"Fragmentation: N/A"
end'
# Warn if database is too large
MAX_DB_SIZE=$((8 * 1024 * 1024 * 1024)) # 8GB threshold
CURRENT_SIZE=$(echo "$DB_SIZE" | jq -r '.[0].Status.dbSize')
if [ "$CURRENT_SIZE" -gt "$MAX_DB_SIZE" ]; then
echo "WARNING: Etcd database size ($CURRENT_SIZE bytes) exceeds recommended maximum (8GB)"
echo "Consider defragmentation or checking for excessive key growth"
fi
```
### 7. Check Disk Space on Etcd Nodes
Verify disk space on nodes running etcd:
```bash
echo ""
echo "Checking disk space on etcd nodes..."
for pod in $(oc get pods -n openshift-etcd -l app=etcd --field-selector=status.phase=Running -o jsonpath='{.items[*].metadata.name}'); do
echo "Pod: $pod"
oc exec -n openshift-etcd "$pod" -c etcd -- df -h /var/lib/etcd | tail -1
# Get disk usage percentage
DISK_USAGE=$(oc exec -n openshift-etcd "$pod" -c etcd -- df -h /var/lib/etcd | tail -1 | awk '{print $5}' | sed 's/%//')
if [ "$DISK_USAGE" -gt 80 ]; then
echo "WARNING: Disk usage on $pod is ${DISK_USAGE}% (threshold: 80%)"
fi
echo ""
done
```
### 8. Check for Recent Etcd Errors
Check recent logs for errors or warnings:
```bash
echo ""
echo "Checking recent etcd logs for errors..."
RECENT_ERRORS=$(oc logs -n openshift-etcd "$ETCD_POD" -c etcd --tail=100 | grep -i "error\|warn\|fatal" | tail -10)
if [ -n "$RECENT_ERRORS" ]; then
echo "Recent errors/warnings found:"
echo "$RECENT_ERRORS"
else
echo "No recent errors in etcd logs"
fi
```
### 9. Check Etcd Performance Metrics (if --verbose)
If verbose mode is enabled, check performance metrics:
```bash
if [ "$VERBOSE" = "true" ]; then
echo ""
echo "Checking etcd performance metrics..."
# Get metrics from etcd pod
METRICS=$(oc exec -n openshift-etcd "$ETCD_POD" -c etcd -- curl -s http://localhost:2379/metrics 2>/dev/null)
# Parse key metrics
echo "Backend Commit Duration (p99):"
echo "$METRICS" | grep "etcd_disk_backend_commit_duration_seconds" | grep "quantile=\"0.99\"" | head -1
echo ""
echo "WAL Fsync Duration (p99):"
echo "$METRICS" | grep "etcd_disk_wal_fsync_duration_seconds" | grep "quantile=\"0.99\"" | head -1
echo ""
echo "Leader Changes:"
echo "$METRICS" | grep "etcd_server_leader_changes_seen_total" | head -1
fi
```
### 10. Generate Summary Report
Create a summary of findings:
```bash
echo ""
echo "==============================================="
echo "Etcd Health Check Summary"
echo "==============================================="
echo "Check Time: $(date)"
echo "Cluster: $(oc whoami --show-server)"
echo ""
echo "Results:"
echo " Etcd Pods Running: $RUNNING_PODS/$TOTAL_PODS"
echo " Cluster Members: $MEMBER_COUNT"
echo " Leader Elected: Yes"
echo " Cluster Health: Healthy"
echo ""
if [ "$WARNINGS" -gt 0 ]; then
echo "Status: WARNING - Found $WARNINGS warnings requiring attention"
exit 0
else
echo "Status: HEALTHY - All checks passed"
exit 0
fi
```
## Return Value
The command returns different exit codes:
- **Exit 0**: Etcd cluster is healthy (may have warnings)
- **Exit 1**: Critical issues detected (no running pods, no leader, health check failed)
**Output Format**:
- Human-readable report with section headers
- Critical issues marked with "CRITICAL:"
- Warnings marked with "WARNING:"
- Success indicators for healthy checks
## Examples
### Example 1: Basic health check
```
/etcd:health-check
```
Output:
```
Checking etcd namespace and pods...
Etcd pods: 3/3 running
Etcd Pod Status:
NAME STATUS READY RESTARTS NODE
etcd-ip-10-0-21-125.us-east-2... Running true 0 ip-10-0-21-125
etcd-ip-10-0-43-249.us-east-2... Running true 0 ip-10-0-43-249
etcd-ip-10-0-68-109.us-east-2... Running true 0 ip-10-0-68-109
Checking etcd cluster health...
Cluster Health Status:
+------------------------------------------+--------+
| ENDPOINT | HEALTH |
+------------------------------------------+--------+
| https://10.0.21.125:2379 | true |
| https://10.0.43.249:2379 | true |
| https://10.0.68.109:2379 | true |
+------------------------------------------+--------+
Checking etcd member list...
Etcd Members:
+------------------+---------+------------------------+
| ID | STATUS | NAME |
+------------------+---------+------------------------+
| 3a2b1c4d5e6f7890 | started | ip-10-0-21-125 |
| 4b3c2d5e6f708901 | started | ip-10-0-43-249 |
| 5c4d3e6f70890123 | started | ip-10-0-68-109 |
+------------------+---------+------------------------+
Total members: 3
Checking etcd leadership...
Leader: https://10.0.21.125:2379
===============================================
Etcd Health Check Summary
===============================================
Status: HEALTHY - All checks passed
```
### Example 2: Verbose health check with metrics
```
/etcd:health-check --verbose
```
## Common Issues and Remediation
### No Etcd Leader
**Symptoms**: Cluster shows no leader elected
**Investigation**:
```bash
oc logs -n openshift-etcd <etcd-pod> -c etcd | grep -i "leader"
oc get events -n openshift-etcd
```
**Remediation**:
- Check network connectivity between etcd members
- Verify etcd pods are running on different nodes
- Check for clock skew between nodes
### High Database Size
**Symptoms**: Database size exceeds 8GB
**Investigation**:
```bash
oc exec -n openshift-etcd <etcd-pod> -c etcdctl -- etcdctl endpoint status -w table
```
**Remediation**:
- Run defragmentation: `/etcd:defrag` (if command exists)
- Check for excessive key creation (e.g., many events)
- Review retention policies
### Disk Space Issues
**Symptoms**: Disk usage > 80% on etcd data directory
**Investigation**:
```bash
oc exec -n openshift-etcd <etcd-pod> -c etcd -- df -h /var/lib/etcd
```
**Remediation**:
- Clean up old snapshots
- Defragment database
- Increase disk size if needed
### Member Not Started
**Symptoms**: Member shows "unstarted" status
**Investigation**:
```bash
oc logs -n openshift-etcd <etcd-pod> -c etcd
oc describe pod -n openshift-etcd <etcd-pod>
```
**Remediation**:
- Check pod logs for errors
- Verify certificates are valid
- Check network policies and firewall rules
## Security Considerations
- Requires cluster-admin or equivalent permissions
- Access to etcd data allows viewing all cluster secrets
- Etcd metrics may contain sensitive information
- Always use secure connections when accessing etcd
## See Also
- Etcd documentation: https://etcd.io/docs/
- OpenShift etcd docs: https://docs.openshift.com/container-platform/latest/backup_and_restore/control_plane_backup_and_restore/
- Related commands: `/etcd:analyze-performance`
## Notes
- This command is read-only and does not modify etcd
- Checks are performed from within etcd pods using etcdctl
- Some checks require etcd to be running
- Performance may vary on large clusters with many keys
- Database size recommendations are based on upstream etcd guidance