Files
gh-openshift-eng-ai-helpers…/commands/cluster-health-check.md
2025-11-30 08:45:56 +08:00

25 KiB
Raw Permalink Blame History

description, argument-hint
description argument-hint
Perform comprehensive health check on HCP cluster and report issues <cluster-name> [--verbose] [--output-format json|text]

Name

hcp:cluster-health-check

Synopsis

/hcp:cluster-health-check <cluster-name> [--verbose] [--output-format json|text]

Description

The /hcp:cluster-health-check command performs an extensive diagnostic of a Hosted Control Plane (HCP) cluster to assess its operational health, stability, and performance. It automates the validation of multiple control plane and infrastructure components to ensure the cluster is functioning as expected.

The command runs a comprehensive set of health checks covering control plane pods, etcd, node pools, networking, and infrastructure readiness. It also detects degraded or progressing states, certificate issues, and recent warning events.

Specifically, it performs the following:

  • Detects and validates the availability of oc or kubectl CLI tools and verifies cluster connectivity.
  • Checks the HostedCluster resource status (Available, Progressing, Degraded) and reports any abnormal conditions with detailed reasons and messages.
  • Validates control plane pod health in the management cluster, detecting non-running pods, crash loops, high restart counts, and image pull errors.
  • Performs etcd health checks to ensure data consistency and availability.
  • Inspects NodePools for replica mismatches, readiness, and auto-scaling configuration accuracy.
  • Evaluates infrastructure readiness (including AWS/Azure-specific validations like endpoint services).
  • Examines networking and ingress components, including Konnectivity servers and router pods.
  • Scans for recent warning events across HostedCluster, NodePool, and control plane namespaces.
  • Reviews certificate conditions to identify expiring or invalid certificates.
  • Generates a clear, color-coded summary report and optionally exports findings in JSON format for automation or CI integration.

Prerequisites

Before using this command, ensure you have:

  1. Kubernetes/OpenShift CLI: Either oc (OpenShift) or kubectl (Kubernetes)

  2. Active cluster connection: Must be connected to a running cluster

    • Verify with: oc whoami or kubectl cluster-info
    • Ensure KUBECONFIG is set if needed
  3. Sufficient permissions: Must have read access to cluster resources

    • Cluster-admin or monitoring role recommended for comprehensive checks
    • Minimum: ability to view nodes, pods, and cluster operators

Arguments

  • [cluster-name] (required): Name of the HostedCluster to check. This uniquely identifies the HCP instance whose health will be analyzed. Example: hcp-demo

  • [namespace] (optional): The namespace in which the HostedCluster resides. Defaults to clusters if not provided. Example: /hcp:cluster-health-check hcp-demo clusters-dev

  • --verbose (optional): Enable detailed output with additional context

    • Shows resource-level details
    • Includes warning conditions
    • Provides remediation suggestions
  • --output-format (optional): Output format for results

    • text (default): Human-readable text format
    • json: Machine-readable JSON format for automation

Implementation

The command performs the following health checks:

1. Determine CLI Tool and Verify Connectivity

Detect which Kubernetes CLI is available and verify cluster connection:

if command -v oc &> /dev/null; then
    CLI="oc"
elif command -v kubectl &> /dev/null; then
    CLI="kubectl"
else
    echo "Error: Neither 'oc' nor 'kubectl' CLI found. Please install one of them."
    exit 1
fi

# Verify cluster connectivity
if ! $CLI cluster-info &> /dev/null; then
    echo "Error: Not connected to a cluster. Please configure your KUBECONFIG."
    exit 1
fi

2. Initialize Health Check Report

Create a report structure to collect findings:

CLUSTER_NAME=$1
NAMESPACE=${2:-"clusters"}
CONTROL_PLANE_NS="${NAMESPACE}-${CLUSTER_NAME}"

REPORT_FILE=".work/hcp-health-check/report-${CLUSTER_NAME}-$(date +%Y%m%d-%H%M%S).txt"
mkdir -p .work/hcp-health-check

# Initialize counters
CRITICAL_ISSUES=0
WARNING_ISSUES=0
INFO_MESSAGES=0

Arguments:

  • $1 (cluster-name): Name of the HostedCluster resource to check. This is the name visible in kubectl get hostedcluster output. Required.
  • $2 (namespace): Namespace containing the HostedCluster resource. Defaults to "clusters" if not specified. Optional.

3. Check HostedCluster Status

Verify the HostedCluster resource health:

echo "Checking HostedCluster Status..."

# Check if HostedCluster exists
if ! $CLI get hostedcluster "$CLUSTER_NAME" -n "$NAMESPACE" &> /dev/null; then
    echo "❌ CRITICAL: HostedCluster '$CLUSTER_NAME' not found in namespace '$NAMESPACE'"
    exit 1
fi

# Get HostedCluster conditions
HC_AVAILABLE=$($CLI get hostedcluster "$CLUSTER_NAME" -n "$NAMESPACE" -o json | jq -r '.status.conditions[] | select(.type=="Available") | .status')
HC_PROGRESSING=$($CLI get hostedcluster "$CLUSTER_NAME" -n "$NAMESPACE" -o json | jq -r '.status.conditions[] | select(.type=="Progressing") | .status')
HC_DEGRADED=$($CLI get hostedcluster "$CLUSTER_NAME" -n "$NAMESPACE" -o json | jq -r '.status.conditions[] | select(.type=="Degraded") | .status')

if [ "$HC_AVAILABLE" != "True" ]; then
    CRITICAL_ISSUES=$((CRITICAL_ISSUES + 1))
    echo "❌ CRITICAL: HostedCluster is not Available"
    # Get reason and message
    $CLI get hostedcluster "$CLUSTER_NAME" -n "$NAMESPACE" -o json | jq -r '.status.conditions[] | select(.type=="Available") | "    Reason: \(.reason)\n    Message: \(.message)"'
fi

if [ "$HC_DEGRADED" == "True" ]; then
    CRITICAL_ISSUES=$((CRITICAL_ISSUES + 1))
    echo "❌ CRITICAL: HostedCluster is Degraded"
    $CLI get hostedcluster "$CLUSTER_NAME" -n "$NAMESPACE" -o json | jq -r '.status.conditions[] | select(.type=="Degraded") | "    Reason: \(.reason)\n    Message: \(.message)"'
fi

if [ "$HC_PROGRESSING" == "True" ]; then
    WARNING_ISSUES=$((WARNING_ISSUES + 1))
    echo "⚠️  WARNING: HostedCluster is Progressing"
    $CLI get hostedcluster "$CLUSTER_NAME" -n "$NAMESPACE" -o json | jq -r '.status.conditions[] | select(.type=="Progressing") | "    Reason: \(.reason)\n    Message: \(.message)"'
fi

# Check version and upgrade status
HC_VERSION=$($CLI get hostedcluster "$CLUSTER_NAME" -n "$NAMESPACE" -o json | jq -r '.status.version.history[0].version // "unknown"')
echo "  HostedCluster version: $HC_VERSION"

4. Check Control Plane Pod Health

Examine control plane components in the management cluster:

echo "Checking Control Plane Pods..."

# Check if control plane namespace exists
if ! $CLI get namespace "$CONTROL_PLANE_NS" &> /dev/null; then
    CRITICAL_ISSUES=$((CRITICAL_ISSUES + 1))
    echo "❌ CRITICAL: Control plane namespace '$CONTROL_PLANE_NS' not found"
else
    # Get pods that are not Running
    FAILED_CP_PODS=$($CLI get pods -n "$CONTROL_PLANE_NS" -o json | jq -r '.items[] | select(.status.phase != "Running" and .status.phase != "Succeeded") | "\(.metadata.name) [\(.status.phase)]"')

    if [ -n "$FAILED_CP_PODS" ]; then
        CRITICAL_ISSUES=$((CRITICAL_ISSUES + $(echo "$FAILED_CP_PODS" | wc -l)))
        echo "❌ CRITICAL: Control plane pods in failed state:"
        echo "$FAILED_CP_PODS" | while read pod; do
            echo "  - $pod"
        done
    fi

    # Check for pods with high restart count
    HIGH_RESTART_CP_PODS=$($CLI get pods -n "$CONTROL_PLANE_NS" -o json | jq -r '.items[] | select(.status.containerStatuses[]? | .restartCount > 3) | "\(.metadata.name) [Restarts: \(.status.containerStatuses[0].restartCount)]"')

    if [ -n "$HIGH_RESTART_CP_PODS" ]; then
        WARNING_ISSUES=$((WARNING_ISSUES + $(echo "$HIGH_RESTART_CP_PODS" | wc -l)))
        echo "⚠️  WARNING: Control plane pods with high restart count (>3):"
        echo "$HIGH_RESTART_CP_PODS" | while read pod; do
            echo "  - $pod"
        done
    fi

    # Check for CrashLoopBackOff pods
    CRASHLOOP_CP_PODS=$($CLI get pods -n "$CONTROL_PLANE_NS" -o json | jq -r '.items[] | select(.status.containerStatuses[]? | .state.waiting?.reason == "CrashLoopBackOff") | .metadata.name')

    if [ -n "$CRASHLOOP_CP_PODS" ]; then
        CRITICAL_ISSUES=$((CRITICAL_ISSUES + $(echo "$CRASHLOOP_CP_PODS" | wc -l)))
        echo "❌ CRITICAL: Control plane pods in CrashLoopBackOff:"
        echo "$CRASHLOOP_CP_PODS" | while read pod; do
            echo "  - $pod"
        done
    fi

    # Check for ImagePullBackOff
    IMAGE_PULL_CP_PODS=$($CLI get pods -n "$CONTROL_PLANE_NS" -o json | jq -r '.items[] | select(.status.containerStatuses[]? | .state.waiting?.reason == "ImagePullBackOff" or .state.waiting?.reason == "ErrImagePull") | .metadata.name')

    if [ -n "$IMAGE_PULL_CP_PODS" ]; then
        CRITICAL_ISSUES=$((CRITICAL_ISSUES + $(echo "$IMAGE_PULL_CP_PODS" | wc -l)))
        echo "❌ CRITICAL: Control plane pods with image pull errors:"
        echo "$IMAGE_PULL_CP_PODS" | while read pod; do
            echo "  - $pod"
        done
    fi

    # Check critical control plane components
    echo "Checking critical control plane components..."
    CRITICAL_COMPONENTS="kube-apiserver etcd kube-controller-manager kube-scheduler"
    
    for component in $CRITICAL_COMPONENTS; do
        COMPONENT_PODS=$($CLI get pods -n "$CONTROL_PLANE_NS" -l app="$component" -o json 2>/dev/null | jq -r '.items[].metadata.name')
        
        if [ -z "$COMPONENT_PODS" ]; then
            # Try alternative label
            COMPONENT_PODS=$($CLI get pods -n "$CONTROL_PLANE_NS" -o json | jq -r ".items[] | select(.metadata.name | startswith(\"$component\")) | .metadata.name")
        fi
        
        if [ -z "$COMPONENT_PODS" ]; then
            WARNING_ISSUES=$((WARNING_ISSUES + 1))
            echo "⚠️  WARNING: No pods found for component: $component"
        else
            # Check if any are not running
            for pod in $COMPONENT_PODS; do
                POD_STATUS=$($CLI get pod "$pod" -n "$CONTROL_PLANE_NS" -o json | jq -r '.status.phase')
                if [ "$POD_STATUS" != "Running" ]; then
                    CRITICAL_ISSUES=$((CRITICAL_ISSUES + 1))
                    echo "❌ CRITICAL: $component pod $pod is not Running (Status: $POD_STATUS)"
                fi
            done
        fi
    done
fi

5. Check Etcd Health

Verify etcd cluster health and performance:

echo "Checking Etcd Health..."

# Find etcd pods
ETCD_PODS=$($CLI get pods -n "$CONTROL_PLANE_NS" -l app=etcd -o json 2>/dev/null | jq -r '.items[].metadata.name')

if [ -z "$ETCD_PODS" ]; then
    ETCD_PODS=$($CLI get pods -n "$CONTROL_PLANE_NS" -o json | jq -r '.items[] | select(.metadata.name | startswith("etcd")) | .metadata.name')
fi

if [ -n "$ETCD_PODS" ]; then
    for pod in $ETCD_PODS; do
        # Check etcd endpoint health
        ETCD_HEALTH=$($CLI exec -n "$CONTROL_PLANE_NS" "$pod" -- etcdctl endpoint health 2>/dev/null || echo "failed")
        
        if echo "$ETCD_HEALTH" | grep -q "unhealthy"; then
            CRITICAL_ISSUES=$((CRITICAL_ISSUES + 1))
            echo "❌ CRITICAL: Etcd pod $pod is unhealthy"
        elif echo "$ETCD_HEALTH" | grep -q "failed"; then
            WARNING_ISSUES=$((WARNING_ISSUES + 1))
            echo "⚠️  WARNING: Could not check etcd health for pod $pod"
        fi
    done
else
    WARNING_ISSUES=$((WARNING_ISSUES + 1))
    echo "⚠️  WARNING: No etcd pods found"
fi

6. Check NodePool Status

Verify NodePool health and scaling:

echo "Checking NodePool Status..."

# Get all NodePools for this HostedCluster
NODEPOOLS=$($CLI get nodepool -n "$NAMESPACE" -l hypershift.openshift.io/hostedcluster="$CLUSTER_NAME" -o json | jq -r '.items[].metadata.name')

if [ -z "$NODEPOOLS" ]; then
    WARNING_ISSUES=$((WARNING_ISSUES + 1))
    echo "⚠️  WARNING: No NodePools found for HostedCluster $CLUSTER_NAME"
else
    for nodepool in $NODEPOOLS; do
        # Get NodePool status
        NP_REPLICAS=$($CLI get nodepool "$nodepool" -n "$NAMESPACE" -o json | jq -r '.spec.replicas // 0')
        NP_READY=$($CLI get nodepool "$nodepool" -n "$NAMESPACE" -o json | jq -r '.status.replicas // 0')
        NP_AVAILABLE=$($CLI get nodepool "$nodepool" -n "$NAMESPACE" -o json | jq -r '.status.conditions[] | select(.type=="Ready") | .status')

        echo "  NodePool: $nodepool [Ready: $NP_READY/$NP_REPLICAS]"

        if [ "$NP_AVAILABLE" != "True" ]; then
            CRITICAL_ISSUES=$((CRITICAL_ISSUES + 1))
            echo "❌ CRITICAL: NodePool $nodepool is not Ready"
            $CLI get nodepool "$nodepool" -n "$NAMESPACE" -o json | jq -r '.status.conditions[] | select(.type=="Ready") | "    Reason: \(.reason)\n    Message: \(.message)"'
        fi

        if [ "$NP_READY" != "$NP_REPLICAS" ]; then
            WARNING_ISSUES=$((WARNING_ISSUES + 1))
            echo "⚠️  WARNING: NodePool $nodepool has mismatched replicas (Ready: $NP_READY, Desired: $NP_REPLICAS)"
        fi

        # Check for auto-scaling
        AUTOSCALING=$($CLI get nodepool "$nodepool" -n "$NAMESPACE" -o json | jq -r '.spec.autoScaling // empty')
        if [ -n "$AUTOSCALING" ]; then
            MIN=$($CLI get nodepool "$nodepool" -n "$NAMESPACE" -o json | jq -r '.spec.autoScaling.min')
            MAX=$($CLI get nodepool "$nodepool" -n "$NAMESPACE" -o json | jq -r '.spec.autoScaling.max')
            echo "    Auto-scaling enabled: Min=$MIN, Max=$MAX"
        fi
    done
fi

7. Check Infrastructure Status

Validate infrastructure components (AWS/Azure specific):

echo "Checking Infrastructure Status..."

# Get infrastructure platform
PLATFORM=$($CLI get hostedcluster "$CLUSTER_NAME" -n "$NAMESPACE" -o json | jq -r '.spec.platform.type')
echo "  Platform: $PLATFORM"

# Check for infrastructure-related conditions
INFRA_READY=$($CLI get hostedcluster "$CLUSTER_NAME" -n "$NAMESPACE" -o json | jq -r '.status.conditions[] | select(.type=="InfrastructureReady") | .status')

if [ "$INFRA_READY" != "True" ]; then
    CRITICAL_ISSUES=$((CRITICAL_ISSUES + 1))
    echo "❌ CRITICAL: Infrastructure is not ready"
    $CLI get hostedcluster "$CLUSTER_NAME" -n "$NAMESPACE" -o json | jq -r '.status.conditions[] | select(.type=="InfrastructureReady") | "    Reason: \(.reason)\n    Message: \(.message)"'
fi

# Platform-specific checks
if [ "$PLATFORM" == "AWS" ]; then
    # Check AWS-specific resources
    echo "  Performing AWS-specific checks..."
    
    # Check if endpoint service is configured
    ENDPOINT_SERVICE=$($CLI get hostedcluster "$CLUSTER_NAME" -n "$NAMESPACE" -o json | jq -r '.status.platform.aws.endpointService // "not-found"')
    if [ "$ENDPOINT_SERVICE" == "not-found" ]; then
        WARNING_ISSUES=$((WARNING_ISSUES + 1))
        echo "⚠️  WARNING: AWS endpoint service not configured"
    fi
fi

8. Check Network and Ingress

Verify network connectivity and ingress configuration:

echo "Checking Network and Ingress..."

# Check if connectivity is healthy (for private clusters)
CONNECTIVITY_PODS=$($CLI get pods -n "$CONTROL_PLANE_NS" -l app=connectivity-server -o json 2>/dev/null | jq -r '.items[].metadata.name')

if [ -n "$CONNECTIVITY_PODS" ]; then
    for pod in $CONNECTIVITY_PODS; do
        POD_STATUS=$($CLI get pod "$pod" -n "$CONTROL_PLANE_NS" -o json | jq -r '.status.phase')
        if [ "$POD_STATUS" != "Running" ]; then
            CRITICAL_ISSUES=$((CRITICAL_ISSUES + 1))
            echo "❌ CRITICAL: Connectivity server pod $pod is not Running"
        fi
    done
fi

# Check ingress/router pods
ROUTER_PODS=$($CLI get pods -n "$CONTROL_PLANE_NS" -l app=router -o json 2>/dev/null | jq -r '.items[].metadata.name')

if [ -z "$ROUTER_PODS" ]; then
    ROUTER_PODS=$($CLI get pods -n "$CONTROL_PLANE_NS" -o json | jq -r '.items[] | select(.metadata.name | contains("router") or contains("ingress")) | .metadata.name')
fi

if [ -n "$ROUTER_PODS" ]; then
    for pod in $ROUTER_PODS; do
        POD_STATUS=$($CLI get pod "$pod" -n "$CONTROL_PLANE_NS" -o json | jq -r '.status.phase')
        if [ "$POD_STATUS" != "Running" ]; then
            WARNING_ISSUES=$((WARNING_ISSUES + 1))
            echo "⚠️  WARNING: Router/Ingress pod $pod is not Running"
        fi
    done
fi

9. Check Recent Events

Look for recent warning/error events:

echo "Checking Recent Events..."

# Get recent warning events for HostedCluster
HC_EVENTS=$($CLI get events -n "$NAMESPACE" --field-selector involvedObject.name="$CLUSTER_NAME",involvedObject.kind=HostedCluster,type=Warning --sort-by='.lastTimestamp' | tail -10)

if [ -n "$HC_EVENTS" ]; then
    echo "⚠️  Recent Warning Events for HostedCluster:"
    echo "$HC_EVENTS"
fi

# Get recent warning events in control plane namespace
CP_EVENTS=$($CLI get events -n "$CONTROL_PLANE_NS" --field-selector type=Warning --sort-by='.lastTimestamp' 2>/dev/null | tail -10)

if [ -n "$CP_EVENTS" ]; then
    echo "⚠️  Recent Warning Events in Control Plane:"
    echo "$CP_EVENTS"
fi

# Get recent events for NodePools
for nodepool in $NODEPOOLS; do
    NP_EVENTS=$($CLI get events -n "$NAMESPACE" --field-selector involvedObject.name="$nodepool",involvedObject.kind=NodePool,type=Warning --sort-by='.lastTimestamp' 2>/dev/null | tail -5)
    
    if [ -n "$NP_EVENTS" ]; then
        echo "⚠️  Recent Warning Events for NodePool $nodepool:"
        echo "$NP_EVENTS"
    fi
done

10. Check Certificate Status

Verify certificate validity:

echo "Checking Certificate Status..."

# Check certificate expiration warnings
CERT_CONDITIONS=$($CLI get hostedcluster "$CLUSTER_NAME" -n "$NAMESPACE" -o json | jq -r '.status.conditions[] | select(.type | contains("Certificate")) | "\(.type): \(.status) - \(.message // "N/A")"')

if [ -n "$CERT_CONDITIONS" ]; then
    echo "$CERT_CONDITIONS" | while read line; do
        if echo "$line" | grep -q "False"; then
            WARNING_ISSUES=$((WARNING_ISSUES + 1))
            echo "⚠️  WARNING: $line"
        fi
    done
fi

11. Generate Summary Report

Create a summary of findings:

echo ""
echo "==============================================="
echo "HCP Cluster Health Check Summary"
echo "==============================================="
echo "Cluster Name: $CLUSTER_NAME"
echo "Namespace: $NAMESPACE"
echo "Control Plane Namespace: $CONTROL_PLANE_NS"
echo "Platform: $PLATFORM"
echo "Version: $HC_VERSION"
echo "Check Time: $(date)"
echo ""
echo "Results:"
echo "  Critical Issues: $CRITICAL_ISSUES"
echo "  Warnings: $WARNING_ISSUES"
echo ""

if [ $CRITICAL_ISSUES -eq 0 ] && [ $WARNING_ISSUES -eq 0 ]; then
    echo "✅ OVERALL STATUS: HEALTHY - No issues detected"
    exit 0
elif [ $CRITICAL_ISSUES -gt 0 ]; then
    echo "❌ OVERALL STATUS: CRITICAL - Immediate attention required"
    exit 1
else
    echo "⚠️  OVERALL STATUS: WARNING - Monitoring recommended"
    exit 0
fi

12. Optional: Export to JSON Format

If --output-format json is specified, export findings as JSON:

{
  "cluster": {
    "name": "my-hcp-cluster",
    "namespace": "clusters",
    "controlPlaneNamespace": "clusters-my-hcp-cluster",
    "platform": "AWS",
    "version": "4.21.0",
    "checkTime": "2025-11-11T12:00:00Z"
  },
  "summary": {
    "criticalIssues": 1,
    "warnings": 3,
    "overallStatus": "WARNING"
  },
  "findings": {
    "hostedCluster": {
      "available": true,
      "progressing": true,
      "degraded": false,
      "infrastructureReady": true
    },
    "controlPlane": {
      "failedPods": [],
      "crashLoopingPods": [],
      "highRestartPods": ["kube-controller-manager-xxx"],
      "imagePullErrors": []
    },
    "etcd": {
      "healthy": true,
      "pods": ["etcd-0", "etcd-1", "etcd-2"]
    },
    "nodePools": {
      "total": 2,
      "ready": 2,
      "details": [
        {
          "name": "workers",
          "replicas": 3,
          "ready": 3,
          "autoScaling": {
            "enabled": true,
            "min": 2,
            "max": 5
          }
        }
      ]
    },
    "network": {
      "konnectivityHealthy": true,
      "ingressHealthy": true
    },
    "events": {
      "recentWarnings": 5
    }
  }
}

Examples

Example 1: Basic health check with default namespace

/hcp:cluster-health-check my-cluster

Output, for healthy cluster:

HCP Cluster Health Check: my-cluster (namespace: clusters)
================================================================================

OVERALL STATUS: ✅ HEALTHY

COMPONENT STATUS:
✅ HostedCluster: Available
✅ Control Plane: All pods running (0 restarts)
✅ NodePool: 3/3 nodes ready
✅ Infrastructure: AWS resources validated
✅ Network: Operational
✅ Storage/Etcd: Healthy

No critical issues found. Cluster is operating normally.

DIAGNOSTIC COMMANDS:
Run these commands for detailed information:

kubectl get hostedcluster my-cluster -n clusters -o yaml
kubectl get pods -n clusters-my-cluster
kubectl get nodepool -n clusters

Output, with warnings:

HCP Cluster Health Check: test-cluster (namespace: clusters)
================================================================================

OVERALL STATUS: ⚠️  WARNING

COMPONENT STATUS:
✅ HostedCluster: Available
⚠️  Control Plane: 1 pod restarting
✅ NodePool: 2/2 nodes ready
✅ Infrastructure: Validated
✅ Network: Operational
⚠️  Storage/Etcd: High latency detected

ISSUES FOUND:

[WARNING] Control Plane Pod Restarting
- Component: kube-controller-manager
- Location: clusters-test-cluster namespace
- Restarts: 3 in the last hour
- Impact: May cause temporary API instability
- Recommended Action:
  kubectl logs -n clusters-test-cluster kube-controller-manager-xxx --previous
  Check for configuration issues or resource constraints

[WARNING] Etcd High Latency
- Component: etcd
- Metrics: Backend commit duration > 100ms
- Impact: Slower cluster operations
- Recommended Action:
  Check management cluster node performance
  Review etcd disk I/O with: kubectl exec -n clusters-test-cluster etcd-0 -- etcdctl endpoint status

DIAGNOSTIC COMMANDS:

1. Check HostedCluster details:
   kubectl get hostedcluster test-cluster -n clusters -o yaml

2. View control plane events:
   kubectl get events -n clusters-test-cluster --sort-by='.lastTimestamp' | tail -20

3. Check etcd metrics:
   kubectl exec -n clusters-test-cluster etcd-0 -- etcdctl endpoint health

Example 2: Health check with custom namespace

/hcp:cluster-health-check production-cluster prod-clusters

Performs health check on cluster "production-cluster" in the "prod-clusters" namespace.

Example 3: Verbose health check

/hcp:cluster-health-check my-cluster --verbose

Example 4: JSON output for automation

/hcp:cluster-health-check my-cluster --output-format json

Return Value

The command returns a structured health report containing:

  • OVERALL STATUS: Health summary (Healthy / Warning ⚠️ / Critical )
  • COMPONENT STATUS: Status of each checked component with visual indicators
  • ISSUES FOUND: Detailed list of problems with:
    • Severity level (Critical/Warning/Info)
    • Component location
    • Impact assessment
    • Recommended actions
  • DIAGNOSTIC COMMANDS: kubectl commands for further investigation

Common Issues and Remediation

Degraded Cluster Operators

Symptoms: Cluster operators showing Degraded=True or Available=False

Investigation:

oc get clusteroperator <operator-name> -o yaml
oc logs -n openshift-<operator-namespace> -l app=<operator-name>

Remediation: Check operator logs and events for specific errors

Nodes Not Ready

Symptoms: Nodes in NotReady state

Investigation:

oc describe node <node-name>
oc get events --field-selector involvedObject.name=<node-name>

Remediation: Common causes include network issues, disk pressure, or kubelet problems

Pods in CrashLoopBackOff

Symptoms: Pods continuously restarting

Investigation:

oc logs <pod-name> -n <namespace> --previous
oc describe pod <pod-name> -n <namespace>

Remediation: Check application logs, resource limits, and configuration

ImagePullBackOff Errors

Symptoms: Pods unable to pull container images

Investigation:

oc describe pod <pod-name> -n <namespace>

Remediation: Verify image name, registry credentials, and network connectivity

Security Considerations

  • Read-only access: This command only reads cluster state, no modifications
  • Sensitive data: Be cautious when sharing reports as they may contain cluster topology information
  • RBAC requirements: Ensure user has appropriate permissions for all resource types checked

Notes

  • This command requires appropriate RBAC permissions to view HostedCluster, NodePool, and Pod resources
  • The command provides diagnostic guidance but does not automatically remediate issues
  • For critical production issues, always follow your organization's incident response procedures
  • Regular health checks (daily or before changes) help catch issues early
  • Some checks may be skipped if user lacks sufficient permissions