Initial commit

2025-11-30 08:45:56 +08:00
commit 9fc59ddec1
11 changed files with 3200 additions and 0 deletions
--- a/commands/cluster-health-check.md
+++ b/commands/cluster-health-check.md
@@ -0,0 +1,705 @@
+---
+description: Perform comprehensive health check on HCP cluster and report issues
+argument-hint: <cluster-name> [--verbose] [--output-format json|text]
+---
+
+## Name
+hcp:cluster-health-check
+
+## Synopsis
+
+```
+/hcp:cluster-health-check <cluster-name> [--verbose] [--output-format json|text]
+```
+
+## Description
+
+The `/hcp:cluster-health-check` command performs an extensive diagnostic of a Hosted Control Plane (HCP) cluster to assess its operational health, stability, and performance. It automates the validation of multiple control plane and infrastructure components to ensure the cluster is functioning as expected.
+
+The command runs a comprehensive set of health checks covering control plane pods, etcd, node pools, networking, and infrastructure readiness. It also detects degraded or progressing states, certificate issues, and recent warning events.
+
+Specifically, it performs the following:
+
+- Detects and validates the availability of oc or kubectl CLI tools and verifies cluster connectivity.
+- Checks the HostedCluster resource status (Available, Progressing, Degraded) and reports any abnormal conditions with detailed reasons and messages.
+- Validates control plane pod health in the management cluster, detecting non-running pods, crash loops, high restart counts, and image pull errors.
+- Performs etcd health checks to ensure data consistency and availability.
+- Inspects NodePools for replica mismatches, readiness, and auto-scaling configuration accuracy.
+- Evaluates infrastructure readiness (including AWS/Azure-specific validations like endpoint services).
+- Examines networking and ingress components, including Konnectivity servers and router pods.
+- Scans for recent warning events across HostedCluster, NodePool, and control plane namespaces.
+- Reviews certificate conditions to identify expiring or invalid certificates.
+- Generates a clear, color-coded summary report and optionally exports findings in JSON format for automation or CI integration.
+
+## Prerequisites
+
+Before using this command, ensure you have:
+
+1. **Kubernetes/OpenShift CLI**: Either `oc` (OpenShift) or `kubectl` (Kubernetes)
+   - Install `oc` from: <https://mirror.openshift.com/pub/openshift-v4/clients/ocp/>
+   - Or install `kubectl` from: <https://kubernetes.io/docs/tasks/tools/>
+   - Verify with: `oc version` or `kubectl version`
+
+2. **Active cluster connection**: Must be connected to a running cluster
+   - Verify with: `oc whoami` or `kubectl cluster-info`
+   - Ensure KUBECONFIG is set if needed
+
+3. **Sufficient permissions**: Must have read access to cluster resources
+   - Cluster-admin or monitoring role recommended for comprehensive checks
+   - Minimum: ability to view nodes, pods, and cluster operators
+
+## Arguments
+
+- **[cluster-name]** (required): Name of the HostedCluster to check. This uniquely identifies the HCP instance whose health will be analyzed. Example: `hcp-demo`
+- **[namespace]** (optional): The namespace in which the HostedCluster resides. Defaults to clusters if not provided. Example: `/hcp:cluster-health-check hcp-demo clusters-dev`
+- **--verbose** (optional): Enable detailed output with additional context
+  - Shows resource-level details
+  - Includes warning conditions
+  - Provides remediation suggestions
+
+- **--output-format** (optional): Output format for results
+  - `text` (default): Human-readable text format
+  - `json`: Machine-readable JSON format for automation
+
+## Implementation
+
+The command performs the following health checks:
+
+### 1. Determine CLI Tool and Verify Connectivity
+
+Detect which Kubernetes CLI is available and verify cluster connection:
+
+```bash
+if command -v oc &> /dev/null; then
+    CLI="oc"
+elif command -v kubectl &> /dev/null; then
+    CLI="kubectl"
+else
+    echo "Error: Neither 'oc' nor 'kubectl' CLI found. Please install one of them."
+    exit 1
+fi
+
+# Verify cluster connectivity
+if ! $CLI cluster-info &> /dev/null; then
+    echo "Error: Not connected to a cluster. Please configure your KUBECONFIG."
+    exit 1
+fi
+```
+
+### 2. Initialize Health Check Report
+
+Create a report structure to collect findings:
+
+```bash
+CLUSTER_NAME=$1
+NAMESPACE=${2:-"clusters"}
+CONTROL_PLANE_NS="${NAMESPACE}-${CLUSTER_NAME}"
+
+REPORT_FILE=".work/hcp-health-check/report-${CLUSTER_NAME}-$(date +%Y%m%d-%H%M%S).txt"
+mkdir -p .work/hcp-health-check
+
+# Initialize counters
+CRITICAL_ISSUES=0
+WARNING_ISSUES=0
+INFO_MESSAGES=0
+```
+
+Arguments:
+- $1 (cluster-name): Name of the HostedCluster resource to check. This is the name visible in `kubectl get hostedcluster` output. Required.
+- $2 (namespace): Namespace containing the HostedCluster resource. Defaults to "clusters" if not specified. Optional.
+
+### 3. Check HostedCluster Status
+
+Verify the HostedCluster resource health:
+
+```bash
+echo "Checking HostedCluster Status..."
+
+# Check if HostedCluster exists
+if ! $CLI get hostedcluster "$CLUSTER_NAME" -n "$NAMESPACE" &> /dev/null; then
+    echo "❌ CRITICAL: HostedCluster '$CLUSTER_NAME' not found in namespace '$NAMESPACE'"
+    exit 1
+fi
+
+# Get HostedCluster conditions
+HC_AVAILABLE=$($CLI get hostedcluster "$CLUSTER_NAME" -n "$NAMESPACE" -o json | jq -r '.status.conditions[] | select(.type=="Available") | .status')
+HC_PROGRESSING=$($CLI get hostedcluster "$CLUSTER_NAME" -n "$NAMESPACE" -o json | jq -r '.status.conditions[] | select(.type=="Progressing") | .status')
+HC_DEGRADED=$($CLI get hostedcluster "$CLUSTER_NAME" -n "$NAMESPACE" -o json | jq -r '.status.conditions[] | select(.type=="Degraded") | .status')
+
+if [ "$HC_AVAILABLE" != "True" ]; then
+    CRITICAL_ISSUES=$((CRITICAL_ISSUES + 1))
+    echo "❌ CRITICAL: HostedCluster is not Available"
+    # Get reason and message
+    $CLI get hostedcluster "$CLUSTER_NAME" -n "$NAMESPACE" -o json | jq -r '.status.conditions[] | select(.type=="Available") | "    Reason: \(.reason)\n    Message: \(.message)"'
+fi
+
+if [ "$HC_DEGRADED" == "True" ]; then
+    CRITICAL_ISSUES=$((CRITICAL_ISSUES + 1))
+    echo "❌ CRITICAL: HostedCluster is Degraded"
+    $CLI get hostedcluster "$CLUSTER_NAME" -n "$NAMESPACE" -o json | jq -r '.status.conditions[] | select(.type=="Degraded") | "    Reason: \(.reason)\n    Message: \(.message)"'
+fi
+
+if [ "$HC_PROGRESSING" == "True" ]; then
+    WARNING_ISSUES=$((WARNING_ISSUES + 1))
+    echo "⚠️  WARNING: HostedCluster is Progressing"
+    $CLI get hostedcluster "$CLUSTER_NAME" -n "$NAMESPACE" -o json | jq -r '.status.conditions[] | select(.type=="Progressing") | "    Reason: \(.reason)\n    Message: \(.message)"'
+fi
+
+# Check version and upgrade status
+HC_VERSION=$($CLI get hostedcluster "$CLUSTER_NAME" -n "$NAMESPACE" -o json | jq -r '.status.version.history[0].version // "unknown"')
+echo "ℹ️  HostedCluster version: $HC_VERSION"
+```
+
+### 4. Check Control Plane Pod Health
+
+Examine control plane components in the management cluster:
+
+```bash
+echo "Checking Control Plane Pods..."
+
+# Check if control plane namespace exists
+if ! $CLI get namespace "$CONTROL_PLANE_NS" &> /dev/null; then
+    CRITICAL_ISSUES=$((CRITICAL_ISSUES + 1))
+    echo "❌ CRITICAL: Control plane namespace '$CONTROL_PLANE_NS' not found"
+else
+    # Get pods that are not Running
+    FAILED_CP_PODS=$($CLI get pods -n "$CONTROL_PLANE_NS" -o json | jq -r '.items[] | select(.status.phase != "Running" and .status.phase != "Succeeded") | "\(.metadata.name) [\(.status.phase)]"')
+
+    if [ -n "$FAILED_CP_PODS" ]; then
+        CRITICAL_ISSUES=$((CRITICAL_ISSUES + $(echo "$FAILED_CP_PODS" | wc -l)))
+        echo "❌ CRITICAL: Control plane pods in failed state:"
+        echo "$FAILED_CP_PODS" | while read pod; do
+            echo "  - $pod"
+        done
+    fi
+
+    # Check for pods with high restart count
+    HIGH_RESTART_CP_PODS=$($CLI get pods -n "$CONTROL_PLANE_NS" -o json | jq -r '.items[] | select(.status.containerStatuses[]? | .restartCount > 3) | "\(.metadata.name) [Restarts: \(.status.containerStatuses[0].restartCount)]"')
+
+    if [ -n "$HIGH_RESTART_CP_PODS" ]; then
+        WARNING_ISSUES=$((WARNING_ISSUES + $(echo "$HIGH_RESTART_CP_PODS" | wc -l)))
+        echo "⚠️  WARNING: Control plane pods with high restart count (>3):"
+        echo "$HIGH_RESTART_CP_PODS" | while read pod; do
+            echo "  - $pod"
+        done
+    fi
+
+    # Check for CrashLoopBackOff pods
+    CRASHLOOP_CP_PODS=$($CLI get pods -n "$CONTROL_PLANE_NS" -o json | jq -r '.items[] | select(.status.containerStatuses[]? | .state.waiting?.reason == "CrashLoopBackOff") | .metadata.name')
+
+    if [ -n "$CRASHLOOP_CP_PODS" ]; then
+        CRITICAL_ISSUES=$((CRITICAL_ISSUES + $(echo "$CRASHLOOP_CP_PODS" | wc -l)))
+        echo "❌ CRITICAL: Control plane pods in CrashLoopBackOff:"
+        echo "$CRASHLOOP_CP_PODS" | while read pod; do
+            echo "  - $pod"
+        done
+    fi
+
+    # Check for ImagePullBackOff
+    IMAGE_PULL_CP_PODS=$($CLI get pods -n "$CONTROL_PLANE_NS" -o json | jq -r '.items[] | select(.status.containerStatuses[]? | .state.waiting?.reason == "ImagePullBackOff" or .state.waiting?.reason == "ErrImagePull") | .metadata.name')
+
+    if [ -n "$IMAGE_PULL_CP_PODS" ]; then
+        CRITICAL_ISSUES=$((CRITICAL_ISSUES + $(echo "$IMAGE_PULL_CP_PODS" | wc -l)))
+        echo "❌ CRITICAL: Control plane pods with image pull errors:"
+        echo "$IMAGE_PULL_CP_PODS" | while read pod; do
+            echo "  - $pod"
+        done
+    fi
+
+    # Check critical control plane components
+    echo "Checking critical control plane components..."
+    CRITICAL_COMPONENTS="kube-apiserver etcd kube-controller-manager kube-scheduler"
+    
+    for component in $CRITICAL_COMPONENTS; do
+        COMPONENT_PODS=$($CLI get pods -n "$CONTROL_PLANE_NS" -l app="$component" -o json 2>/dev/null | jq -r '.items[].metadata.name')
+        
+        if [ -z "$COMPONENT_PODS" ]; then
+            # Try alternative label
+            COMPONENT_PODS=$($CLI get pods -n "$CONTROL_PLANE_NS" -o json | jq -r ".items[] | select(.metadata.name | startswith(\"$component\")) | .metadata.name")
+        fi
+        
+        if [ -z "$COMPONENT_PODS" ]; then
+            WARNING_ISSUES=$((WARNING_ISSUES + 1))
+            echo "⚠️  WARNING: No pods found for component: $component"
+        else
+            # Check if any are not running
+            for pod in $COMPONENT_PODS; do
+                POD_STATUS=$($CLI get pod "$pod" -n "$CONTROL_PLANE_NS" -o json | jq -r '.status.phase')
+                if [ "$POD_STATUS" != "Running" ]; then
+                    CRITICAL_ISSUES=$((CRITICAL_ISSUES + 1))
+                    echo "❌ CRITICAL: $component pod $pod is not Running (Status: $POD_STATUS)"
+                fi
+            done
+        fi
+    done
+fi
+```
+
+### 5. Check Etcd Health
+
+Verify etcd cluster health and performance:
+
+```bash
+echo "Checking Etcd Health..."
+
+# Find etcd pods
+ETCD_PODS=$($CLI get pods -n "$CONTROL_PLANE_NS" -l app=etcd -o json 2>/dev/null | jq -r '.items[].metadata.name')
+
+if [ -z "$ETCD_PODS" ]; then
+    ETCD_PODS=$($CLI get pods -n "$CONTROL_PLANE_NS" -o json | jq -r '.items[] | select(.metadata.name | startswith("etcd")) | .metadata.name')
+fi
+
+if [ -n "$ETCD_PODS" ]; then
+    for pod in $ETCD_PODS; do
+        # Check etcd endpoint health
+        ETCD_HEALTH=$($CLI exec -n "$CONTROL_PLANE_NS" "$pod" -- etcdctl endpoint health 2>/dev/null || echo "failed")
+        
+        if echo "$ETCD_HEALTH" | grep -q "unhealthy"; then
+            CRITICAL_ISSUES=$((CRITICAL_ISSUES + 1))
+            echo "❌ CRITICAL: Etcd pod $pod is unhealthy"
+        elif echo "$ETCD_HEALTH" | grep -q "failed"; then
+            WARNING_ISSUES=$((WARNING_ISSUES + 1))
+            echo "⚠️  WARNING: Could not check etcd health for pod $pod"
+        fi
+    done
+else
+    WARNING_ISSUES=$((WARNING_ISSUES + 1))
+    echo "⚠️  WARNING: No etcd pods found"
+fi
+```
+
+### 6. Check NodePool Status
+
+Verify NodePool health and scaling:
+
+```bash
+echo "Checking NodePool Status..."
+
+# Get all NodePools for this HostedCluster
+NODEPOOLS=$($CLI get nodepool -n "$NAMESPACE" -l hypershift.openshift.io/hostedcluster="$CLUSTER_NAME" -o json | jq -r '.items[].metadata.name')
+
+if [ -z "$NODEPOOLS" ]; then
+    WARNING_ISSUES=$((WARNING_ISSUES + 1))
+    echo "⚠️  WARNING: No NodePools found for HostedCluster $CLUSTER_NAME"
+else
+    for nodepool in $NODEPOOLS; do
+        # Get NodePool status
+        NP_REPLICAS=$($CLI get nodepool "$nodepool" -n "$NAMESPACE" -o json | jq -r '.spec.replicas // 0')
+        NP_READY=$($CLI get nodepool "$nodepool" -n "$NAMESPACE" -o json | jq -r '.status.replicas // 0')
+        NP_AVAILABLE=$($CLI get nodepool "$nodepool" -n "$NAMESPACE" -o json | jq -r '.status.conditions[] | select(.type=="Ready") | .status')
+
+        echo "  NodePool: $nodepool [Ready: $NP_READY/$NP_REPLICAS]"
+
+        if [ "$NP_AVAILABLE" != "True" ]; then
+            CRITICAL_ISSUES=$((CRITICAL_ISSUES + 1))
+            echo "❌ CRITICAL: NodePool $nodepool is not Ready"
+            $CLI get nodepool "$nodepool" -n "$NAMESPACE" -o json | jq -r '.status.conditions[] | select(.type=="Ready") | "    Reason: \(.reason)\n    Message: \(.message)"'
+        fi
+
+        if [ "$NP_READY" != "$NP_REPLICAS" ]; then
+            WARNING_ISSUES=$((WARNING_ISSUES + 1))
+            echo "⚠️  WARNING: NodePool $nodepool has mismatched replicas (Ready: $NP_READY, Desired: $NP_REPLICAS)"
+        fi
+
+        # Check for auto-scaling
+        AUTOSCALING=$($CLI get nodepool "$nodepool" -n "$NAMESPACE" -o json | jq -r '.spec.autoScaling // empty')
+        if [ -n "$AUTOSCALING" ]; then
+            MIN=$($CLI get nodepool "$nodepool" -n "$NAMESPACE" -o json | jq -r '.spec.autoScaling.min')
+            MAX=$($CLI get nodepool "$nodepool" -n "$NAMESPACE" -o json | jq -r '.spec.autoScaling.max')
+            echo "    Auto-scaling enabled: Min=$MIN, Max=$MAX"
+        fi
+    done
+fi
+```
+
+### 7. Check Infrastructure Status
+
+Validate infrastructure components (AWS/Azure specific):
+
+```bash
+echo "Checking Infrastructure Status..."
+
+# Get infrastructure platform
+PLATFORM=$($CLI get hostedcluster "$CLUSTER_NAME" -n "$NAMESPACE" -o json | jq -r '.spec.platform.type')
+echo "  Platform: $PLATFORM"
+
+# Check for infrastructure-related conditions
+INFRA_READY=$($CLI get hostedcluster "$CLUSTER_NAME" -n "$NAMESPACE" -o json | jq -r '.status.conditions[] | select(.type=="InfrastructureReady") | .status')
+
+if [ "$INFRA_READY" != "True" ]; then
+    CRITICAL_ISSUES=$((CRITICAL_ISSUES + 1))
+    echo "❌ CRITICAL: Infrastructure is not ready"
+    $CLI get hostedcluster "$CLUSTER_NAME" -n "$NAMESPACE" -o json | jq -r '.status.conditions[] | select(.type=="InfrastructureReady") | "    Reason: \(.reason)\n    Message: \(.message)"'
+fi
+
+# Platform-specific checks
+if [ "$PLATFORM" == "AWS" ]; then
+    # Check AWS-specific resources
+    echo "  Performing AWS-specific checks..."
+    
+    # Check if endpoint service is configured
+    ENDPOINT_SERVICE=$($CLI get hostedcluster "$CLUSTER_NAME" -n "$NAMESPACE" -o json | jq -r '.status.platform.aws.endpointService // "not-found"')
+    if [ "$ENDPOINT_SERVICE" == "not-found" ]; then
+        WARNING_ISSUES=$((WARNING_ISSUES + 1))
+        echo "⚠️  WARNING: AWS endpoint service not configured"
+    fi
+fi
+```
+
+### 8. Check Network and Ingress
+
+Verify network connectivity and ingress configuration:
+
+```bash
+echo "Checking Network and Ingress..."
+
+# Check if connectivity is healthy (for private clusters)
+CONNECTIVITY_PODS=$($CLI get pods -n "$CONTROL_PLANE_NS" -l app=connectivity-server -o json 2>/dev/null | jq -r '.items[].metadata.name')
+
+if [ -n "$CONNECTIVITY_PODS" ]; then
+    for pod in $CONNECTIVITY_PODS; do
+        POD_STATUS=$($CLI get pod "$pod" -n "$CONTROL_PLANE_NS" -o json | jq -r '.status.phase')
+        if [ "$POD_STATUS" != "Running" ]; then
+            CRITICAL_ISSUES=$((CRITICAL_ISSUES + 1))
+            echo "❌ CRITICAL: Connectivity server pod $pod is not Running"
+        fi
+    done
+fi
+
+# Check ingress/router pods
+ROUTER_PODS=$($CLI get pods -n "$CONTROL_PLANE_NS" -l app=router -o json 2>/dev/null | jq -r '.items[].metadata.name')
+
+if [ -z "$ROUTER_PODS" ]; then
+    ROUTER_PODS=$($CLI get pods -n "$CONTROL_PLANE_NS" -o json | jq -r '.items[] | select(.metadata.name | contains("router") or contains("ingress")) | .metadata.name')
+fi
+
+if [ -n "$ROUTER_PODS" ]; then
+    for pod in $ROUTER_PODS; do
+        POD_STATUS=$($CLI get pod "$pod" -n "$CONTROL_PLANE_NS" -o json | jq -r '.status.phase')
+        if [ "$POD_STATUS" != "Running" ]; then
+            WARNING_ISSUES=$((WARNING_ISSUES + 1))
+            echo "⚠️  WARNING: Router/Ingress pod $pod is not Running"
+        fi
+    done
+fi
+```
+
+### 9. Check Recent Events
+
+Look for recent warning/error events:
+
+```bash
+echo "Checking Recent Events..."
+
+# Get recent warning events for HostedCluster
+HC_EVENTS=$($CLI get events -n "$NAMESPACE" --field-selector involvedObject.name="$CLUSTER_NAME",involvedObject.kind=HostedCluster,type=Warning --sort-by='.lastTimestamp' | tail -10)
+
+if [ -n "$HC_EVENTS" ]; then
+    echo "⚠️  Recent Warning Events for HostedCluster:"
+    echo "$HC_EVENTS"
+fi
+
+# Get recent warning events in control plane namespace
+CP_EVENTS=$($CLI get events -n "$CONTROL_PLANE_NS" --field-selector type=Warning --sort-by='.lastTimestamp' 2>/dev/null | tail -10)
+
+if [ -n "$CP_EVENTS" ]; then
+    echo "⚠️  Recent Warning Events in Control Plane:"
+    echo "$CP_EVENTS"
+fi
+
+# Get recent events for NodePools
+for nodepool in $NODEPOOLS; do
+    NP_EVENTS=$($CLI get events -n "$NAMESPACE" --field-selector involvedObject.name="$nodepool",involvedObject.kind=NodePool,type=Warning --sort-by='.lastTimestamp' 2>/dev/null | tail -5)
+    
+    if [ -n "$NP_EVENTS" ]; then
+        echo "⚠️  Recent Warning Events for NodePool $nodepool:"
+        echo "$NP_EVENTS"
+    fi
+done
+```
+
+### 10. Check Certificate Status
+
+Verify certificate validity:
+
+```bash
+echo "Checking Certificate Status..."
+
+# Check certificate expiration warnings
+CERT_CONDITIONS=$($CLI get hostedcluster "$CLUSTER_NAME" -n "$NAMESPACE" -o json | jq -r '.status.conditions[] | select(.type | contains("Certificate")) | "\(.type): \(.status) - \(.message // "N/A")"')
+
+if [ -n "$CERT_CONDITIONS" ]; then
+    echo "$CERT_CONDITIONS" | while read line; do
+        if echo "$line" | grep -q "False"; then
+            WARNING_ISSUES=$((WARNING_ISSUES + 1))
+            echo "⚠️  WARNING: $line"
+        fi
+    done
+fi
+```
+
+### 11. Generate Summary Report
+
+Create a summary of findings:
+
+```bash
+echo ""
+echo "==============================================="
+echo "HCP Cluster Health Check Summary"
+echo "==============================================="
+echo "Cluster Name: $CLUSTER_NAME"
+echo "Namespace: $NAMESPACE"
+echo "Control Plane Namespace: $CONTROL_PLANE_NS"
+echo "Platform: $PLATFORM"
+echo "Version: $HC_VERSION"
+echo "Check Time: $(date)"
+echo ""
+echo "Results:"
+echo "  Critical Issues: $CRITICAL_ISSUES"
+echo "  Warnings: $WARNING_ISSUES"
+echo ""
+
+if [ $CRITICAL_ISSUES -eq 0 ] && [ $WARNING_ISSUES -eq 0 ]; then
+    echo "✅ OVERALL STATUS: HEALTHY - No issues detected"
+    exit 0
+elif [ $CRITICAL_ISSUES -gt 0 ]; then
+    echo "❌ OVERALL STATUS: CRITICAL - Immediate attention required"
+    exit 1
+else
+    echo "⚠️  OVERALL STATUS: WARNING - Monitoring recommended"
+    exit 0
+fi
+```
+
+### 12. Optional: Export to JSON Format
+
+If `--output-format json` is specified, export findings as JSON:
+
+```json
+{
+  "cluster": {
+    "name": "my-hcp-cluster",
+    "namespace": "clusters",
+    "controlPlaneNamespace": "clusters-my-hcp-cluster",
+    "platform": "AWS",
+    "version": "4.21.0",
+    "checkTime": "2025-11-11T12:00:00Z"
+  },
+  "summary": {
+    "criticalIssues": 1,
+    "warnings": 3,
+    "overallStatus": "WARNING"
+  },
+  "findings": {
+    "hostedCluster": {
+      "available": true,
+      "progressing": true,
+      "degraded": false,
+      "infrastructureReady": true
+    },
+    "controlPlane": {
+      "failedPods": [],
+      "crashLoopingPods": [],
+      "highRestartPods": ["kube-controller-manager-xxx"],
+      "imagePullErrors": []
+    },
+    "etcd": {
+      "healthy": true,
+      "pods": ["etcd-0", "etcd-1", "etcd-2"]
+    },
+    "nodePools": {
+      "total": 2,
+      "ready": 2,
+      "details": [
+        {
+          "name": "workers",
+          "replicas": 3,
+          "ready": 3,
+          "autoScaling": {
+            "enabled": true,
+            "min": 2,
+            "max": 5
+          }
+        }
+      ]
+    },
+    "network": {
+      "konnectivityHealthy": true,
+      "ingressHealthy": true
+    },
+    "events": {
+      "recentWarnings": 5
+    }
+  }
+}
+```
+
+## Examples
+
+### Example 1: Basic health check with default namespace
+```bash
+/hcp:cluster-health-check my-cluster
+```
+
+Output, for healthy cluster:
+```text
+HCP Cluster Health Check: my-cluster (namespace: clusters)
+================================================================================
+
+OVERALL STATUS: ✅ HEALTHY
+
+COMPONENT STATUS:
+✅ HostedCluster: Available
+✅ Control Plane: All pods running (0 restarts)
+✅ NodePool: 3/3 nodes ready
+✅ Infrastructure: AWS resources validated
+✅ Network: Operational
+✅ Storage/Etcd: Healthy
+
+No critical issues found. Cluster is operating normally.
+
+DIAGNOSTIC COMMANDS:
+Run these commands for detailed information:
+
+kubectl get hostedcluster my-cluster -n clusters -o yaml
+kubectl get pods -n clusters-my-cluster
+kubectl get nodepool -n clusters
+```
+
+Output, with warnings:
+```text
+HCP Cluster Health Check: test-cluster (namespace: clusters)
+================================================================================
+
+OVERALL STATUS: ⚠️  WARNING
+
+COMPONENT STATUS:
+✅ HostedCluster: Available
+⚠️  Control Plane: 1 pod restarting
+✅ NodePool: 2/2 nodes ready
+✅ Infrastructure: Validated
+✅ Network: Operational
+⚠️  Storage/Etcd: High latency detected
+
+ISSUES FOUND:
+
+[WARNING] Control Plane Pod Restarting
+- Component: kube-controller-manager
+- Location: clusters-test-cluster namespace
+- Restarts: 3 in the last hour
+- Impact: May cause temporary API instability
+- Recommended Action:
+  kubectl logs -n clusters-test-cluster kube-controller-manager-xxx --previous
+  Check for configuration issues or resource constraints
+
+[WARNING] Etcd High Latency
+- Component: etcd
+- Metrics: Backend commit duration > 100ms
+- Impact: Slower cluster operations
+- Recommended Action:
+  Check management cluster node performance
+  Review etcd disk I/O with: kubectl exec -n clusters-test-cluster etcd-0 -- etcdctl endpoint status
+
+DIAGNOSTIC COMMANDS:
+
+1. Check HostedCluster details:
+   kubectl get hostedcluster test-cluster -n clusters -o yaml
+
+2. View control plane events:
+   kubectl get events -n clusters-test-cluster --sort-by='.lastTimestamp' | tail -20
+
+3. Check etcd metrics:
+   kubectl exec -n clusters-test-cluster etcd-0 -- etcdctl endpoint health
+```
+
+### Example 2: Health check with custom namespace
+```bash
+/hcp:cluster-health-check production-cluster prod-clusters
+```
+Performs health check on cluster "production-cluster" in the "prod-clusters" namespace.
+
+### Example 3: Verbose health check
+```bash
+/hcp:cluster-health-check my-cluster --verbose
+```
+
+### Example 4: JSON output for automation
+```bash
+/hcp:cluster-health-check my-cluster --output-format json
+```
+
+## Return Value
+
+The command returns a structured health report containing:
+
+- **OVERALL STATUS**: Health summary (Healthy ✅ / Warning ⚠️ / Critical ❌)
+- **COMPONENT STATUS**: Status of each checked component with visual indicators
+- **ISSUES FOUND**: Detailed list of problems with:
+  - Severity level (Critical/Warning/Info)
+  - Component location
+  - Impact assessment
+  - Recommended actions
+- **DIAGNOSTIC COMMANDS**: kubectl commands for further investigation
+
+## Common Issues and Remediation
+
+### Degraded Cluster Operators
+
+**Symptoms**: Cluster operators showing Degraded=True or Available=False
+
+**Investigation**:
+```bash
+oc get clusteroperator <operator-name> -o yaml
+oc logs -n openshift-<operator-namespace> -l app=<operator-name>
+```
+
+**Remediation**: Check operator logs and events for specific errors
+
+### Nodes Not Ready
+
+**Symptoms**: Nodes in NotReady state
+
+**Investigation**:
+```bash
+oc describe node <node-name>
+oc get events --field-selector involvedObject.name=<node-name>
+```
+
+**Remediation**: Common causes include network issues, disk pressure, or kubelet problems
+
+### Pods in CrashLoopBackOff
+
+**Symptoms**: Pods continuously restarting
+
+**Investigation**:
+```bash
+oc logs <pod-name> -n <namespace> --previous
+oc describe pod <pod-name> -n <namespace>
+```
+
+**Remediation**: Check application logs, resource limits, and configuration
+
+### ImagePullBackOff Errors
+
+**Symptoms**: Pods unable to pull container images
+
+**Investigation**:
+```bash
+oc describe pod <pod-name> -n <namespace>
+```
+
+**Remediation**: Verify image name, registry credentials, and network connectivity
+
+## Security Considerations
+
+- **Read-only access**: This command only reads cluster state, no modifications
+- **Sensitive data**: Be cautious when sharing reports as they may contain cluster topology information
+- **RBAC requirements**: Ensure user has appropriate permissions for all resource types checked
+
+## Notes
+
+- This command requires appropriate RBAC permissions to view HostedCluster, NodePool, and Pod resources
+- The command provides diagnostic guidance but does not automatically remediate issues
+- For critical production issues, always follow your organization's incident response procedures
+- Regular health checks (daily or before changes) help catch issues early
+- Some checks may be skipped if user lacks sufficient permissions
--- a/commands/generate.md
+++ b/commands/generate.md
@@ -0,0 +1,357 @@
+---
+description: Generate ready-to-execute hypershift cluster creation commands from natural language descriptions
+argument-hint: <provider> <cluster-description>
+---
+
+## Name
+hcp:generate
+
+## Synopsis
+```
+/hcp:generate <provider> <cluster-description>
+```
+
+## Description
+The `hcp:generate` command translates natural language descriptions into precise, ready-to-execute `hypershift create cluster` commands. It supports multiple cloud providers and platforms, each with their specific requirements and best practices.
+
+**Important**: This command **generates commands for you to run** - it does not provision clusters directly.
+
+This command is particularly useful for:
+- Generating complete, copy-paste-ready hypershift commands with proper parameters
+- Applying provider-specific best practices and configurations automatically
+- Handling complex parameter validation and smart defaults
+- Providing interactive prompts for missing critical information
+- Learning proper hypershift command syntax and options
+
+## Key Features
+
+- **Multi-Provider Support** - AWS, Azure, KubeVirt, OpenStack, PowerVS, and Agent providers
+- **Smart Analysis** - Extracts platform, configuration, and requirements from natural language
+- **Interactive Prompts** - Asks for missing critical information with helpful guidance
+- **Provider Expertise** - Applies platform-specific best practices and configurations
+- **Security Validation** - Ensures safe parameter handling and credential management
+- **Namespace Management** - Implements best practices for cluster isolation
+
+## Implementation
+
+The `hcp:generate` command runs in multiple phases:
+
+### 🎯 Phase 1: Load Provider-Specific Implementation Guidance
+
+Invoke the appropriate skill based on provider using the Skill tool:
+
+- **Provider: `aws`** → Invoke `hcp-create-aws` skill
+  - Loads AWS-specific requirements and configurations
+  - Provides STS credentials handling
+  - Offers region and availability zone guidance
+  - Handles IAM roles and VPC configuration
+
+- **Provider: `azure`** → Invoke `hcp-create-azure` skill
+  - Loads Azure-specific requirements (self-managed control plane only)
+  - Provides resource group and location guidance
+  - Handles identity configuration options
+  - Manages virtual network integration
+
+- **Provider: `kubevirt`** → Invoke `hcp-create-kubevirt` skill
+  - Loads KubeVirt-specific network conflict prevention
+  - Provides virtual machine configuration guidance
+  - Handles storage class requirements
+  - Manages IPv4/IPv6 CIDR validation
+
+- **Provider: `openstack`** → Invoke `hcp-create-openstack` skill
+  - Loads OpenStack-specific requirements
+  - Provides floating IP network guidance
+  - Handles flavor selection and custom images
+  - Manages network topology configuration
+
+- **Provider: `powervs`** → Invoke `hcp-create-powervs` skill
+  - Loads PowerVS/IBM Cloud specific requirements
+  - Provides region and zone guidance
+  - Handles IBM Cloud API key configuration
+  - Manages processor and memory specifications
+
+- **Provider: `agent`** → Invoke `hcp-create-agent` skill
+  - Loads bare metal and edge deployment guidance
+  - Provides pre-provisioned agent requirements
+  - Handles manual network configuration
+  - Manages disconnected environment setup
+
+### 📝 Phase 2: Parse Arguments & Detect Context
+
+Parse command arguments:
+- **Required:** `provider`, `cluster-description`
+- Parse natural language description for:
+  - Environment type (development, production, disconnected)
+  - Special requirements (FIPS, architecture, storage)
+  - Resource constraints (cost-optimization, performance)
+  - Network requirements
+
+### ⚙️ Phase 3: Apply Provider-Specific Defaults and Validation
+
+**Universal requirements (applied to ALL clusters):**
+- **Namespace strategy:** Generate unique namespace based on cluster name (`{cluster-name}-ns`)
+- **Security validation:** Scan for credentials or sensitive data
+- **Release image:** Always include `--release-image` with proper version
+
+**Provider-specific validation:**
+- Network conflict prevention (especially KubeVirt)
+- Credential requirements validation
+- Region/zone availability checks
+- Resource limit validation
+
+### 💬 Phase 4: Interactive Prompts (Provider-Guided)
+
+Each provider skill guides the collection of missing information:
+
+**Common prompts for all providers:**
+- Cluster name (if not specified)
+- Pull secret path
+- OpenShift version/release image
+- Base domain (where applicable)
+
+**Provider-specific prompts:**
+- AWS: STS credentials, IAM role ARN, region selection
+- Azure: Identity configuration method, resource group name
+- KubeVirt: Management cluster network CIDRs, storage classes
+- OpenStack: External network UUID, flavor selection
+- PowerVS: IBM Cloud resource group, processor specifications
+- Agent: Agent namespace, pre-provisioned agent details
+
+### 🔒 Phase 5: Security and Configuration Validation
+
+**Security checks:**
+- Scan all inputs for credentials, API keys, or secrets
+- Validate that sensitive information uses placeholder values
+- Ensure proper credential file references
+
+**Configuration validation:**
+- Verify required parameters are present
+- Validate parameter combinations and dependencies
+- Check for common misconfigurations
+
+### ✅ Phase 6: Generate Command
+
+Based on provider skill guidance:
+- Construct complete `hypershift create cluster` command
+- Apply smart defaults for optional parameters
+- Include all required provider-specific flags
+- Format command for copy-paste execution
+
+### 📤 Phase 7: Return Result
+
+Display to user:
+- **Summary:** Brief description of what will be created
+- **Generated Command:** Complete, executable command
+- **Key Decisions:** Explanation of important choices made
+- **Next Steps:** What to do after running the command
+- **Provider-specific notes:** Any special considerations
+
+## Usage Examples
+
+1. **Create AWS development cluster**:
+   ```
+   /hcp:generate aws "development cluster for testing new features"
+   ```
+   → Invokes `hcp-create-aws` skill, prompts for AWS-specific details
+
+2. **Create production KubeVirt cluster**:
+   ```
+   /hcp:generate kubevirt "production cluster with high availability"
+   ```
+   → Invokes `hcp-create-kubevirt` skill, handles network conflict prevention
+
+3. **Create cost-optimized Azure cluster**:
+   ```
+   /hcp:generate azure "small cluster for dev work, minimize costs"
+   ```
+   → Invokes `hcp-create-azure` skill, applies cost optimization
+
+4. **Create disconnected agent cluster**:
+   ```
+   /hcp:generate agent "airgapped cluster for secure environment"
+   ```
+   → Invokes `hcp-create-agent` skill, handles disconnected requirements
+
+5. **Create FIPS-enabled OpenStack cluster**:
+   ```
+   /hcp:generate openstack "production cluster with FIPS compliance"
+   ```
+   → Invokes `hcp-create-openstack` skill, applies FIPS configuration
+
+6. **Create ARM-based PowerVS cluster**:
+   ```
+   /hcp:generate powervs "arm64 cluster for multi-arch testing"
+   ```
+   → Invokes `hcp-create-powervs` skill, handles ARM architecture
+
+## Arguments
+
+- **$1 – provider** *(required)*
+  Cloud provider or platform to use.
+  **Options:** `aws` | `azure` | `kubevirt` | `openstack` | `powervs` | `agent`
+
+- **$2 – cluster-description** *(required)*
+  Natural language description of the desired cluster.
+  Use quotes for multi-word descriptions: `"production cluster with HA"`
+
+  **Description should include:**
+  - Environment type (development, production, testing)
+  - Special requirements (FIPS, architecture, storage)
+  - Resource preferences (cost-optimized, high-performance)
+  - Network requirements (disconnected, private)
+
+## Return Value
+
+- **Summary**: Brief description of the cluster that will be created
+- **Generated Command**: Complete `hypershift create cluster` command
+- **Key Decisions**: Explanation of choices made during generation
+- **Next Steps**: Instructions for executing the command and post-creation tasks
+
+**Example output:**
+```
+## Summary
+Creating a development AWS hosted cluster with basic configuration.
+
+## Generated Command
+```bash
+hypershift create cluster aws \
+  --name dev-cluster \
+  --namespace dev-cluster-ns \
+  --region us-east-1 \
+  --instance-type m5.large \
+  --pull-secret /path/to/pull-secret.json \
+  --node-pool-replicas 2 \
+  --zones us-east-1a,us-east-1b \
+  --control-plane-availability-policy SingleReplica \
+  --sts-creds /path/to/sts-creds.json \
+  --role-arn arn:aws:iam::123456789:role/hypershift-role \
+  --base-domain example.com \
+  --release-image quay.io/openshift-release-dev/ocp-release:4.18.0-multi
+```
+
+## Key Decisions
+- Used SingleReplica for development (cost-effective)
+- Selected 2 zones for basic redundancy
+- m5.large instances balance cost and performance for dev workloads
+- **Unique namespace**: `dev-cluster-ns` for better isolation and disaster recovery
+
+## Next Steps
+1. Ensure your pull secret file exists at the specified path
+2. Verify AWS credentials are configured
+3. Confirm STS credentials file is accessible
+4. Run the command above to create your cluster
+```
+
+## Error Handling
+
+### Invalid Provider
+
+**Scenario:** User specifies unsupported provider.
+
+**Action:**
+```
+Invalid provider "gcp". Supported providers: aws, azure, kubevirt, openstack, powervs, agent
+
+Did you mean "aws"?
+```
+
+### Missing Description
+
+**Scenario:** User doesn't provide cluster description.
+
+**Action:**
+```
+Cluster description is required. Please describe what kind of cluster you want.
+
+Examples:
+- "development cluster for testing"
+- "production cluster with high availability"
+- "cost-optimized cluster for demos"
+
+Usage: /hcp:generate aws "development cluster for testing"
+```
+
+### Ambiguous Requirements
+
+**Scenario:** Description is too vague or contradictory.
+
+**Action:**
+```
+The description "fast and cheap cluster" has conflicting requirements.
+
+Let me help clarify:
+1. Performance-optimized (higher costs, better resources)
+2. Cost-optimized (lower costs, minimal resources)
+3. Balanced (moderate costs and performance)
+
+Which approach do you prefer?
+```
+
+### Provider-Specific Errors
+
+**Scenario:** Provider-specific validation fails.
+
+**Action:**
+- Forward to appropriate skill for specialized error handling
+- Provide provider-specific guidance and solutions
+- Offer alternative configurations when possible
+
+## Best Practices
+
+1. **Be descriptive:** Include environment type, requirements, and constraints
+2. **Specify architecture:** Mention if you need ARM64 or specific architectures
+3. **Include network needs:** Specify if disconnected or special networking required
+4. **Mention compliance:** Include FIPS, security, or regulatory requirements
+5. **Consider costs:** Specify if cost optimization is important
+6. **Plan for growth:** Mention if cluster needs to scale or handle specific workloads
+
+## Anti-Patterns to Avoid
+
+❌ **Vague descriptions**
+```
+/hcp:generate aws "cluster"
+```
+✅ Be specific: "development cluster for API testing with minimal resources"
+
+❌ **Conflicting requirements**
+```
+/hcp:generate aws "high-performance cluster but very cheap"
+```
+✅ Be realistic: "balanced cluster optimizing for cost while maintaining decent performance"
+
+❌ **Provider mismatches**
+```
+/hcp:generate azure "cluster for my on-premises lab"
+```
+✅ Use appropriate provider: "kubevirt cluster for my on-premises lab"
+
+## See Also
+
+- `hypershift create cluster --help` - Official hypershift CLI documentation
+- Provider-specific skills:
+  - `hcp-create-aws` - AWS-specific guidance
+  - `hcp-create-azure` - Azure-specific guidance
+  - `hcp-create-kubevirt` - KubeVirt-specific guidance
+  - `hcp-create-openstack` - OpenStack-specific guidance
+  - `hcp-create-powervs` - PowerVS-specific guidance
+  - `hcp-create-agent` - Agent provider guidance
+
+## Skills Reference
+
+The following skills are automatically invoked by this command based on provider:
+
+**Provider-specific skills:**
+- **hcp-create-aws** - AWS provider implementation details
+- **hcp-create-azure** - Azure provider implementation details
+- **hcp-create-kubevirt** - KubeVirt provider implementation details
+- **hcp-create-openstack** - OpenStack provider implementation details
+- **hcp-create-powervs** - PowerVS provider implementation details
+- **hcp-create-agent** - Agent provider implementation details
+
+To view skill details:
+```bash
+ls plugins/hypershift/skills/
+cat plugins/hypershift/skills/hcp-create-aws/SKILL.md
+cat plugins/hypershift/skills/hcp-create-kubevirt/SKILL.md
+# ... etc for other providers
+```