Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:45:56 +08:00
commit 9fc59ddec1
11 changed files with 3200 additions and 0 deletions

View File

@@ -0,0 +1,705 @@
---
description: Perform comprehensive health check on HCP cluster and report issues
argument-hint: <cluster-name> [--verbose] [--output-format json|text]
---
## Name
hcp:cluster-health-check
## Synopsis
```
/hcp:cluster-health-check <cluster-name> [--verbose] [--output-format json|text]
```
## Description
The `/hcp:cluster-health-check` command performs an extensive diagnostic of a Hosted Control Plane (HCP) cluster to assess its operational health, stability, and performance. It automates the validation of multiple control plane and infrastructure components to ensure the cluster is functioning as expected.
The command runs a comprehensive set of health checks covering control plane pods, etcd, node pools, networking, and infrastructure readiness. It also detects degraded or progressing states, certificate issues, and recent warning events.
Specifically, it performs the following:
- Detects and validates the availability of oc or kubectl CLI tools and verifies cluster connectivity.
- Checks the HostedCluster resource status (Available, Progressing, Degraded) and reports any abnormal conditions with detailed reasons and messages.
- Validates control plane pod health in the management cluster, detecting non-running pods, crash loops, high restart counts, and image pull errors.
- Performs etcd health checks to ensure data consistency and availability.
- Inspects NodePools for replica mismatches, readiness, and auto-scaling configuration accuracy.
- Evaluates infrastructure readiness (including AWS/Azure-specific validations like endpoint services).
- Examines networking and ingress components, including Konnectivity servers and router pods.
- Scans for recent warning events across HostedCluster, NodePool, and control plane namespaces.
- Reviews certificate conditions to identify expiring or invalid certificates.
- Generates a clear, color-coded summary report and optionally exports findings in JSON format for automation or CI integration.
## Prerequisites
Before using this command, ensure you have:
1. **Kubernetes/OpenShift CLI**: Either `oc` (OpenShift) or `kubectl` (Kubernetes)
- Install `oc` from: <https://mirror.openshift.com/pub/openshift-v4/clients/ocp/>
- Or install `kubectl` from: <https://kubernetes.io/docs/tasks/tools/>
- Verify with: `oc version` or `kubectl version`
2. **Active cluster connection**: Must be connected to a running cluster
- Verify with: `oc whoami` or `kubectl cluster-info`
- Ensure KUBECONFIG is set if needed
3. **Sufficient permissions**: Must have read access to cluster resources
- Cluster-admin or monitoring role recommended for comprehensive checks
- Minimum: ability to view nodes, pods, and cluster operators
## Arguments
- **[cluster-name]** (required): Name of the HostedCluster to check. This uniquely identifies the HCP instance whose health will be analyzed. Example: `hcp-demo`
- **[namespace]** (optional): The namespace in which the HostedCluster resides. Defaults to clusters if not provided. Example: `/hcp:cluster-health-check hcp-demo clusters-dev`
- **--verbose** (optional): Enable detailed output with additional context
- Shows resource-level details
- Includes warning conditions
- Provides remediation suggestions
- **--output-format** (optional): Output format for results
- `text` (default): Human-readable text format
- `json`: Machine-readable JSON format for automation
## Implementation
The command performs the following health checks:
### 1. Determine CLI Tool and Verify Connectivity
Detect which Kubernetes CLI is available and verify cluster connection:
```bash
if command -v oc &> /dev/null; then
CLI="oc"
elif command -v kubectl &> /dev/null; then
CLI="kubectl"
else
echo "Error: Neither 'oc' nor 'kubectl' CLI found. Please install one of them."
exit 1
fi
# Verify cluster connectivity
if ! $CLI cluster-info &> /dev/null; then
echo "Error: Not connected to a cluster. Please configure your KUBECONFIG."
exit 1
fi
```
### 2. Initialize Health Check Report
Create a report structure to collect findings:
```bash
CLUSTER_NAME=$1
NAMESPACE=${2:-"clusters"}
CONTROL_PLANE_NS="${NAMESPACE}-${CLUSTER_NAME}"
REPORT_FILE=".work/hcp-health-check/report-${CLUSTER_NAME}-$(date +%Y%m%d-%H%M%S).txt"
mkdir -p .work/hcp-health-check
# Initialize counters
CRITICAL_ISSUES=0
WARNING_ISSUES=0
INFO_MESSAGES=0
```
Arguments:
- $1 (cluster-name): Name of the HostedCluster resource to check. This is the name visible in `kubectl get hostedcluster` output. Required.
- $2 (namespace): Namespace containing the HostedCluster resource. Defaults to "clusters" if not specified. Optional.
### 3. Check HostedCluster Status
Verify the HostedCluster resource health:
```bash
echo "Checking HostedCluster Status..."
# Check if HostedCluster exists
if ! $CLI get hostedcluster "$CLUSTER_NAME" -n "$NAMESPACE" &> /dev/null; then
echo "❌ CRITICAL: HostedCluster '$CLUSTER_NAME' not found in namespace '$NAMESPACE'"
exit 1
fi
# Get HostedCluster conditions
HC_AVAILABLE=$($CLI get hostedcluster "$CLUSTER_NAME" -n "$NAMESPACE" -o json | jq -r '.status.conditions[] | select(.type=="Available") | .status')
HC_PROGRESSING=$($CLI get hostedcluster "$CLUSTER_NAME" -n "$NAMESPACE" -o json | jq -r '.status.conditions[] | select(.type=="Progressing") | .status')
HC_DEGRADED=$($CLI get hostedcluster "$CLUSTER_NAME" -n "$NAMESPACE" -o json | jq -r '.status.conditions[] | select(.type=="Degraded") | .status')
if [ "$HC_AVAILABLE" != "True" ]; then
CRITICAL_ISSUES=$((CRITICAL_ISSUES + 1))
echo "❌ CRITICAL: HostedCluster is not Available"
# Get reason and message
$CLI get hostedcluster "$CLUSTER_NAME" -n "$NAMESPACE" -o json | jq -r '.status.conditions[] | select(.type=="Available") | " Reason: \(.reason)\n Message: \(.message)"'
fi
if [ "$HC_DEGRADED" == "True" ]; then
CRITICAL_ISSUES=$((CRITICAL_ISSUES + 1))
echo "❌ CRITICAL: HostedCluster is Degraded"
$CLI get hostedcluster "$CLUSTER_NAME" -n "$NAMESPACE" -o json | jq -r '.status.conditions[] | select(.type=="Degraded") | " Reason: \(.reason)\n Message: \(.message)"'
fi
if [ "$HC_PROGRESSING" == "True" ]; then
WARNING_ISSUES=$((WARNING_ISSUES + 1))
echo "⚠️ WARNING: HostedCluster is Progressing"
$CLI get hostedcluster "$CLUSTER_NAME" -n "$NAMESPACE" -o json | jq -r '.status.conditions[] | select(.type=="Progressing") | " Reason: \(.reason)\n Message: \(.message)"'
fi
# Check version and upgrade status
HC_VERSION=$($CLI get hostedcluster "$CLUSTER_NAME" -n "$NAMESPACE" -o json | jq -r '.status.version.history[0].version // "unknown"')
echo " HostedCluster version: $HC_VERSION"
```
### 4. Check Control Plane Pod Health
Examine control plane components in the management cluster:
```bash
echo "Checking Control Plane Pods..."
# Check if control plane namespace exists
if ! $CLI get namespace "$CONTROL_PLANE_NS" &> /dev/null; then
CRITICAL_ISSUES=$((CRITICAL_ISSUES + 1))
echo "❌ CRITICAL: Control plane namespace '$CONTROL_PLANE_NS' not found"
else
# Get pods that are not Running
FAILED_CP_PODS=$($CLI get pods -n "$CONTROL_PLANE_NS" -o json | jq -r '.items[] | select(.status.phase != "Running" and .status.phase != "Succeeded") | "\(.metadata.name) [\(.status.phase)]"')
if [ -n "$FAILED_CP_PODS" ]; then
CRITICAL_ISSUES=$((CRITICAL_ISSUES + $(echo "$FAILED_CP_PODS" | wc -l)))
echo "❌ CRITICAL: Control plane pods in failed state:"
echo "$FAILED_CP_PODS" | while read pod; do
echo " - $pod"
done
fi
# Check for pods with high restart count
HIGH_RESTART_CP_PODS=$($CLI get pods -n "$CONTROL_PLANE_NS" -o json | jq -r '.items[] | select(.status.containerStatuses[]? | .restartCount > 3) | "\(.metadata.name) [Restarts: \(.status.containerStatuses[0].restartCount)]"')
if [ -n "$HIGH_RESTART_CP_PODS" ]; then
WARNING_ISSUES=$((WARNING_ISSUES + $(echo "$HIGH_RESTART_CP_PODS" | wc -l)))
echo "⚠️ WARNING: Control plane pods with high restart count (>3):"
echo "$HIGH_RESTART_CP_PODS" | while read pod; do
echo " - $pod"
done
fi
# Check for CrashLoopBackOff pods
CRASHLOOP_CP_PODS=$($CLI get pods -n "$CONTROL_PLANE_NS" -o json | jq -r '.items[] | select(.status.containerStatuses[]? | .state.waiting?.reason == "CrashLoopBackOff") | .metadata.name')
if [ -n "$CRASHLOOP_CP_PODS" ]; then
CRITICAL_ISSUES=$((CRITICAL_ISSUES + $(echo "$CRASHLOOP_CP_PODS" | wc -l)))
echo "❌ CRITICAL: Control plane pods in CrashLoopBackOff:"
echo "$CRASHLOOP_CP_PODS" | while read pod; do
echo " - $pod"
done
fi
# Check for ImagePullBackOff
IMAGE_PULL_CP_PODS=$($CLI get pods -n "$CONTROL_PLANE_NS" -o json | jq -r '.items[] | select(.status.containerStatuses[]? | .state.waiting?.reason == "ImagePullBackOff" or .state.waiting?.reason == "ErrImagePull") | .metadata.name')
if [ -n "$IMAGE_PULL_CP_PODS" ]; then
CRITICAL_ISSUES=$((CRITICAL_ISSUES + $(echo "$IMAGE_PULL_CP_PODS" | wc -l)))
echo "❌ CRITICAL: Control plane pods with image pull errors:"
echo "$IMAGE_PULL_CP_PODS" | while read pod; do
echo " - $pod"
done
fi
# Check critical control plane components
echo "Checking critical control plane components..."
CRITICAL_COMPONENTS="kube-apiserver etcd kube-controller-manager kube-scheduler"
for component in $CRITICAL_COMPONENTS; do
COMPONENT_PODS=$($CLI get pods -n "$CONTROL_PLANE_NS" -l app="$component" -o json 2>/dev/null | jq -r '.items[].metadata.name')
if [ -z "$COMPONENT_PODS" ]; then
# Try alternative label
COMPONENT_PODS=$($CLI get pods -n "$CONTROL_PLANE_NS" -o json | jq -r ".items[] | select(.metadata.name | startswith(\"$component\")) | .metadata.name")
fi
if [ -z "$COMPONENT_PODS" ]; then
WARNING_ISSUES=$((WARNING_ISSUES + 1))
echo "⚠️ WARNING: No pods found for component: $component"
else
# Check if any are not running
for pod in $COMPONENT_PODS; do
POD_STATUS=$($CLI get pod "$pod" -n "$CONTROL_PLANE_NS" -o json | jq -r '.status.phase')
if [ "$POD_STATUS" != "Running" ]; then
CRITICAL_ISSUES=$((CRITICAL_ISSUES + 1))
echo "❌ CRITICAL: $component pod $pod is not Running (Status: $POD_STATUS)"
fi
done
fi
done
fi
```
### 5. Check Etcd Health
Verify etcd cluster health and performance:
```bash
echo "Checking Etcd Health..."
# Find etcd pods
ETCD_PODS=$($CLI get pods -n "$CONTROL_PLANE_NS" -l app=etcd -o json 2>/dev/null | jq -r '.items[].metadata.name')
if [ -z "$ETCD_PODS" ]; then
ETCD_PODS=$($CLI get pods -n "$CONTROL_PLANE_NS" -o json | jq -r '.items[] | select(.metadata.name | startswith("etcd")) | .metadata.name')
fi
if [ -n "$ETCD_PODS" ]; then
for pod in $ETCD_PODS; do
# Check etcd endpoint health
ETCD_HEALTH=$($CLI exec -n "$CONTROL_PLANE_NS" "$pod" -- etcdctl endpoint health 2>/dev/null || echo "failed")
if echo "$ETCD_HEALTH" | grep -q "unhealthy"; then
CRITICAL_ISSUES=$((CRITICAL_ISSUES + 1))
echo "❌ CRITICAL: Etcd pod $pod is unhealthy"
elif echo "$ETCD_HEALTH" | grep -q "failed"; then
WARNING_ISSUES=$((WARNING_ISSUES + 1))
echo "⚠️ WARNING: Could not check etcd health for pod $pod"
fi
done
else
WARNING_ISSUES=$((WARNING_ISSUES + 1))
echo "⚠️ WARNING: No etcd pods found"
fi
```
### 6. Check NodePool Status
Verify NodePool health and scaling:
```bash
echo "Checking NodePool Status..."
# Get all NodePools for this HostedCluster
NODEPOOLS=$($CLI get nodepool -n "$NAMESPACE" -l hypershift.openshift.io/hostedcluster="$CLUSTER_NAME" -o json | jq -r '.items[].metadata.name')
if [ -z "$NODEPOOLS" ]; then
WARNING_ISSUES=$((WARNING_ISSUES + 1))
echo "⚠️ WARNING: No NodePools found for HostedCluster $CLUSTER_NAME"
else
for nodepool in $NODEPOOLS; do
# Get NodePool status
NP_REPLICAS=$($CLI get nodepool "$nodepool" -n "$NAMESPACE" -o json | jq -r '.spec.replicas // 0')
NP_READY=$($CLI get nodepool "$nodepool" -n "$NAMESPACE" -o json | jq -r '.status.replicas // 0')
NP_AVAILABLE=$($CLI get nodepool "$nodepool" -n "$NAMESPACE" -o json | jq -r '.status.conditions[] | select(.type=="Ready") | .status')
echo " NodePool: $nodepool [Ready: $NP_READY/$NP_REPLICAS]"
if [ "$NP_AVAILABLE" != "True" ]; then
CRITICAL_ISSUES=$((CRITICAL_ISSUES + 1))
echo "❌ CRITICAL: NodePool $nodepool is not Ready"
$CLI get nodepool "$nodepool" -n "$NAMESPACE" -o json | jq -r '.status.conditions[] | select(.type=="Ready") | " Reason: \(.reason)\n Message: \(.message)"'
fi
if [ "$NP_READY" != "$NP_REPLICAS" ]; then
WARNING_ISSUES=$((WARNING_ISSUES + 1))
echo "⚠️ WARNING: NodePool $nodepool has mismatched replicas (Ready: $NP_READY, Desired: $NP_REPLICAS)"
fi
# Check for auto-scaling
AUTOSCALING=$($CLI get nodepool "$nodepool" -n "$NAMESPACE" -o json | jq -r '.spec.autoScaling // empty')
if [ -n "$AUTOSCALING" ]; then
MIN=$($CLI get nodepool "$nodepool" -n "$NAMESPACE" -o json | jq -r '.spec.autoScaling.min')
MAX=$($CLI get nodepool "$nodepool" -n "$NAMESPACE" -o json | jq -r '.spec.autoScaling.max')
echo " Auto-scaling enabled: Min=$MIN, Max=$MAX"
fi
done
fi
```
### 7. Check Infrastructure Status
Validate infrastructure components (AWS/Azure specific):
```bash
echo "Checking Infrastructure Status..."
# Get infrastructure platform
PLATFORM=$($CLI get hostedcluster "$CLUSTER_NAME" -n "$NAMESPACE" -o json | jq -r '.spec.platform.type')
echo " Platform: $PLATFORM"
# Check for infrastructure-related conditions
INFRA_READY=$($CLI get hostedcluster "$CLUSTER_NAME" -n "$NAMESPACE" -o json | jq -r '.status.conditions[] | select(.type=="InfrastructureReady") | .status')
if [ "$INFRA_READY" != "True" ]; then
CRITICAL_ISSUES=$((CRITICAL_ISSUES + 1))
echo "❌ CRITICAL: Infrastructure is not ready"
$CLI get hostedcluster "$CLUSTER_NAME" -n "$NAMESPACE" -o json | jq -r '.status.conditions[] | select(.type=="InfrastructureReady") | " Reason: \(.reason)\n Message: \(.message)"'
fi
# Platform-specific checks
if [ "$PLATFORM" == "AWS" ]; then
# Check AWS-specific resources
echo " Performing AWS-specific checks..."
# Check if endpoint service is configured
ENDPOINT_SERVICE=$($CLI get hostedcluster "$CLUSTER_NAME" -n "$NAMESPACE" -o json | jq -r '.status.platform.aws.endpointService // "not-found"')
if [ "$ENDPOINT_SERVICE" == "not-found" ]; then
WARNING_ISSUES=$((WARNING_ISSUES + 1))
echo "⚠️ WARNING: AWS endpoint service not configured"
fi
fi
```
### 8. Check Network and Ingress
Verify network connectivity and ingress configuration:
```bash
echo "Checking Network and Ingress..."
# Check if connectivity is healthy (for private clusters)
CONNECTIVITY_PODS=$($CLI get pods -n "$CONTROL_PLANE_NS" -l app=connectivity-server -o json 2>/dev/null | jq -r '.items[].metadata.name')
if [ -n "$CONNECTIVITY_PODS" ]; then
for pod in $CONNECTIVITY_PODS; do
POD_STATUS=$($CLI get pod "$pod" -n "$CONTROL_PLANE_NS" -o json | jq -r '.status.phase')
if [ "$POD_STATUS" != "Running" ]; then
CRITICAL_ISSUES=$((CRITICAL_ISSUES + 1))
echo "❌ CRITICAL: Connectivity server pod $pod is not Running"
fi
done
fi
# Check ingress/router pods
ROUTER_PODS=$($CLI get pods -n "$CONTROL_PLANE_NS" -l app=router -o json 2>/dev/null | jq -r '.items[].metadata.name')
if [ -z "$ROUTER_PODS" ]; then
ROUTER_PODS=$($CLI get pods -n "$CONTROL_PLANE_NS" -o json | jq -r '.items[] | select(.metadata.name | contains("router") or contains("ingress")) | .metadata.name')
fi
if [ -n "$ROUTER_PODS" ]; then
for pod in $ROUTER_PODS; do
POD_STATUS=$($CLI get pod "$pod" -n "$CONTROL_PLANE_NS" -o json | jq -r '.status.phase')
if [ "$POD_STATUS" != "Running" ]; then
WARNING_ISSUES=$((WARNING_ISSUES + 1))
echo "⚠️ WARNING: Router/Ingress pod $pod is not Running"
fi
done
fi
```
### 9. Check Recent Events
Look for recent warning/error events:
```bash
echo "Checking Recent Events..."
# Get recent warning events for HostedCluster
HC_EVENTS=$($CLI get events -n "$NAMESPACE" --field-selector involvedObject.name="$CLUSTER_NAME",involvedObject.kind=HostedCluster,type=Warning --sort-by='.lastTimestamp' | tail -10)
if [ -n "$HC_EVENTS" ]; then
echo "⚠️ Recent Warning Events for HostedCluster:"
echo "$HC_EVENTS"
fi
# Get recent warning events in control plane namespace
CP_EVENTS=$($CLI get events -n "$CONTROL_PLANE_NS" --field-selector type=Warning --sort-by='.lastTimestamp' 2>/dev/null | tail -10)
if [ -n "$CP_EVENTS" ]; then
echo "⚠️ Recent Warning Events in Control Plane:"
echo "$CP_EVENTS"
fi
# Get recent events for NodePools
for nodepool in $NODEPOOLS; do
NP_EVENTS=$($CLI get events -n "$NAMESPACE" --field-selector involvedObject.name="$nodepool",involvedObject.kind=NodePool,type=Warning --sort-by='.lastTimestamp' 2>/dev/null | tail -5)
if [ -n "$NP_EVENTS" ]; then
echo "⚠️ Recent Warning Events for NodePool $nodepool:"
echo "$NP_EVENTS"
fi
done
```
### 10. Check Certificate Status
Verify certificate validity:
```bash
echo "Checking Certificate Status..."
# Check certificate expiration warnings
CERT_CONDITIONS=$($CLI get hostedcluster "$CLUSTER_NAME" -n "$NAMESPACE" -o json | jq -r '.status.conditions[] | select(.type | contains("Certificate")) | "\(.type): \(.status) - \(.message // "N/A")"')
if [ -n "$CERT_CONDITIONS" ]; then
echo "$CERT_CONDITIONS" | while read line; do
if echo "$line" | grep -q "False"; then
WARNING_ISSUES=$((WARNING_ISSUES + 1))
echo "⚠️ WARNING: $line"
fi
done
fi
```
### 11. Generate Summary Report
Create a summary of findings:
```bash
echo ""
echo "==============================================="
echo "HCP Cluster Health Check Summary"
echo "==============================================="
echo "Cluster Name: $CLUSTER_NAME"
echo "Namespace: $NAMESPACE"
echo "Control Plane Namespace: $CONTROL_PLANE_NS"
echo "Platform: $PLATFORM"
echo "Version: $HC_VERSION"
echo "Check Time: $(date)"
echo ""
echo "Results:"
echo " Critical Issues: $CRITICAL_ISSUES"
echo " Warnings: $WARNING_ISSUES"
echo ""
if [ $CRITICAL_ISSUES -eq 0 ] && [ $WARNING_ISSUES -eq 0 ]; then
echo "✅ OVERALL STATUS: HEALTHY - No issues detected"
exit 0
elif [ $CRITICAL_ISSUES -gt 0 ]; then
echo "❌ OVERALL STATUS: CRITICAL - Immediate attention required"
exit 1
else
echo "⚠️ OVERALL STATUS: WARNING - Monitoring recommended"
exit 0
fi
```
### 12. Optional: Export to JSON Format
If `--output-format json` is specified, export findings as JSON:
```json
{
"cluster": {
"name": "my-hcp-cluster",
"namespace": "clusters",
"controlPlaneNamespace": "clusters-my-hcp-cluster",
"platform": "AWS",
"version": "4.21.0",
"checkTime": "2025-11-11T12:00:00Z"
},
"summary": {
"criticalIssues": 1,
"warnings": 3,
"overallStatus": "WARNING"
},
"findings": {
"hostedCluster": {
"available": true,
"progressing": true,
"degraded": false,
"infrastructureReady": true
},
"controlPlane": {
"failedPods": [],
"crashLoopingPods": [],
"highRestartPods": ["kube-controller-manager-xxx"],
"imagePullErrors": []
},
"etcd": {
"healthy": true,
"pods": ["etcd-0", "etcd-1", "etcd-2"]
},
"nodePools": {
"total": 2,
"ready": 2,
"details": [
{
"name": "workers",
"replicas": 3,
"ready": 3,
"autoScaling": {
"enabled": true,
"min": 2,
"max": 5
}
}
]
},
"network": {
"konnectivityHealthy": true,
"ingressHealthy": true
},
"events": {
"recentWarnings": 5
}
}
}
```
## Examples
### Example 1: Basic health check with default namespace
```bash
/hcp:cluster-health-check my-cluster
```
Output, for healthy cluster:
```text
HCP Cluster Health Check: my-cluster (namespace: clusters)
================================================================================
OVERALL STATUS: ✅ HEALTHY
COMPONENT STATUS:
✅ HostedCluster: Available
✅ Control Plane: All pods running (0 restarts)
✅ NodePool: 3/3 nodes ready
✅ Infrastructure: AWS resources validated
✅ Network: Operational
✅ Storage/Etcd: Healthy
No critical issues found. Cluster is operating normally.
DIAGNOSTIC COMMANDS:
Run these commands for detailed information:
kubectl get hostedcluster my-cluster -n clusters -o yaml
kubectl get pods -n clusters-my-cluster
kubectl get nodepool -n clusters
```
Output, with warnings:
```text
HCP Cluster Health Check: test-cluster (namespace: clusters)
================================================================================
OVERALL STATUS: ⚠️ WARNING
COMPONENT STATUS:
✅ HostedCluster: Available
⚠️ Control Plane: 1 pod restarting
✅ NodePool: 2/2 nodes ready
✅ Infrastructure: Validated
✅ Network: Operational
⚠️ Storage/Etcd: High latency detected
ISSUES FOUND:
[WARNING] Control Plane Pod Restarting
- Component: kube-controller-manager
- Location: clusters-test-cluster namespace
- Restarts: 3 in the last hour
- Impact: May cause temporary API instability
- Recommended Action:
kubectl logs -n clusters-test-cluster kube-controller-manager-xxx --previous
Check for configuration issues or resource constraints
[WARNING] Etcd High Latency
- Component: etcd
- Metrics: Backend commit duration > 100ms
- Impact: Slower cluster operations
- Recommended Action:
Check management cluster node performance
Review etcd disk I/O with: kubectl exec -n clusters-test-cluster etcd-0 -- etcdctl endpoint status
DIAGNOSTIC COMMANDS:
1. Check HostedCluster details:
kubectl get hostedcluster test-cluster -n clusters -o yaml
2. View control plane events:
kubectl get events -n clusters-test-cluster --sort-by='.lastTimestamp' | tail -20
3. Check etcd metrics:
kubectl exec -n clusters-test-cluster etcd-0 -- etcdctl endpoint health
```
### Example 2: Health check with custom namespace
```bash
/hcp:cluster-health-check production-cluster prod-clusters
```
Performs health check on cluster "production-cluster" in the "prod-clusters" namespace.
### Example 3: Verbose health check
```bash
/hcp:cluster-health-check my-cluster --verbose
```
### Example 4: JSON output for automation
```bash
/hcp:cluster-health-check my-cluster --output-format json
```
## Return Value
The command returns a structured health report containing:
- **OVERALL STATUS**: Health summary (Healthy ✅ / Warning ⚠️ / Critical ❌)
- **COMPONENT STATUS**: Status of each checked component with visual indicators
- **ISSUES FOUND**: Detailed list of problems with:
- Severity level (Critical/Warning/Info)
- Component location
- Impact assessment
- Recommended actions
- **DIAGNOSTIC COMMANDS**: kubectl commands for further investigation
## Common Issues and Remediation
### Degraded Cluster Operators
**Symptoms**: Cluster operators showing Degraded=True or Available=False
**Investigation**:
```bash
oc get clusteroperator <operator-name> -o yaml
oc logs -n openshift-<operator-namespace> -l app=<operator-name>
```
**Remediation**: Check operator logs and events for specific errors
### Nodes Not Ready
**Symptoms**: Nodes in NotReady state
**Investigation**:
```bash
oc describe node <node-name>
oc get events --field-selector involvedObject.name=<node-name>
```
**Remediation**: Common causes include network issues, disk pressure, or kubelet problems
### Pods in CrashLoopBackOff
**Symptoms**: Pods continuously restarting
**Investigation**:
```bash
oc logs <pod-name> -n <namespace> --previous
oc describe pod <pod-name> -n <namespace>
```
**Remediation**: Check application logs, resource limits, and configuration
### ImagePullBackOff Errors
**Symptoms**: Pods unable to pull container images
**Investigation**:
```bash
oc describe pod <pod-name> -n <namespace>
```
**Remediation**: Verify image name, registry credentials, and network connectivity
## Security Considerations
- **Read-only access**: This command only reads cluster state, no modifications
- **Sensitive data**: Be cautious when sharing reports as they may contain cluster topology information
- **RBAC requirements**: Ensure user has appropriate permissions for all resource types checked
## Notes
- This command requires appropriate RBAC permissions to view HostedCluster, NodePool, and Pod resources
- The command provides diagnostic guidance but does not automatically remediate issues
- For critical production issues, always follow your organization's incident response procedures
- Regular health checks (daily or before changes) help catch issues early
- Some checks may be skipped if user lacks sufficient permissions

357
commands/generate.md Normal file
View File

@@ -0,0 +1,357 @@
---
description: Generate ready-to-execute hypershift cluster creation commands from natural language descriptions
argument-hint: <provider> <cluster-description>
---
## Name
hcp:generate
## Synopsis
```
/hcp:generate <provider> <cluster-description>
```
## Description
The `hcp:generate` command translates natural language descriptions into precise, ready-to-execute `hypershift create cluster` commands. It supports multiple cloud providers and platforms, each with their specific requirements and best practices.
**Important**: This command **generates commands for you to run** - it does not provision clusters directly.
This command is particularly useful for:
- Generating complete, copy-paste-ready hypershift commands with proper parameters
- Applying provider-specific best practices and configurations automatically
- Handling complex parameter validation and smart defaults
- Providing interactive prompts for missing critical information
- Learning proper hypershift command syntax and options
## Key Features
- **Multi-Provider Support** - AWS, Azure, KubeVirt, OpenStack, PowerVS, and Agent providers
- **Smart Analysis** - Extracts platform, configuration, and requirements from natural language
- **Interactive Prompts** - Asks for missing critical information with helpful guidance
- **Provider Expertise** - Applies platform-specific best practices and configurations
- **Security Validation** - Ensures safe parameter handling and credential management
- **Namespace Management** - Implements best practices for cluster isolation
## Implementation
The `hcp:generate` command runs in multiple phases:
### 🎯 Phase 1: Load Provider-Specific Implementation Guidance
Invoke the appropriate skill based on provider using the Skill tool:
- **Provider: `aws`** → Invoke `hcp-create-aws` skill
- Loads AWS-specific requirements and configurations
- Provides STS credentials handling
- Offers region and availability zone guidance
- Handles IAM roles and VPC configuration
- **Provider: `azure`** → Invoke `hcp-create-azure` skill
- Loads Azure-specific requirements (self-managed control plane only)
- Provides resource group and location guidance
- Handles identity configuration options
- Manages virtual network integration
- **Provider: `kubevirt`** → Invoke `hcp-create-kubevirt` skill
- Loads KubeVirt-specific network conflict prevention
- Provides virtual machine configuration guidance
- Handles storage class requirements
- Manages IPv4/IPv6 CIDR validation
- **Provider: `openstack`** → Invoke `hcp-create-openstack` skill
- Loads OpenStack-specific requirements
- Provides floating IP network guidance
- Handles flavor selection and custom images
- Manages network topology configuration
- **Provider: `powervs`** → Invoke `hcp-create-powervs` skill
- Loads PowerVS/IBM Cloud specific requirements
- Provides region and zone guidance
- Handles IBM Cloud API key configuration
- Manages processor and memory specifications
- **Provider: `agent`** → Invoke `hcp-create-agent` skill
- Loads bare metal and edge deployment guidance
- Provides pre-provisioned agent requirements
- Handles manual network configuration
- Manages disconnected environment setup
### 📝 Phase 2: Parse Arguments & Detect Context
Parse command arguments:
- **Required:** `provider`, `cluster-description`
- Parse natural language description for:
- Environment type (development, production, disconnected)
- Special requirements (FIPS, architecture, storage)
- Resource constraints (cost-optimization, performance)
- Network requirements
### ⚙️ Phase 3: Apply Provider-Specific Defaults and Validation
**Universal requirements (applied to ALL clusters):**
- **Namespace strategy:** Generate unique namespace based on cluster name (`{cluster-name}-ns`)
- **Security validation:** Scan for credentials or sensitive data
- **Release image:** Always include `--release-image` with proper version
**Provider-specific validation:**
- Network conflict prevention (especially KubeVirt)
- Credential requirements validation
- Region/zone availability checks
- Resource limit validation
### 💬 Phase 4: Interactive Prompts (Provider-Guided)
Each provider skill guides the collection of missing information:
**Common prompts for all providers:**
- Cluster name (if not specified)
- Pull secret path
- OpenShift version/release image
- Base domain (where applicable)
**Provider-specific prompts:**
- AWS: STS credentials, IAM role ARN, region selection
- Azure: Identity configuration method, resource group name
- KubeVirt: Management cluster network CIDRs, storage classes
- OpenStack: External network UUID, flavor selection
- PowerVS: IBM Cloud resource group, processor specifications
- Agent: Agent namespace, pre-provisioned agent details
### 🔒 Phase 5: Security and Configuration Validation
**Security checks:**
- Scan all inputs for credentials, API keys, or secrets
- Validate that sensitive information uses placeholder values
- Ensure proper credential file references
**Configuration validation:**
- Verify required parameters are present
- Validate parameter combinations and dependencies
- Check for common misconfigurations
### ✅ Phase 6: Generate Command
Based on provider skill guidance:
- Construct complete `hypershift create cluster` command
- Apply smart defaults for optional parameters
- Include all required provider-specific flags
- Format command for copy-paste execution
### 📤 Phase 7: Return Result
Display to user:
- **Summary:** Brief description of what will be created
- **Generated Command:** Complete, executable command
- **Key Decisions:** Explanation of important choices made
- **Next Steps:** What to do after running the command
- **Provider-specific notes:** Any special considerations
## Usage Examples
1. **Create AWS development cluster**:
```
/hcp:generate aws "development cluster for testing new features"
```
→ Invokes `hcp-create-aws` skill, prompts for AWS-specific details
2. **Create production KubeVirt cluster**:
```
/hcp:generate kubevirt "production cluster with high availability"
```
→ Invokes `hcp-create-kubevirt` skill, handles network conflict prevention
3. **Create cost-optimized Azure cluster**:
```
/hcp:generate azure "small cluster for dev work, minimize costs"
```
→ Invokes `hcp-create-azure` skill, applies cost optimization
4. **Create disconnected agent cluster**:
```
/hcp:generate agent "airgapped cluster for secure environment"
```
→ Invokes `hcp-create-agent` skill, handles disconnected requirements
5. **Create FIPS-enabled OpenStack cluster**:
```
/hcp:generate openstack "production cluster with FIPS compliance"
```
→ Invokes `hcp-create-openstack` skill, applies FIPS configuration
6. **Create ARM-based PowerVS cluster**:
```
/hcp:generate powervs "arm64 cluster for multi-arch testing"
```
→ Invokes `hcp-create-powervs` skill, handles ARM architecture
## Arguments
- **$1 provider** *(required)*
Cloud provider or platform to use.
**Options:** `aws` | `azure` | `kubevirt` | `openstack` | `powervs` | `agent`
- **$2 cluster-description** *(required)*
Natural language description of the desired cluster.
Use quotes for multi-word descriptions: `"production cluster with HA"`
**Description should include:**
- Environment type (development, production, testing)
- Special requirements (FIPS, architecture, storage)
- Resource preferences (cost-optimized, high-performance)
- Network requirements (disconnected, private)
## Return Value
- **Summary**: Brief description of the cluster that will be created
- **Generated Command**: Complete `hypershift create cluster` command
- **Key Decisions**: Explanation of choices made during generation
- **Next Steps**: Instructions for executing the command and post-creation tasks
**Example output:**
```
## Summary
Creating a development AWS hosted cluster with basic configuration.
## Generated Command
```bash
hypershift create cluster aws \
--name dev-cluster \
--namespace dev-cluster-ns \
--region us-east-1 \
--instance-type m5.large \
--pull-secret /path/to/pull-secret.json \
--node-pool-replicas 2 \
--zones us-east-1a,us-east-1b \
--control-plane-availability-policy SingleReplica \
--sts-creds /path/to/sts-creds.json \
--role-arn arn:aws:iam::123456789:role/hypershift-role \
--base-domain example.com \
--release-image quay.io/openshift-release-dev/ocp-release:4.18.0-multi
```
## Key Decisions
- Used SingleReplica for development (cost-effective)
- Selected 2 zones for basic redundancy
- m5.large instances balance cost and performance for dev workloads
- **Unique namespace**: `dev-cluster-ns` for better isolation and disaster recovery
## Next Steps
1. Ensure your pull secret file exists at the specified path
2. Verify AWS credentials are configured
3. Confirm STS credentials file is accessible
4. Run the command above to create your cluster
```
## Error Handling
### Invalid Provider
**Scenario:** User specifies unsupported provider.
**Action:**
```
Invalid provider "gcp". Supported providers: aws, azure, kubevirt, openstack, powervs, agent
Did you mean "aws"?
```
### Missing Description
**Scenario:** User doesn't provide cluster description.
**Action:**
```
Cluster description is required. Please describe what kind of cluster you want.
Examples:
- "development cluster for testing"
- "production cluster with high availability"
- "cost-optimized cluster for demos"
Usage: /hcp:generate aws "development cluster for testing"
```
### Ambiguous Requirements
**Scenario:** Description is too vague or contradictory.
**Action:**
```
The description "fast and cheap cluster" has conflicting requirements.
Let me help clarify:
1. Performance-optimized (higher costs, better resources)
2. Cost-optimized (lower costs, minimal resources)
3. Balanced (moderate costs and performance)
Which approach do you prefer?
```
### Provider-Specific Errors
**Scenario:** Provider-specific validation fails.
**Action:**
- Forward to appropriate skill for specialized error handling
- Provide provider-specific guidance and solutions
- Offer alternative configurations when possible
## Best Practices
1. **Be descriptive:** Include environment type, requirements, and constraints
2. **Specify architecture:** Mention if you need ARM64 or specific architectures
3. **Include network needs:** Specify if disconnected or special networking required
4. **Mention compliance:** Include FIPS, security, or regulatory requirements
5. **Consider costs:** Specify if cost optimization is important
6. **Plan for growth:** Mention if cluster needs to scale or handle specific workloads
## Anti-Patterns to Avoid
❌ **Vague descriptions**
```
/hcp:generate aws "cluster"
```
✅ Be specific: "development cluster for API testing with minimal resources"
❌ **Conflicting requirements**
```
/hcp:generate aws "high-performance cluster but very cheap"
```
✅ Be realistic: "balanced cluster optimizing for cost while maintaining decent performance"
❌ **Provider mismatches**
```
/hcp:generate azure "cluster for my on-premises lab"
```
✅ Use appropriate provider: "kubevirt cluster for my on-premises lab"
## See Also
- `hypershift create cluster --help` - Official hypershift CLI documentation
- Provider-specific skills:
- `hcp-create-aws` - AWS-specific guidance
- `hcp-create-azure` - Azure-specific guidance
- `hcp-create-kubevirt` - KubeVirt-specific guidance
- `hcp-create-openstack` - OpenStack-specific guidance
- `hcp-create-powervs` - PowerVS-specific guidance
- `hcp-create-agent` - Agent provider guidance
## Skills Reference
The following skills are automatically invoked by this command based on provider:
**Provider-specific skills:**
- **hcp-create-aws** - AWS provider implementation details
- **hcp-create-azure** - Azure provider implementation details
- **hcp-create-kubevirt** - KubeVirt provider implementation details
- **hcp-create-openstack** - OpenStack provider implementation details
- **hcp-create-powervs** - PowerVS provider implementation details
- **hcp-create-agent** - Agent provider implementation details
To view skill details:
```bash
ls plugins/hypershift/skills/
cat plugins/hypershift/skills/hcp-create-aws/SKILL.md
cat plugins/hypershift/skills/hcp-create-kubevirt/SKILL.md
# ... etc for other providers
```