--- name: Prow Job Analyze Metal Install Failure description: Analyze OpenShift bare metal installation failures in Prow CI jobs using dev-scripts artifacts. Use for jobs with "metal" in name, for debugging Metal3/Ironic provisioning, installation, or dev-scripts setup failures. You may also use the prow-job-analyze-install-failure skill with this one. --- # Prow Job Analyze Metal Install Failure This skill helps debug OpenShift bare metal installation failures in CI jobs by analyzing dev-scripts logs, libvirt console logs, sosreports, and other metal-specific artifacts. ## When to Use This Skill Use this skill when: - A bare metal CI job fails with "install should succeed" test failure - The job name contains "metal" or "baremetal" - You need to debug Metal3/Ironic provisioning issues - You need to analyze dev-scripts setup failures This skill is invoked by the main `prow-job-analyze-install-failure` skill when it detects a metal job. ## Metal Installation Overview Metal IPI jobs use **dev-scripts** (https://github.com/openshift-metal3/dev-scripts) with **Metal3** and **Ironic** to install OpenShift: - **dev-scripts**: Framework for setting up and installing OpenShift on bare metal - **Metal3**: Kubernetes-native interface to Ironic - **Ironic**: Bare metal provisioning service The installation process has multiple layers: 1. **dev-scripts setup**: Configures hypervisor, sets up Ironic/Metal3, builds installer 2. **Ironic provisioning**: Provisions bare metal nodes (or VMs acting as bare metal) 3. **OpenShift installation**: Standard installer runs on provisioned nodes Failures can occur at any layer, so analysis must check all of them. ## Network Architecture (CRITICAL for Understanding IPv6/Disconnected Jobs) **IMPORTANT**: The term "disconnected" refers to the cluster nodes, NOT the hypervisor. ### Hypervisor (dev-scripts host) - **HAS** full internet access - Downloads packages, container images, and dependencies from the public internet - Runs dev-scripts Ansible playbooks that download tools (Go, installer, etc.) - Hosts a local mirror registry to serve the cluster ### Cluster VMs/Nodes - Run in a **private IPv6-only network** (when IP_STACK=v6) - **NO** direct internet access (truly disconnected) - Pull container images from the hypervisor's local mirror registry - Access to hypervisor services only (registry, DNS, etc.) ### Common Misconception When analyzing failures in "metal-ipi-ovn-ipv6" jobs: - ❌ WRONG: "The hypervisor cannot access the internet, so downloads fail" - ✅ CORRECT: "The hypervisor has internet access. If downloads fail, it's likely due to the remote service being unavailable, not network restrictions" ### Implications for Failure Analysis 1. **Dev-scripts failures** (steps 01-05): If external downloads fail, check if the remote service/URL is down or has removed the resource 2. **Installation failures** (step 06+): If cluster nodes cannot pull images, check the local mirror registry on the hypervisor 3. **HTTP 403/404 errors during dev-scripts**: Usually means the resource was removed from the upstream source, not that the network is restricted ## Prerequisites 1. **gcloud CLI Installation** - Check if installed: `which gcloud` - If not installed, provide instructions for the user's platform - Installation guide: https://cloud.google.com/sdk/docs/install 2. **gcloud Authentication (Optional)** - The `test-platform-results` bucket is publicly accessible - No authentication is required for read access ## Input Format The user will provide: 1. **Build ID** - Extracted by the main skill 2. **Bucket path** - Extracted by the main skill 3. **Target name** - Extracted by the main skill 4. **Working directory** - Already created by main skill ## Metal-Specific Artifacts Metal jobs produce several diagnostic archives: ### OFCIR Acquisition Logs - **Location**: `{target}/ofcir-acquire/` - **Purpose**: Shows the OFCIR host acquisition process - **Contains**: - `build-log.txt`: Log showing pool, provider, and host details - `artifacts/junit_metal_setup.xml`: JUnit with test `[sig-metal] should get working host from infra provider` - **Critical for**: Determining if the job failed to acquire a host before installation started - **Key information**: - Pool name (e.g., "cipool-ironic-cluster-el9", "cipool-ibmcloud") - Provider (e.g., "ironic", "equinix", "aws", "ibmcloud") - Host name and details ### Dev-scripts Logs - **Location**: `{target}/baremetalds-devscripts-setup/artifacts/root/dev-scripts/logs/` - **Purpose**: Shows installation setup process and cluster installation - **Contains**: Numbered log files showing each setup step (requirements, host config, Ironic setup, installer build, cluster creation). **Note**: dev-scripts invokes the installer, so installer logs (`.openshift_install*.log`) will also be present in the devscripts folders. - **Critical for**: Early failures before cluster creation, Ironic/Metal3 setup issues, installation failures ### libvirt-logs.tar - **Location**: `{target}/baremetalds-devscripts-gather/artifacts/` - **Purpose**: VM/node console logs showing boot sequence - **Contains**: Console output from bootstrap and master VMs/nodes - **Critical for**: Boot failures, Ignition errors, kernel panics, network configuration issues ### sosreport - **Location**: `{target}/baremetalds-devscripts-gather/artifacts/` - **Purpose**: Hypervisor system diagnostics - **Contains**: Hypervisor logs, system configuration, diagnostic command output - **Useful for**: Hypervisor-level issues, not typically needed for VM boot problems ### squid-logs.tar - **Location**: `{target}/baremetalds-devscripts-gather/artifacts/` - **Purpose**: Squid proxy logs for inbound CI access to the cluster - **Contains**: Logs showing CI system's inbound connections to the cluster under test. **Note**: The squid proxy runs on the hypervisor for INBOUND access (CI → cluster), NOT for outbound access (cluster → registry). - **Critical for**: Debugging CI access issues to the cluster, particularly in IPv6/disconnected environments ## Implementation Steps ### Step 1: Check OFCIR Acquisition 1. **Download OFCIR logs** ```bash gcloud storage cp gs://test-platform-results/{bucket-path}/artifacts/{target}/ofcir-acquire/build-log.txt .work/prow-job-analyze-install-failure/{build_id}/logs/ofcir-build-log.txt --no-user-output-enabled 2>&1 || echo "OFCIR build log not found" gcloud storage cp gs://test-platform-results/{bucket-path}/artifacts/{target}/ofcir-acquire/artifacts/junit_metal_setup.xml .work/prow-job-analyze-install-failure/{build_id}/logs/junit_metal_setup.xml --no-user-output-enabled 2>&1 || echo "OFCIR JUnit not found" ``` 2. **Check junit_metal_setup.xml for acquisition failure** - Read the JUnit file - Look for test case: `[sig-metal] should get working host from infra provider` - If the test failed, OFCIR failed to acquire a host - This means installation never started - the failure is in host acquisition 3. **Extract OFCIR details from build-log.txt** - Parse the JSON in the build log to extract: - `pool`: The OFCIR pool name - `provider`: The infrastructure provider - `name`: The host name allocated - Save these for the final report 4. **If OFCIR acquisition failed** - Stop analysis - installation never started - Report: "OFCIR host acquisition failed" - Include pool and provider information - Suggest: Check OFCIR pool availability and provider status ### Step 2: Download Dev-Scripts Logs 1. **Download dev-scripts logs directory** ```bash gcloud storage cp -r gs://test-platform-results/{bucket-path}/artifacts/{target}/baremetalds-devscripts-setup/artifacts/root/dev-scripts/logs/ .work/prow-job-analyze-install-failure/{build_id}/logs/devscripts/ --no-user-output-enabled ``` 2. **Handle missing dev-scripts logs gracefully** - Some metal jobs may not have dev-scripts artifacts - If missing, note this in the analysis and proceed with other artifacts ### Step 2: Download libvirt Console Logs 1. **Find and download libvirt-logs.tar** ```bash gcloud storage ls -r gs://test-platform-results/{bucket-path}/artifacts/ 2>&1 | grep "libvirt-logs\.tar$" gcloud storage cp {full-gcs-path-to-libvirt-logs.tar} .work/prow-job-analyze-install-failure/{build_id}/logs/ --no-user-output-enabled ``` 2. **Extract libvirt logs** ```bash tar -xf .work/prow-job-analyze-install-failure/{build_id}/logs/libvirt-logs.tar -C .work/prow-job-analyze-install-failure/{build_id}/logs/ ``` ### Step 3: Download Optional Artifacts 1. **Download sosreport (optional)** ```bash gcloud storage ls -r gs://test-platform-results/{bucket-path}/artifacts/ 2>&1 | grep "sosreport.*\.tar\.xz$" gcloud storage cp {full-gcs-path-to-sosreport} .work/prow-job-analyze-install-failure/{build_id}/logs/ --no-user-output-enabled tar -xf .work/prow-job-analyze-install-failure/{build_id}/logs/sosreport-{name}.tar.xz -C .work/prow-job-analyze-install-failure/{build_id}/logs/ ``` 2. **Download squid-logs (optional, for IPv6/disconnected jobs)** ```bash gcloud storage ls -r gs://test-platform-results/{bucket-path}/artifacts/ 2>&1 | grep "squid-logs.*\.tar$" gcloud storage cp {full-gcs-path-to-squid-logs} .work/prow-job-analyze-install-failure/{build_id}/logs/ --no-user-output-enabled tar -xf .work/prow-job-analyze-install-failure/{build_id}/logs/squid-logs-{name}.tar -C .work/prow-job-analyze-install-failure/{build_id}/logs/ ``` ### Step 4: Analyze Dev-Scripts Logs **Check dev-scripts logs FIRST** - they show what happened during setup and installation. 1. **Read dev-scripts logs in order** - Logs are numbered sequentially showing setup steps - **Note**: dev-scripts invokes the installer, so you'll find `.openshift_install*.log` files in the devscripts directories - Look for the first error or failure 2. **Key errors to look for**: - **Host configuration failures**: Networking, DNS, storage setup issues - **Ironic/Metal3 setup issues**: BMC connectivity, provisioning network, node registration failures - **Installer build failures**: Problems building the OpenShift installer binary - **Install-config validation errors**: Invalid configuration before cluster creation - **Installation failures**: Check installer logs (`.openshift_install*.log`) present in devscripts folders 3. **Important distinction**: - If failure is in dev-scripts setup logs (01-05), the problem is in the setup process - If failure is in installer logs or 06_create_cluster, the problem is in the cluster installation (also analyzed by main skill) 4. **Save dev-scripts analysis**: - Save findings to: `.work/prow-job-analyze-install-failure/{build_id}/analysis/devscripts-summary.txt` ### Step 5: Analyze libvirt Console Logs **Console logs are CRITICAL for metal failures during cluster creation.** 1. **Find console logs** ```bash find .work/prow-job-analyze-install-failure/{build_id}/logs/ -name "*console*.log" ``` - Look for patterns like `{cluster-name}-bootstrap_console.log`, `{cluster-name}-master-{N}_console.log` 2. **Analyze console logs for boot/provisioning issues**: - **Kernel boot failures or panics**: Look for "panic", "kernel", "oops" - **Ignition failures**: Look for "ignition", "config fetch failed", "Ignition failed" - **Network configuration issues**: Look for "dhcp", "network unreachable", "DNS", "timeout" - **Disk mounting failures**: Look for "mount", "disk", "filesystem" - **Service startup failures**: Look for systemd errors, service failures 3. **Console logs show the complete boot sequence**: - As if you were watching a physical console - Shows kernel messages, Ignition provisioning, CoreOS startup - Critical for understanding what happened before the system was fully booted 4. **Save console log analysis**: - Save findings to: `.work/prow-job-analyze-install-failure/{build_id}/analysis/console-summary.txt` ### Step 6: Analyze sosreport (If Downloaded) **Only needed for hypervisor-level issues.** 1. **Check sosreport for hypervisor diagnostics**: - `var/log/messages` - Hypervisor system log - `sos_commands/` - Output of diagnostic commands - `etc/libvirt/` - Libvirt configuration 2. **Look for hypervisor-level issues**: - Libvirt errors - Network configuration problems on hypervisor - Resource constraints (CPU, memory, disk) ### Step 7: Analyze squid-logs (If Downloaded) **Important for debugging CI access to the cluster.** 1. **Check squid proxy logs**: - Look for failed connections from CI to the cluster - Look for HTTP errors or blocked requests - Check patterns of CI test framework access issues 2. **Common issues**: - CI unable to connect to cluster API - Proxy configuration errors blocking CI access - Network routing issues between CI and cluster - **Note**: These logs are for INBOUND access (CI → cluster), not for cluster's outbound access to registries ### Step 8: Generate Metal-Specific Analysis Report 1. **Create comprehensive metal analysis report**: ``` Metal Installation Failure Analysis ==================================== Job: {job-name} Build ID: {build_id} Prow URL: {original-url} Installation Method: dev-scripts + Metal3 + Ironic OFCIR Host Acquisition ---------------------- Pool: {pool name from OFCIR build log} Provider: {provider from OFCIR build log} Host: {host name from OFCIR build log} Status: {Success or Failure} {If OFCIR acquisition failed, note that installation never started} Dev-Scripts Analysis -------------------- {Summary of dev-scripts logs} Key Findings: - {First error in dev-scripts setup} - {Related errors} If dev-scripts failed: The problem is in the setup process (host config, Ironic, installer build) If dev-scripts succeeded: The problem is in cluster installation (see main analysis) Console Logs Analysis --------------------- {Summary of VM/node console logs} Bootstrap Node: - {Boot sequence status} - {Ignition status} - {Network configuration} - {Key errors} Master Nodes: - {Status for each master} - {Key errors} Hypervisor Diagnostics (sosreport) ----------------------------------- {Summary of sosreport findings, if applicable} Proxy Logs (squid) ------------------ {Summary of proxy logs, if applicable} Note: Squid logs show CI access to the cluster, not cluster's registry access Metal-Specific Recommended Steps --------------------------------- Based on the failure: For dev-scripts setup failures: - Review host configuration (networking, DNS, storage) - Check Ironic/Metal3 setup logs for BMC/provisioning issues - Verify installer build completed successfully - Check installer logs in devscripts folders For console boot failures: - Check Ignition configuration and network connectivity - Review kernel boot messages for hardware issues - Verify network configuration (DHCP, DNS, routing) For CI access issues: - Check squid proxy logs for failed CI connections to cluster - Verify network routing between CI and cluster - Check proxy configuration Artifacts Location ------------------ Dev-scripts logs: .work/prow-job-analyze-install-failure/{build_id}/logs/devscripts/ Console logs: .work/prow-job-analyze-install-failure/{build_id}/logs/ sosreport: .work/prow-job-analyze-install-failure/{build_id}/logs/sosreport-*/ squid logs: .work/prow-job-analyze-install-failure/{build_id}/logs/squid-logs-*/ ``` 2. **Save report**: - Save to: `.work/prow-job-analyze-install-failure/{build_id}/analysis/metal-analysis.txt` ### Step 9: Return Metal Analysis to Main Skill 1. **Provide summary to main skill**: - Brief summary of metal-specific findings - Indication of whether failure was in dev-scripts setup or cluster installation - Key error messages and recommended actions ## Common Metal Failure Patterns | Issue | Symptoms | Where to Look | |-------|----------|---------------| | **Dev-scripts host config** | Early failure before cluster creation | Dev-scripts logs (host configuration step) | | **Ironic/Metal3 setup** | Provisioning failures, BMC errors | Dev-scripts logs (Ironic setup), Ironic logs | | **Node boot failure** | VMs/nodes won't boot | Console logs (kernel, boot sequence) | | **Ignition failure** | Nodes boot but don't provision | Console logs (Ignition messages) | | **Network config** | DHCP failures, DNS issues | Console logs (network messages), dev-scripts host config | | **CI access issues** | Tests can't connect to cluster | squid logs (proxy logs for CI → cluster access) | | **Hypervisor issues** | Resource constraints, libvirt errors | sosreport (system logs, libvirt config) | ## Tips - **Check dev-scripts logs FIRST**: They show setup and installation (dev-scripts invokes the installer) - **Installer logs in devscripts**: Look for `.openshift_install*.log` files in devscripts directories - **Console logs are critical**: They show the actual boot sequence like a physical console - **Ironic/Metal3 errors** often appear in dev-scripts setup logs - **Squid logs are for CI access**: They show inbound CI → cluster access, not outbound cluster → registry - **Boot vs. provisioning**: Boot failures appear in console logs, provisioning failures in Ironic logs - **Layer distinction**: Separate dev-scripts setup from Ironic provisioning from OpenShift installation