Files
gh-openshift-eng-ai-helpers…/skills/prow-job-analyze-metal-install-failure/SKILL.md
2025-11-30 08:46:16 +08:00

17 KiB

name, description
name description
Prow Job Analyze Metal Install Failure Analyze OpenShift bare metal installation failures in Prow CI jobs using dev-scripts artifacts. Use for jobs with "metal" in name, for debugging Metal3/Ironic provisioning, installation, or dev-scripts setup failures. You may also use the prow-job-analyze-install-failure skill with this one.

Prow Job Analyze Metal Install Failure

This skill helps debug OpenShift bare metal installation failures in CI jobs by analyzing dev-scripts logs, libvirt console logs, sosreports, and other metal-specific artifacts.

When to Use This Skill

Use this skill when:

  • A bare metal CI job fails with "install should succeed" test failure
  • The job name contains "metal" or "baremetal"
  • You need to debug Metal3/Ironic provisioning issues
  • You need to analyze dev-scripts setup failures

This skill is invoked by the main prow-job-analyze-install-failure skill when it detects a metal job.

Metal Installation Overview

Metal IPI jobs use dev-scripts (https://github.com/openshift-metal3/dev-scripts) with Metal3 and Ironic to install OpenShift:

  • dev-scripts: Framework for setting up and installing OpenShift on bare metal
  • Metal3: Kubernetes-native interface to Ironic
  • Ironic: Bare metal provisioning service

The installation process has multiple layers:

  1. dev-scripts setup: Configures hypervisor, sets up Ironic/Metal3, builds installer
  2. Ironic provisioning: Provisions bare metal nodes (or VMs acting as bare metal)
  3. OpenShift installation: Standard installer runs on provisioned nodes

Failures can occur at any layer, so analysis must check all of them.

Network Architecture (CRITICAL for Understanding IPv6/Disconnected Jobs)

IMPORTANT: The term "disconnected" refers to the cluster nodes, NOT the hypervisor.

Hypervisor (dev-scripts host)

  • HAS full internet access
  • Downloads packages, container images, and dependencies from the public internet
  • Runs dev-scripts Ansible playbooks that download tools (Go, installer, etc.)
  • Hosts a local mirror registry to serve the cluster

Cluster VMs/Nodes

  • Run in a private IPv6-only network (when IP_STACK=v6)
  • NO direct internet access (truly disconnected)
  • Pull container images from the hypervisor's local mirror registry
  • Access to hypervisor services only (registry, DNS, etc.)

Common Misconception

When analyzing failures in "metal-ipi-ovn-ipv6" jobs:

  • WRONG: "The hypervisor cannot access the internet, so downloads fail"
  • CORRECT: "The hypervisor has internet access. If downloads fail, it's likely due to the remote service being unavailable, not network restrictions"

Implications for Failure Analysis

  1. Dev-scripts failures (steps 01-05): If external downloads fail, check if the remote service/URL is down or has removed the resource
  2. Installation failures (step 06+): If cluster nodes cannot pull images, check the local mirror registry on the hypervisor
  3. HTTP 403/404 errors during dev-scripts: Usually means the resource was removed from the upstream source, not that the network is restricted

Prerequisites

  1. gcloud CLI Installation

  2. gcloud Authentication (Optional)

    • The test-platform-results bucket is publicly accessible
    • No authentication is required for read access

Input Format

The user will provide:

  1. Build ID - Extracted by the main skill
  2. Bucket path - Extracted by the main skill
  3. Target name - Extracted by the main skill
  4. Working directory - Already created by main skill

Metal-Specific Artifacts

Metal jobs produce several diagnostic archives:

OFCIR Acquisition Logs

  • Location: {target}/ofcir-acquire/
  • Purpose: Shows the OFCIR host acquisition process
  • Contains:
    • build-log.txt: Log showing pool, provider, and host details
    • artifacts/junit_metal_setup.xml: JUnit with test [sig-metal] should get working host from infra provider
  • Critical for: Determining if the job failed to acquire a host before installation started
  • Key information:
    • Pool name (e.g., "cipool-ironic-cluster-el9", "cipool-ibmcloud")
    • Provider (e.g., "ironic", "equinix", "aws", "ibmcloud")
    • Host name and details

Dev-scripts Logs

  • Location: {target}/baremetalds-devscripts-setup/artifacts/root/dev-scripts/logs/
  • Purpose: Shows installation setup process and cluster installation
  • Contains: Numbered log files showing each setup step (requirements, host config, Ironic setup, installer build, cluster creation). Note: dev-scripts invokes the installer, so installer logs (.openshift_install*.log) will also be present in the devscripts folders.
  • Critical for: Early failures before cluster creation, Ironic/Metal3 setup issues, installation failures

libvirt-logs.tar

  • Location: {target}/baremetalds-devscripts-gather/artifacts/
  • Purpose: VM/node console logs showing boot sequence
  • Contains: Console output from bootstrap and master VMs/nodes
  • Critical for: Boot failures, Ignition errors, kernel panics, network configuration issues

sosreport

  • Location: {target}/baremetalds-devscripts-gather/artifacts/
  • Purpose: Hypervisor system diagnostics
  • Contains: Hypervisor logs, system configuration, diagnostic command output
  • Useful for: Hypervisor-level issues, not typically needed for VM boot problems

squid-logs.tar

  • Location: {target}/baremetalds-devscripts-gather/artifacts/
  • Purpose: Squid proxy logs for inbound CI access to the cluster
  • Contains: Logs showing CI system's inbound connections to the cluster under test. Note: The squid proxy runs on the hypervisor for INBOUND access (CI → cluster), NOT for outbound access (cluster → registry).
  • Critical for: Debugging CI access issues to the cluster, particularly in IPv6/disconnected environments

Implementation Steps

Step 1: Check OFCIR Acquisition

  1. Download OFCIR logs

    gcloud storage cp gs://test-platform-results/{bucket-path}/artifacts/{target}/ofcir-acquire/build-log.txt .work/prow-job-analyze-install-failure/{build_id}/logs/ofcir-build-log.txt --no-user-output-enabled 2>&1 || echo "OFCIR build log not found"
    gcloud storage cp gs://test-platform-results/{bucket-path}/artifacts/{target}/ofcir-acquire/artifacts/junit_metal_setup.xml .work/prow-job-analyze-install-failure/{build_id}/logs/junit_metal_setup.xml --no-user-output-enabled 2>&1 || echo "OFCIR JUnit not found"
    
  2. Check junit_metal_setup.xml for acquisition failure

    • Read the JUnit file
    • Look for test case: [sig-metal] should get working host from infra provider
    • If the test failed, OFCIR failed to acquire a host
    • This means installation never started - the failure is in host acquisition
  3. Extract OFCIR details from build-log.txt

    • Parse the JSON in the build log to extract:
      • pool: The OFCIR pool name
      • provider: The infrastructure provider
      • name: The host name allocated
    • Save these for the final report
  4. If OFCIR acquisition failed

    • Stop analysis - installation never started
    • Report: "OFCIR host acquisition failed"
    • Include pool and provider information
    • Suggest: Check OFCIR pool availability and provider status

Step 2: Download Dev-Scripts Logs

  1. Download dev-scripts logs directory

    gcloud storage cp -r gs://test-platform-results/{bucket-path}/artifacts/{target}/baremetalds-devscripts-setup/artifacts/root/dev-scripts/logs/ .work/prow-job-analyze-install-failure/{build_id}/logs/devscripts/ --no-user-output-enabled
    
  2. Handle missing dev-scripts logs gracefully

    • Some metal jobs may not have dev-scripts artifacts
    • If missing, note this in the analysis and proceed with other artifacts

Step 2: Download libvirt Console Logs

  1. Find and download libvirt-logs.tar

    gcloud storage ls -r gs://test-platform-results/{bucket-path}/artifacts/ 2>&1 | grep "libvirt-logs\.tar$"
    gcloud storage cp {full-gcs-path-to-libvirt-logs.tar} .work/prow-job-analyze-install-failure/{build_id}/logs/ --no-user-output-enabled
    
  2. Extract libvirt logs

    tar -xf .work/prow-job-analyze-install-failure/{build_id}/logs/libvirt-logs.tar -C .work/prow-job-analyze-install-failure/{build_id}/logs/
    

Step 3: Download Optional Artifacts

  1. Download sosreport (optional)

    gcloud storage ls -r gs://test-platform-results/{bucket-path}/artifacts/ 2>&1 | grep "sosreport.*\.tar\.xz$"
    gcloud storage cp {full-gcs-path-to-sosreport} .work/prow-job-analyze-install-failure/{build_id}/logs/ --no-user-output-enabled
    tar -xf .work/prow-job-analyze-install-failure/{build_id}/logs/sosreport-{name}.tar.xz -C .work/prow-job-analyze-install-failure/{build_id}/logs/
    
  2. Download squid-logs (optional, for IPv6/disconnected jobs)

    gcloud storage ls -r gs://test-platform-results/{bucket-path}/artifacts/ 2>&1 | grep "squid-logs.*\.tar$"
    gcloud storage cp {full-gcs-path-to-squid-logs} .work/prow-job-analyze-install-failure/{build_id}/logs/ --no-user-output-enabled
    tar -xf .work/prow-job-analyze-install-failure/{build_id}/logs/squid-logs-{name}.tar -C .work/prow-job-analyze-install-failure/{build_id}/logs/
    

Step 4: Analyze Dev-Scripts Logs

Check dev-scripts logs FIRST - they show what happened during setup and installation.

  1. Read dev-scripts logs in order

    • Logs are numbered sequentially showing setup steps
    • Note: dev-scripts invokes the installer, so you'll find .openshift_install*.log files in the devscripts directories
    • Look for the first error or failure
  2. Key errors to look for:

    • Host configuration failures: Networking, DNS, storage setup issues
    • Ironic/Metal3 setup issues: BMC connectivity, provisioning network, node registration failures
    • Installer build failures: Problems building the OpenShift installer binary
    • Install-config validation errors: Invalid configuration before cluster creation
    • Installation failures: Check installer logs (.openshift_install*.log) present in devscripts folders
  3. Important distinction:

    • If failure is in dev-scripts setup logs (01-05), the problem is in the setup process
    • If failure is in installer logs or 06_create_cluster, the problem is in the cluster installation (also analyzed by main skill)
  4. Save dev-scripts analysis:

    • Save findings to: .work/prow-job-analyze-install-failure/{build_id}/analysis/devscripts-summary.txt

Step 5: Analyze libvirt Console Logs

Console logs are CRITICAL for metal failures during cluster creation.

  1. Find console logs

    find .work/prow-job-analyze-install-failure/{build_id}/logs/ -name "*console*.log"
    
    • Look for patterns like {cluster-name}-bootstrap_console.log, {cluster-name}-master-{N}_console.log
  2. Analyze console logs for boot/provisioning issues:

    • Kernel boot failures or panics: Look for "panic", "kernel", "oops"
    • Ignition failures: Look for "ignition", "config fetch failed", "Ignition failed"
    • Network configuration issues: Look for "dhcp", "network unreachable", "DNS", "timeout"
    • Disk mounting failures: Look for "mount", "disk", "filesystem"
    • Service startup failures: Look for systemd errors, service failures
  3. Console logs show the complete boot sequence:

    • As if you were watching a physical console
    • Shows kernel messages, Ignition provisioning, CoreOS startup
    • Critical for understanding what happened before the system was fully booted
  4. Save console log analysis:

    • Save findings to: .work/prow-job-analyze-install-failure/{build_id}/analysis/console-summary.txt

Step 6: Analyze sosreport (If Downloaded)

Only needed for hypervisor-level issues.

  1. Check sosreport for hypervisor diagnostics:

    • var/log/messages - Hypervisor system log
    • sos_commands/ - Output of diagnostic commands
    • etc/libvirt/ - Libvirt configuration
  2. Look for hypervisor-level issues:

    • Libvirt errors
    • Network configuration problems on hypervisor
    • Resource constraints (CPU, memory, disk)

Step 7: Analyze squid-logs (If Downloaded)

Important for debugging CI access to the cluster.

  1. Check squid proxy logs:

    • Look for failed connections from CI to the cluster
    • Look for HTTP errors or blocked requests
    • Check patterns of CI test framework access issues
  2. Common issues:

    • CI unable to connect to cluster API
    • Proxy configuration errors blocking CI access
    • Network routing issues between CI and cluster
    • Note: These logs are for INBOUND access (CI → cluster), not for cluster's outbound access to registries

Step 8: Generate Metal-Specific Analysis Report

  1. Create comprehensive metal analysis report:

    Metal Installation Failure Analysis
    ====================================
    
    Job: {job-name}
    Build ID: {build_id}
    Prow URL: {original-url}
    
    Installation Method: dev-scripts + Metal3 + Ironic
    
    OFCIR Host Acquisition
    ----------------------
    Pool: {pool name from OFCIR build log}
    Provider: {provider from OFCIR build log}
    Host: {host name from OFCIR build log}
    Status: {Success or Failure}
    
    {If OFCIR acquisition failed, note that installation never started}
    
    Dev-Scripts Analysis
    --------------------
    {Summary of dev-scripts logs}
    
    Key Findings:
    - {First error in dev-scripts setup}
    - {Related errors}
    
    If dev-scripts failed: The problem is in the setup process (host config, Ironic, installer build)
    If dev-scripts succeeded: The problem is in cluster installation (see main analysis)
    
    Console Logs Analysis
    ---------------------
    {Summary of VM/node console logs}
    
    Bootstrap Node:
    - {Boot sequence status}
    - {Ignition status}
    - {Network configuration}
    - {Key errors}
    
    Master Nodes:
    - {Status for each master}
    - {Key errors}
    
    Hypervisor Diagnostics (sosreport)
    -----------------------------------
    {Summary of sosreport findings, if applicable}
    
    Proxy Logs (squid)
    ------------------
    {Summary of proxy logs, if applicable}
    Note: Squid logs show CI access to the cluster, not cluster's registry access
    
    Metal-Specific Recommended Steps
    ---------------------------------
    Based on the failure:
    
    For dev-scripts setup failures:
    - Review host configuration (networking, DNS, storage)
    - Check Ironic/Metal3 setup logs for BMC/provisioning issues
    - Verify installer build completed successfully
    - Check installer logs in devscripts folders
    
    For console boot failures:
    - Check Ignition configuration and network connectivity
    - Review kernel boot messages for hardware issues
    - Verify network configuration (DHCP, DNS, routing)
    
    For CI access issues:
    - Check squid proxy logs for failed CI connections to cluster
    - Verify network routing between CI and cluster
    - Check proxy configuration
    
    Artifacts Location
    ------------------
    Dev-scripts logs: .work/prow-job-analyze-install-failure/{build_id}/logs/devscripts/
    Console logs: .work/prow-job-analyze-install-failure/{build_id}/logs/
    sosreport: .work/prow-job-analyze-install-failure/{build_id}/logs/sosreport-*/
    squid logs: .work/prow-job-analyze-install-failure/{build_id}/logs/squid-logs-*/
    
  2. Save report:

    • Save to: .work/prow-job-analyze-install-failure/{build_id}/analysis/metal-analysis.txt

Step 9: Return Metal Analysis to Main Skill

  1. Provide summary to main skill:
    • Brief summary of metal-specific findings
    • Indication of whether failure was in dev-scripts setup or cluster installation
    • Key error messages and recommended actions

Common Metal Failure Patterns

Issue Symptoms Where to Look
Dev-scripts host config Early failure before cluster creation Dev-scripts logs (host configuration step)
Ironic/Metal3 setup Provisioning failures, BMC errors Dev-scripts logs (Ironic setup), Ironic logs
Node boot failure VMs/nodes won't boot Console logs (kernel, boot sequence)
Ignition failure Nodes boot but don't provision Console logs (Ignition messages)
Network config DHCP failures, DNS issues Console logs (network messages), dev-scripts host config
CI access issues Tests can't connect to cluster squid logs (proxy logs for CI → cluster access)
Hypervisor issues Resource constraints, libvirt errors sosreport (system logs, libvirt config)

Tips

  • Check dev-scripts logs FIRST: They show setup and installation (dev-scripts invokes the installer)
  • Installer logs in devscripts: Look for .openshift_install*.log files in devscripts directories
  • Console logs are critical: They show the actual boot sequence like a physical console
  • Ironic/Metal3 errors often appear in dev-scripts setup logs
  • Squid logs are for CI access: They show inbound CI → cluster access, not outbound cluster → registry
  • Boot vs. provisioning: Boot failures appear in console logs, provisioning failures in Ironic logs
  • Layer distinction: Separate dev-scripts setup from Ironic provisioning from OpenShift installation