Files
gh-openshift-eng-ai-helpers…/commands/add-debug-wait.md
2025-11-30 08:45:38 +08:00

15 KiB
Raw Permalink Blame History

description, argument-hint
description argument-hint
Add a wait step to a CI workflow for debugging test failures <workflow-or-job-name> [timeout]

Name

ci:add-debug-wait

Synopsis

/ci:add-debug-wait <workflow-or-job-name> [timeout]

Description

The ci:add-debug-wait command adds a wait step to a CI job/workflow for debugging test failures.

What it does:

  1. Takes job name, OCP version, and optional timeout as input
  2. Finds and edits the job config or workflow file
  3. Adds - ref: wait before the last test step (with optional timeout configuration)
  4. Commits and pushes the change
  5. Gives you a GitHub link to create the PR

That's it! Simple, fast, and automated.

Implementation

The command performs the following steps:

Step 1: Gather Required Information

Prompt user for (in this order):

  1. Workflow/Job Name: (from command argument $1 or prompt)

    Workflow or job name: <user-input>
    Example: aws-c2s-ipi-disc-priv-fips-f7
    Example: baremetalds-two-node-arbiter-e2e-openshift-test-private-tests
    
  2. Timeout (optional, from command argument $2):

    Wait timeout in hours (optional, default: 3h):
    Examples: "1h", "2h", "8h", "24h", "72h"
    Valid range: 1h to 72h
    
    • If not provided, uses the wait step's default behavior (3 hours)
    • Format: Integer followed by 'h' (e.g., "1h", "2h", "8h")
    • Valid range: 1h to 72h (maximum enforced by wait step's timeout setting)
    • Will be normalized to Go duration format (e.g., "8h" → "8h0m0s")
    • This will be set as the timeout: property on the wait step in the workflow/job YAML
  3. OCP Version: (prompt - REQUIRED for searching job configs)

    OCP version for debugging (e.g., 4.18, 4.19, 4.20, 4.21, 4.22):
    

    This is used to:

    • Search the correct job config file (e.g., release-4.21)
    • Document which version needs debugging
    • Add context to the PR
  4. OpenShift Release Repo Path: (prompt if not in current directory)

    Path to openshift/release repository:
    Default: ~/repos/openshift-release
    

Step 2: Validate Environment

Silently validate (no user prompts):

cd <repo-path>

# Check 1: Repository exists and is correct
git remote -v | grep "openshift/release" || exit 1

# Skip repo update - work with current state
# User can manually update their repo if needed

Step 3: Search for Job/Test Configuration

Priority 1: Search job configs first (more specific and targeted):

cd <repo-path>

# Search for job config files matching the OCP version
# The job name could be in various config files, so search broadly
grep -r "as: ${job_name}" ci-operator/config/ --include="*release-${ocp_version}*.yaml" -l

Example searches:

  • For aws-c2s-ipi-disc-priv-fips-f7 and OCP 4.21:
    grep -r "as: aws-c2s-ipi-disc-priv-fips-f7" ci-operator/config/ --include="*release-4.21*.yaml" -l
    

Handle job config search results:

  • 1 file found:

    ✅ Found job configuration:
    ${file_path}
    
    Type: Job configuration file
    
    Proceeding with job config modification...
    

    → Continue to Step 4a: Analyze Job Configuration

  • Multiple files found:

    Found ${count} matching job config files:
    
    1. ci-operator/config/.../release-4.21__amd64-nightly.yaml
    2. ci-operator/config/.../release-4.21__arm64-nightly.yaml
    3. ci-operator/config/.../release-4.21__ppc64le-nightly.yaml
    
    Select file (1-${count}) or 'q' to quit:
    

    Prompt user to select which file to modify, then continue to Step 4a: Analyze Job Configuration

  • 0 files found:

      No job config found for: ${job_name} (OCP ${ocp_version})
    
    Searching for workflow files instead...
    

    → Continue to Priority 2 below

Priority 2: Search workflow files (if job config not found):

cd <repo-path>

# Search for workflow files
find ci-operator/step-registry -type f -name "*${workflow_name}*workflow*.yaml"

Handle workflow search results:

  • 0 files found:

    ❌ No job config or workflow file found for: ${job_name}
    
    Suggestions:
    1. Check spelling of job/workflow name
    2. Verify OCP version (${ocp_version})
    3. Try with partial name
    4. Search manually:
       - Job configs: grep -r "as: ${job_name}" ci-operator/config/
       - Workflows: find ci-operator/step-registry -name "*workflow*.yaml" | grep <partial-name>
    
  • 1 file found:

    ✅ Found workflow file:
    ${file_path}
    
    Type: Workflow file
    
    Proceeding with workflow modification...
    

    → Continue to Step 4b: Analyze Workflow File

  • Multiple files found:

    Found ${count} matching workflow files:
    
    1. ci-operator/step-registry/.../workflow1.yaml
    2. ci-operator/step-registry/.../workflow2.yaml
    3. ci-operator/step-registry/.../workflow3.yaml
    
    Select file (1-${count}) or 'q' to quit:
    

    Prompt user to select which file to modify, then continue to Step 4b: Analyze Workflow File

Step 4a: Analyze Job Configuration

Read and parse the job config YAML:

# Find the specific test definition
grep -A 30 "as: ${job_name}" <job-config-file>

Check for:

  1. Has steps: section
  2. Has test: section inside steps
  3. Does NOT already have - ref: wait

Example current structure:

- as: aws-c2s-ipi-disc-priv-fips-f7
  cron: 36 16 3,12,19,26 * *
  steps:
    cluster_profile: aws-c2s-qe
    env:
      BASE_DOMAIN: qe.devcluster.openshift.com
      FIPS_ENABLED: "true"
    test:
    - chain: openshift-e2e-test-qe
    workflow: cucushift-installer-rehearse-aws-c2s-ipi-disconnected-private

If wait already exists:

  Wait step already configured in job config

Current test section:
  test:
  - ref: wait
  - chain: openshift-e2e-test-qe

No changes needed. The job is already set up for debugging.

If no test section found:

  Job config found but no test: section

This job uses only the workflow's test steps.
Searching for the workflow: ${workflow_name}

→ Fall back to searching for workflow (Priority 2 in Step 3)

→ Continue to Step 5a: Show Diff for Job Config

Step 4b: Analyze Workflow File

Read and parse the workflow YAML:

cat <workflow-file>

Check for:

  1. Has workflow: section
  2. Has test: section
  3. Does NOT already have - ref: wait

Example current structure:

workflow:
  as: baremetalds-two-node-arbiter-upgrade
  steps:
    pre:
      - chain: baremetalds-ipi-pre
    test:
      - chain: baremetalds-ipi-test
    post:
      - chain: baremetalds-ipi-post

If wait already exists:

  Wait step already configured in workflow

Current test section:
  test:
    - ref: wait
    - chain: baremetalds-ipi-test

No changes needed. The workflow is already set up for debugging.

If no test section exists:

  Workflow has no test: section

This workflow is provision/deprovision only.
The test steps must be defined in the job config.

Please provide the full job name to modify the job config instead.

→ Exit or prompt for job name

→ Continue to Step 5b: Modify Workflow File

Step 5a: Modify Job Config File

Edit the job config file directly - no confirmation needed:

# Add wait step before the last test step
# If timeout is provided, add it as a step property
# See Step 6 for the YAML modification algorithm

Two scenarios:

  1. Without custom timeout (uses wait step's built-in default of 3h):

    test:
    - ref: wait
    - chain: openshift-e2e-test-qe
    

    Note: No timeout or best_effort needed - the wait step will use its default TIMEOUT env var (3 hours)

  2. With custom timeout (user provided timeout parameter):

    test:
    - ref: wait
      timeout: 8h0m0s
      best_effort: true
    - chain: openshift-e2e-test-qe
    

    Note: best_effort: true is required when timeout is customized to prevent the wait step from failing the job if it times out

Show brief confirmation:

✅ Modified: ${job_name} (OCP ${ocp_version})
   File: <job-config-file-path>
   Added: - ref: wait${timeout:+ (timeout: ${timeout})}

Step 5b: Modify Workflow File

Edit the workflow file directly - no confirmation needed:

# Add wait step before the last test step
# If timeout is provided, add it as a step property
# See Step 6 for the YAML modification algorithm

Two scenarios:

  1. Without custom timeout (uses wait step's built-in default of 3h):

    test:
    - ref: wait
    - chain: baremetalds-ipi-test
    

    Note: No timeout or best_effort needed - the wait step will use its default TIMEOUT env var (3 hours)

  2. With custom timeout (user provided timeout parameter):

    test:
    - ref: wait
      timeout: 8h0m0s
      best_effort: true
    - chain: baremetalds-ipi-test
    

    Note: best_effort: true is required when timeout is customized to prevent the wait step from failing the job if it times out

Show brief confirmation:

✅ Modified: ${workflow_name} workflow
   File: <workflow-file-path>
   Added: - ref: wait${timeout:+ (timeout: ${timeout})}
   ⚠️  Impact: Affects ALL jobs using this workflow

Step 6: Create Branch and Commit

Branch naming:

debug-${workflow_name}-${ocp_version}-$(date +%Y%m%d)

Example: debug-baremetalds-two-node-arbiter-4.21-20250131

Git operations:

# Create branch
git checkout -b "${branch_name}"

# Modify the file (add wait step using the implementation below)
# Add '- ref: wait' as the first step in the test: section

# Stage change
git add <workflow-file>

# Commit
git commit -m "[Debug] Add wait step to ${workflow_name} for OCP ${ocp_version}

This adds a wait step to enable debugging of test failures in OCP ${ocp_version}.

The wait step pauses the workflow before tests run, allowing QE to:
- SSH into the test environment
- Inspect system state and logs
- Debug configuration issues
- Investigate test failures

OCP Version: ${ocp_version}
Workflow: ${workflow_name}"

YAML Modification Algorithm:

The modification process for both job configs and workflow files follows the same pattern:

  1. Locate the target: Find the test: section

    • For job configs: Within the specific job definition (- as: ${job_name})
    • For workflows: At the workflow level
  2. Find test steps: Identify all steps (lines with - ref: or - chain:)

  3. Check for duplicates: Ensure - ref: wait doesn't already exist

  4. Insert wait step: Add before the last test step with matching indentation

  5. Handle timeout:

    • Without timeout: Add simple - ref: wait
    • With timeout: Add as multi-line with timeout and best_effort properties

Example transformation:

Before:

test:
- chain: openshift-e2e-test-qe

After (without timeout):

test:
- ref: wait
- chain: openshift-e2e-test-qe

After (with timeout=8h):

test:
- ref: wait
  timeout: 8h0m0s
  best_effort: true
- chain: openshift-e2e-test-qe

Critical constraints:

  • Preserve exact YAML indentation (typically 2 spaces per level)
  • Insert BEFORE the last step, not after
  • When timeout is set, best_effort: true is required to prevent job failure
  • Normalize timeout format to Go duration (e.g., "8h" → "8h0m0s")

Auto-push the branch:

git push origin "${branch_name}"

Display GitHub PR creation link:

✅ Changes pushed successfully!

Create PR here:
https://github.com/openshift/release/compare/master...${branch_name}

Branch: ${branch_name}
Job: ${job_name}
OCP: ${ocp_version}

⚠️  Remember to close PR after debugging (DO NOT MERGE)

That's it! Simple and clean.

Error Handling

Error: Repository Not Found

❌ Error: Repository not found at ${repo_path}

Please provide the correct path to openshift/release repository.

To clone:
git clone https://github.com/openshift/release.git

Error: Not in openshift/release Repo

❌ Error: This doesn't appear to be the openshift/release repository

Remote URL: ${current_remote}
Expected: github.com/openshift/release

Please navigate to the correct repository.

Error: Workflow File Not Found

❌ Error: Workflow file not found

Searched for: *${workflow_name}*workflow*.yaml
Location: ci-operator/step-registry/

Suggestions:
1. Verify the workflow name
2. Try a partial match
3. Search manually: find ci-operator/step-registry -name "*workflow*.yaml"

Error: Wait Step Already Exists

  Wait step already configured in this workflow

No action needed - you can proceed with debugging using the existing wait step.

Error: Invalid OCP Version

❌ Invalid OCP version: ${version}

Valid versions: 4.18, 4.19, 4.20, 4.21, 4.22, master

Please provide a valid version.

Error: Invalid Timeout Format

❌ Invalid timeout format: ${timeout}

Valid format: Integer followed by 'h' (e.g., "1h", "2h", "8h", "24h", "72h")
Valid range: 1h to 72h

Examples:
- "1h" (1 hour)
- "8h" (8 hours)
- "24h" (24 hours)
- "72h" (72 hours, maximum)

Please provide a valid timeout in hours.

Note: Timeout Normalization

When a user provides a timeout like "8h", the implementation should normalize it to the standard Go duration format "8h0m0s" for consistency with existing configurations in the codebase.

Return Value

  • Success: PR URL and debugging instructions
  • Error: Error message with suggestions for resolution
  • Format: Text output with emoji indicators for status

Examples

Example 1: Without Timeout (Default 3h)

/ci:add-debug-wait aws-ipi-f7-longduration-workload

Prompts for: OCP version (4.21), repo path

Result:

test:
- ref: wait
- chain: openshift-e2e-test-qe

Returns: PR creation link

Example 2: With Custom Timeout

/ci:add-debug-wait aws-ipi-f7-longduration-workload 8h

Prompts for: OCP version (4.21), repo path

Result:

test:
- ref: wait
  timeout: 8h0m0s
  best_effort: true
- chain: openshift-e2e-test-qe

Returns: PR creation link with timeout info

Example 3: Workflow File

/ci:add-debug-wait baremetalds-two-node-arbiter-upgrade 24h

Behavior: Searches job config first, falls back to workflow if not found. Warns that workflow changes affect ALL jobs using it.

Returns: PR creation link

Arguments

  • $1 (workflow-or-job-name): The name of the CI workflow or job to add the wait step to (required)
  • $2 (timeout): Optional timeout in hours (1h-72h). Examples: "1h", "8h", "24h", "72h". If not provided, uses wait step's default (3h)

Notes

Best Practices for QE

Before Running Command:

  • Confirm test is actually failing
  • Check existing debug PRs
  • Know which OCP version is affected

During Debugging:

  • 📝 Take detailed notes
  • 💾 Save logs and screenshots
  • 🔍 Document root cause
  • 📊 Record all findings

After Debugging:

  • Document findings
  • Close the debug PR
  • Delete the branch
  • Share learnings with team
  • Create fix PR if needed

Future Enhancements

Consider adding companion commands:

  • /ci:close-debug-pr - Lists open debug PRs, prompts for findings, closes PR
  • /ci:list-debug-prs - Show all open debug PRs
  • /ci:revert-debug-pr - Revert a debug PR that was merged by mistake