gh-openshift-eng-ai-helpers-plugins-ci/commands/add-debug-wait.md at master

zhongwei/gh-openshift-eng-ai-helpers-plugins-ci

Files

Zhongwei Li d09f471f51 Initial commit

2025-11-30 08:45:38 +08:00

15 KiB

Raw Permalink Blame History

description, argument-hint

description	argument-hint
Add a wait step to a CI workflow for debugging test failures	<workflow-or-job-name> [timeout]

Name

ci:add-debug-wait

Synopsis

/ci:add-debug-wait <workflow-or-job-name> [timeout]

Description

The ci:add-debug-wait command adds a wait step to a CI job/workflow for debugging test failures.

What it does:

Takes job name, OCP version, and optional timeout as input
Finds and edits the job config or workflow file
Adds - ref: wait before the last test step (with optional timeout configuration)
Commits and pushes the change
Gives you a GitHub link to create the PR

That's it! Simple, fast, and automated.

Implementation

The command performs the following steps:

Step 1: Gather Required Information

Prompt user for (in this order):

Workflow/Job Name: (from command argument $1 or prompt)

Workflow or job name: <user-input>
Example: aws-c2s-ipi-disc-priv-fips-f7
Example: baremetalds-two-node-arbiter-e2e-openshift-test-private-tests

Timeout (optional, from command argument $2):
```
Wait timeout in hours (optional, default: 3h):
Examples: "1h", "2h", "8h", "24h", "72h"
Valid range: 1h to 72h
```
- If not provided, uses the wait step's default behavior (3 hours)
- Format: Integer followed by 'h' (e.g., "1h", "2h", "8h")
- Valid range: 1h to 72h (maximum enforced by wait step's timeout setting)
- Will be normalized to Go duration format (e.g., "8h" → "8h0m0s")
- This will be set as the timeout: property on the wait step in the workflow/job YAML
OCP Version: (prompt - REQUIRED for searching job configs)
```
OCP version for debugging (e.g., 4.18, 4.19, 4.20, 4.21, 4.22):
```
This is used to:
- Search the correct job config file (e.g., release-4.21)
- Document which version needs debugging
- Add context to the PR

OpenShift Release Repo Path: (prompt if not in current directory)

Path to openshift/release repository:
Default: ~/repos/openshift-release

Step 2: Validate Environment

Silently validate (no user prompts):

cd <repo-path>

# Check 1: Repository exists and is correct
git remote -v | grep "openshift/release" || exit 1

# Skip repo update - work with current state
# User can manually update their repo if needed

Step 3: Search for Job/Test Configuration

Priority 1: Search job configs first (more specific and targeted):

cd <repo-path>

# Search for job config files matching the OCP version
# The job name could be in various config files, so search broadly
grep -r "as: ${job_name}" ci-operator/config/ --include="*release-${ocp_version}*.yaml" -l

Example searches:

For aws-c2s-ipi-disc-priv-fips-f7 and OCP 4.21:

grep -r "as: aws-c2s-ipi-disc-priv-fips-f7" ci-operator/config/ --include="*release-4.21*.yaml" -l

Handle job config search results:

1 file found:

✅ Found job configuration:
${file_path}

Type: Job configuration file

Proceeding with job config modification...

→ Continue to Step 4a: Analyze Job Configuration

Multiple files found:

Found ${count} matching job config files:

1. ci-operator/config/.../release-4.21__amd64-nightly.yaml
2. ci-operator/config/.../release-4.21__arm64-nightly.yaml
3. ci-operator/config/.../release-4.21__ppc64le-nightly.yaml

Select file (1-${count}) or 'q' to quit:

Prompt user to select which file to modify, then continue to Step 4a: Analyze Job Configuration

0 files found:

ℹ️  No job config found for: ${job_name} (OCP ${ocp_version})

Searching for workflow files instead...

→ Continue to Priority 2 below

Priority 2: Search workflow files (if job config not found):

cd <repo-path>

# Search for workflow files
find ci-operator/step-registry -type f -name "*${workflow_name}*workflow*.yaml"

Handle workflow search results:

0 files found:

❌ No job config or workflow file found for: ${job_name}

Suggestions:
1. Check spelling of job/workflow name
2. Verify OCP version (${ocp_version})
3. Try with partial name
4. Search manually:
   - Job configs: grep -r "as: ${job_name}" ci-operator/config/
   - Workflows: find ci-operator/step-registry -name "*workflow*.yaml" | grep <partial-name>

1 file found:

✅ Found workflow file:
${file_path}

Type: Workflow file

Proceeding with workflow modification...

→ Continue to Step 4b: Analyze Workflow File

Multiple files found:

Found ${count} matching workflow files:

1. ci-operator/step-registry/.../workflow1.yaml
2. ci-operator/step-registry/.../workflow2.yaml
3. ci-operator/step-registry/.../workflow3.yaml

Select file (1-${count}) or 'q' to quit:

Prompt user to select which file to modify, then continue to Step 4b: Analyze Workflow File

Step 4a: Analyze Job Configuration

Read and parse the job config YAML:

# Find the specific test definition
grep -A 30 "as: ${job_name}" <job-config-file>

Check for:

✅ Has steps: section
✅ Has test: section inside steps
❌ Does NOT already have - ref: wait

Example current structure:

- as: aws-c2s-ipi-disc-priv-fips-f7
  cron: 36 16 3,12,19,26 * *
  steps:
    cluster_profile: aws-c2s-qe
    env:
      BASE_DOMAIN: qe.devcluster.openshift.com
      FIPS_ENABLED: "true"
    test:
    - chain: openshift-e2e-test-qe
    workflow: cucushift-installer-rehearse-aws-c2s-ipi-disconnected-private

If wait already exists:

ℹ️  Wait step already configured in job config

Current test section:
  test:
  - ref: wait
  - chain: openshift-e2e-test-qe

No changes needed. The job is already set up for debugging.

If no test section found:

ℹ️  Job config found but no test: section

This job uses only the workflow's test steps.
Searching for the workflow: ${workflow_name}

→ Fall back to searching for workflow (Priority 2 in Step 3)

→ Continue to Step 5a: Show Diff for Job Config

Step 4b: Analyze Workflow File

Read and parse the workflow YAML:

cat <workflow-file>

Check for:

✅ Has workflow: section
✅ Has test: section
❌ Does NOT already have - ref: wait

Example current structure:

workflow:
  as: baremetalds-two-node-arbiter-upgrade
  steps:
    pre:
      - chain: baremetalds-ipi-pre
    test:
      - chain: baremetalds-ipi-test
    post:
      - chain: baremetalds-ipi-post

If wait already exists:

ℹ️  Wait step already configured in workflow

Current test section:
  test:
    - ref: wait
    - chain: baremetalds-ipi-test

No changes needed. The workflow is already set up for debugging.

If no test section exists:

ℹ️  Workflow has no test: section

This workflow is provision/deprovision only.
The test steps must be defined in the job config.

Please provide the full job name to modify the job config instead.

→ Exit or prompt for job name

→ Continue to Step 5b: Modify Workflow File

Step 5a: Modify Job Config File

Edit the job config file directly - no confirmation needed:

# Add wait step before the last test step
# If timeout is provided, add it as a step property
# See Step 6 for the YAML modification algorithm

Two scenarios:

Without custom timeout (uses wait step's built-in default of 3h):
```
test:
- ref: wait
- chain: openshift-e2e-test-qe
```
Note: No timeout or best_effort needed - the wait step will use its default TIMEOUT env var (3 hours)
With custom timeout (user provided timeout parameter):
```
test:
- ref: wait
  timeout: 8h0m0s
  best_effort: true
- chain: openshift-e2e-test-qe
```
Note: best_effort: true is required when timeout is customized to prevent the wait step from failing the job if it times out

Show brief confirmation:

✅ Modified: ${job_name} (OCP ${ocp_version})
   File: <job-config-file-path>
   Added: - ref: wait${timeout:+ (timeout: ${timeout})}

Step 5b: Modify Workflow File

Edit the workflow file directly - no confirmation needed:

# Add wait step before the last test step
# If timeout is provided, add it as a step property
# See Step 6 for the YAML modification algorithm

Two scenarios:

Without custom timeout (uses wait step's built-in default of 3h):
```
test:
- ref: wait
- chain: baremetalds-ipi-test
```
Note: No timeout or best_effort needed - the wait step will use its default TIMEOUT env var (3 hours)
With custom timeout (user provided timeout parameter):
```
test:
- ref: wait
  timeout: 8h0m0s
  best_effort: true
- chain: baremetalds-ipi-test
```
Note: best_effort: true is required when timeout is customized to prevent the wait step from failing the job if it times out

Show brief confirmation:

✅ Modified: ${workflow_name} workflow
   File: <workflow-file-path>
   Added: - ref: wait${timeout:+ (timeout: ${timeout})}
   ⚠️  Impact: Affects ALL jobs using this workflow

Step 6: Create Branch and Commit

Branch naming:

debug-${workflow_name}-${ocp_version}-$(date +%Y%m%d)

Example: debug-baremetalds-two-node-arbiter-4.21-20250131

Git operations:

# Create branch
git checkout -b "${branch_name}"

# Modify the file (add wait step using the implementation below)
# Add '- ref: wait' as the first step in the test: section

# Stage change
git add <workflow-file>

# Commit
git commit -m "[Debug] Add wait step to ${workflow_name} for OCP ${ocp_version}

This adds a wait step to enable debugging of test failures in OCP ${ocp_version}.

The wait step pauses the workflow before tests run, allowing QE to:
- SSH into the test environment
- Inspect system state and logs
- Debug configuration issues
- Investigate test failures

OCP Version: ${ocp_version}
Workflow: ${workflow_name}"

YAML Modification Algorithm:

The modification process for both job configs and workflow files follows the same pattern:

Locate the target: Find the test: section
- For job configs: Within the specific job definition (- as: ${job_name})
- For workflows: At the workflow level
Find test steps: Identify all steps (lines with - ref: or - chain:)
Check for duplicates: Ensure - ref: wait doesn't already exist
Insert wait step: Add before the last test step with matching indentation
Handle timeout:
- Without timeout: Add simple - ref: wait
- With timeout: Add as multi-line with timeout and best_effort properties

Example transformation:

Before:

test:
- chain: openshift-e2e-test-qe

After (without timeout):

test:
- ref: wait
- chain: openshift-e2e-test-qe

After (with timeout=8h):

test:
- ref: wait
  timeout: 8h0m0s
  best_effort: true
- chain: openshift-e2e-test-qe

Critical constraints:

Preserve exact YAML indentation (typically 2 spaces per level)
Insert BEFORE the last step, not after
When timeout is set, best_effort: true is required to prevent job failure
Normalize timeout format to Go duration (e.g., "8h" → "8h0m0s")

Step 7: Push and Show GitHub Link

Auto-push the branch:

git push origin "${branch_name}"

Display GitHub PR creation link:

✅ Changes pushed successfully!

Create PR here:
https://github.com/openshift/release/compare/master...${branch_name}

Branch: ${branch_name}
Job: ${job_name}
OCP: ${ocp_version}

⚠️  Remember to close PR after debugging (DO NOT MERGE)

That's it! Simple and clean.

Error Handling

Error: Repository Not Found

❌ Error: Repository not found at ${repo_path}

Please provide the correct path to openshift/release repository.

To clone:
git clone https://github.com/openshift/release.git

Error: Not in openshift/release Repo

❌ Error: This doesn't appear to be the openshift/release repository

Remote URL: ${current_remote}
Expected: github.com/openshift/release

Please navigate to the correct repository.

Error: Workflow File Not Found

❌ Error: Workflow file not found

Searched for: *${workflow_name}*workflow*.yaml
Location: ci-operator/step-registry/

Suggestions:
1. Verify the workflow name
2. Try a partial match
3. Search manually: find ci-operator/step-registry -name "*workflow*.yaml"

Error: Wait Step Already Exists

ℹ️  Wait step already configured in this workflow

No action needed - you can proceed with debugging using the existing wait step.

Error: Invalid OCP Version

❌ Invalid OCP version: ${version}

Valid versions: 4.18, 4.19, 4.20, 4.21, 4.22, master

Please provide a valid version.

Error: Invalid Timeout Format

❌ Invalid timeout format: ${timeout}

Valid format: Integer followed by 'h' (e.g., "1h", "2h", "8h", "24h", "72h")
Valid range: 1h to 72h

Examples:
- "1h" (1 hour)
- "8h" (8 hours)
- "24h" (24 hours)
- "72h" (72 hours, maximum)

Please provide a valid timeout in hours.

Note: Timeout Normalization

When a user provides a timeout like "8h", the implementation should normalize it to the standard Go duration format "8h0m0s" for consistency with existing configurations in the codebase.

Return Value

Success: PR URL and debugging instructions
Error: Error message with suggestions for resolution
Format: Text output with emoji indicators for status

Examples

Example 1: Without Timeout (Default 3h)

/ci:add-debug-wait aws-ipi-f7-longduration-workload

Prompts for: OCP version (4.21), repo path

Result:

test:
- ref: wait
- chain: openshift-e2e-test-qe

Returns: PR creation link

Example 2: With Custom Timeout

/ci:add-debug-wait aws-ipi-f7-longduration-workload 8h