15 KiB
description, argument-hint
| description | argument-hint |
|---|---|
| Add a wait step to a CI workflow for debugging test failures | <workflow-or-job-name> [timeout] |
Name
ci:add-debug-wait
Synopsis
/ci:add-debug-wait <workflow-or-job-name> [timeout]
Description
The ci:add-debug-wait command adds a wait step to a CI job/workflow for debugging test failures.
What it does:
- Takes job name, OCP version, and optional timeout as input
- Finds and edits the job config or workflow file
- Adds
- ref: waitbefore the last test step (with optional timeout configuration) - Commits and pushes the change
- Gives you a GitHub link to create the PR
That's it! Simple, fast, and automated.
Implementation
The command performs the following steps:
Step 1: Gather Required Information
Prompt user for (in this order):
-
Workflow/Job Name: (from command argument $1 or prompt)
Workflow or job name: <user-input> Example: aws-c2s-ipi-disc-priv-fips-f7 Example: baremetalds-two-node-arbiter-e2e-openshift-test-private-tests -
Timeout (optional, from command argument $2):
Wait timeout in hours (optional, default: 3h): Examples: "1h", "2h", "8h", "24h", "72h" Valid range: 1h to 72h- If not provided, uses the wait step's default behavior (3 hours)
- Format: Integer followed by 'h' (e.g., "1h", "2h", "8h")
- Valid range: 1h to 72h (maximum enforced by wait step's timeout setting)
- Will be normalized to Go duration format (e.g., "8h" → "8h0m0s")
- This will be set as the
timeout:property on the wait step in the workflow/job YAML
-
OCP Version: (prompt - REQUIRED for searching job configs)
OCP version for debugging (e.g., 4.18, 4.19, 4.20, 4.21, 4.22):This is used to:
- Search the correct job config file (e.g., release-4.21)
- Document which version needs debugging
- Add context to the PR
-
OpenShift Release Repo Path: (prompt if not in current directory)
Path to openshift/release repository: Default: ~/repos/openshift-release
Step 2: Validate Environment
Silently validate (no user prompts):
cd <repo-path>
# Check 1: Repository exists and is correct
git remote -v | grep "openshift/release" || exit 1
# Skip repo update - work with current state
# User can manually update their repo if needed
Step 3: Search for Job/Test Configuration
Priority 1: Search job configs first (more specific and targeted):
cd <repo-path>
# Search for job config files matching the OCP version
# The job name could be in various config files, so search broadly
grep -r "as: ${job_name}" ci-operator/config/ --include="*release-${ocp_version}*.yaml" -l
Example searches:
- For
aws-c2s-ipi-disc-priv-fips-f7and OCP 4.21:grep -r "as: aws-c2s-ipi-disc-priv-fips-f7" ci-operator/config/ --include="*release-4.21*.yaml" -l
Handle job config search results:
-
1 file found:
✅ Found job configuration: ${file_path} Type: Job configuration file Proceeding with job config modification...→ Continue to Step 4a: Analyze Job Configuration
-
Multiple files found:
Found ${count} matching job config files: 1. ci-operator/config/.../release-4.21__amd64-nightly.yaml 2. ci-operator/config/.../release-4.21__arm64-nightly.yaml 3. ci-operator/config/.../release-4.21__ppc64le-nightly.yaml Select file (1-${count}) or 'q' to quit:Prompt user to select which file to modify, then continue to Step 4a: Analyze Job Configuration
-
0 files found:
ℹ️ No job config found for: ${job_name} (OCP ${ocp_version}) Searching for workflow files instead...→ Continue to Priority 2 below
Priority 2: Search workflow files (if job config not found):
cd <repo-path>
# Search for workflow files
find ci-operator/step-registry -type f -name "*${workflow_name}*workflow*.yaml"
Handle workflow search results:
-
0 files found:
❌ No job config or workflow file found for: ${job_name} Suggestions: 1. Check spelling of job/workflow name 2. Verify OCP version (${ocp_version}) 3. Try with partial name 4. Search manually: - Job configs: grep -r "as: ${job_name}" ci-operator/config/ - Workflows: find ci-operator/step-registry -name "*workflow*.yaml" | grep <partial-name> -
1 file found:
✅ Found workflow file: ${file_path} Type: Workflow file Proceeding with workflow modification...→ Continue to Step 4b: Analyze Workflow File
-
Multiple files found:
Found ${count} matching workflow files: 1. ci-operator/step-registry/.../workflow1.yaml 2. ci-operator/step-registry/.../workflow2.yaml 3. ci-operator/step-registry/.../workflow3.yaml Select file (1-${count}) or 'q' to quit:Prompt user to select which file to modify, then continue to Step 4b: Analyze Workflow File
Step 4a: Analyze Job Configuration
Read and parse the job config YAML:
# Find the specific test definition
grep -A 30 "as: ${job_name}" <job-config-file>
Check for:
- ✅ Has
steps:section - ✅ Has
test:section inside steps - ❌ Does NOT already have
- ref: wait
Example current structure:
- as: aws-c2s-ipi-disc-priv-fips-f7
cron: 36 16 3,12,19,26 * *
steps:
cluster_profile: aws-c2s-qe
env:
BASE_DOMAIN: qe.devcluster.openshift.com
FIPS_ENABLED: "true"
test:
- chain: openshift-e2e-test-qe
workflow: cucushift-installer-rehearse-aws-c2s-ipi-disconnected-private
If wait already exists:
ℹ️ Wait step already configured in job config
Current test section:
test:
- ref: wait
- chain: openshift-e2e-test-qe
No changes needed. The job is already set up for debugging.
If no test section found:
ℹ️ Job config found but no test: section
This job uses only the workflow's test steps.
Searching for the workflow: ${workflow_name}
→ Fall back to searching for workflow (Priority 2 in Step 3)
→ Continue to Step 5a: Show Diff for Job Config
Step 4b: Analyze Workflow File
Read and parse the workflow YAML:
cat <workflow-file>
Check for:
- ✅ Has
workflow:section - ✅ Has
test:section - ❌ Does NOT already have
- ref: wait
Example current structure:
workflow:
as: baremetalds-two-node-arbiter-upgrade
steps:
pre:
- chain: baremetalds-ipi-pre
test:
- chain: baremetalds-ipi-test
post:
- chain: baremetalds-ipi-post
If wait already exists:
ℹ️ Wait step already configured in workflow
Current test section:
test:
- ref: wait
- chain: baremetalds-ipi-test
No changes needed. The workflow is already set up for debugging.
If no test section exists:
ℹ️ Workflow has no test: section
This workflow is provision/deprovision only.
The test steps must be defined in the job config.
Please provide the full job name to modify the job config instead.
→ Exit or prompt for job name
→ Continue to Step 5b: Modify Workflow File
Step 5a: Modify Job Config File
Edit the job config file directly - no confirmation needed:
# Add wait step before the last test step
# If timeout is provided, add it as a step property
# See Step 6 for the YAML modification algorithm
Two scenarios:
-
Without custom timeout (uses wait step's built-in default of 3h):
test: - ref: wait - chain: openshift-e2e-test-qeNote: No timeout or best_effort needed - the wait step will use its default TIMEOUT env var (3 hours)
-
With custom timeout (user provided timeout parameter):
test: - ref: wait timeout: 8h0m0s best_effort: true - chain: openshift-e2e-test-qeNote:
best_effort: trueis required when timeout is customized to prevent the wait step from failing the job if it times out
Show brief confirmation:
✅ Modified: ${job_name} (OCP ${ocp_version})
File: <job-config-file-path>
Added: - ref: wait${timeout:+ (timeout: ${timeout})}
Step 5b: Modify Workflow File
Edit the workflow file directly - no confirmation needed:
# Add wait step before the last test step
# If timeout is provided, add it as a step property
# See Step 6 for the YAML modification algorithm
Two scenarios:
-
Without custom timeout (uses wait step's built-in default of 3h):
test: - ref: wait - chain: baremetalds-ipi-testNote: No timeout or best_effort needed - the wait step will use its default TIMEOUT env var (3 hours)
-
With custom timeout (user provided timeout parameter):
test: - ref: wait timeout: 8h0m0s best_effort: true - chain: baremetalds-ipi-testNote:
best_effort: trueis required when timeout is customized to prevent the wait step from failing the job if it times out
Show brief confirmation:
✅ Modified: ${workflow_name} workflow
File: <workflow-file-path>
Added: - ref: wait${timeout:+ (timeout: ${timeout})}
⚠️ Impact: Affects ALL jobs using this workflow
Step 6: Create Branch and Commit
Branch naming:
debug-${workflow_name}-${ocp_version}-$(date +%Y%m%d)
Example: debug-baremetalds-two-node-arbiter-4.21-20250131
Git operations:
# Create branch
git checkout -b "${branch_name}"
# Modify the file (add wait step using the implementation below)
# Add '- ref: wait' as the first step in the test: section
# Stage change
git add <workflow-file>
# Commit
git commit -m "[Debug] Add wait step to ${workflow_name} for OCP ${ocp_version}
This adds a wait step to enable debugging of test failures in OCP ${ocp_version}.
The wait step pauses the workflow before tests run, allowing QE to:
- SSH into the test environment
- Inspect system state and logs
- Debug configuration issues
- Investigate test failures
OCP Version: ${ocp_version}
Workflow: ${workflow_name}"
YAML Modification Algorithm:
The modification process for both job configs and workflow files follows the same pattern:
-
Locate the target: Find the
test:section- For job configs: Within the specific job definition (
- as: ${job_name}) - For workflows: At the workflow level
- For job configs: Within the specific job definition (
-
Find test steps: Identify all steps (lines with
- ref:or- chain:) -
Check for duplicates: Ensure
- ref: waitdoesn't already exist -
Insert wait step: Add before the last test step with matching indentation
-
Handle timeout:
- Without timeout: Add simple
- ref: wait - With timeout: Add as multi-line with
timeoutandbest_effortproperties
- Without timeout: Add simple
Example transformation:
Before:
test:
- chain: openshift-e2e-test-qe
After (without timeout):
test:
- ref: wait
- chain: openshift-e2e-test-qe
After (with timeout=8h):
test:
- ref: wait
timeout: 8h0m0s
best_effort: true
- chain: openshift-e2e-test-qe
Critical constraints:
- Preserve exact YAML indentation (typically 2 spaces per level)
- Insert BEFORE the last step, not after
- When timeout is set,
best_effort: trueis required to prevent job failure - Normalize timeout format to Go duration (e.g., "8h" → "8h0m0s")
Step 7: Push and Show GitHub Link
Auto-push the branch:
git push origin "${branch_name}"
Display GitHub PR creation link:
✅ Changes pushed successfully!
Create PR here:
https://github.com/openshift/release/compare/master...${branch_name}
Branch: ${branch_name}
Job: ${job_name}
OCP: ${ocp_version}
⚠️ Remember to close PR after debugging (DO NOT MERGE)
That's it! Simple and clean.
Error Handling
Error: Repository Not Found
❌ Error: Repository not found at ${repo_path}
Please provide the correct path to openshift/release repository.
To clone:
git clone https://github.com/openshift/release.git
Error: Not in openshift/release Repo
❌ Error: This doesn't appear to be the openshift/release repository
Remote URL: ${current_remote}
Expected: github.com/openshift/release
Please navigate to the correct repository.
Error: Workflow File Not Found
❌ Error: Workflow file not found
Searched for: *${workflow_name}*workflow*.yaml
Location: ci-operator/step-registry/
Suggestions:
1. Verify the workflow name
2. Try a partial match
3. Search manually: find ci-operator/step-registry -name "*workflow*.yaml"
Error: Wait Step Already Exists
ℹ️ Wait step already configured in this workflow
No action needed - you can proceed with debugging using the existing wait step.
Error: Invalid OCP Version
❌ Invalid OCP version: ${version}
Valid versions: 4.18, 4.19, 4.20, 4.21, 4.22, master
Please provide a valid version.
Error: Invalid Timeout Format
❌ Invalid timeout format: ${timeout}
Valid format: Integer followed by 'h' (e.g., "1h", "2h", "8h", "24h", "72h")
Valid range: 1h to 72h
Examples:
- "1h" (1 hour)
- "8h" (8 hours)
- "24h" (24 hours)
- "72h" (72 hours, maximum)
Please provide a valid timeout in hours.
Note: Timeout Normalization
When a user provides a timeout like "8h", the implementation should normalize it to the standard Go duration format "8h0m0s" for consistency with existing configurations in the codebase.
Return Value
- Success: PR URL and debugging instructions
- Error: Error message with suggestions for resolution
- Format: Text output with emoji indicators for status
Examples
Example 1: Without Timeout (Default 3h)
/ci:add-debug-wait aws-ipi-f7-longduration-workload
Prompts for: OCP version (4.21), repo path
Result:
test:
- ref: wait
- chain: openshift-e2e-test-qe
Returns: PR creation link
Example 2: With Custom Timeout
/ci:add-debug-wait aws-ipi-f7-longduration-workload 8h
Prompts for: OCP version (4.21), repo path
Result:
test:
- ref: wait
timeout: 8h0m0s
best_effort: true
- chain: openshift-e2e-test-qe
Returns: PR creation link with timeout info
Example 3: Workflow File
/ci:add-debug-wait baremetalds-two-node-arbiter-upgrade 24h
Behavior: Searches job config first, falls back to workflow if not found. Warns that workflow changes affect ALL jobs using it.
Returns: PR creation link
Arguments
- $1 (workflow-or-job-name): The name of the CI workflow or job to add the wait step to (required)
- $2 (timeout): Optional timeout in hours (1h-72h). Examples: "1h", "8h", "24h", "72h". If not provided, uses wait step's default (3h)
Notes
Best Practices for QE
Before Running Command:
- ✅ Confirm test is actually failing
- ✅ Check existing debug PRs
- ✅ Know which OCP version is affected
During Debugging:
- 📝 Take detailed notes
- 💾 Save logs and screenshots
- 🔍 Document root cause
- 📊 Record all findings
After Debugging:
- ✅ Document findings
- ✅ Close the debug PR
- ✅ Delete the branch
- ✅ Share learnings with team
- ✅ Create fix PR if needed
Future Enhancements
Consider adding companion commands:
/ci:close-debug-pr- Lists open debug PRs, prompts for findings, closes PR/ci:list-debug-prs- Show all open debug PRs/ci:revert-debug-pr- Revert a debug PR that was merged by mistake