--- description: Add a wait step to a CI workflow for debugging test failures argument-hint: [timeout] --- ## Name ci:add-debug-wait ## Synopsis ``` /ci:add-debug-wait [timeout] ``` ## Description The `ci:add-debug-wait` command adds a `wait` step to a CI job/workflow for debugging test failures. **What it does:** 1. Takes job name, OCP version, and optional timeout as input 2. Finds and edits the job config or workflow file 3. Adds `- ref: wait` before the last test step (with optional timeout configuration) 4. Commits and pushes the change 5. Gives you a GitHub link to create the PR **That's it!** Simple, fast, and automated. ## Implementation The command performs the following steps: ### Step 1: Gather Required Information **Prompt user for** (in this order): 1. **Workflow/Job Name**: (from command argument $1 or prompt) ``` Workflow or job name: Example: aws-c2s-ipi-disc-priv-fips-f7 Example: baremetalds-two-node-arbiter-e2e-openshift-test-private-tests ``` 2. **Timeout** (optional, from command argument $2): ``` Wait timeout in hours (optional, default: 3h): Examples: "1h", "2h", "8h", "24h", "72h" Valid range: 1h to 72h ``` - If not provided, uses the wait step's default behavior (3 hours) - Format: Integer followed by 'h' (e.g., "1h", "2h", "8h") - Valid range: 1h to 72h (maximum enforced by wait step's timeout setting) - Will be normalized to Go duration format (e.g., "8h" → "8h0m0s") - This will be set as the `timeout:` property on the wait step in the workflow/job YAML 3. **OCP Version**: (prompt - REQUIRED for searching job configs) ``` OCP version for debugging (e.g., 4.18, 4.19, 4.20, 4.21, 4.22): ``` This is used to: - Search the correct job config file (e.g., release-4.21) - Document which version needs debugging - Add context to the PR 4. **OpenShift Release Repo Path**: (prompt if not in current directory) ``` Path to openshift/release repository: Default: ~/repos/openshift-release ``` ### Step 2: Validate Environment **Silently validate** (no user prompts): ```bash cd # Check 1: Repository exists and is correct git remote -v | grep "openshift/release" || exit 1 # Skip repo update - work with current state # User can manually update their repo if needed ``` ### Step 3: Search for Job/Test Configuration **Priority 1: Search job configs first** (more specific and targeted): ```bash cd # Search for job config files matching the OCP version # The job name could be in various config files, so search broadly grep -r "as: ${job_name}" ci-operator/config/ --include="*release-${ocp_version}*.yaml" -l ``` **Example searches**: - For `aws-c2s-ipi-disc-priv-fips-f7` and OCP 4.21: ```bash grep -r "as: aws-c2s-ipi-disc-priv-fips-f7" ci-operator/config/ --include="*release-4.21*.yaml" -l ``` **Handle job config search results**: - **1 file found**: ``` ✅ Found job configuration: ${file_path} Type: Job configuration file Proceeding with job config modification... ``` → Continue to **Step 4a: Analyze Job Configuration** - **Multiple files found**: ``` Found ${count} matching job config files: 1. ci-operator/config/.../release-4.21__amd64-nightly.yaml 2. ci-operator/config/.../release-4.21__arm64-nightly.yaml 3. ci-operator/config/.../release-4.21__ppc64le-nightly.yaml Select file (1-${count}) or 'q' to quit: ``` **Prompt user to select** which file to modify, then continue to **Step 4a: Analyze Job Configuration** - **0 files found**: ``` ℹ️ No job config found for: ${job_name} (OCP ${ocp_version}) Searching for workflow files instead... ``` → Continue to **Priority 2** below **Priority 2: Search workflow files** (if job config not found): ```bash cd # Search for workflow files find ci-operator/step-registry -type f -name "*${workflow_name}*workflow*.yaml" ``` **Handle workflow search results**: - **0 files found**: ``` ❌ No job config or workflow file found for: ${job_name} Suggestions: 1. Check spelling of job/workflow name 2. Verify OCP version (${ocp_version}) 3. Try with partial name 4. Search manually: - Job configs: grep -r "as: ${job_name}" ci-operator/config/ - Workflows: find ci-operator/step-registry -name "*workflow*.yaml" | grep ``` - **1 file found**: ``` ✅ Found workflow file: ${file_path} Type: Workflow file Proceeding with workflow modification... ``` → Continue to **Step 4b: Analyze Workflow File** - **Multiple files found**: ``` Found ${count} matching workflow files: 1. ci-operator/step-registry/.../workflow1.yaml 2. ci-operator/step-registry/.../workflow2.yaml 3. ci-operator/step-registry/.../workflow3.yaml Select file (1-${count}) or 'q' to quit: ``` **Prompt user to select** which file to modify, then continue to **Step 4b: Analyze Workflow File** ### Step 4a: Analyze Job Configuration **Read and parse the job config YAML**: ```bash # Find the specific test definition grep -A 30 "as: ${job_name}" ``` **Check for**: 1. ✅ Has `steps:` section 2. ✅ Has `test:` section inside steps 3. ❌ Does NOT already have `- ref: wait` **Example current structure**: ```yaml - as: aws-c2s-ipi-disc-priv-fips-f7 cron: 36 16 3,12,19,26 * * steps: cluster_profile: aws-c2s-qe env: BASE_DOMAIN: qe.devcluster.openshift.com FIPS_ENABLED: "true" test: - chain: openshift-e2e-test-qe workflow: cucushift-installer-rehearse-aws-c2s-ipi-disconnected-private ``` **If wait already exists**: ``` ℹ️ Wait step already configured in job config Current test section: test: - ref: wait - chain: openshift-e2e-test-qe No changes needed. The job is already set up for debugging. ``` **If no test section found**: ``` ℹ️ Job config found but no test: section This job uses only the workflow's test steps. Searching for the workflow: ${workflow_name} ``` → Fall back to searching for workflow (Priority 2 in Step 3) → Continue to **Step 5a: Show Diff for Job Config** ### Step 4b: Analyze Workflow File **Read and parse the workflow YAML**: ```bash cat ``` **Check for**: 1. ✅ Has `workflow:` section 2. ✅ Has `test:` section 3. ❌ Does NOT already have `- ref: wait` **Example current structure**: ```yaml workflow: as: baremetalds-two-node-arbiter-upgrade steps: pre: - chain: baremetalds-ipi-pre test: - chain: baremetalds-ipi-test post: - chain: baremetalds-ipi-post ``` **If wait already exists**: ``` ℹ️ Wait step already configured in workflow Current test section: test: - ref: wait - chain: baremetalds-ipi-test No changes needed. The workflow is already set up for debugging. ``` **If no test section exists**: ``` ℹ️ Workflow has no test: section This workflow is provision/deprovision only. The test steps must be defined in the job config. Please provide the full job name to modify the job config instead. ``` → Exit or prompt for job name → Continue to **Step 5b: Modify Workflow File** ### Step 5a: Modify Job Config File **Edit the job config file directly** - no confirmation needed: ```bash # Add wait step before the last test step # If timeout is provided, add it as a step property # See Step 6 for the YAML modification algorithm ``` **Two scenarios**: 1. **Without custom timeout** (uses wait step's built-in default of 3h): ```yaml test: - ref: wait - chain: openshift-e2e-test-qe ``` Note: No timeout or best_effort needed - the wait step will use its default TIMEOUT env var (3 hours) 2. **With custom timeout** (user provided timeout parameter): ```yaml test: - ref: wait timeout: 8h0m0s best_effort: true - chain: openshift-e2e-test-qe ``` Note: `best_effort: true` is required when timeout is customized to prevent the wait step from failing the job if it times out **Show brief confirmation**: ``` ✅ Modified: ${job_name} (OCP ${ocp_version}) File: Added: - ref: wait${timeout:+ (timeout: ${timeout})} ``` ### Step 5b: Modify Workflow File **Edit the workflow file directly** - no confirmation needed: ```bash # Add wait step before the last test step # If timeout is provided, add it as a step property # See Step 6 for the YAML modification algorithm ``` **Two scenarios**: 1. **Without custom timeout** (uses wait step's built-in default of 3h): ```yaml test: - ref: wait - chain: baremetalds-ipi-test ``` Note: No timeout or best_effort needed - the wait step will use its default TIMEOUT env var (3 hours) 2. **With custom timeout** (user provided timeout parameter): ```yaml test: - ref: wait timeout: 8h0m0s best_effort: true - chain: baremetalds-ipi-test ``` Note: `best_effort: true` is required when timeout is customized to prevent the wait step from failing the job if it times out **Show brief confirmation**: ``` ✅ Modified: ${workflow_name} workflow File: Added: - ref: wait${timeout:+ (timeout: ${timeout})} ⚠️ Impact: Affects ALL jobs using this workflow ``` ### Step 6: Create Branch and Commit **Branch naming**: ``` debug-${workflow_name}-${ocp_version}-$(date +%Y%m%d) ``` Example: `debug-baremetalds-two-node-arbiter-4.21-20250131` **Git operations**: ```bash # Create branch git checkout -b "${branch_name}" # Modify the file (add wait step using the implementation below) # Add '- ref: wait' as the first step in the test: section # Stage change git add # Commit git commit -m "[Debug] Add wait step to ${workflow_name} for OCP ${ocp_version} This adds a wait step to enable debugging of test failures in OCP ${ocp_version}. The wait step pauses the workflow before tests run, allowing QE to: - SSH into the test environment - Inspect system state and logs - Debug configuration issues - Investigate test failures OCP Version: ${ocp_version} Workflow: ${workflow_name}" ``` **YAML Modification Algorithm**: The modification process for both job configs and workflow files follows the same pattern: 1. **Locate the target**: Find the `test:` section - For job configs: Within the specific job definition (`- as: ${job_name}`) - For workflows: At the workflow level 2. **Find test steps**: Identify all steps (lines with `- ref:` or `- chain:`) 3. **Check for duplicates**: Ensure `- ref: wait` doesn't already exist 4. **Insert wait step**: Add before the **last** test step with matching indentation 5. **Handle timeout**: - Without timeout: Add simple `- ref: wait` - With timeout: Add as multi-line with `timeout` and `best_effort` properties **Example transformation:** Before: ```yaml test: - chain: openshift-e2e-test-qe ``` After (without timeout): ```yaml test: - ref: wait - chain: openshift-e2e-test-qe ``` After (with timeout=8h): ```yaml test: - ref: wait timeout: 8h0m0s best_effort: true - chain: openshift-e2e-test-qe ``` **Critical constraints:** - Preserve exact YAML indentation (typically 2 spaces per level) - Insert BEFORE the last step, not after - When timeout is set, `best_effort: true` is required to prevent job failure - Normalize timeout format to Go duration (e.g., "8h" → "8h0m0s") ### Step 7: Push and Show GitHub Link **Auto-push the branch**: ```bash git push origin "${branch_name}" ``` **Display GitHub PR creation link**: ``` ✅ Changes pushed successfully! Create PR here: https://github.com/openshift/release/compare/master...${branch_name} Branch: ${branch_name} Job: ${job_name} OCP: ${ocp_version} ⚠️ Remember to close PR after debugging (DO NOT MERGE) ``` That's it! Simple and clean. ### Error Handling **Error: Repository Not Found** ``` ❌ Error: Repository not found at ${repo_path} Please provide the correct path to openshift/release repository. To clone: git clone https://github.com/openshift/release.git ``` **Error: Not in openshift/release Repo** ``` ❌ Error: This doesn't appear to be the openshift/release repository Remote URL: ${current_remote} Expected: github.com/openshift/release Please navigate to the correct repository. ``` **Error: Workflow File Not Found** ``` ❌ Error: Workflow file not found Searched for: *${workflow_name}*workflow*.yaml Location: ci-operator/step-registry/ Suggestions: 1. Verify the workflow name 2. Try a partial match 3. Search manually: find ci-operator/step-registry -name "*workflow*.yaml" ``` **Error: Wait Step Already Exists** ``` ℹ️ Wait step already configured in this workflow No action needed - you can proceed with debugging using the existing wait step. ``` **Error: Invalid OCP Version** ``` ❌ Invalid OCP version: ${version} Valid versions: 4.18, 4.19, 4.20, 4.21, 4.22, master Please provide a valid version. ``` ### Error: Invalid Timeout Format ``` ❌ Invalid timeout format: ${timeout} Valid format: Integer followed by 'h' (e.g., "1h", "2h", "8h", "24h", "72h") Valid range: 1h to 72h Examples: - "1h" (1 hour) - "8h" (8 hours) - "24h" (24 hours) - "72h" (72 hours, maximum) Please provide a valid timeout in hours. ``` ### Note: Timeout Normalization When a user provides a timeout like "8h", the implementation should normalize it to the standard Go duration format "8h0m0s" for consistency with existing configurations in the codebase. ## Return Value - **Success**: PR URL and debugging instructions - **Error**: Error message with suggestions for resolution - **Format**: Text output with emoji indicators for status ## Examples ### Example 1: Without Timeout (Default 3h) ```bash /ci:add-debug-wait aws-ipi-f7-longduration-workload ``` Prompts for: OCP version (4.21), repo path Result: ```yaml test: - ref: wait - chain: openshift-e2e-test-qe ``` Returns: PR creation link ### Example 2: With Custom Timeout ```bash /ci:add-debug-wait aws-ipi-f7-longduration-workload 8h ``` Prompts for: OCP version (4.21), repo path Result: ```yaml test: - ref: wait timeout: 8h0m0s best_effort: true - chain: openshift-e2e-test-qe ``` Returns: PR creation link with timeout info ### Example 3: Workflow File ```bash /ci:add-debug-wait baremetalds-two-node-arbiter-upgrade 24h ``` Behavior: Searches job config first, falls back to workflow if not found. Warns that workflow changes affect ALL jobs using it. Returns: PR creation link ## Arguments - **$1** (workflow-or-job-name): The name of the CI workflow or job to add the wait step to (required) - **$2** (timeout): Optional timeout in hours (1h-72h). Examples: "1h", "8h", "24h", "72h". If not provided, uses wait step's default (3h) ## Notes ### Best Practices for QE **Before Running Command**: - ✅ Confirm test is actually failing - ✅ Check existing debug PRs - ✅ Know which OCP version is affected **During Debugging**: - 📝 Take detailed notes - 💾 Save logs and screenshots - 🔍 Document root cause - 📊 Record all findings **After Debugging**: - ✅ Document findings - ✅ Close the debug PR - ✅ Delete the branch - ✅ Share learnings with team - ✅ Create fix PR if needed ### Future Enhancements Consider adding companion commands: - `/ci:close-debug-pr` - Lists open debug PRs, prompts for findings, closes PR - `/ci:list-debug-prs` - Show all open debug PRs - `/ci:revert-debug-pr` - Revert a debug PR that was merged by mistake