624 lines
15 KiB
Markdown
624 lines
15 KiB
Markdown
---
|
||
description: Add a wait step to a CI workflow for debugging test failures
|
||
argument-hint: <workflow-or-job-name> [timeout]
|
||
---
|
||
|
||
## Name
|
||
ci:add-debug-wait
|
||
|
||
## Synopsis
|
||
```
|
||
/ci:add-debug-wait <workflow-or-job-name> [timeout]
|
||
```
|
||
|
||
## Description
|
||
|
||
The `ci:add-debug-wait` command adds a `wait` step to a CI job/workflow for debugging test failures.
|
||
|
||
**What it does:**
|
||
1. Takes job name, OCP version, and optional timeout as input
|
||
2. Finds and edits the job config or workflow file
|
||
3. Adds `- ref: wait` before the last test step (with optional timeout configuration)
|
||
4. Commits and pushes the change
|
||
5. Gives you a GitHub link to create the PR
|
||
|
||
**That's it!** Simple, fast, and automated.
|
||
|
||
## Implementation
|
||
|
||
The command performs the following steps:
|
||
|
||
### Step 1: Gather Required Information
|
||
|
||
**Prompt user for** (in this order):
|
||
|
||
1. **Workflow/Job Name**: (from command argument $1 or prompt)
|
||
```
|
||
Workflow or job name: <user-input>
|
||
Example: aws-c2s-ipi-disc-priv-fips-f7
|
||
Example: baremetalds-two-node-arbiter-e2e-openshift-test-private-tests
|
||
```
|
||
|
||
2. **Timeout** (optional, from command argument $2):
|
||
```
|
||
Wait timeout in hours (optional, default: 3h):
|
||
Examples: "1h", "2h", "8h", "24h", "72h"
|
||
Valid range: 1h to 72h
|
||
```
|
||
- If not provided, uses the wait step's default behavior (3 hours)
|
||
- Format: Integer followed by 'h' (e.g., "1h", "2h", "8h")
|
||
- Valid range: 1h to 72h (maximum enforced by wait step's timeout setting)
|
||
- Will be normalized to Go duration format (e.g., "8h" → "8h0m0s")
|
||
- This will be set as the `timeout:` property on the wait step in the workflow/job YAML
|
||
|
||
3. **OCP Version**: (prompt - REQUIRED for searching job configs)
|
||
```
|
||
OCP version for debugging (e.g., 4.18, 4.19, 4.20, 4.21, 4.22):
|
||
```
|
||
This is used to:
|
||
- Search the correct job config file (e.g., release-4.21)
|
||
- Document which version needs debugging
|
||
- Add context to the PR
|
||
|
||
4. **OpenShift Release Repo Path**: (prompt if not in current directory)
|
||
```
|
||
Path to openshift/release repository:
|
||
Default: ~/repos/openshift-release
|
||
```
|
||
|
||
### Step 2: Validate Environment
|
||
|
||
**Silently validate** (no user prompts):
|
||
|
||
```bash
|
||
cd <repo-path>
|
||
|
||
# Check 1: Repository exists and is correct
|
||
git remote -v | grep "openshift/release" || exit 1
|
||
|
||
# Skip repo update - work with current state
|
||
# User can manually update their repo if needed
|
||
```
|
||
|
||
### Step 3: Search for Job/Test Configuration
|
||
|
||
**Priority 1: Search job configs first** (more specific and targeted):
|
||
|
||
```bash
|
||
cd <repo-path>
|
||
|
||
# Search for job config files matching the OCP version
|
||
# The job name could be in various config files, so search broadly
|
||
grep -r "as: ${job_name}" ci-operator/config/ --include="*release-${ocp_version}*.yaml" -l
|
||
```
|
||
|
||
**Example searches**:
|
||
- For `aws-c2s-ipi-disc-priv-fips-f7` and OCP 4.21:
|
||
```bash
|
||
grep -r "as: aws-c2s-ipi-disc-priv-fips-f7" ci-operator/config/ --include="*release-4.21*.yaml" -l
|
||
```
|
||
|
||
**Handle job config search results**:
|
||
|
||
- **1 file found**:
|
||
```
|
||
✅ Found job configuration:
|
||
${file_path}
|
||
|
||
Type: Job configuration file
|
||
|
||
Proceeding with job config modification...
|
||
```
|
||
→ Continue to **Step 4a: Analyze Job Configuration**
|
||
|
||
- **Multiple files found**:
|
||
```
|
||
Found ${count} matching job config files:
|
||
|
||
1. ci-operator/config/.../release-4.21__amd64-nightly.yaml
|
||
2. ci-operator/config/.../release-4.21__arm64-nightly.yaml
|
||
3. ci-operator/config/.../release-4.21__ppc64le-nightly.yaml
|
||
|
||
Select file (1-${count}) or 'q' to quit:
|
||
```
|
||
|
||
**Prompt user to select** which file to modify, then continue to **Step 4a: Analyze Job Configuration**
|
||
|
||
- **0 files found**:
|
||
```
|
||
ℹ️ No job config found for: ${job_name} (OCP ${ocp_version})
|
||
|
||
Searching for workflow files instead...
|
||
```
|
||
→ Continue to **Priority 2** below
|
||
|
||
**Priority 2: Search workflow files** (if job config not found):
|
||
|
||
```bash
|
||
cd <repo-path>
|
||
|
||
# Search for workflow files
|
||
find ci-operator/step-registry -type f -name "*${workflow_name}*workflow*.yaml"
|
||
```
|
||
|
||
**Handle workflow search results**:
|
||
|
||
- **0 files found**:
|
||
```
|
||
❌ No job config or workflow file found for: ${job_name}
|
||
|
||
Suggestions:
|
||
1. Check spelling of job/workflow name
|
||
2. Verify OCP version (${ocp_version})
|
||
3. Try with partial name
|
||
4. Search manually:
|
||
- Job configs: grep -r "as: ${job_name}" ci-operator/config/
|
||
- Workflows: find ci-operator/step-registry -name "*workflow*.yaml" | grep <partial-name>
|
||
```
|
||
|
||
- **1 file found**:
|
||
```
|
||
✅ Found workflow file:
|
||
${file_path}
|
||
|
||
Type: Workflow file
|
||
|
||
Proceeding with workflow modification...
|
||
```
|
||
→ Continue to **Step 4b: Analyze Workflow File**
|
||
|
||
- **Multiple files found**:
|
||
```
|
||
Found ${count} matching workflow files:
|
||
|
||
1. ci-operator/step-registry/.../workflow1.yaml
|
||
2. ci-operator/step-registry/.../workflow2.yaml
|
||
3. ci-operator/step-registry/.../workflow3.yaml
|
||
|
||
Select file (1-${count}) or 'q' to quit:
|
||
```
|
||
|
||
**Prompt user to select** which file to modify, then continue to **Step 4b: Analyze Workflow File**
|
||
|
||
### Step 4a: Analyze Job Configuration
|
||
|
||
**Read and parse the job config YAML**:
|
||
|
||
```bash
|
||
# Find the specific test definition
|
||
grep -A 30 "as: ${job_name}" <job-config-file>
|
||
```
|
||
|
||
**Check for**:
|
||
1. ✅ Has `steps:` section
|
||
2. ✅ Has `test:` section inside steps
|
||
3. ❌ Does NOT already have `- ref: wait`
|
||
|
||
**Example current structure**:
|
||
```yaml
|
||
- as: aws-c2s-ipi-disc-priv-fips-f7
|
||
cron: 36 16 3,12,19,26 * *
|
||
steps:
|
||
cluster_profile: aws-c2s-qe
|
||
env:
|
||
BASE_DOMAIN: qe.devcluster.openshift.com
|
||
FIPS_ENABLED: "true"
|
||
test:
|
||
- chain: openshift-e2e-test-qe
|
||
workflow: cucushift-installer-rehearse-aws-c2s-ipi-disconnected-private
|
||
```
|
||
|
||
**If wait already exists**:
|
||
```
|
||
ℹ️ Wait step already configured in job config
|
||
|
||
Current test section:
|
||
test:
|
||
- ref: wait
|
||
- chain: openshift-e2e-test-qe
|
||
|
||
No changes needed. The job is already set up for debugging.
|
||
```
|
||
|
||
**If no test section found**:
|
||
```
|
||
ℹ️ Job config found but no test: section
|
||
|
||
This job uses only the workflow's test steps.
|
||
Searching for the workflow: ${workflow_name}
|
||
```
|
||
→ Fall back to searching for workflow (Priority 2 in Step 3)
|
||
|
||
→ Continue to **Step 5a: Show Diff for Job Config**
|
||
|
||
### Step 4b: Analyze Workflow File
|
||
|
||
**Read and parse the workflow YAML**:
|
||
|
||
```bash
|
||
cat <workflow-file>
|
||
```
|
||
|
||
**Check for**:
|
||
1. ✅ Has `workflow:` section
|
||
2. ✅ Has `test:` section
|
||
3. ❌ Does NOT already have `- ref: wait`
|
||
|
||
**Example current structure**:
|
||
```yaml
|
||
workflow:
|
||
as: baremetalds-two-node-arbiter-upgrade
|
||
steps:
|
||
pre:
|
||
- chain: baremetalds-ipi-pre
|
||
test:
|
||
- chain: baremetalds-ipi-test
|
||
post:
|
||
- chain: baremetalds-ipi-post
|
||
```
|
||
|
||
**If wait already exists**:
|
||
```
|
||
ℹ️ Wait step already configured in workflow
|
||
|
||
Current test section:
|
||
test:
|
||
- ref: wait
|
||
- chain: baremetalds-ipi-test
|
||
|
||
No changes needed. The workflow is already set up for debugging.
|
||
```
|
||
|
||
**If no test section exists**:
|
||
```
|
||
ℹ️ Workflow has no test: section
|
||
|
||
This workflow is provision/deprovision only.
|
||
The test steps must be defined in the job config.
|
||
|
||
Please provide the full job name to modify the job config instead.
|
||
```
|
||
→ Exit or prompt for job name
|
||
|
||
→ Continue to **Step 5b: Modify Workflow File**
|
||
|
||
### Step 5a: Modify Job Config File
|
||
|
||
**Edit the job config file directly** - no confirmation needed:
|
||
|
||
```bash
|
||
# Add wait step before the last test step
|
||
# If timeout is provided, add it as a step property
|
||
# See Step 6 for the YAML modification algorithm
|
||
```
|
||
|
||
**Two scenarios**:
|
||
|
||
1. **Without custom timeout** (uses wait step's built-in default of 3h):
|
||
```yaml
|
||
test:
|
||
- ref: wait
|
||
- chain: openshift-e2e-test-qe
|
||
```
|
||
Note: No timeout or best_effort needed - the wait step will use its default TIMEOUT env var (3 hours)
|
||
|
||
2. **With custom timeout** (user provided timeout parameter):
|
||
```yaml
|
||
test:
|
||
- ref: wait
|
||
timeout: 8h0m0s
|
||
best_effort: true
|
||
- chain: openshift-e2e-test-qe
|
||
```
|
||
Note: `best_effort: true` is required when timeout is customized to prevent the wait step from failing the job if it times out
|
||
|
||
**Show brief confirmation**:
|
||
```
|
||
✅ Modified: ${job_name} (OCP ${ocp_version})
|
||
File: <job-config-file-path>
|
||
Added: - ref: wait${timeout:+ (timeout: ${timeout})}
|
||
```
|
||
|
||
### Step 5b: Modify Workflow File
|
||
|
||
**Edit the workflow file directly** - no confirmation needed:
|
||
|
||
```bash
|
||
# Add wait step before the last test step
|
||
# If timeout is provided, add it as a step property
|
||
# See Step 6 for the YAML modification algorithm
|
||
```
|
||
|
||
**Two scenarios**:
|
||
|
||
1. **Without custom timeout** (uses wait step's built-in default of 3h):
|
||
```yaml
|
||
test:
|
||
- ref: wait
|
||
- chain: baremetalds-ipi-test
|
||
```
|
||
Note: No timeout or best_effort needed - the wait step will use its default TIMEOUT env var (3 hours)
|
||
|
||
2. **With custom timeout** (user provided timeout parameter):
|
||
```yaml
|
||
test:
|
||
- ref: wait
|
||
timeout: 8h0m0s
|
||
best_effort: true
|
||
- chain: baremetalds-ipi-test
|
||
```
|
||
Note: `best_effort: true` is required when timeout is customized to prevent the wait step from failing the job if it times out
|
||
|
||
**Show brief confirmation**:
|
||
```
|
||
✅ Modified: ${workflow_name} workflow
|
||
File: <workflow-file-path>
|
||
Added: - ref: wait${timeout:+ (timeout: ${timeout})}
|
||
⚠️ Impact: Affects ALL jobs using this workflow
|
||
```
|
||
|
||
### Step 6: Create Branch and Commit
|
||
|
||
**Branch naming**:
|
||
```
|
||
debug-${workflow_name}-${ocp_version}-$(date +%Y%m%d)
|
||
```
|
||
|
||
Example: `debug-baremetalds-two-node-arbiter-4.21-20250131`
|
||
|
||
**Git operations**:
|
||
```bash
|
||
# Create branch
|
||
git checkout -b "${branch_name}"
|
||
|
||
# Modify the file (add wait step using the implementation below)
|
||
# Add '- ref: wait' as the first step in the test: section
|
||
|
||
# Stage change
|
||
git add <workflow-file>
|
||
|
||
# Commit
|
||
git commit -m "[Debug] Add wait step to ${workflow_name} for OCP ${ocp_version}
|
||
|
||
This adds a wait step to enable debugging of test failures in OCP ${ocp_version}.
|
||
|
||
The wait step pauses the workflow before tests run, allowing QE to:
|
||
- SSH into the test environment
|
||
- Inspect system state and logs
|
||
- Debug configuration issues
|
||
- Investigate test failures
|
||
|
||
OCP Version: ${ocp_version}
|
||
Workflow: ${workflow_name}"
|
||
```
|
||
|
||
**YAML Modification Algorithm**:
|
||
|
||
The modification process for both job configs and workflow files follows the same pattern:
|
||
|
||
1. **Locate the target**: Find the `test:` section
|
||
- For job configs: Within the specific job definition (`- as: ${job_name}`)
|
||
- For workflows: At the workflow level
|
||
|
||
2. **Find test steps**: Identify all steps (lines with `- ref:` or `- chain:`)
|
||
|
||
3. **Check for duplicates**: Ensure `- ref: wait` doesn't already exist
|
||
|
||
4. **Insert wait step**: Add before the **last** test step with matching indentation
|
||
|
||
5. **Handle timeout**:
|
||
- Without timeout: Add simple `- ref: wait`
|
||
- With timeout: Add as multi-line with `timeout` and `best_effort` properties
|
||
|
||
**Example transformation:**
|
||
|
||
Before:
|
||
```yaml
|
||
test:
|
||
- chain: openshift-e2e-test-qe
|
||
```
|
||
|
||
After (without timeout):
|
||
```yaml
|
||
test:
|
||
- ref: wait
|
||
- chain: openshift-e2e-test-qe
|
||
```
|
||
|
||
After (with timeout=8h):
|
||
```yaml
|
||
test:
|
||
- ref: wait
|
||
timeout: 8h0m0s
|
||
best_effort: true
|
||
- chain: openshift-e2e-test-qe
|
||
```
|
||
|
||
**Critical constraints:**
|
||
- Preserve exact YAML indentation (typically 2 spaces per level)
|
||
- Insert BEFORE the last step, not after
|
||
- When timeout is set, `best_effort: true` is required to prevent job failure
|
||
- Normalize timeout format to Go duration (e.g., "8h" → "8h0m0s")
|
||
|
||
### Step 7: Push and Show GitHub Link
|
||
|
||
**Auto-push the branch**:
|
||
```bash
|
||
git push origin "${branch_name}"
|
||
```
|
||
|
||
**Display GitHub PR creation link**:
|
||
```
|
||
✅ Changes pushed successfully!
|
||
|
||
Create PR here:
|
||
https://github.com/openshift/release/compare/master...${branch_name}
|
||
|
||
Branch: ${branch_name}
|
||
Job: ${job_name}
|
||
OCP: ${ocp_version}
|
||
|
||
⚠️ Remember to close PR after debugging (DO NOT MERGE)
|
||
```
|
||
|
||
That's it! Simple and clean.
|
||
|
||
### Error Handling
|
||
|
||
**Error: Repository Not Found**
|
||
```
|
||
❌ Error: Repository not found at ${repo_path}
|
||
|
||
Please provide the correct path to openshift/release repository.
|
||
|
||
To clone:
|
||
git clone https://github.com/openshift/release.git
|
||
```
|
||
|
||
**Error: Not in openshift/release Repo**
|
||
```
|
||
❌ Error: This doesn't appear to be the openshift/release repository
|
||
|
||
Remote URL: ${current_remote}
|
||
Expected: github.com/openshift/release
|
||
|
||
Please navigate to the correct repository.
|
||
```
|
||
|
||
**Error: Workflow File Not Found**
|
||
```
|
||
❌ Error: Workflow file not found
|
||
|
||
Searched for: *${workflow_name}*workflow*.yaml
|
||
Location: ci-operator/step-registry/
|
||
|
||
Suggestions:
|
||
1. Verify the workflow name
|
||
2. Try a partial match
|
||
3. Search manually: find ci-operator/step-registry -name "*workflow*.yaml"
|
||
```
|
||
|
||
**Error: Wait Step Already Exists**
|
||
```
|
||
ℹ️ Wait step already configured in this workflow
|
||
|
||
No action needed - you can proceed with debugging using the existing wait step.
|
||
```
|
||
|
||
**Error: Invalid OCP Version**
|
||
```
|
||
❌ Invalid OCP version: ${version}
|
||
|
||
Valid versions: 4.18, 4.19, 4.20, 4.21, 4.22, master
|
||
|
||
Please provide a valid version.
|
||
```
|
||
|
||
### Error: Invalid Timeout Format
|
||
```
|
||
❌ Invalid timeout format: ${timeout}
|
||
|
||
Valid format: Integer followed by 'h' (e.g., "1h", "2h", "8h", "24h", "72h")
|
||
Valid range: 1h to 72h
|
||
|
||
Examples:
|
||
- "1h" (1 hour)
|
||
- "8h" (8 hours)
|
||
- "24h" (24 hours)
|
||
- "72h" (72 hours, maximum)
|
||
|
||
Please provide a valid timeout in hours.
|
||
```
|
||
|
||
### Note: Timeout Normalization
|
||
|
||
When a user provides a timeout like "8h", the implementation should normalize it to the standard Go duration format "8h0m0s" for consistency with existing configurations in the codebase.
|
||
|
||
## Return Value
|
||
|
||
- **Success**: PR URL and debugging instructions
|
||
- **Error**: Error message with suggestions for resolution
|
||
- **Format**: Text output with emoji indicators for status
|
||
|
||
## Examples
|
||
|
||
### Example 1: Without Timeout (Default 3h)
|
||
|
||
```bash
|
||
/ci:add-debug-wait aws-ipi-f7-longduration-workload
|
||
```
|
||
|
||
Prompts for: OCP version (4.21), repo path
|
||
|
||
Result:
|
||
```yaml
|
||
test:
|
||
- ref: wait
|
||
- chain: openshift-e2e-test-qe
|
||
```
|
||
|
||
Returns: PR creation link
|
||
|
||
### Example 2: With Custom Timeout
|
||
|
||
```bash
|
||
/ci:add-debug-wait aws-ipi-f7-longduration-workload 8h
|
||
```
|
||
|
||
Prompts for: OCP version (4.21), repo path
|
||
|
||
Result:
|
||
```yaml
|
||
test:
|
||
- ref: wait
|
||
timeout: 8h0m0s
|
||
best_effort: true
|
||
- chain: openshift-e2e-test-qe
|
||
```
|
||
|
||
Returns: PR creation link with timeout info
|
||
|
||
### Example 3: Workflow File
|
||
|
||
```bash
|
||
/ci:add-debug-wait baremetalds-two-node-arbiter-upgrade 24h
|
||
```
|
||
|
||
Behavior: Searches job config first, falls back to workflow if not found. Warns that workflow changes affect ALL jobs using it.
|
||
|
||
Returns: PR creation link
|
||
|
||
## Arguments
|
||
|
||
- **$1** (workflow-or-job-name): The name of the CI workflow or job to add the wait step to (required)
|
||
- **$2** (timeout): Optional timeout in hours (1h-72h). Examples: "1h", "8h", "24h", "72h". If not provided, uses wait step's default (3h)
|
||
|
||
## Notes
|
||
|
||
### Best Practices for QE
|
||
|
||
**Before Running Command**:
|
||
- ✅ Confirm test is actually failing
|
||
- ✅ Check existing debug PRs
|
||
- ✅ Know which OCP version is affected
|
||
|
||
**During Debugging**:
|
||
- 📝 Take detailed notes
|
||
- 💾 Save logs and screenshots
|
||
- 🔍 Document root cause
|
||
- 📊 Record all findings
|
||
|
||
**After Debugging**:
|
||
- ✅ Document findings
|
||
- ✅ Close the debug PR
|
||
- ✅ Delete the branch
|
||
- ✅ Share learnings with team
|
||
- ✅ Create fix PR if needed
|
||
|
||
### Future Enhancements
|
||
|
||
Consider adding companion commands:
|
||
- `/ci:close-debug-pr` - Lists open debug PRs, prompts for findings, closes PR
|
||
- `/ci:list-debug-prs` - Show all open debug PRs
|
||
- `/ci:revert-debug-pr` - Revert a debug PR that was merged by mistake
|