Files
gh-openshift-eng-ai-helpers…/commands/add-debug-wait.md
2025-11-30 08:45:38 +08:00

624 lines
15 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters
This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
description: Add a wait step to a CI workflow for debugging test failures
argument-hint: <workflow-or-job-name> [timeout]
---
## Name
ci:add-debug-wait
## Synopsis
```
/ci:add-debug-wait <workflow-or-job-name> [timeout]
```
## Description
The `ci:add-debug-wait` command adds a `wait` step to a CI job/workflow for debugging test failures.
**What it does:**
1. Takes job name, OCP version, and optional timeout as input
2. Finds and edits the job config or workflow file
3. Adds `- ref: wait` before the last test step (with optional timeout configuration)
4. Commits and pushes the change
5. Gives you a GitHub link to create the PR
**That's it!** Simple, fast, and automated.
## Implementation
The command performs the following steps:
### Step 1: Gather Required Information
**Prompt user for** (in this order):
1. **Workflow/Job Name**: (from command argument $1 or prompt)
```
Workflow or job name: <user-input>
Example: aws-c2s-ipi-disc-priv-fips-f7
Example: baremetalds-two-node-arbiter-e2e-openshift-test-private-tests
```
2. **Timeout** (optional, from command argument $2):
```
Wait timeout in hours (optional, default: 3h):
Examples: "1h", "2h", "8h", "24h", "72h"
Valid range: 1h to 72h
```
- If not provided, uses the wait step's default behavior (3 hours)
- Format: Integer followed by 'h' (e.g., "1h", "2h", "8h")
- Valid range: 1h to 72h (maximum enforced by wait step's timeout setting)
- Will be normalized to Go duration format (e.g., "8h" → "8h0m0s")
- This will be set as the `timeout:` property on the wait step in the workflow/job YAML
3. **OCP Version**: (prompt - REQUIRED for searching job configs)
```
OCP version for debugging (e.g., 4.18, 4.19, 4.20, 4.21, 4.22):
```
This is used to:
- Search the correct job config file (e.g., release-4.21)
- Document which version needs debugging
- Add context to the PR
4. **OpenShift Release Repo Path**: (prompt if not in current directory)
```
Path to openshift/release repository:
Default: ~/repos/openshift-release
```
### Step 2: Validate Environment
**Silently validate** (no user prompts):
```bash
cd <repo-path>
# Check 1: Repository exists and is correct
git remote -v | grep "openshift/release" || exit 1
# Skip repo update - work with current state
# User can manually update their repo if needed
```
### Step 3: Search for Job/Test Configuration
**Priority 1: Search job configs first** (more specific and targeted):
```bash
cd <repo-path>
# Search for job config files matching the OCP version
# The job name could be in various config files, so search broadly
grep -r "as: ${job_name}" ci-operator/config/ --include="*release-${ocp_version}*.yaml" -l
```
**Example searches**:
- For `aws-c2s-ipi-disc-priv-fips-f7` and OCP 4.21:
```bash
grep -r "as: aws-c2s-ipi-disc-priv-fips-f7" ci-operator/config/ --include="*release-4.21*.yaml" -l
```
**Handle job config search results**:
- **1 file found**:
```
✅ Found job configuration:
${file_path}
Type: Job configuration file
Proceeding with job config modification...
```
→ Continue to **Step 4a: Analyze Job Configuration**
- **Multiple files found**:
```
Found ${count} matching job config files:
1. ci-operator/config/.../release-4.21__amd64-nightly.yaml
2. ci-operator/config/.../release-4.21__arm64-nightly.yaml
3. ci-operator/config/.../release-4.21__ppc64le-nightly.yaml
Select file (1-${count}) or 'q' to quit:
```
**Prompt user to select** which file to modify, then continue to **Step 4a: Analyze Job Configuration**
- **0 files found**:
```
No job config found for: ${job_name} (OCP ${ocp_version})
Searching for workflow files instead...
```
→ Continue to **Priority 2** below
**Priority 2: Search workflow files** (if job config not found):
```bash
cd <repo-path>
# Search for workflow files
find ci-operator/step-registry -type f -name "*${workflow_name}*workflow*.yaml"
```
**Handle workflow search results**:
- **0 files found**:
```
❌ No job config or workflow file found for: ${job_name}
Suggestions:
1. Check spelling of job/workflow name
2. Verify OCP version (${ocp_version})
3. Try with partial name
4. Search manually:
- Job configs: grep -r "as: ${job_name}" ci-operator/config/
- Workflows: find ci-operator/step-registry -name "*workflow*.yaml" | grep <partial-name>
```
- **1 file found**:
```
✅ Found workflow file:
${file_path}
Type: Workflow file
Proceeding with workflow modification...
```
→ Continue to **Step 4b: Analyze Workflow File**
- **Multiple files found**:
```
Found ${count} matching workflow files:
1. ci-operator/step-registry/.../workflow1.yaml
2. ci-operator/step-registry/.../workflow2.yaml
3. ci-operator/step-registry/.../workflow3.yaml
Select file (1-${count}) or 'q' to quit:
```
**Prompt user to select** which file to modify, then continue to **Step 4b: Analyze Workflow File**
### Step 4a: Analyze Job Configuration
**Read and parse the job config YAML**:
```bash
# Find the specific test definition
grep -A 30 "as: ${job_name}" <job-config-file>
```
**Check for**:
1. ✅ Has `steps:` section
2. ✅ Has `test:` section inside steps
3. ❌ Does NOT already have `- ref: wait`
**Example current structure**:
```yaml
- as: aws-c2s-ipi-disc-priv-fips-f7
cron: 36 16 3,12,19,26 * *
steps:
cluster_profile: aws-c2s-qe
env:
BASE_DOMAIN: qe.devcluster.openshift.com
FIPS_ENABLED: "true"
test:
- chain: openshift-e2e-test-qe
workflow: cucushift-installer-rehearse-aws-c2s-ipi-disconnected-private
```
**If wait already exists**:
```
Wait step already configured in job config
Current test section:
test:
- ref: wait
- chain: openshift-e2e-test-qe
No changes needed. The job is already set up for debugging.
```
**If no test section found**:
```
Job config found but no test: section
This job uses only the workflow's test steps.
Searching for the workflow: ${workflow_name}
```
→ Fall back to searching for workflow (Priority 2 in Step 3)
→ Continue to **Step 5a: Show Diff for Job Config**
### Step 4b: Analyze Workflow File
**Read and parse the workflow YAML**:
```bash
cat <workflow-file>
```
**Check for**:
1. ✅ Has `workflow:` section
2. ✅ Has `test:` section
3. ❌ Does NOT already have `- ref: wait`
**Example current structure**:
```yaml
workflow:
as: baremetalds-two-node-arbiter-upgrade
steps:
pre:
- chain: baremetalds-ipi-pre
test:
- chain: baremetalds-ipi-test
post:
- chain: baremetalds-ipi-post
```
**If wait already exists**:
```
Wait step already configured in workflow
Current test section:
test:
- ref: wait
- chain: baremetalds-ipi-test
No changes needed. The workflow is already set up for debugging.
```
**If no test section exists**:
```
Workflow has no test: section
This workflow is provision/deprovision only.
The test steps must be defined in the job config.
Please provide the full job name to modify the job config instead.
```
→ Exit or prompt for job name
→ Continue to **Step 5b: Modify Workflow File**
### Step 5a: Modify Job Config File
**Edit the job config file directly** - no confirmation needed:
```bash
# Add wait step before the last test step
# If timeout is provided, add it as a step property
# See Step 6 for the YAML modification algorithm
```
**Two scenarios**:
1. **Without custom timeout** (uses wait step's built-in default of 3h):
```yaml
test:
- ref: wait
- chain: openshift-e2e-test-qe
```
Note: No timeout or best_effort needed - the wait step will use its default TIMEOUT env var (3 hours)
2. **With custom timeout** (user provided timeout parameter):
```yaml
test:
- ref: wait
timeout: 8h0m0s
best_effort: true
- chain: openshift-e2e-test-qe
```
Note: `best_effort: true` is required when timeout is customized to prevent the wait step from failing the job if it times out
**Show brief confirmation**:
```
✅ Modified: ${job_name} (OCP ${ocp_version})
File: <job-config-file-path>
Added: - ref: wait${timeout:+ (timeout: ${timeout})}
```
### Step 5b: Modify Workflow File
**Edit the workflow file directly** - no confirmation needed:
```bash
# Add wait step before the last test step
# If timeout is provided, add it as a step property
# See Step 6 for the YAML modification algorithm
```
**Two scenarios**:
1. **Without custom timeout** (uses wait step's built-in default of 3h):
```yaml
test:
- ref: wait
- chain: baremetalds-ipi-test
```
Note: No timeout or best_effort needed - the wait step will use its default TIMEOUT env var (3 hours)
2. **With custom timeout** (user provided timeout parameter):
```yaml
test:
- ref: wait
timeout: 8h0m0s
best_effort: true
- chain: baremetalds-ipi-test
```
Note: `best_effort: true` is required when timeout is customized to prevent the wait step from failing the job if it times out
**Show brief confirmation**:
```
✅ Modified: ${workflow_name} workflow
File: <workflow-file-path>
Added: - ref: wait${timeout:+ (timeout: ${timeout})}
⚠️ Impact: Affects ALL jobs using this workflow
```
### Step 6: Create Branch and Commit
**Branch naming**:
```
debug-${workflow_name}-${ocp_version}-$(date +%Y%m%d)
```
Example: `debug-baremetalds-two-node-arbiter-4.21-20250131`
**Git operations**:
```bash
# Create branch
git checkout -b "${branch_name}"
# Modify the file (add wait step using the implementation below)
# Add '- ref: wait' as the first step in the test: section
# Stage change
git add <workflow-file>
# Commit
git commit -m "[Debug] Add wait step to ${workflow_name} for OCP ${ocp_version}
This adds a wait step to enable debugging of test failures in OCP ${ocp_version}.
The wait step pauses the workflow before tests run, allowing QE to:
- SSH into the test environment
- Inspect system state and logs
- Debug configuration issues
- Investigate test failures
OCP Version: ${ocp_version}
Workflow: ${workflow_name}"
```
**YAML Modification Algorithm**:
The modification process for both job configs and workflow files follows the same pattern:
1. **Locate the target**: Find the `test:` section
- For job configs: Within the specific job definition (`- as: ${job_name}`)
- For workflows: At the workflow level
2. **Find test steps**: Identify all steps (lines with `- ref:` or `- chain:`)
3. **Check for duplicates**: Ensure `- ref: wait` doesn't already exist
4. **Insert wait step**: Add before the **last** test step with matching indentation
5. **Handle timeout**:
- Without timeout: Add simple `- ref: wait`
- With timeout: Add as multi-line with `timeout` and `best_effort` properties
**Example transformation:**
Before:
```yaml
test:
- chain: openshift-e2e-test-qe
```
After (without timeout):
```yaml
test:
- ref: wait
- chain: openshift-e2e-test-qe
```
After (with timeout=8h):
```yaml
test:
- ref: wait
timeout: 8h0m0s
best_effort: true
- chain: openshift-e2e-test-qe
```
**Critical constraints:**
- Preserve exact YAML indentation (typically 2 spaces per level)
- Insert BEFORE the last step, not after
- When timeout is set, `best_effort: true` is required to prevent job failure
- Normalize timeout format to Go duration (e.g., "8h" → "8h0m0s")
### Step 7: Push and Show GitHub Link
**Auto-push the branch**:
```bash
git push origin "${branch_name}"
```
**Display GitHub PR creation link**:
```
✅ Changes pushed successfully!
Create PR here:
https://github.com/openshift/release/compare/master...${branch_name}
Branch: ${branch_name}
Job: ${job_name}
OCP: ${ocp_version}
⚠️ Remember to close PR after debugging (DO NOT MERGE)
```
That's it! Simple and clean.
### Error Handling
**Error: Repository Not Found**
```
❌ Error: Repository not found at ${repo_path}
Please provide the correct path to openshift/release repository.
To clone:
git clone https://github.com/openshift/release.git
```
**Error: Not in openshift/release Repo**
```
❌ Error: This doesn't appear to be the openshift/release repository
Remote URL: ${current_remote}
Expected: github.com/openshift/release
Please navigate to the correct repository.
```
**Error: Workflow File Not Found**
```
❌ Error: Workflow file not found
Searched for: *${workflow_name}*workflow*.yaml
Location: ci-operator/step-registry/
Suggestions:
1. Verify the workflow name
2. Try a partial match
3. Search manually: find ci-operator/step-registry -name "*workflow*.yaml"
```
**Error: Wait Step Already Exists**
```
Wait step already configured in this workflow
No action needed - you can proceed with debugging using the existing wait step.
```
**Error: Invalid OCP Version**
```
❌ Invalid OCP version: ${version}
Valid versions: 4.18, 4.19, 4.20, 4.21, 4.22, master
Please provide a valid version.
```
### Error: Invalid Timeout Format
```
❌ Invalid timeout format: ${timeout}
Valid format: Integer followed by 'h' (e.g., "1h", "2h", "8h", "24h", "72h")
Valid range: 1h to 72h
Examples:
- "1h" (1 hour)
- "8h" (8 hours)
- "24h" (24 hours)
- "72h" (72 hours, maximum)
Please provide a valid timeout in hours.
```
### Note: Timeout Normalization
When a user provides a timeout like "8h", the implementation should normalize it to the standard Go duration format "8h0m0s" for consistency with existing configurations in the codebase.
## Return Value
- **Success**: PR URL and debugging instructions
- **Error**: Error message with suggestions for resolution
- **Format**: Text output with emoji indicators for status
## Examples
### Example 1: Without Timeout (Default 3h)
```bash
/ci:add-debug-wait aws-ipi-f7-longduration-workload
```
Prompts for: OCP version (4.21), repo path
Result:
```yaml
test:
- ref: wait
- chain: openshift-e2e-test-qe
```
Returns: PR creation link
### Example 2: With Custom Timeout
```bash
/ci:add-debug-wait aws-ipi-f7-longduration-workload 8h
```
Prompts for: OCP version (4.21), repo path
Result:
```yaml
test:
- ref: wait
timeout: 8h0m0s
best_effort: true
- chain: openshift-e2e-test-qe
```
Returns: PR creation link with timeout info
### Example 3: Workflow File
```bash
/ci:add-debug-wait baremetalds-two-node-arbiter-upgrade 24h
```
Behavior: Searches job config first, falls back to workflow if not found. Warns that workflow changes affect ALL jobs using it.
Returns: PR creation link
## Arguments
- **$1** (workflow-or-job-name): The name of the CI workflow or job to add the wait step to (required)
- **$2** (timeout): Optional timeout in hours (1h-72h). Examples: "1h", "8h", "24h", "72h". If not provided, uses wait step's default (3h)
## Notes
### Best Practices for QE
**Before Running Command**:
- ✅ Confirm test is actually failing
- ✅ Check existing debug PRs
- ✅ Know which OCP version is affected
**During Debugging**:
- 📝 Take detailed notes
- 💾 Save logs and screenshots
- 🔍 Document root cause
- 📊 Record all findings
**After Debugging**:
- ✅ Document findings
- ✅ Close the debug PR
- ✅ Delete the branch
- ✅ Share learnings with team
- ✅ Create fix PR if needed
### Future Enhancements
Consider adding companion commands:
- `/ci:close-debug-pr` - Lists open debug PRs, prompts for findings, closes PR
- `/ci:list-debug-prs` - Show all open debug PRs
- `/ci:revert-debug-pr` - Revert a debug PR that was merged by mistake