Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:45:38 +08:00
commit d09f471f51
14 changed files with 2241 additions and 0 deletions

623
commands/add-debug-wait.md Normal file
View File

@@ -0,0 +1,623 @@
---
description: Add a wait step to a CI workflow for debugging test failures
argument-hint: <workflow-or-job-name> [timeout]
---
## Name
ci:add-debug-wait
## Synopsis
```
/ci:add-debug-wait <workflow-or-job-name> [timeout]
```
## Description
The `ci:add-debug-wait` command adds a `wait` step to a CI job/workflow for debugging test failures.
**What it does:**
1. Takes job name, OCP version, and optional timeout as input
2. Finds and edits the job config or workflow file
3. Adds `- ref: wait` before the last test step (with optional timeout configuration)
4. Commits and pushes the change
5. Gives you a GitHub link to create the PR
**That's it!** Simple, fast, and automated.
## Implementation
The command performs the following steps:
### Step 1: Gather Required Information
**Prompt user for** (in this order):
1. **Workflow/Job Name**: (from command argument $1 or prompt)
```
Workflow or job name: <user-input>
Example: aws-c2s-ipi-disc-priv-fips-f7
Example: baremetalds-two-node-arbiter-e2e-openshift-test-private-tests
```
2. **Timeout** (optional, from command argument $2):
```
Wait timeout in hours (optional, default: 3h):
Examples: "1h", "2h", "8h", "24h", "72h"
Valid range: 1h to 72h
```
- If not provided, uses the wait step's default behavior (3 hours)
- Format: Integer followed by 'h' (e.g., "1h", "2h", "8h")
- Valid range: 1h to 72h (maximum enforced by wait step's timeout setting)
- Will be normalized to Go duration format (e.g., "8h" → "8h0m0s")
- This will be set as the `timeout:` property on the wait step in the workflow/job YAML
3. **OCP Version**: (prompt - REQUIRED for searching job configs)
```
OCP version for debugging (e.g., 4.18, 4.19, 4.20, 4.21, 4.22):
```
This is used to:
- Search the correct job config file (e.g., release-4.21)
- Document which version needs debugging
- Add context to the PR
4. **OpenShift Release Repo Path**: (prompt if not in current directory)
```
Path to openshift/release repository:
Default: ~/repos/openshift-release
```
### Step 2: Validate Environment
**Silently validate** (no user prompts):
```bash
cd <repo-path>
# Check 1: Repository exists and is correct
git remote -v | grep "openshift/release" || exit 1
# Skip repo update - work with current state
# User can manually update their repo if needed
```
### Step 3: Search for Job/Test Configuration
**Priority 1: Search job configs first** (more specific and targeted):
```bash
cd <repo-path>
# Search for job config files matching the OCP version
# The job name could be in various config files, so search broadly
grep -r "as: ${job_name}" ci-operator/config/ --include="*release-${ocp_version}*.yaml" -l
```
**Example searches**:
- For `aws-c2s-ipi-disc-priv-fips-f7` and OCP 4.21:
```bash
grep -r "as: aws-c2s-ipi-disc-priv-fips-f7" ci-operator/config/ --include="*release-4.21*.yaml" -l
```
**Handle job config search results**:
- **1 file found**:
```
✅ Found job configuration:
${file_path}
Type: Job configuration file
Proceeding with job config modification...
```
→ Continue to **Step 4a: Analyze Job Configuration**
- **Multiple files found**:
```
Found ${count} matching job config files:
1. ci-operator/config/.../release-4.21__amd64-nightly.yaml
2. ci-operator/config/.../release-4.21__arm64-nightly.yaml
3. ci-operator/config/.../release-4.21__ppc64le-nightly.yaml
Select file (1-${count}) or 'q' to quit:
```
**Prompt user to select** which file to modify, then continue to **Step 4a: Analyze Job Configuration**
- **0 files found**:
```
No job config found for: ${job_name} (OCP ${ocp_version})
Searching for workflow files instead...
```
→ Continue to **Priority 2** below
**Priority 2: Search workflow files** (if job config not found):
```bash
cd <repo-path>
# Search for workflow files
find ci-operator/step-registry -type f -name "*${workflow_name}*workflow*.yaml"
```
**Handle workflow search results**:
- **0 files found**:
```
❌ No job config or workflow file found for: ${job_name}
Suggestions:
1. Check spelling of job/workflow name
2. Verify OCP version (${ocp_version})
3. Try with partial name
4. Search manually:
- Job configs: grep -r "as: ${job_name}" ci-operator/config/
- Workflows: find ci-operator/step-registry -name "*workflow*.yaml" | grep <partial-name>
```
- **1 file found**:
```
✅ Found workflow file:
${file_path}
Type: Workflow file
Proceeding with workflow modification...
```
→ Continue to **Step 4b: Analyze Workflow File**
- **Multiple files found**:
```
Found ${count} matching workflow files:
1. ci-operator/step-registry/.../workflow1.yaml
2. ci-operator/step-registry/.../workflow2.yaml
3. ci-operator/step-registry/.../workflow3.yaml
Select file (1-${count}) or 'q' to quit:
```
**Prompt user to select** which file to modify, then continue to **Step 4b: Analyze Workflow File**
### Step 4a: Analyze Job Configuration
**Read and parse the job config YAML**:
```bash
# Find the specific test definition
grep -A 30 "as: ${job_name}" <job-config-file>
```
**Check for**:
1. ✅ Has `steps:` section
2. ✅ Has `test:` section inside steps
3. ❌ Does NOT already have `- ref: wait`
**Example current structure**:
```yaml
- as: aws-c2s-ipi-disc-priv-fips-f7
cron: 36 16 3,12,19,26 * *
steps:
cluster_profile: aws-c2s-qe
env:
BASE_DOMAIN: qe.devcluster.openshift.com
FIPS_ENABLED: "true"
test:
- chain: openshift-e2e-test-qe
workflow: cucushift-installer-rehearse-aws-c2s-ipi-disconnected-private
```
**If wait already exists**:
```
Wait step already configured in job config
Current test section:
test:
- ref: wait
- chain: openshift-e2e-test-qe
No changes needed. The job is already set up for debugging.
```
**If no test section found**:
```
Job config found but no test: section
This job uses only the workflow's test steps.
Searching for the workflow: ${workflow_name}
```
→ Fall back to searching for workflow (Priority 2 in Step 3)
→ Continue to **Step 5a: Show Diff for Job Config**
### Step 4b: Analyze Workflow File
**Read and parse the workflow YAML**:
```bash
cat <workflow-file>
```
**Check for**:
1. ✅ Has `workflow:` section
2. ✅ Has `test:` section
3. ❌ Does NOT already have `- ref: wait`
**Example current structure**:
```yaml
workflow:
as: baremetalds-two-node-arbiter-upgrade
steps:
pre:
- chain: baremetalds-ipi-pre
test:
- chain: baremetalds-ipi-test
post:
- chain: baremetalds-ipi-post
```
**If wait already exists**:
```
Wait step already configured in workflow
Current test section:
test:
- ref: wait
- chain: baremetalds-ipi-test
No changes needed. The workflow is already set up for debugging.
```
**If no test section exists**:
```
Workflow has no test: section
This workflow is provision/deprovision only.
The test steps must be defined in the job config.
Please provide the full job name to modify the job config instead.
```
→ Exit or prompt for job name
→ Continue to **Step 5b: Modify Workflow File**
### Step 5a: Modify Job Config File
**Edit the job config file directly** - no confirmation needed:
```bash
# Add wait step before the last test step
# If timeout is provided, add it as a step property
# See Step 6 for the YAML modification algorithm
```
**Two scenarios**:
1. **Without custom timeout** (uses wait step's built-in default of 3h):
```yaml
test:
- ref: wait
- chain: openshift-e2e-test-qe
```
Note: No timeout or best_effort needed - the wait step will use its default TIMEOUT env var (3 hours)
2. **With custom timeout** (user provided timeout parameter):
```yaml
test:
- ref: wait
timeout: 8h0m0s
best_effort: true
- chain: openshift-e2e-test-qe
```
Note: `best_effort: true` is required when timeout is customized to prevent the wait step from failing the job if it times out
**Show brief confirmation**:
```
✅ Modified: ${job_name} (OCP ${ocp_version})
File: <job-config-file-path>
Added: - ref: wait${timeout:+ (timeout: ${timeout})}
```
### Step 5b: Modify Workflow File
**Edit the workflow file directly** - no confirmation needed:
```bash
# Add wait step before the last test step
# If timeout is provided, add it as a step property
# See Step 6 for the YAML modification algorithm
```
**Two scenarios**:
1. **Without custom timeout** (uses wait step's built-in default of 3h):
```yaml
test:
- ref: wait
- chain: baremetalds-ipi-test
```
Note: No timeout or best_effort needed - the wait step will use its default TIMEOUT env var (3 hours)
2. **With custom timeout** (user provided timeout parameter):
```yaml
test:
- ref: wait
timeout: 8h0m0s
best_effort: true
- chain: baremetalds-ipi-test
```
Note: `best_effort: true` is required when timeout is customized to prevent the wait step from failing the job if it times out
**Show brief confirmation**:
```
✅ Modified: ${workflow_name} workflow
File: <workflow-file-path>
Added: - ref: wait${timeout:+ (timeout: ${timeout})}
⚠️ Impact: Affects ALL jobs using this workflow
```
### Step 6: Create Branch and Commit
**Branch naming**:
```
debug-${workflow_name}-${ocp_version}-$(date +%Y%m%d)
```
Example: `debug-baremetalds-two-node-arbiter-4.21-20250131`
**Git operations**:
```bash
# Create branch
git checkout -b "${branch_name}"
# Modify the file (add wait step using the implementation below)
# Add '- ref: wait' as the first step in the test: section
# Stage change
git add <workflow-file>
# Commit
git commit -m "[Debug] Add wait step to ${workflow_name} for OCP ${ocp_version}
This adds a wait step to enable debugging of test failures in OCP ${ocp_version}.
The wait step pauses the workflow before tests run, allowing QE to:
- SSH into the test environment
- Inspect system state and logs
- Debug configuration issues
- Investigate test failures
OCP Version: ${ocp_version}
Workflow: ${workflow_name}"
```
**YAML Modification Algorithm**:
The modification process for both job configs and workflow files follows the same pattern:
1. **Locate the target**: Find the `test:` section
- For job configs: Within the specific job definition (`- as: ${job_name}`)
- For workflows: At the workflow level
2. **Find test steps**: Identify all steps (lines with `- ref:` or `- chain:`)
3. **Check for duplicates**: Ensure `- ref: wait` doesn't already exist
4. **Insert wait step**: Add before the **last** test step with matching indentation
5. **Handle timeout**:
- Without timeout: Add simple `- ref: wait`
- With timeout: Add as multi-line with `timeout` and `best_effort` properties
**Example transformation:**
Before:
```yaml
test:
- chain: openshift-e2e-test-qe
```
After (without timeout):
```yaml
test:
- ref: wait
- chain: openshift-e2e-test-qe
```
After (with timeout=8h):
```yaml
test:
- ref: wait
timeout: 8h0m0s
best_effort: true
- chain: openshift-e2e-test-qe
```
**Critical constraints:**
- Preserve exact YAML indentation (typically 2 spaces per level)
- Insert BEFORE the last step, not after
- When timeout is set, `best_effort: true` is required to prevent job failure
- Normalize timeout format to Go duration (e.g., "8h" → "8h0m0s")
### Step 7: Push and Show GitHub Link
**Auto-push the branch**:
```bash
git push origin "${branch_name}"
```
**Display GitHub PR creation link**:
```
✅ Changes pushed successfully!
Create PR here:
https://github.com/openshift/release/compare/master...${branch_name}
Branch: ${branch_name}
Job: ${job_name}
OCP: ${ocp_version}
⚠️ Remember to close PR after debugging (DO NOT MERGE)
```
That's it! Simple and clean.
### Error Handling
**Error: Repository Not Found**
```
❌ Error: Repository not found at ${repo_path}
Please provide the correct path to openshift/release repository.
To clone:
git clone https://github.com/openshift/release.git
```
**Error: Not in openshift/release Repo**
```
❌ Error: This doesn't appear to be the openshift/release repository
Remote URL: ${current_remote}
Expected: github.com/openshift/release
Please navigate to the correct repository.
```
**Error: Workflow File Not Found**
```
❌ Error: Workflow file not found
Searched for: *${workflow_name}*workflow*.yaml
Location: ci-operator/step-registry/
Suggestions:
1. Verify the workflow name
2. Try a partial match
3. Search manually: find ci-operator/step-registry -name "*workflow*.yaml"
```
**Error: Wait Step Already Exists**
```
Wait step already configured in this workflow
No action needed - you can proceed with debugging using the existing wait step.
```
**Error: Invalid OCP Version**
```
❌ Invalid OCP version: ${version}
Valid versions: 4.18, 4.19, 4.20, 4.21, 4.22, master
Please provide a valid version.
```
### Error: Invalid Timeout Format
```
❌ Invalid timeout format: ${timeout}
Valid format: Integer followed by 'h' (e.g., "1h", "2h", "8h", "24h", "72h")
Valid range: 1h to 72h
Examples:
- "1h" (1 hour)
- "8h" (8 hours)
- "24h" (24 hours)
- "72h" (72 hours, maximum)
Please provide a valid timeout in hours.
```
### Note: Timeout Normalization
When a user provides a timeout like "8h", the implementation should normalize it to the standard Go duration format "8h0m0s" for consistency with existing configurations in the codebase.
## Return Value
- **Success**: PR URL and debugging instructions
- **Error**: Error message with suggestions for resolution
- **Format**: Text output with emoji indicators for status
## Examples
### Example 1: Without Timeout (Default 3h)
```bash
/ci:add-debug-wait aws-ipi-f7-longduration-workload
```
Prompts for: OCP version (4.21), repo path
Result:
```yaml
test:
- ref: wait
- chain: openshift-e2e-test-qe
```
Returns: PR creation link
### Example 2: With Custom Timeout
```bash
/ci:add-debug-wait aws-ipi-f7-longduration-workload 8h
```
Prompts for: OCP version (4.21), repo path
Result:
```yaml
test:
- ref: wait
timeout: 8h0m0s
best_effort: true
- chain: openshift-e2e-test-qe
```
Returns: PR creation link with timeout info
### Example 3: Workflow File
```bash
/ci:add-debug-wait baremetalds-two-node-arbiter-upgrade 24h
```
Behavior: Searches job config first, falls back to workflow if not found. Warns that workflow changes affect ALL jobs using it.
Returns: PR creation link
## Arguments
- **$1** (workflow-or-job-name): The name of the CI workflow or job to add the wait step to (required)
- **$2** (timeout): Optional timeout in hours (1h-72h). Examples: "1h", "8h", "24h", "72h". If not provided, uses wait step's default (3h)
## Notes
### Best Practices for QE
**Before Running Command**:
- ✅ Confirm test is actually failing
- ✅ Check existing debug PRs
- ✅ Know which OCP version is affected
**During Debugging**:
- 📝 Take detailed notes
- 💾 Save logs and screenshots
- 🔍 Document root cause
- 📊 Record all findings
**After Debugging**:
- ✅ Document findings
- ✅ Close the debug PR
- ✅ Delete the branch
- ✅ Share learnings with team
- ✅ Create fix PR if needed
### Future Enhancements
Consider adding companion commands:
- `/ci:close-debug-pr` - Lists open debug PRs, prompts for findings, closes PR
- `/ci:list-debug-prs` - Show all open debug PRs
- `/ci:revert-debug-pr` - Revert a debug PR that was merged by mistake