Initial commit

2025-11-30 08:45:38 +08:00
commit d09f471f51
14 changed files with 2241 additions and 0 deletions
--- a/commands/add-debug-wait.md
+++ b/commands/add-debug-wait.md
@@ -0,0 +1,623 @@
+---
+description: Add a wait step to a CI workflow for debugging test failures
+argument-hint: <workflow-or-job-name> [timeout]
+---
+
+## Name
+ci:add-debug-wait
+
+## Synopsis
+```
+/ci:add-debug-wait <workflow-or-job-name> [timeout]
+```
+
+## Description
+
+The `ci:add-debug-wait` command adds a `wait` step to a CI job/workflow for debugging test failures.
+
+**What it does:**
+1. Takes job name, OCP version, and optional timeout as input
+2. Finds and edits the job config or workflow file
+3. Adds `- ref: wait` before the last test step (with optional timeout configuration)
+4. Commits and pushes the change
+5. Gives you a GitHub link to create the PR
+
+**That's it!** Simple, fast, and automated.
+
+## Implementation
+
+The command performs the following steps:
+
+### Step 1: Gather Required Information
+
+**Prompt user for** (in this order):
+
+1. **Workflow/Job Name**: (from command argument $1 or prompt)
+   ```
+   Workflow or job name: <user-input>
+   Example: aws-c2s-ipi-disc-priv-fips-f7
+   Example: baremetalds-two-node-arbiter-e2e-openshift-test-private-tests
+   ```
+
+2. **Timeout** (optional, from command argument $2):
+   ```
+   Wait timeout in hours (optional, default: 3h):
+   Examples: "1h", "2h", "8h", "24h", "72h"
+   Valid range: 1h to 72h
+   ```
+   - If not provided, uses the wait step's default behavior (3 hours)
+   - Format: Integer followed by 'h' (e.g., "1h", "2h", "8h")
+   - Valid range: 1h to 72h (maximum enforced by wait step's timeout setting)
+   - Will be normalized to Go duration format (e.g., "8h" → "8h0m0s")
+   - This will be set as the `timeout:` property on the wait step in the workflow/job YAML
+
+3. **OCP Version**: (prompt - REQUIRED for searching job configs)
+   ```
+   OCP version for debugging (e.g., 4.18, 4.19, 4.20, 4.21, 4.22):
+   ```
+   This is used to:
+   - Search the correct job config file (e.g., release-4.21)
+   - Document which version needs debugging
+   - Add context to the PR
+
+4. **OpenShift Release Repo Path**: (prompt if not in current directory)
+   ```
+   Path to openshift/release repository:
+   Default: ~/repos/openshift-release
+   ```
+
+### Step 2: Validate Environment
+
+**Silently validate** (no user prompts):
+
+```bash
+cd <repo-path>
+
+# Check 1: Repository exists and is correct
+git remote -v | grep "openshift/release" || exit 1
+
+# Skip repo update - work with current state
+# User can manually update their repo if needed
+```
+
+### Step 3: Search for Job/Test Configuration
+
+**Priority 1: Search job configs first** (more specific and targeted):
+
+```bash
+cd <repo-path>
+
+# Search for job config files matching the OCP version
+# The job name could be in various config files, so search broadly
+grep -r "as: ${job_name}" ci-operator/config/ --include="*release-${ocp_version}*.yaml" -l
+```
+
+**Example searches**:
+- For `aws-c2s-ipi-disc-priv-fips-f7` and OCP 4.21:
+  ```bash
+  grep -r "as: aws-c2s-ipi-disc-priv-fips-f7" ci-operator/config/ --include="*release-4.21*.yaml" -l
+  ```
+
+**Handle job config search results**:
+
+- **1 file found**:
+  ```
+  ✅ Found job configuration:
+  ${file_path}
+
+  Type: Job configuration file
+
+  Proceeding with job config modification...
+  ```
+  → Continue to **Step 4a: Analyze Job Configuration**
+
+- **Multiple files found**:
+  ```
+  Found ${count} matching job config files:
+
+  1. ci-operator/config/.../release-4.21__amd64-nightly.yaml
+  2. ci-operator/config/.../release-4.21__arm64-nightly.yaml
+  3. ci-operator/config/.../release-4.21__ppc64le-nightly.yaml
+
+  Select file (1-${count}) or 'q' to quit:
+  ```
+
+  **Prompt user to select** which file to modify, then continue to **Step 4a: Analyze Job Configuration**
+
+- **0 files found**:
+  ```
+  ℹ️  No job config found for: ${job_name} (OCP ${ocp_version})
+
+  Searching for workflow files instead...
+  ```
+  → Continue to **Priority 2** below
+
+**Priority 2: Search workflow files** (if job config not found):
+
+```bash
+cd <repo-path>
+
+# Search for workflow files
+find ci-operator/step-registry -type f -name "*${workflow_name}*workflow*.yaml"
+```
+
+**Handle workflow search results**:
+
+- **0 files found**:
+  ```
+  ❌ No job config or workflow file found for: ${job_name}
+
+  Suggestions:
+  1. Check spelling of job/workflow name
+  2. Verify OCP version (${ocp_version})
+  3. Try with partial name
+  4. Search manually:
+     - Job configs: grep -r "as: ${job_name}" ci-operator/config/
+     - Workflows: find ci-operator/step-registry -name "*workflow*.yaml" | grep <partial-name>
+  ```
+
+- **1 file found**:
+  ```
+  ✅ Found workflow file:
+  ${file_path}
+
+  Type: Workflow file
+
+  Proceeding with workflow modification...
+  ```
+  → Continue to **Step 4b: Analyze Workflow File**
+
+- **Multiple files found**:
+  ```
+  Found ${count} matching workflow files:
+
+  1. ci-operator/step-registry/.../workflow1.yaml
+  2. ci-operator/step-registry/.../workflow2.yaml
+  3. ci-operator/step-registry/.../workflow3.yaml
+
+  Select file (1-${count}) or 'q' to quit:
+  ```
+
+  **Prompt user to select** which file to modify, then continue to **Step 4b: Analyze Workflow File**
+
+### Step 4a: Analyze Job Configuration
+
+**Read and parse the job config YAML**:
+
+```bash
+# Find the specific test definition
+grep -A 30 "as: ${job_name}" <job-config-file>
+```
+
+**Check for**:
+1. ✅ Has `steps:` section
+2. ✅ Has `test:` section inside steps
+3. ❌ Does NOT already have `- ref: wait`
+
+**Example current structure**:
+```yaml
+- as: aws-c2s-ipi-disc-priv-fips-f7
+  cron: 36 16 3,12,19,26 * *
+  steps:
+    cluster_profile: aws-c2s-qe
+    env:
+      BASE_DOMAIN: qe.devcluster.openshift.com
+      FIPS_ENABLED: "true"
+    test:
+    - chain: openshift-e2e-test-qe
+    workflow: cucushift-installer-rehearse-aws-c2s-ipi-disconnected-private
+```
+
+**If wait already exists**:
+```
+ℹ️  Wait step already configured in job config
+
+Current test section:
+  test:
+  - ref: wait
+  - chain: openshift-e2e-test-qe
+
+No changes needed. The job is already set up for debugging.
+```
+
+**If no test section found**:
+```
+ℹ️  Job config found but no test: section
+
+This job uses only the workflow's test steps.
+Searching for the workflow: ${workflow_name}
+```
+→ Fall back to searching for workflow (Priority 2 in Step 3)
+
+→ Continue to **Step 5a: Show Diff for Job Config**
+
+### Step 4b: Analyze Workflow File
+
+**Read and parse the workflow YAML**:
+
+```bash
+cat <workflow-file>
+```
+
+**Check for**:
+1. ✅ Has `workflow:` section
+2. ✅ Has `test:` section
+3. ❌ Does NOT already have `- ref: wait`
+
+**Example current structure**:
+```yaml
+workflow:
+  as: baremetalds-two-node-arbiter-upgrade
+  steps:
+    pre:
+      - chain: baremetalds-ipi-pre
+    test:
+      - chain: baremetalds-ipi-test
+    post:
+      - chain: baremetalds-ipi-post
+```
+
+**If wait already exists**:
+```
+ℹ️  Wait step already configured in workflow
+
+Current test section:
+  test:
+    - ref: wait
+    - chain: baremetalds-ipi-test
+
+No changes needed. The workflow is already set up for debugging.
+```
+
+**If no test section exists**:
+```
+ℹ️  Workflow has no test: section
+
+This workflow is provision/deprovision only.
+The test steps must be defined in the job config.
+
+Please provide the full job name to modify the job config instead.
+```
+→ Exit or prompt for job name
+
+→ Continue to **Step 5b: Modify Workflow File**
+
+### Step 5a: Modify Job Config File
+
+**Edit the job config file directly** - no confirmation needed:
+
+```bash
+# Add wait step before the last test step
+# If timeout is provided, add it as a step property
+# See Step 6 for the YAML modification algorithm
+```
+
+**Two scenarios**:
+
+1. **Without custom timeout** (uses wait step's built-in default of 3h):
+   ```yaml
+   test:
+   - ref: wait
+   - chain: openshift-e2e-test-qe
+   ```
+   Note: No timeout or best_effort needed - the wait step will use its default TIMEOUT env var (3 hours)
+
+2. **With custom timeout** (user provided timeout parameter):
+   ```yaml
+   test:
+   - ref: wait
+     timeout: 8h0m0s
+     best_effort: true
+   - chain: openshift-e2e-test-qe
+   ```
+   Note: `best_effort: true` is required when timeout is customized to prevent the wait step from failing the job if it times out
+
+**Show brief confirmation**:
+```
+✅ Modified: ${job_name} (OCP ${ocp_version})
+   File: <job-config-file-path>
+   Added: - ref: wait${timeout:+ (timeout: ${timeout})}
+```
+
+### Step 5b: Modify Workflow File
+
+**Edit the workflow file directly** - no confirmation needed:
+
+```bash
+# Add wait step before the last test step
+# If timeout is provided, add it as a step property
+# See Step 6 for the YAML modification algorithm
+```
+
+**Two scenarios**:
+
+1. **Without custom timeout** (uses wait step's built-in default of 3h):
+   ```yaml
+   test:
+   - ref: wait
+   - chain: baremetalds-ipi-test
+   ```
+   Note: No timeout or best_effort needed - the wait step will use its default TIMEOUT env var (3 hours)
+
+2. **With custom timeout** (user provided timeout parameter):
+   ```yaml
+   test:
+   - ref: wait
+     timeout: 8h0m0s
+     best_effort: true
+   - chain: baremetalds-ipi-test
+   ```
+   Note: `best_effort: true` is required when timeout is customized to prevent the wait step from failing the job if it times out
+
+**Show brief confirmation**:
+```
+✅ Modified: ${workflow_name} workflow
+   File: <workflow-file-path>
+   Added: - ref: wait${timeout:+ (timeout: ${timeout})}
+   ⚠️  Impact: Affects ALL jobs using this workflow
+```
+
+### Step 6: Create Branch and Commit
+
+**Branch naming**:
+```
+debug-${workflow_name}-${ocp_version}-$(date +%Y%m%d)
+```
+
+Example: `debug-baremetalds-two-node-arbiter-4.21-20250131`
+
+**Git operations**:
+```bash
+# Create branch
+git checkout -b "${branch_name}"
+
+# Modify the file (add wait step using the implementation below)
+# Add '- ref: wait' as the first step in the test: section
+
+# Stage change
+git add <workflow-file>
+
+# Commit
+git commit -m "[Debug] Add wait step to ${workflow_name} for OCP ${ocp_version}
+
+This adds a wait step to enable debugging of test failures in OCP ${ocp_version}.
+
+The wait step pauses the workflow before tests run, allowing QE to:
+- SSH into the test environment
+- Inspect system state and logs
+- Debug configuration issues
+- Investigate test failures
+
+OCP Version: ${ocp_version}
+Workflow: ${workflow_name}"
+```
+
+**YAML Modification Algorithm**:
+
+The modification process for both job configs and workflow files follows the same pattern:
+
+1. **Locate the target**: Find the `test:` section
+   - For job configs: Within the specific job definition (`- as: ${job_name}`)
+   - For workflows: At the workflow level
+
+2. **Find test steps**: Identify all steps (lines with `- ref:` or `- chain:`)
+
+3. **Check for duplicates**: Ensure `- ref: wait` doesn't already exist
+
+4. **Insert wait step**: Add before the **last** test step with matching indentation
+
+5. **Handle timeout**:
+   - Without timeout: Add simple `- ref: wait`
+   - With timeout: Add as multi-line with `timeout` and `best_effort` properties
+
+**Example transformation:**
+
+Before:
+```yaml
+test:
+- chain: openshift-e2e-test-qe
+```
+
+After (without timeout):
+```yaml
+test:
+- ref: wait
+- chain: openshift-e2e-test-qe
+```
+
+After (with timeout=8h):
+```yaml
+test:
+- ref: wait
+  timeout: 8h0m0s
+  best_effort: true
+- chain: openshift-e2e-test-qe
+```
+
+**Critical constraints:**
+- Preserve exact YAML indentation (typically 2 spaces per level)
+- Insert BEFORE the last step, not after
+- When timeout is set, `best_effort: true` is required to prevent job failure
+- Normalize timeout format to Go duration (e.g., "8h" → "8h0m0s")
+
+### Step 7: Push and Show GitHub Link
+
+**Auto-push the branch**:
+```bash
+git push origin "${branch_name}"
+```
+
+**Display GitHub PR creation link**:
+```
+✅ Changes pushed successfully!
+
+Create PR here:
+https://github.com/openshift/release/compare/master...${branch_name}
+
+Branch: ${branch_name}
+Job: ${job_name}
+OCP: ${ocp_version}
+
+⚠️  Remember to close PR after debugging (DO NOT MERGE)
+```
+
+That's it! Simple and clean.
+
+### Error Handling
+
+**Error: Repository Not Found**
+```
+❌ Error: Repository not found at ${repo_path}
+
+Please provide the correct path to openshift/release repository.
+
+To clone:
+git clone https://github.com/openshift/release.git
+```
+
+**Error: Not in openshift/release Repo**
+```
+❌ Error: This doesn't appear to be the openshift/release repository
+
+Remote URL: ${current_remote}
+Expected: github.com/openshift/release
+
+Please navigate to the correct repository.
+```
+
+**Error: Workflow File Not Found**
+```
+❌ Error: Workflow file not found
+
+Searched for: *${workflow_name}*workflow*.yaml
+Location: ci-operator/step-registry/
+
+Suggestions:
+1. Verify the workflow name
+2. Try a partial match
+3. Search manually: find ci-operator/step-registry -name "*workflow*.yaml"
+```
+
+**Error: Wait Step Already Exists**
+```
+ℹ️  Wait step already configured in this workflow
+
+No action needed - you can proceed with debugging using the existing wait step.
+```
+
+**Error: Invalid OCP Version**
+```
+❌ Invalid OCP version: ${version}
+
+Valid versions: 4.18, 4.19, 4.20, 4.21, 4.22, master
+
+Please provide a valid version.
+```
+
+### Error: Invalid Timeout Format
+```
+❌ Invalid timeout format: ${timeout}
+
+Valid format: Integer followed by 'h' (e.g., "1h", "2h", "8h", "24h", "72h")
+Valid range: 1h to 72h
+
+Examples:
+- "1h" (1 hour)
+- "8h" (8 hours)
+- "24h" (24 hours)
+- "72h" (72 hours, maximum)
+
+Please provide a valid timeout in hours.
+```
+
+### Note: Timeout Normalization
+
+When a user provides a timeout like "8h", the implementation should normalize it to the standard Go duration format "8h0m0s" for consistency with existing configurations in the codebase.
+
+## Return Value
+
+- **Success**: PR URL and debugging instructions
+- **Error**: Error message with suggestions for resolution
+- **Format**: Text output with emoji indicators for status
+
+## Examples
+
+### Example 1: Without Timeout (Default 3h)
+
+```bash
+/ci:add-debug-wait aws-ipi-f7-longduration-workload
+```
+
+Prompts for: OCP version (4.21), repo path
+
+Result:
+```yaml
+test:
+- ref: wait
+- chain: openshift-e2e-test-qe
+```
+
+Returns: PR creation link
+
+### Example 2: With Custom Timeout
+
+```bash
+/ci:add-debug-wait aws-ipi-f7-longduration-workload 8h
+```
+
+Prompts for: OCP version (4.21), repo path
+
+Result:
+```yaml
+test:
+- ref: wait
+  timeout: 8h0m0s
+  best_effort: true
+- chain: openshift-e2e-test-qe
+```
+
+Returns: PR creation link with timeout info
+
+### Example 3: Workflow File
+
+```bash
+/ci:add-debug-wait baremetalds-two-node-arbiter-upgrade 24h
+```
+
+Behavior: Searches job config first, falls back to workflow if not found. Warns that workflow changes affect ALL jobs using it.
+
+Returns: PR creation link
+
+## Arguments
+
+- **$1** (workflow-or-job-name): The name of the CI workflow or job to add the wait step to (required)
+- **$2** (timeout): Optional timeout in hours (1h-72h). Examples: "1h", "8h", "24h", "72h". If not provided, uses wait step's default (3h)
+
+## Notes
+
+### Best Practices for QE
+
+**Before Running Command**:
+- ✅ Confirm test is actually failing
+- ✅ Check existing debug PRs
+- ✅ Know which OCP version is affected
+
+**During Debugging**:
+- 📝 Take detailed notes
+- 💾 Save logs and screenshots
+- 🔍 Document root cause
+- 📊 Record all findings
+
+**After Debugging**:
+- ✅ Document findings
+- ✅ Close the debug PR
+- ✅ Delete the branch
+- ✅ Share learnings with team
+- ✅ Create fix PR if needed
+
+### Future Enhancements
+
+Consider adding companion commands:
+- `/ci:close-debug-pr` - Lists open debug PRs, prompts for findings, closes PR
+- `/ci:list-debug-prs` - Show all open debug PRs
+- `/ci:revert-debug-pr` - Revert a debug PR that was merged by mistake