Files
gh-openshift-eng-ai-helpers…/commands/debug.md
2025-11-30 08:46:11 +08:00

218 lines
8.5 KiB
Markdown

---
description: Debug OLM issues using must-gather logs and source code analysis
argument-hint: <issue-description> <must-gather-path> [olm-version]
---
## Name
olm:debug
## Synopsis
```
/olm:debug <issue-description> <must-gather-path> [olm-version]
```
## Description
The `olm:debug` command analyzes OLM (Operator Lifecycle Manager) issues by correlating must-gather logs with the appropriate OLM source code. It automatically determines the OCP version from the must-gather logs, checks out the corresponding branch from the relevant OLM repositories, queries Jira for known bugs in the OCPBUGS project (OLM component), and provides detailed analysis and debugging insights.
## Arguments
- **$1** (required): Issue description - A brief description of the OLM issue being investigated
- **$2** (required): Must-gather path - Absolute or relative path to the must-gather log directory
- **$3** (optional): OLM version - Either `olmv0` (default) or `olmv1`
- `olmv0`: Uses operator-framework-olm repository
- `olmv1`: Uses operator-framework-operator-controller and cluster-olm-operator repositories
## Implementation
### Phase 1: Environment Setup and Validation
1. **Validate arguments**
- Check that issue description is provided
- Verify must-gather path exists and is accessible
- Set OLM version to `olmv0` if not specified
2. **Parse must-gather logs to determine OCP version**
- Look for version information in must-gather logs
- Common locations:
- `cluster-scoped-resources/core/nodes/*.yaml` - check node annotations
- `cluster-scoped-resources/config.openshift.io/clusterversions/*.yaml`
- Extract OCP version (e.g., `4.14`, `4.15`, `4.16`)
- Determine corresponding branch name (e.g., `release-4.14`)
3. **Create working directory**
- Use `.work/olm-debug/<timestamp>/` for temporary files
- Create subdirectories: `repos/`, `analysis/`, `logs/`
### Phase 2: Repository Setup
4. **Clone appropriate repositories based on OLM version**
**For olmv0:**
- Clone `https://github.com/openshift/operator-framework-olm.git`
- Checkout branch `release-<ocp-version>` (e.g., `release-4.14`)
- If branch doesn't exist, try `main` or `master` branch
**For olmv1:**
- Clone `https://github.com/openshift/operator-framework-operator-controller.git`
- Clone `https://github.com/openshift/cluster-olm-operator.git`
- For each repo, checkout branch `release-<ocp-version>`
- If branch doesn't exist, try `main` or `master` branch
5. **Verify repository setup**
- Confirm branches are checked out successfully
- List key directories to understand codebase structure
### Phase 3: Log Analysis
6. **Extract relevant OLM logs from must-gather**
- For olmv0, look for:
- `namespaces/openshift-operator-lifecycle-manager/` logs
- OLM operator logs: `pods/catalog-operator-*/`, `pods/olm-operator-*/`
- CSV (ClusterServiceVersion) resources
- Subscription resources
- InstallPlan resources
- For olmv1, look for:
- `namespaces/openshift-operator-controller/` logs
- Operator controller logs
- ClusterExtension resources
- Catalog resources
7. **Identify error patterns and relevant logs**
- Search for ERROR, WARN, FATAL level logs
- Extract stack traces
- Identify failed reconciliations
- Note timestamps of issues
### Phase 4: Known Bug Search in Jira
8. **Query Jira for known OLM bugs**
- Search OCPBUGS project with component "olm"
- Use Jira REST API or web scraping to fetch bugs
- Query parameters:
- Project: `OCPBUGS`
- Component: `olm`
- Affects Version: Matches the OCP version (e.g., `4.14.0`, `4.15.0`)
- Status: Open, In Progress, or Recently Resolved
- API endpoint example:
```
https://issues.redhat.com/rest/api/2/search?jql=project=OCPBUGS AND component=olm AND affectedVersion~"4.14"
```
9. **Match errors with known bugs**
- Extract error messages and keywords from logs
- Search for matching patterns in Jira bug summaries and descriptions
- Look for similar symptoms in bug reports
- Identify potential matches based on:
- Error message similarity
- Affected OCP version
- Component affected (catalog-operator, olm-operator, etc.)
- Symptom descriptions
10. **Categorize and prioritize matches**
- High priority: Exact error message match with same OCP version
- Medium priority: Similar symptoms with same component
- Low priority: Related issues in same version range
- Note bugs that have patches or workarounds available
### Phase 5: Code Correlation
11. **Map errors to source code**
- Search cloned repositories for:
- Error messages found in logs
- Function names from stack traces
- Related controllers and reconcilers
- Use grep/ripgrep to find relevant code sections
12. **Analyze relevant code sections**
- Read the source code around identified errors
- Understand the reconciliation logic
- Identify potential root causes
### Phase 6: Analysis and Recommendations
13. **Generate detailed analysis report**
- Summary of the issue
- OCP and OLM version information
- Timeline of events from logs
- Known bugs section with Jira links
- Relevant code sections with explanations
- Potential root causes
- Recommended debugging steps
- Suggested fixes or workarounds
14. **Create output files**
- `analysis.md`: Detailed analysis report
- `relevant-logs.txt`: Extracted relevant log entries
- `code-references.md`: Links to relevant source code sections with line numbers
- `known-bugs.md`: List of potentially related Jira bugs with match confidence
### Error Handling
- **Must-gather path not found**: Provide clear error message with expected path format
- **Unable to determine OCP version**: Ask user to provide OCP version manually
- **Repository clone failures**: Check network connectivity, provide manual clone instructions
- **Branch not found**: Fall back to main/master branch and warn user about version mismatch
- **No relevant logs found**: Provide guidance on what logs to look for manually
- **Jira access failures**: Continue with analysis if Jira is unavailable; note in report that known bug search was skipped
- **Jira authentication required**: Provide instructions for setting up Jira credentials if needed
## Return Value
The command generates the following outputs in `.work/olm-debug/<timestamp>/`:
- **analysis.md**: Comprehensive analysis report including:
- Issue summary
- Version information (OCP, OLM)
- Log analysis with timeline
- Known bugs section with links to matching Jira issues
- Code correlation and root cause analysis
- Recommendations
- **relevant-logs.txt**: Extracted relevant log entries from must-gather
- **code-references.md**: Links to relevant source code files with line numbers
- **known-bugs.md**: List of potentially related Jira bugs including:
- Bug ID and link (e.g., OCPBUGS-12345)
- Bug summary and status
- Match confidence (High/Medium/Low)
- Affected versions
- Available workarounds or patches
- **repos/**: Cloned repository directories for further manual investigation
## Examples
1. **Basic usage with olmv0 (default)**:
```
/olm:debug "CSV stuck in pending state" /path/to/must-gather
```
2. **Debug olmv1 issue**:
```
/olm:debug "ClusterExtension installation failing" /path/to/must-gather olmv1
```
3. **Debug with detailed issue description**:
```
/olm:debug "Operator upgrade from v1.0 to v2.0 fails with dependency resolution error" ~/Downloads/must-gather.local.123456 olmv0
```
## Notes
- The command requires `git` to be installed for cloning repositories
- Network access is required to clone from GitHub and access Jira
- Large must-gather archives may take time to process
- The analysis is based on pattern matching and may require manual verification
- For private repositories, ensure GitHub credentials are configured
- Jira access to https://issues.redhat.com/ may require authentication for full access
- Known bug matching is based on text similarity and may produce false positives
- Always verify suggested bug matches by reading the full bug description
## See Also
- OLM Documentation: https://olm.operatorframework.io/
- OpenShift OLM: https://docs.openshift.com/container-platform/latest/operators/understanding/olm/olm-understanding-olm.html
- Must-gather documentation: https://docs.openshift.com/container-platform/latest/support/gathering-cluster-data.html
- OCPBUGS Jira Project: https://issues.redhat.com/projects/OCPBUGS/
- Jira REST API: https://docs.atlassian.com/jira-software/REST/latest/