--- description: Debug OLM issues using must-gather logs and source code analysis argument-hint: [olm-version] --- ## Name olm:debug ## Synopsis ``` /olm:debug [olm-version] ``` ## Description The `olm:debug` command analyzes OLM (Operator Lifecycle Manager) issues by correlating must-gather logs with the appropriate OLM source code. It automatically determines the OCP version from the must-gather logs, checks out the corresponding branch from the relevant OLM repositories, queries Jira for known bugs in the OCPBUGS project (OLM component), and provides detailed analysis and debugging insights. ## Arguments - **$1** (required): Issue description - A brief description of the OLM issue being investigated - **$2** (required): Must-gather path - Absolute or relative path to the must-gather log directory - **$3** (optional): OLM version - Either `olmv0` (default) or `olmv1` - `olmv0`: Uses operator-framework-olm repository - `olmv1`: Uses operator-framework-operator-controller and cluster-olm-operator repositories ## Implementation ### Phase 1: Environment Setup and Validation 1. **Validate arguments** - Check that issue description is provided - Verify must-gather path exists and is accessible - Set OLM version to `olmv0` if not specified 2. **Parse must-gather logs to determine OCP version** - Look for version information in must-gather logs - Common locations: - `cluster-scoped-resources/core/nodes/*.yaml` - check node annotations - `cluster-scoped-resources/config.openshift.io/clusterversions/*.yaml` - Extract OCP version (e.g., `4.14`, `4.15`, `4.16`) - Determine corresponding branch name (e.g., `release-4.14`) 3. **Create working directory** - Use `.work/olm-debug//` for temporary files - Create subdirectories: `repos/`, `analysis/`, `logs/` ### Phase 2: Repository Setup 4. **Clone appropriate repositories based on OLM version** **For olmv0:** - Clone `https://github.com/openshift/operator-framework-olm.git` - Checkout branch `release-` (e.g., `release-4.14`) - If branch doesn't exist, try `main` or `master` branch **For olmv1:** - Clone `https://github.com/openshift/operator-framework-operator-controller.git` - Clone `https://github.com/openshift/cluster-olm-operator.git` - For each repo, checkout branch `release-` - If branch doesn't exist, try `main` or `master` branch 5. **Verify repository setup** - Confirm branches are checked out successfully - List key directories to understand codebase structure ### Phase 3: Log Analysis 6. **Extract relevant OLM logs from must-gather** - For olmv0, look for: - `namespaces/openshift-operator-lifecycle-manager/` logs - OLM operator logs: `pods/catalog-operator-*/`, `pods/olm-operator-*/` - CSV (ClusterServiceVersion) resources - Subscription resources - InstallPlan resources - For olmv1, look for: - `namespaces/openshift-operator-controller/` logs - Operator controller logs - ClusterExtension resources - Catalog resources 7. **Identify error patterns and relevant logs** - Search for ERROR, WARN, FATAL level logs - Extract stack traces - Identify failed reconciliations - Note timestamps of issues ### Phase 4: Known Bug Search in Jira 8. **Query Jira for known OLM bugs** - Search OCPBUGS project with component "olm" - Use Jira REST API or web scraping to fetch bugs - Query parameters: - Project: `OCPBUGS` - Component: `olm` - Affects Version: Matches the OCP version (e.g., `4.14.0`, `4.15.0`) - Status: Open, In Progress, or Recently Resolved - API endpoint example: ``` https://issues.redhat.com/rest/api/2/search?jql=project=OCPBUGS AND component=olm AND affectedVersion~"4.14" ``` 9. **Match errors with known bugs** - Extract error messages and keywords from logs - Search for matching patterns in Jira bug summaries and descriptions - Look for similar symptoms in bug reports - Identify potential matches based on: - Error message similarity - Affected OCP version - Component affected (catalog-operator, olm-operator, etc.) - Symptom descriptions 10. **Categorize and prioritize matches** - High priority: Exact error message match with same OCP version - Medium priority: Similar symptoms with same component - Low priority: Related issues in same version range - Note bugs that have patches or workarounds available ### Phase 5: Code Correlation 11. **Map errors to source code** - Search cloned repositories for: - Error messages found in logs - Function names from stack traces - Related controllers and reconcilers - Use grep/ripgrep to find relevant code sections 12. **Analyze relevant code sections** - Read the source code around identified errors - Understand the reconciliation logic - Identify potential root causes ### Phase 6: Analysis and Recommendations 13. **Generate detailed analysis report** - Summary of the issue - OCP and OLM version information - Timeline of events from logs - Known bugs section with Jira links - Relevant code sections with explanations - Potential root causes - Recommended debugging steps - Suggested fixes or workarounds 14. **Create output files** - `analysis.md`: Detailed analysis report - `relevant-logs.txt`: Extracted relevant log entries - `code-references.md`: Links to relevant source code sections with line numbers - `known-bugs.md`: List of potentially related Jira bugs with match confidence ### Error Handling - **Must-gather path not found**: Provide clear error message with expected path format - **Unable to determine OCP version**: Ask user to provide OCP version manually - **Repository clone failures**: Check network connectivity, provide manual clone instructions - **Branch not found**: Fall back to main/master branch and warn user about version mismatch - **No relevant logs found**: Provide guidance on what logs to look for manually - **Jira access failures**: Continue with analysis if Jira is unavailable; note in report that known bug search was skipped - **Jira authentication required**: Provide instructions for setting up Jira credentials if needed ## Return Value The command generates the following outputs in `.work/olm-debug//`: - **analysis.md**: Comprehensive analysis report including: - Issue summary - Version information (OCP, OLM) - Log analysis with timeline - Known bugs section with links to matching Jira issues - Code correlation and root cause analysis - Recommendations - **relevant-logs.txt**: Extracted relevant log entries from must-gather - **code-references.md**: Links to relevant source code files with line numbers - **known-bugs.md**: List of potentially related Jira bugs including: - Bug ID and link (e.g., OCPBUGS-12345) - Bug summary and status - Match confidence (High/Medium/Low) - Affected versions - Available workarounds or patches - **repos/**: Cloned repository directories for further manual investigation ## Examples 1. **Basic usage with olmv0 (default)**: ``` /olm:debug "CSV stuck in pending state" /path/to/must-gather ``` 2. **Debug olmv1 issue**: ``` /olm:debug "ClusterExtension installation failing" /path/to/must-gather olmv1 ``` 3. **Debug with detailed issue description**: ``` /olm:debug "Operator upgrade from v1.0 to v2.0 fails with dependency resolution error" ~/Downloads/must-gather.local.123456 olmv0 ``` ## Notes - The command requires `git` to be installed for cloning repositories - Network access is required to clone from GitHub and access Jira - Large must-gather archives may take time to process - The analysis is based on pattern matching and may require manual verification - For private repositories, ensure GitHub credentials are configured - Jira access to https://issues.redhat.com/ may require authentication for full access - Known bug matching is based on text similarity and may produce false positives - Always verify suggested bug matches by reading the full bug description ## See Also - OLM Documentation: https://olm.operatorframework.io/ - OpenShift OLM: https://docs.openshift.com/container-platform/latest/operators/understanding/olm/olm-understanding-olm.html - Must-gather documentation: https://docs.openshift.com/container-platform/latest/support/gathering-cluster-data.html - OCPBUGS Jira Project: https://issues.redhat.com/projects/OCPBUGS/ - Jira REST API: https://docs.atlassian.com/jira-software/REST/latest/