14 KiB
Name
olm:diagnose
Synopsis
/olm:diagnose [operator-name] [namespace] [--fix] [--cluster]
Description
The olm:diagnose command diagnoses common OLM and operator issues, including orphaned CRDs, stuck namespaces, failed installations, and catalog source problems. It can optionally attempt to fix detected issues automatically.
This command helps you:
- Detect and clean up orphaned CRDs from deleted operators
- Fix namespaces stuck in Terminating state
- Identify and resolve failed operator installations
- Detect conflicting OperatorGroups
- Check catalog source health
- Identify resources preventing clean uninstallation
- Generate comprehensive troubleshooting reports
Implementation
The command performs the following steps:
-
Parse Arguments:
$1: Operator name (optional) - Specific operator to diagnose$2: Namespace (optional) - Specific namespace to check$3+: Flags (optional):--fix: Automatically attempt to fix detected issues (requires confirmation)--cluster: Run cluster-wide diagnostics (catalog sources, global CRDs, etc.)
-
Prerequisites Check:
- Verify
ocCLI is installed:which oc - Verify cluster access:
oc whoami - Check if user has cluster-admin or sufficient privileges
- Warn if running without
--fixflag (dry-run mode)
- Verify
-
Determine Scope:
- Operator-specific: If operator name provided, focus on that operator
- Namespace-specific: If namespace provided, check all operators in that namespace
- Cluster-wide: If
--clusterflag or no arguments, check entire cluster
-
Scan for Orphaned CRDs:
- Get all CRDs in the cluster:
oc get crd -o json - For each CRD, check if there's a corresponding operator:
- Look for CSVs that own this CRD
- Look for active Subscriptions related to this CRD
- Identify orphaned CRDs (no owning operator found):
# Find CRDs without active operators # This is a simplified check - actual implementation should verify operator ownership oc get crd -o json | jq -r '.items[] | select(.metadata.annotations["operators.coreos.com/owner"] // "" | length == 0) | .metadata.name' - Check if CRs exist for orphaned CRDs:
oc get <crd-kind> --all-namespaces --ignore-not-found - Report findings:
⚠️ Orphaned CRDs Detected The following CRDs have no active operator: - certificates.cert-manager.io (3 CR instances in 2 namespaces) - issuers.cert-manager.io (5 CR instances in 3 namespaces) These CRDs may be leftovers from uninstalled operators. [If --fix flag:] Do you want to delete these CRDs and their CRs? (yes/no) WARNING: This will delete all custom resources of these types!
- Get all CRDs in the cluster:
-
Check for Stuck Namespaces:
- Get all namespaces in Terminating state:
oc get namespaces -o json | jq -r '.items[] | select(.status.phase=="Terminating") | .metadata.name' - For each stuck namespace:
- Check remaining resources:
oc api-resources --verbs=list --namespaced -o name | \ xargs -n 1 oc get --show-kind --ignore-not-found -n {namespace} - Check namespace finalizers:
oc get namespace {namespace} -o jsonpath='{.metadata.finalizers}' - Identify blocking resources
- Check remaining resources:
- Report findings:
❌ Stuck Namespace Detected Namespace: {namespace} State: Terminating (stuck for {duration}) Blocking resources: - CustomResourceDefinition: {crd-name} (finalizer: {finalizer}) - ServiceAccount: {sa-name} (token secret) Finalizers on namespace: - kubernetes [If --fix flag:] Attempted fixes: 1. Delete remaining resources 2. Remove finalizers from CRs 3. Patch namespace to remove finalizers (CAUTION) WARNING: Force-deleting namespace can cause cluster instability.
- Get all namespaces in Terminating state:
-
Scan for Failed Operator Installations:
- Get all CSVs not in "Succeeded" phase:
oc get csv --all-namespaces -o json | \ jq -r '.items[] | select(.status.phase != "Succeeded") | "\(.metadata.namespace)/\(.metadata.name): \(.status.phase)"' - For each failed CSV:
- Get failure reason:
.status.reason - Get failure message:
.status.message - Check related InstallPlan status
- Check deployment status
- Check recent events
- Get failure reason:
- Report findings:
❌ Failed Operator Installation Operator: {operator-name} Namespace: {namespace} CSV: {csv-name} Phase: Failed Reason: {reason} Message: {message} Related InstallPlan: {installplan-name} (Phase: {phase}) Recent Events: - {timestamp} Warning: {event-message} Troubleshooting suggestions: - Check operator logs: oc logs -n {namespace} deployment/{deployment} - Check image pull issues: oc describe pod -n {namespace} - Verify catalog source health - Check RBAC permissions
- Get all CSVs not in "Succeeded" phase:
-
Check for Conflicting OperatorGroups:
- Get all OperatorGroups per namespace:
oc get operatorgroup --all-namespaces -o json - Identify namespaces with multiple OperatorGroups (conflict):
oc get operatorgroup --all-namespaces -o json | \ jq -r '.items | group_by(.metadata.namespace) | .[] | select(length > 1) | .[0].metadata.namespace' - Check for OperatorGroups with overlapping target namespaces
- Report findings:
⚠️ Conflicting OperatorGroups Detected Namespace: {namespace} OperatorGroups: {count} - {og-1} (targets: {target-namespaces-1}) - {og-2} (targets: {target-namespaces-2}) Multiple OperatorGroups in a namespace can cause conflicts. Only one OperatorGroup should exist per namespace. [If --fix flag:] Keep which OperatorGroup? (1/2)
- Get all OperatorGroups per namespace:
-
Verify Catalog Source Health (if
--clusterflag):- Get all CatalogSources:
oc get catalogsource -n openshift-marketplace -o json - For each catalog:
- Check status:
.status.connectionState.lastObservedState - Check pod status
- Check last update time
- Verify grpc connection
- Check status:
- Report findings:
🔍 Catalog Source Health Check ✓ redhat-operators: READY (last updated: 2h ago) ✓ certified-operators: READY (last updated: 3h ago) ✓ community-operators: READY (last updated: 1h ago) ❌ custom-catalog: CONNECTION_FAILED (pod: CrashLoopBackOff) [If issues found:] Unhealthy Catalog: custom-catalog Pod: custom-catalog-abc123 (Status: CrashLoopBackOff) To troubleshoot: oc logs -n openshift-marketplace custom-catalog-abc123 oc describe catalogsource custom-catalog -n openshift-marketplace
- Get all CatalogSources:
-
Check for Subscription/CSV Mismatches:
- Get all Subscriptions:
oc get subscription --all-namespaces -o json - For each Subscription:
- Compare
installedCSVwithcurrentCSV - Check if CSV exists
- Verify CSV phase
- Compare
- Report findings:
⚠️ Subscription/CSV Mismatch Operator: {operator-name} Namespace: {namespace} Installed CSV: {installed-csv} Current CSV: {current-csv} CSV {installed-csv} not found in namespace. This may indicate a failed installation or upgrade. Suggested fix: oc delete subscription {operator-name} -n {namespace} /olm:install {operator-name} {namespace}
- Get all Subscriptions:
-
Check for Pending Manual Approvals:
- Find all unapproved InstallPlans:
oc get installplan --all-namespaces -o json | \ jq -r '.items[] | select(.spec.approved==false)' - Report findings:
ℹ️ Pending Manual Approvals The following operators have pending InstallPlans requiring approval: - Operator: openshift-cert-manager-operator Namespace: cert-manager-operator InstallPlan: install-abc123 Target Version: v1.14.0 To approve: /olm:approve openshift-cert-manager-operator cert-manager-operator - Operator: external-secrets-operator Namespace: eso-operator InstallPlan: install-def456 Target Version: v0.11.0 To approve: /olm:approve external-secrets-operator eso-operator
- Find all unapproved InstallPlans:
-
Generate Comprehensive Report:
═══════════════════════════════════════════════════════════ OLM HEALTH CHECK REPORT ═══════════════════════════════════════════════════════════ Scan Scope: [Operator-specific | Namespace | Cluster-wide] Scan Time: {timestamp} ✓ HEALTHY CHECKS: {count} - Catalog sources operational - No conflicting OperatorGroups - All CSVs in Succeeded phase ⚠️ WARNINGS: {count} - {warning-count} orphaned CRDs detected - {warning-count} pending manual approvals ❌ ERRORS: {count} - {error-count} stuck namespaces - {error-count} failed operator installations - {error-count} unhealthy catalog sources ═══════════════════════════════════════════════════════════ DETAILED FINDINGS ═══════════════════════════════════════════════════════════ [Details for each finding...] ═══════════════════════════════════════════════════════════ RECOMMENDATIONS ═══════════════════════════════════════════════════════════ 1. Clean up orphaned CRDs: /olm:diagnose --fix 2. Fix stuck namespace: /olm:diagnose {namespace} --fix 3. Approve pending upgrades: /olm:approve {operator-name} For more details on troubleshooting, see: https://docs.redhat.com/en/documentation/openshift_container_platform/4.20/html/operators/administrator-tasks#olm-troubleshooting-operator-issues -
Auto-Fix Issues (if
--fixflag):- For each detected issue, ask for confirmation
- Attempt fixes based on issue type:
- Orphaned CRDs: Delete CRs first, then CRDs
- Stuck namespaces: Delete remaining resources, remove finalizers
- Failed installations: Restart by deleting and recreating
- Conflicting OperatorGroups: Remove unwanted OperatorGroup
- Unhealthy catalogs: Restart catalog pod
- Display results of each fix attempt
- Generate final summary
Return Value
- Success: Report generated with findings
- Issues Found: Detailed report with warnings and errors
- Fixed: Issues resolved (if
--fixflag used) - Format: Structured report showing:
- Summary of health checks
- Detailed findings for each issue
- Recommendations and next steps
- Links to documentation
Examples
-
Check specific operator:
/olm:diagnose openshift-cert-manager-operator -
Cluster-wide health check:
/olm:diagnose --cluster -
Diagnose and fix issues:
/olm:diagnose openshift-cert-manager-operator cert-manager-operator --fix -
Full cluster scan with auto-fix:
/olm:diagnose --cluster --fix
Arguments
- $1 (operator-name): Name of specific operator to diagnose (optional)
- If not provided, checks all operators (or cluster-wide with
--cluster) - Example: "openshift-cert-manager-operator"
- If not provided, checks all operators (or cluster-wide with
- $2 (namespace): Specific namespace to check (optional)
- If not provided with operator-name, searches all namespaces
- Example: "cert-manager-operator"
- $3+ (flags): Optional flags
--fix: Attempt to automatically fix detected issues- Prompts for confirmation before each fix
- Use with caution in production environments
--cluster: Run cluster-wide diagnostics- Checks catalog sources
- Scans for orphaned CRDs across all namespaces
- Identifies global issues
Troubleshooting
-
Permission denied:
# Check required permissions oc auth can-i get crd oc auth can-i get csv --all-namespaces oc auth can-i patch namespace -
Unable to fix stuck namespace:
- Some resources may require manual intervention
- Check API service availability:
oc get apiservice
-
CRDs won't delete:
# Check for remaining CRs oc get <crd-kind> --all-namespaces # Check for finalizers oc get crd <crd-name> -o jsonpath='{.metadata.finalizers}' -
Catalog source issues persist:
# Restart catalog pod oc delete pod -n openshift-marketplace <catalog-pod> # Check catalog source definition oc get catalogsource <catalog-name> -n openshift-marketplace -o yaml
Related Commands
/olm:status <operator-name>- Check specific operator status/olm:list- List all operators/olm:uninstall <operator-name>- Clean uninstall with orphan cleanup/olm:approve <operator-name>- Approve pending InstallPlans