gh-openshift-eng-ai-helpers-plugins-olm/commands/diagnose.md at 07f84861a9519f73fcb95ce8fd75aeb17b53eff3

zhongwei/gh-openshift-eng-ai-helpers-plugins-olm

Files

Zhongwei Li 07f84861a9 Initial commit

2025-11-30 08:46:11 +08:00

14 KiB

Raw Blame History

description: Diagnose and optionally fix common OLM and operator issues argument-hint: [operator-name] [namespace] [--fix] [--cluster]

Name

olm:diagnose

Synopsis

/olm:diagnose [operator-name] [namespace] [--fix] [--cluster]

Description

The olm:diagnose command diagnoses common OLM and operator issues, including orphaned CRDs, stuck namespaces, failed installations, and catalog source problems. It can optionally attempt to fix detected issues automatically.

This command helps you:

Detect and clean up orphaned CRDs from deleted operators
Fix namespaces stuck in Terminating state
Identify and resolve failed operator installations
Detect conflicting OperatorGroups
Check catalog source health
Identify resources preventing clean uninstallation
Generate comprehensive troubleshooting reports

Implementation

The command performs the following steps:

Parse Arguments:
- $1: Operator name (optional) - Specific operator to diagnose
- $2: Namespace (optional) - Specific namespace to check
- $3+: Flags (optional):
  - --fix: Automatically attempt to fix detected issues (requires confirmation)
  - --cluster: Run cluster-wide diagnostics (catalog sources, global CRDs, etc.)
Prerequisites Check:
- Verify oc CLI is installed: which oc
- Verify cluster access: oc whoami
- Check if user has cluster-admin or sufficient privileges
- Warn if running without --fix flag (dry-run mode)
Determine Scope:
- Operator-specific: If operator name provided, focus on that operator
- Namespace-specific: If namespace provided, check all operators in that namespace
- Cluster-wide: If --cluster flag or no arguments, check entire cluster

Scan for Orphaned CRDs:

Get all CRDs in the cluster:
```
oc get crd -o json
```
For each CRD, check if there's a corresponding operator:
- Look for CSVs that own this CRD
- Look for active Subscriptions related to this CRD

Identify orphaned CRDs (no owning operator found):

# Find CRDs without active operators
# This is a simplified check - actual implementation should verify operator ownership
oc get crd -o json | jq -r '.items[] | 
  select(.metadata.annotations["operators.coreos.com/owner"] // "" | length == 0) | 
  .metadata.name'

Check if CRs exist for orphaned CRDs:

oc get <crd-kind> --all-namespaces --ignore-not-found

Report findings:

⚠️  Orphaned CRDs Detected

The following CRDs have no active operator:
- certificates.cert-manager.io (3 CR instances in 2 namespaces)
- issuers.cert-manager.io (5 CR instances in 3 namespaces)

These CRDs may be leftovers from uninstalled operators.

[If --fix flag:]
Do you want to delete these CRDs and their CRs? (yes/no)
WARNING: This will delete all custom resources of these types!

Check for Stuck Namespaces:

Get all namespaces in Terminating state:

oc get namespaces -o json | jq -r '.items[] | select(.status.phase=="Terminating") | .metadata.name'

For each stuck namespace:

Check remaining resources:

oc api-resources --verbs=list --namespaced -o name | \
  xargs -n 1 oc get --show-kind --ignore-not-found -n {namespace}

Check namespace finalizers:

oc get namespace {namespace} -o jsonpath='{.metadata.finalizers}'

Identify blocking resources

Report findings:

❌ Stuck Namespace Detected

Namespace: {namespace}
State: Terminating (stuck for {duration})

Blocking resources:
- CustomResourceDefinition: {crd-name} (finalizer: {finalizer})
- ServiceAccount: {sa-name} (token secret)

Finalizers on namespace:
- kubernetes

[If --fix flag:]
Attempted fixes:
1. Delete remaining resources
2. Remove finalizers from CRs
3. Patch namespace to remove finalizers (CAUTION)

WARNING: Force-deleting namespace can cause cluster instability.

Scan for Failed Operator Installations:

Get all CSVs not in "Succeeded" phase:

oc get csv --all-namespaces -o json | \
  jq -r '.items[] | select(.status.phase != "Succeeded") | "\(.metadata.namespace)/\(.metadata.name): \(.status.phase)"'

For each failed CSV:
- Get failure reason: .status.reason
- Get failure message: .status.message
- Check related InstallPlan status
- Check deployment status
- Check recent events

Report findings:

❌ Failed Operator Installation

Operator: {operator-name}
Namespace: {namespace}
CSV: {csv-name}
Phase: Failed
Reason: {reason}
Message: {message}

Related InstallPlan: {installplan-name} (Phase: {phase})

Recent Events:
- {timestamp} Warning: {event-message}

Troubleshooting suggestions:
- Check operator logs: oc logs -n {namespace} deployment/{deployment}
- Check image pull issues: oc describe pod -n {namespace}
- Verify catalog source health
- Check RBAC permissions

Check for Conflicting OperatorGroups:

Get all OperatorGroups per namespace:

oc get operatorgroup --all-namespaces -o json

Identify namespaces with multiple OperatorGroups (conflict):

oc get operatorgroup --all-namespaces -o json | \
  jq -r '.items | group_by(.metadata.namespace) | .[] | select(length > 1) | .[0].metadata.namespace'

Check for OperatorGroups with overlapping target namespaces

Report findings:

⚠️  Conflicting OperatorGroups Detected

Namespace: {namespace}
OperatorGroups: {count}
- {og-1} (targets: {target-namespaces-1})
- {og-2} (targets: {target-namespaces-2})

Multiple OperatorGroups in a namespace can cause conflicts.
Only one OperatorGroup should exist per namespace.

[If --fix flag:]
Keep which OperatorGroup? (1/2)

Verify Catalog Source Health (if --cluster flag):

Get all CatalogSources:

oc get catalogsource -n openshift-marketplace -o json

For each catalog:
- Check status: .status.connectionState.lastObservedState
- Check pod status
- Check last update time
- Verify grpc connection

Report findings:

🔍 Catalog Source Health Check

✓ redhat-operators: READY (last updated: 2h ago)
✓ certified-operators: READY (last updated: 3h ago)
✓ community-operators: READY (last updated: 1h ago)
❌ custom-catalog: CONNECTION_FAILED (pod: CrashLoopBackOff)

[If issues found:]
Unhealthy Catalog: custom-catalog
Pod: custom-catalog-abc123 (Status: CrashLoopBackOff)

To troubleshoot:
oc logs -n openshift-marketplace custom-catalog-abc123
oc describe catalogsource custom-catalog -n openshift-marketplace

Check for Subscription/CSV Mismatches:

Get all Subscriptions:

oc get subscription --all-namespaces -o json

For each Subscription:
- Compare installedCSV with currentCSV
- Check if CSV exists
- Verify CSV phase

Report findings:

⚠️  Subscription/CSV Mismatch

Operator: {operator-name}
Namespace: {namespace}
Installed CSV: {installed-csv}
Current CSV: {current-csv}

CSV {installed-csv} not found in namespace.
This may indicate a failed installation or upgrade.

Suggested fix:
oc delete subscription {operator-name} -n {namespace}
/olm:install {operator-name} {namespace}

Check for Pending Manual Approvals:

Find all unapproved InstallPlans:

oc get installplan --all-namespaces -o json | \
  jq -r '.items[] | select(.spec.approved==false)'

Report findings:

ℹ️  Pending Manual Approvals

The following operators have pending InstallPlans requiring approval:

- Operator: openshift-cert-manager-operator
  Namespace: cert-manager-operator
  InstallPlan: install-abc123
  Target Version: v1.14.0
  To approve: /olm:approve openshift-cert-manager-operator cert-manager-operator

- Operator: external-secrets-operator
  Namespace: eso-operator
  InstallPlan: install-def456
  Target Version: v0.11.0
  To approve: /olm:approve external-secrets-operator eso-operator

Generate Comprehensive Report:

═══════════════════════════════════════════════════════════
OLM HEALTH CHECK REPORT
═══════════════════════════════════════════════════════════

Scan Scope: [Operator-specific | Namespace | Cluster-wide]
Scan Time: {timestamp}

✓ HEALTHY CHECKS: {count}
- Catalog sources operational
- No conflicting OperatorGroups
- All CSVs in Succeeded phase

⚠️  WARNINGS: {count}
- {warning-count} orphaned CRDs detected
- {warning-count} pending manual approvals

❌ ERRORS: {count}
- {error-count} stuck namespaces
- {error-count} failed operator installations
- {error-count} unhealthy catalog sources

═══════════════════════════════════════════════════════════
DETAILED FINDINGS
═══════════════════════════════════════════════════════════

[Details for each finding...]

═══════════════════════════════════════════════════════════
RECOMMENDATIONS
═══════════════════════════════════════════════════════════

1. Clean up orphaned CRDs: /olm:diagnose --fix
2. Fix stuck namespace: /olm:diagnose {namespace} --fix
3. Approve pending upgrades: /olm:approve {operator-name}

For more details on troubleshooting, see:
https://docs.redhat.com/en/documentation/openshift_container_platform/4.20/html/operators/administrator-tasks#olm-troubleshooting-operator-issues

Auto-Fix Issues (if --fix flag):
- For each detected issue, ask for confirmation
- Attempt fixes based on issue type:
  - Orphaned CRDs: Delete CRs first, then CRDs
  - Stuck namespaces: Delete remaining resources, remove finalizers
  - Failed installations: Restart by deleting and recreating
  - Conflicting OperatorGroups: Remove unwanted OperatorGroup
  - Unhealthy catalogs: Restart catalog pod
- Display results of each fix attempt
- Generate final summary

Return Value

Success: Report generated with findings
Issues Found: Detailed report with warnings and errors
Fixed: Issues resolved (if --fix flag used)
Format: Structured report showing:
- Summary of health checks
- Detailed findings for each issue
- Recommendations and next steps
- Links to documentation

Examples

Check specific operator:

/olm:diagnose openshift-cert-manager-operator

Cluster-wide health check:
```
/olm:diagnose --cluster
```

Diagnose and fix issues:

/olm:diagnose openshift-cert-manager-operator cert-manager-operator --fix

Full cluster scan with auto-fix:
```
/olm:diagnose --cluster --fix
```

Arguments

$1 (operator-name): Name of specific operator to diagnose (optional)
- If not provided, checks all operators (or cluster-wide with --cluster)
- Example: "openshift-cert-manager-operator"
$2 (namespace): Specific namespace to check (optional)
- If not provided with operator-name, searches all namespaces
- Example: "cert-manager-operator"
$3+ (flags): Optional flags
- --fix: Attempt to automatically fix detected issues
  - Prompts for confirmation before each fix
  - Use with caution in production environments
- --cluster: Run cluster-wide diagnostics
  - Checks catalog sources
  - Scans for orphaned CRDs across all namespaces
  - Identifies global issues

Troubleshooting

Permission denied:

# Check required permissions
oc auth can-i get crd
oc auth can-i get csv --all-namespaces
oc auth can-i patch namespace

Unable to fix stuck namespace:
- Some resources may require manual intervention
- Check API service availability:
```
oc get apiservice
```

CRDs won't delete:

# Check for remaining CRs
oc get <crd-kind> --all-namespaces

# Check for finalizers
oc get crd <crd-name> -o jsonpath='{.metadata.finalizers}'

Catalog source issues persist:

# Restart catalog pod
oc delete pod -n openshift-marketplace <catalog-pod>

# Check catalog source definition
oc get catalogsource <catalog-name> -n openshift-marketplace -o yaml

/olm:status <operator-name> - Check specific operator status
/olm:list - List all operators
/olm:uninstall <operator-name> - Clean uninstall with orphan cleanup
/olm:approve <operator-name> - Approve pending InstallPlans

14 KiB Raw Blame History Unescape Escape

Name

Synopsis

Description

Implementation

Return Value

Examples

Arguments

Troubleshooting

Related Commands

Additional Resources

14 KiB

Raw Blame History