Initial commit

2025-11-30 08:46:06 +08:00
commit cf9da06850
16 changed files with 3315 additions and 0 deletions
--- a/commands/analyze.md
+++ b/commands/analyze.md
@@ -0,0 +1,262 @@
+---
+description: Quick analysis of must-gather data - runs all analysis scripts and provides comprehensive cluster diagnostics
+argument-hint: [must-gather-path] [component]
+---
+
+## Name
+must-gather:analyze
+
+## Synopsis
+```
+/must-gather:analyze [must-gather-path] [component]
+```
+
+## Description
+
+The `analyze` command performs comprehensive analysis of OpenShift must-gather diagnostic data. It runs specialized Python analysis scripts to extract and summarize cluster health information across multiple components.
+
+The command can analyze:
+- Cluster version and update status
+- Cluster operator health (degraded, progressing, unavailable)
+- Node conditions and resource status
+- Pod failures, restarts, and crash loops
+- Network configuration and OVN health
+- OVN databases - logical topology, ACLs, pods
+- Kubernetes events (warnings and errors)
+- etcd cluster health and quorum status
+- Persistent volume and claim status
+- Prometheus alerts
+
+You can request analysis of the entire cluster or focus on a specific component.
+
+## Prerequisites
+
+**Required Directory Structure:**
+
+Must-gather data typically has this structure:
+```
+must-gather/
+└── registry-ci-openshift-org-origin-...-sha256-<hash>/
+    ├── cluster-scoped-resources/
+    ├── namespaces/
+    └── ...
+```
+
+The actual must-gather directory is the subdirectory with the hash name, not the parent directory.
+
+**Required Scripts:**
+
+Analysis scripts are bundled with this plugin at:
+```
+<plugin-root>/skills/must-gather-analyzer/scripts/
+├── analyze_clusterversion.py
+├── analyze_clusteroperators.py
+├── analyze_nodes.py
+├── analyze_pods.py
+├── analyze_network.py
+├── analyze_ovn_dbs.py
+├── analyze_events.py
+├── analyze_etcd.py
+└── analyze_pvs.py
+```
+
+Where `<plugin-root>` is the directory where this plugin is installed (typically `~/.cursor/commands/ai-helpers/plugins/must-gather/` or similar).
+
+## Error Handling
+
+**CRITICAL: Script-Only Analysis**
+
+- **NEVER** attempt to analyze must-gather data directly using bash commands, grep, or manual file reading
+- **ONLY** use the provided Python scripts in `plugins/must-gather/skills/must-gather-analyzer/scripts/`
+- If scripts are missing or not found:
+  1. Stop immediately
+  2. Inform the user that the analysis scripts are not available
+  3. Ask the user to ensure the scripts are installed at the correct path
+  4. Do NOT attempt alternative approaches
+
+**Script Availability Check:**
+
+Before running any analysis:
+
+1. Locate the scripts directory by searching for a known script:
+   ```bash
+   SCRIPT_PATH=$(find ~ -name "analyze_clusteroperators.py" -path "*/must-gather/skills/must-gather-analyzer/scripts/*" 2>/dev/null | head -1)
+   
+   if [ -z "$SCRIPT_PATH" ]; then
+       echo "ERROR: Must-gather analysis scripts not found."
+       echo "Please ensure the must-gather plugin from ai-helpers is properly installed."
+       exit 1
+   fi
+   
+   # All scripts are in the same directory, so just get the directory
+   SCRIPTS_DIR=$(dirname "$SCRIPT_PATH")
+   ```
+
+2. If scripts cannot be found, STOP and report to the user:
+   ```
+   The must-gather analysis scripts could not be located. Please ensure the must-gather plugin from openshift-eng/ai-helpers is properly installed in your Claude Code plugins directory.
+   ```
+
+## Implementation
+
+The command performs the following steps:
+
+1. **Validate Must-Gather Path**:
+   - If path not provided as argument, ask the user
+   - Check if path contains `cluster-scoped-resources/` and `namespaces/` directories
+   - If user provides root directory, automatically find the correct subdirectory
+   - Verify the path exists and is readable
+
+2. **Determine Analysis Scope**:
+
+   **STEP 1: Check for SPECIFIC component keywords**
+
+   If the user mentions a specific component, run ONLY that script:
+   - "pods", "pod status", "containers", "crashloop", "failing pods" → `analyze_pods.py` ONLY
+   - "etcd", "etcd health", "quorum" → `analyze_etcd.py` ONLY
+   - "network", "networking", "ovn", "connectivity" → `analyze_network.py` ONLY
+   - "ovn databases", "ovn-dbs", "ovn db", "logical switches", "acls" → `analyze_ovn_dbs.py` ONLY
+   - "nodes", "node status", "node conditions" → `analyze_nodes.py` ONLY
+   - "operators", "cluster operators", "degraded" → `analyze_clusteroperators.py` ONLY
+   - "version", "cluster version", "update", "upgrade" → `analyze_clusterversion.py` ONLY
+   - "events", "warnings", "errors" → `analyze_events.py` ONLY
+   - "storage", "pv", "pvc", "volumes", "persistent" → `analyze_pvs.py` ONLY
+   - "alerts", "prometheus", "monitoring" → `analyze_prometheus.py` ONLY
+
+   **STEP 2: No specific component mentioned**
+
+   If generic request like "analyze must-gather", "/must-gather:analyze", or "check the cluster", run ALL scripts in this order:
+   1. ClusterVersion (`analyze_clusterversion.py`)
+   2. Cluster Operators (`analyze_clusteroperators.py`)
+   3. Nodes (`analyze_nodes.py`)
+   4. Pods - problems only (`analyze_pods.py --problems-only`)
+   5. Network (`analyze_network.py`)
+   6. Events - warnings only (`analyze_events.py --type Warning --count 50`)
+   7. etcd (`analyze_etcd.py`)
+   8. Storage (`analyze_pvs.py`)
+   9. Monitoring (`analyze_prometheus.py`)
+
+3. **Locate Plugin Scripts**:
+   - Use the script availability check from the Error Handling section to find the plugin root
+   - Store the scripts directory path in `$SCRIPTS_DIR`
+
+4. **Execute Analysis Scripts**:
+   ```bash
+   python3 "$SCRIPTS_DIR/<script>.py" <must-gather-path>
+   ```
+   
+   Example:
+   ```bash
+   python3 "$SCRIPTS_DIR/analyze_clusteroperators.py" ./must-gather.local.123/quay-io-...
+   ```
+
+5. **Synthesize Results**: Generate findings and recommendations based on script output
+
+## Return Value
+
+The command outputs structured analysis results to stdout:
+
+**For Component-Specific Analysis:**
+- Script output for the requested component only
+- Focused findings and recommendations
+
+**For Full Analysis:**
+- Organized sections for each component
+- Executive summary of overall cluster health
+- Prioritized list of critical issues
+- Actionable recommendations
+- Suggested log files to review
+
+## Output Structure
+
+```
+================================================================================
+MUST-GATHER ANALYSIS SUMMARY
+================================================================================
+
+[Script outputs organized by component]
+
+CLUSTER VERSION:
+[output from analyze_clusterversion.py]
+
+CLUSTER OPERATORS:
+[output from analyze_clusteroperators.py]
+
+NODES:
+[output from analyze_nodes.py]
+
+PROBLEMATIC PODS:
+[output from analyze_pods.py --problems-only]
+
+NETWORK STATUS:
+[output from analyze_network.py]
+
+WARNING EVENTS (Last 50):
+[output from analyze_events.py --type Warning --count 50]
+
+ETCD CLUSTER HEALTH:
+[output from analyze_etcd.py]
+
+STORAGE (PVs/PVCs):
+[output from analyze_pvs.py]
+
+MONITORING (Alerts):
+[output from analyze_prometheus.py]
+
+================================================================================
+FINDINGS AND RECOMMENDATIONS
+================================================================================
+
+Critical Issues:
+- [Critical problems requiring immediate attention]
+
+Warnings:
+- [Potential issues or degraded components]
+
+Recommendations:
+- [Specific next steps for investigation]
+
+Logs to Review:
+- [Specific log files to examine based on findings]
+```
+
+## Examples
+
+1. **Full cluster analysis**:
+   ```
+   /must-gather:analyze ./must-gather/registry-ci-openshift-org-origin-4-20-...-sha256-abc123/
+   ```
+   Runs all analysis scripts and provides comprehensive cluster diagnostics.
+
+2. **Analyze pod issues only**:
+   ```
+   /must-gather:analyze ./must-gather/registry-ci-openshift-org-origin-4-20-...-sha256-abc123/ analyze the pod statuses
+   ```
+   Runs only `analyze_pods.py` to focus on pod-related issues.
+
+3. **Check etcd health**:
+   ```
+   /must-gather:analyze check etcd health
+   ```
+   Asks for must-gather path, then runs only `analyze_etcd.py`.
+
+4. **Network troubleshooting**:
+   ```
+   /must-gather:analyze ./must-gather/registry-ci-openshift-org-origin-4-20-...-sha256-abc123/ show me network issues
+   ```
+   Runs only `analyze_network.py` for network-specific analysis.
+
+## Notes
+
+- **Must-Gather Path**: Always use the subdirectory containing `cluster-scoped-resources/` and `namespaces/`, not the parent directory
+- **Script Dependencies**: Analysis scripts must be executable and have required Python dependencies installed
+- **Error Handling**: If scripts are not found or must-gather path is invalid, clear error messages are displayed
+- **Cross-Referencing**: The analysis attempts to correlate issues across components (e.g., degraded operator → failing pods)
+- **Pattern Detection**: Identifies patterns like multiple pod failures on the same node
+- **Actionable Output**: Focuses on insights and recommendations rather than raw data dumps
+- **Priority**: Issues are prioritized by severity (Critical > Warning > Info)
+
+## Arguments
+
+- **$1** (must-gather-path): Optional. Path to the must-gather directory (the subdirectory with the hash name). If not provided, the user will be asked.
+- **$2+** (component): Optional. If keywords for a specific component are detected, only that component's analysis script will run. Otherwise, all scripts run.