Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:46:06 +08:00
commit cf9da06850
16 changed files with 3315 additions and 0 deletions

262
commands/analyze.md Normal file
View File

@@ -0,0 +1,262 @@
---
description: Quick analysis of must-gather data - runs all analysis scripts and provides comprehensive cluster diagnostics
argument-hint: [must-gather-path] [component]
---
## Name
must-gather:analyze
## Synopsis
```
/must-gather:analyze [must-gather-path] [component]
```
## Description
The `analyze` command performs comprehensive analysis of OpenShift must-gather diagnostic data. It runs specialized Python analysis scripts to extract and summarize cluster health information across multiple components.
The command can analyze:
- Cluster version and update status
- Cluster operator health (degraded, progressing, unavailable)
- Node conditions and resource status
- Pod failures, restarts, and crash loops
- Network configuration and OVN health
- OVN databases - logical topology, ACLs, pods
- Kubernetes events (warnings and errors)
- etcd cluster health and quorum status
- Persistent volume and claim status
- Prometheus alerts
You can request analysis of the entire cluster or focus on a specific component.
## Prerequisites
**Required Directory Structure:**
Must-gather data typically has this structure:
```
must-gather/
└── registry-ci-openshift-org-origin-...-sha256-<hash>/
├── cluster-scoped-resources/
├── namespaces/
└── ...
```
The actual must-gather directory is the subdirectory with the hash name, not the parent directory.
**Required Scripts:**
Analysis scripts are bundled with this plugin at:
```
<plugin-root>/skills/must-gather-analyzer/scripts/
├── analyze_clusterversion.py
├── analyze_clusteroperators.py
├── analyze_nodes.py
├── analyze_pods.py
├── analyze_network.py
├── analyze_ovn_dbs.py
├── analyze_events.py
├── analyze_etcd.py
└── analyze_pvs.py
```
Where `<plugin-root>` is the directory where this plugin is installed (typically `~/.cursor/commands/ai-helpers/plugins/must-gather/` or similar).
## Error Handling
**CRITICAL: Script-Only Analysis**
- **NEVER** attempt to analyze must-gather data directly using bash commands, grep, or manual file reading
- **ONLY** use the provided Python scripts in `plugins/must-gather/skills/must-gather-analyzer/scripts/`
- If scripts are missing or not found:
1. Stop immediately
2. Inform the user that the analysis scripts are not available
3. Ask the user to ensure the scripts are installed at the correct path
4. Do NOT attempt alternative approaches
**Script Availability Check:**
Before running any analysis:
1. Locate the scripts directory by searching for a known script:
```bash
SCRIPT_PATH=$(find ~ -name "analyze_clusteroperators.py" -path "*/must-gather/skills/must-gather-analyzer/scripts/*" 2>/dev/null | head -1)
if [ -z "$SCRIPT_PATH" ]; then
echo "ERROR: Must-gather analysis scripts not found."
echo "Please ensure the must-gather plugin from ai-helpers is properly installed."
exit 1
fi
# All scripts are in the same directory, so just get the directory
SCRIPTS_DIR=$(dirname "$SCRIPT_PATH")
```
2. If scripts cannot be found, STOP and report to the user:
```
The must-gather analysis scripts could not be located. Please ensure the must-gather plugin from openshift-eng/ai-helpers is properly installed in your Claude Code plugins directory.
```
## Implementation
The command performs the following steps:
1. **Validate Must-Gather Path**:
- If path not provided as argument, ask the user
- Check if path contains `cluster-scoped-resources/` and `namespaces/` directories
- If user provides root directory, automatically find the correct subdirectory
- Verify the path exists and is readable
2. **Determine Analysis Scope**:
**STEP 1: Check for SPECIFIC component keywords**
If the user mentions a specific component, run ONLY that script:
- "pods", "pod status", "containers", "crashloop", "failing pods" → `analyze_pods.py` ONLY
- "etcd", "etcd health", "quorum" → `analyze_etcd.py` ONLY
- "network", "networking", "ovn", "connectivity" → `analyze_network.py` ONLY
- "ovn databases", "ovn-dbs", "ovn db", "logical switches", "acls" → `analyze_ovn_dbs.py` ONLY
- "nodes", "node status", "node conditions" → `analyze_nodes.py` ONLY
- "operators", "cluster operators", "degraded" → `analyze_clusteroperators.py` ONLY
- "version", "cluster version", "update", "upgrade" → `analyze_clusterversion.py` ONLY
- "events", "warnings", "errors" → `analyze_events.py` ONLY
- "storage", "pv", "pvc", "volumes", "persistent" → `analyze_pvs.py` ONLY
- "alerts", "prometheus", "monitoring" → `analyze_prometheus.py` ONLY
**STEP 2: No specific component mentioned**
If generic request like "analyze must-gather", "/must-gather:analyze", or "check the cluster", run ALL scripts in this order:
1. ClusterVersion (`analyze_clusterversion.py`)
2. Cluster Operators (`analyze_clusteroperators.py`)
3. Nodes (`analyze_nodes.py`)
4. Pods - problems only (`analyze_pods.py --problems-only`)
5. Network (`analyze_network.py`)
6. Events - warnings only (`analyze_events.py --type Warning --count 50`)
7. etcd (`analyze_etcd.py`)
8. Storage (`analyze_pvs.py`)
9. Monitoring (`analyze_prometheus.py`)
3. **Locate Plugin Scripts**:
- Use the script availability check from the Error Handling section to find the plugin root
- Store the scripts directory path in `$SCRIPTS_DIR`
4. **Execute Analysis Scripts**:
```bash
python3 "$SCRIPTS_DIR/<script>.py" <must-gather-path>
```
Example:
```bash
python3 "$SCRIPTS_DIR/analyze_clusteroperators.py" ./must-gather.local.123/quay-io-...
```
5. **Synthesize Results**: Generate findings and recommendations based on script output
## Return Value
The command outputs structured analysis results to stdout:
**For Component-Specific Analysis:**
- Script output for the requested component only
- Focused findings and recommendations
**For Full Analysis:**
- Organized sections for each component
- Executive summary of overall cluster health
- Prioritized list of critical issues
- Actionable recommendations
- Suggested log files to review
## Output Structure
```
================================================================================
MUST-GATHER ANALYSIS SUMMARY
================================================================================
[Script outputs organized by component]
CLUSTER VERSION:
[output from analyze_clusterversion.py]
CLUSTER OPERATORS:
[output from analyze_clusteroperators.py]
NODES:
[output from analyze_nodes.py]
PROBLEMATIC PODS:
[output from analyze_pods.py --problems-only]
NETWORK STATUS:
[output from analyze_network.py]
WARNING EVENTS (Last 50):
[output from analyze_events.py --type Warning --count 50]
ETCD CLUSTER HEALTH:
[output from analyze_etcd.py]
STORAGE (PVs/PVCs):
[output from analyze_pvs.py]
MONITORING (Alerts):
[output from analyze_prometheus.py]
================================================================================
FINDINGS AND RECOMMENDATIONS
================================================================================
Critical Issues:
- [Critical problems requiring immediate attention]
Warnings:
- [Potential issues or degraded components]
Recommendations:
- [Specific next steps for investigation]
Logs to Review:
- [Specific log files to examine based on findings]
```
## Examples
1. **Full cluster analysis**:
```
/must-gather:analyze ./must-gather/registry-ci-openshift-org-origin-4-20-...-sha256-abc123/
```
Runs all analysis scripts and provides comprehensive cluster diagnostics.
2. **Analyze pod issues only**:
```
/must-gather:analyze ./must-gather/registry-ci-openshift-org-origin-4-20-...-sha256-abc123/ analyze the pod statuses
```
Runs only `analyze_pods.py` to focus on pod-related issues.
3. **Check etcd health**:
```
/must-gather:analyze check etcd health
```
Asks for must-gather path, then runs only `analyze_etcd.py`.
4. **Network troubleshooting**:
```
/must-gather:analyze ./must-gather/registry-ci-openshift-org-origin-4-20-...-sha256-abc123/ show me network issues
```
Runs only `analyze_network.py` for network-specific analysis.
## Notes
- **Must-Gather Path**: Always use the subdirectory containing `cluster-scoped-resources/` and `namespaces/`, not the parent directory
- **Script Dependencies**: Analysis scripts must be executable and have required Python dependencies installed
- **Error Handling**: If scripts are not found or must-gather path is invalid, clear error messages are displayed
- **Cross-Referencing**: The analysis attempts to correlate issues across components (e.g., degraded operator → failing pods)
- **Pattern Detection**: Identifies patterns like multiple pod failures on the same node
- **Actionable Output**: Focuses on insights and recommendations rather than raw data dumps
- **Priority**: Issues are prioritized by severity (Critical > Warning > Info)
## Arguments
- **$1** (must-gather-path): Optional. Path to the must-gather directory (the subdirectory with the hash name). If not provided, the user will be asked.
- **$2+** (component): Optional. If keywords for a specific component are detected, only that component's analysis script will run. Otherwise, all scripts run.

266
commands/ovn-dbs.md Normal file
View File

@@ -0,0 +1,266 @@
---
description: Analyze OVN databases from a must-gather using ovsdb-tool
argument-hint: [must-gather-path]
---
## Name
must-gather:ovn-dbs
## Synopsis
```
/must-gather:ovn-dbs [must-gather-path] [--node <node-name>] [--query <json>]
```
## Description
The `ovn-dbs` command analyzes OVN Northbound and Southbound databases collected from clusters. It uses `ovsdb-tool` to query the binary database files (`.db`) collected per-node, providing detailed information about the logical network topology, pods, ACLs, and routers on each node.
The command automatically maps ovnkube pods to their corresponding nodes by reading pod specifications from the must-gather data.
**Two modes of operation:**
1. **Standard Analysis** (default): Runs pre-built analysis showing switches, ports, ACLs, and routers
2. **Query Mode** (`--query`): Run custom OVSDB JSON queries for specific data extraction
**What it analyzes:**
- **Per-zone logical network topology**
- **Logical Switches** and their ports
- **Pod Logical Switch Ports** with namespace, pod name, and IP addresses
- **Access Control Lists (ACLs)** with priorities, directions, and match rules
- **Logical Routers** and their ports
**Important:** This command only works with must-gathers from clusters, where each node/zone has its own database files.
## Prerequisites
The must-gather should contain:
```
network_logs/
└── ovnk_database_store.tar.gz
```
**Required Tools:**
- `ovsdb-tool` must be installed (from openvswitch package)
- Check with: `which ovsdb-tool`
- Install: `sudo dnf install openvswitch` or `sudo apt install openvswitch-common`
**Analysis Script:**
The script is bundled with this plugin:
```
<plugin-root>/skills/must-gather-analyzer/scripts/analyze_ovn_dbs.py
```
Where `<plugin-root>` is the directory where this plugin is installed (typically `~/.cursor/commands/ai-helpers/plugins/must-gather/` or similar).
Claude will automatically locate it by searching for the script in the plugin installation directory, regardless of your current working directory.
## Implementation
The command performs the following steps:
1. **Locate Analysis Script**:
```bash
SCRIPT_PATH=$(find ~ -name "analyze_ovn_dbs.py" -path "*/must-gather/skills/must-gather-analyzer/scripts/*" 2>/dev/null | head -1)
if [ -z "$SCRIPT_PATH" ]; then
echo "ERROR: analyze_ovn_dbs.py script not found."
echo "Please ensure the must-gather plugin from ai-helpers is properly installed."
exit 1
fi
SCRIPTS_DIR=$(dirname "$SCRIPT_PATH")
```
2. **Extract Database Tarball**:
- Locate `network_logs/ovnk_database_store.tar.gz`
- Extract if not already extracted
- Find all `*_nbdb` and `*_sbdb` files
3. **Query Each Zone's Database**:
For each zone (node), query the Northbound database using `ovsdb-tool query`:
```bash
ovsdb-tool query <zone>_nbdb '["OVN_Northbound", {"op":"select", "table":"<table>", "where":[], "columns":[...]}]'
```
4. **Analyze and Display**:
- **Logical Switches**: Names and port counts
- **Logical Switch Ports**: Filter for pods (external_ids.pod=true), show namespace, pod name, and IP
- **ACLs**: Priority, direction, match rules, and actions
- **Logical Routers**: Names and port counts
5. **Present Zone Summary**:
- Total counts per zone
- Detailed breakdowns
- Sorted and formatted output
## Return Value
The command outputs structured analysis for each node:
```
Found 6 node(s)
================================================================================
Node: ip-10-0-26-145.us-east-2.compute.internal
Pod: ovnkube-node-79cbh
================================================================================
Logical Switches: 4
Logical Switch Ports: 55
ACLs: 7
Logical Routers: 2
LOGICAL SWITCHES (4):
NAME PORTS
--------------------------------------------------------------------------------
transit_switch 6
ip-10-0-1-10.us-east-2.compute.internal 7
ext_ip-10-0-1-10.us-east-2.compute.internal 2
join 2
POD LOGICAL SWITCH PORTS (5):
NAMESPACE POD IP
------------------------------------------------------------------------------------------------------------------------
openshift-dns dns-default-abc123 10.128.0.5
openshift-monitoring prometheus-k8s-0 10.128.0.10
openshift-etcd etcd-master-0 10.128.0.3
...
ACCESS CONTROL LISTS (7):
PRIORITY DIRECTION ACTION MATCH
------------------------------------------------------------------------------------------------------------------------
1012 from-lport allow inport == @a4743249366342378346 && (ip4.mcast ...
1011 to-lport drop (ip4.mcast || mldv1 || mldv2 || ...
1001 to-lport allow-related ip4.src==10.128.0.2
...
LOGICAL ROUTERS (2):
NAME PORTS
--------------------------------------------------------------------------------
ovn_cluster_router 3
GR_ip-10-0-1-10.us-east-2.compute.internal 2
```
## Examples
1. **Analyze all nodes in a must-gather**:
```
/must-gather:ovn-dbs ./must-gather/registry-ci-openshift-org-origin-4-20-...-sha256-abc123/
```
Shows logical network topology for all nodes.
2. **Analyze specific node**:
```
/must-gather:ovn-dbs ./must-gather/.../ --node ip-10-0-26-145
```
Shows OVN database information only for the specified node (supports partial name matching).
3. **Analyze master node**:
```
/must-gather:ovn-dbs ./must-gather/.../ --node master-0
```
Filter to a specific master node using partial name matching.
4. **Interactive usage without path**:
```
/must-gather:ovn-dbs
```
The command will ask for the must-gather path.
5. **Check if pod exists in OVN**:
```
/must-gather:ovn-dbs ./must-gather/.../
```
Then search the output for the pod name to see which node it's on and its IP allocation.
6. **Investigate ACL rules on a specific node**:
```
/must-gather:ovn-dbs ./must-gather/.../ --node worker-1
```
Review the ACL section for a specific node to understand traffic filtering rules.
7. **Run custom OVSDB query** (Query Mode):
```
/must-gather:ovn-dbs ./must-gather/.../ --query '["OVN_Northbound", {"op":"select", "table":"ACL", "where":[["priority", ">", 1000]], "columns":["priority","match","action"]}]'
```
Query ACLs with priority > 1000 across all nodes. Claude can construct the JSON query for any OVSDB table.
8. **Query specific node with custom query**:
```
/must-gather:ovn-dbs ./must-gather/.../ --node master-0 --query '["OVN_Northbound", {"op":"select", "table":"Logical_Switch", "where":[], "columns":["name","ports"]}]'
```
List all logical switches with their ports on master-0.
9. **Query specific table** (Claude constructs JSON):
Just ask Claude to query a specific OVSDB table and it will construct the appropriate JSON query. For example:
- "Show all Logical_Router_Static_Route entries"
- "Find ACLs with action 'drop'"
- "List Logical_Switch_Port entries where external_ids contains 'openshift-etcd'"
## Error Handling
**Missing ovsdb-tool:**
```
Error: ovsdb-tool not found. Please install openvswitch package.
```
Solution: Install openvswitch: `sudo dnf install openvswitch`
**Missing database tarball:**
```
Error: Database tarball not found: network_logs/ovnk_database_store.tar.gz
```
Solution: Ensure this is a must-gather from an OVN cluster.
**Node not found:**
```
Error: No databases found for node matching 'master-5'
Available nodes:
- ip-10-0-77-117.us-east-2.compute.internal
- ip-10-0-26-145.us-east-2.compute.internal
- ip-10-0-1-194.us-east-2.compute.internal
```
Solution: Use one of the listed node names or a partial match.
## Notes
- **Binary Database Format**: Uses `ovsdb-tool` to read OVSDB binary files directly
- **Per-Node Analysis**: Each node in IC mode has its own database (one NB and one SB per zone)
- **Node Mapping**: Automatically correlates ovnkube pods to nodes by reading pod specs from must-gather
- **Pod Discovery**: Pods are identified by `external_ids` with `pod=true`
- **IP Extraction**: Pod IPs are parsed from the `addresses` field (format: "MAC IP")
- **ACL Priorities**: Higher priority ACLs are processed first (shown at top)
- **Node Filtering**: Supports partial name matching for convenience (e.g., "--node master" matches all masters)
- **Query Mode**: Accepts raw OVSDB JSON queries in the format `["OVN_Northbound", {"op":"select", "table":"...", ...}]`
- **Claude Query Construction**: Claude can automatically construct OVSDB JSON queries based on natural language requests
- **Performance**: Querying large databases may take a few seconds per node
## Use Cases
1. **Verify Pod Network Configuration**:
- Check if pods are registered in OVN
- Verify IP address assignments
- Confirm logical switch port creation
2. **Troubleshoot Connectivity Issues**:
- Review ACL rules blocking traffic
- Check if pods are in correct logical switches
- Verify router configurations
3. **Understand Topology**:
- See how zones are interconnected via transit_switch
- Review gateway router configurations
- Understand logical network structure
4. **Audit Network Policies**:
- See ACL rules generated from NetworkPolicies
- Identify overly permissive or restrictive rules
- Check rule priorities and match conditions
## Arguments
- **$1** (must-gather-path): Optional. Path to the must-gather directory containing network_logs/. If not provided, user will be prompted.
- **--node, -n** (node-name): Optional. Filter analysis to a specific node. Supports partial name matching (e.g., "master-0", "ip-10-0-26-145"). If no match is found, displays list of available nodes.
- **--query, -q** (json-query): Optional. Run a raw OVSDB JSON query instead of standard analysis. Claude can construct the JSON query based on OVSDB transaction format. When provided, outputs raw JSON results instead of formatted analysis.