Initial commit
This commit is contained in:
262
commands/analyze.md
Normal file
262
commands/analyze.md
Normal file
@@ -0,0 +1,262 @@
|
||||
---
|
||||
description: Quick analysis of must-gather data - runs all analysis scripts and provides comprehensive cluster diagnostics
|
||||
argument-hint: [must-gather-path] [component]
|
||||
---
|
||||
|
||||
## Name
|
||||
must-gather:analyze
|
||||
|
||||
## Synopsis
|
||||
```
|
||||
/must-gather:analyze [must-gather-path] [component]
|
||||
```
|
||||
|
||||
## Description
|
||||
|
||||
The `analyze` command performs comprehensive analysis of OpenShift must-gather diagnostic data. It runs specialized Python analysis scripts to extract and summarize cluster health information across multiple components.
|
||||
|
||||
The command can analyze:
|
||||
- Cluster version and update status
|
||||
- Cluster operator health (degraded, progressing, unavailable)
|
||||
- Node conditions and resource status
|
||||
- Pod failures, restarts, and crash loops
|
||||
- Network configuration and OVN health
|
||||
- OVN databases - logical topology, ACLs, pods
|
||||
- Kubernetes events (warnings and errors)
|
||||
- etcd cluster health and quorum status
|
||||
- Persistent volume and claim status
|
||||
- Prometheus alerts
|
||||
|
||||
You can request analysis of the entire cluster or focus on a specific component.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
**Required Directory Structure:**
|
||||
|
||||
Must-gather data typically has this structure:
|
||||
```
|
||||
must-gather/
|
||||
└── registry-ci-openshift-org-origin-...-sha256-<hash>/
|
||||
├── cluster-scoped-resources/
|
||||
├── namespaces/
|
||||
└── ...
|
||||
```
|
||||
|
||||
The actual must-gather directory is the subdirectory with the hash name, not the parent directory.
|
||||
|
||||
**Required Scripts:**
|
||||
|
||||
Analysis scripts are bundled with this plugin at:
|
||||
```
|
||||
<plugin-root>/skills/must-gather-analyzer/scripts/
|
||||
├── analyze_clusterversion.py
|
||||
├── analyze_clusteroperators.py
|
||||
├── analyze_nodes.py
|
||||
├── analyze_pods.py
|
||||
├── analyze_network.py
|
||||
├── analyze_ovn_dbs.py
|
||||
├── analyze_events.py
|
||||
├── analyze_etcd.py
|
||||
└── analyze_pvs.py
|
||||
```
|
||||
|
||||
Where `<plugin-root>` is the directory where this plugin is installed (typically `~/.cursor/commands/ai-helpers/plugins/must-gather/` or similar).
|
||||
|
||||
## Error Handling
|
||||
|
||||
**CRITICAL: Script-Only Analysis**
|
||||
|
||||
- **NEVER** attempt to analyze must-gather data directly using bash commands, grep, or manual file reading
|
||||
- **ONLY** use the provided Python scripts in `plugins/must-gather/skills/must-gather-analyzer/scripts/`
|
||||
- If scripts are missing or not found:
|
||||
1. Stop immediately
|
||||
2. Inform the user that the analysis scripts are not available
|
||||
3. Ask the user to ensure the scripts are installed at the correct path
|
||||
4. Do NOT attempt alternative approaches
|
||||
|
||||
**Script Availability Check:**
|
||||
|
||||
Before running any analysis:
|
||||
|
||||
1. Locate the scripts directory by searching for a known script:
|
||||
```bash
|
||||
SCRIPT_PATH=$(find ~ -name "analyze_clusteroperators.py" -path "*/must-gather/skills/must-gather-analyzer/scripts/*" 2>/dev/null | head -1)
|
||||
|
||||
if [ -z "$SCRIPT_PATH" ]; then
|
||||
echo "ERROR: Must-gather analysis scripts not found."
|
||||
echo "Please ensure the must-gather plugin from ai-helpers is properly installed."
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# All scripts are in the same directory, so just get the directory
|
||||
SCRIPTS_DIR=$(dirname "$SCRIPT_PATH")
|
||||
```
|
||||
|
||||
2. If scripts cannot be found, STOP and report to the user:
|
||||
```
|
||||
The must-gather analysis scripts could not be located. Please ensure the must-gather plugin from openshift-eng/ai-helpers is properly installed in your Claude Code plugins directory.
|
||||
```
|
||||
|
||||
## Implementation
|
||||
|
||||
The command performs the following steps:
|
||||
|
||||
1. **Validate Must-Gather Path**:
|
||||
- If path not provided as argument, ask the user
|
||||
- Check if path contains `cluster-scoped-resources/` and `namespaces/` directories
|
||||
- If user provides root directory, automatically find the correct subdirectory
|
||||
- Verify the path exists and is readable
|
||||
|
||||
2. **Determine Analysis Scope**:
|
||||
|
||||
**STEP 1: Check for SPECIFIC component keywords**
|
||||
|
||||
If the user mentions a specific component, run ONLY that script:
|
||||
- "pods", "pod status", "containers", "crashloop", "failing pods" → `analyze_pods.py` ONLY
|
||||
- "etcd", "etcd health", "quorum" → `analyze_etcd.py` ONLY
|
||||
- "network", "networking", "ovn", "connectivity" → `analyze_network.py` ONLY
|
||||
- "ovn databases", "ovn-dbs", "ovn db", "logical switches", "acls" → `analyze_ovn_dbs.py` ONLY
|
||||
- "nodes", "node status", "node conditions" → `analyze_nodes.py` ONLY
|
||||
- "operators", "cluster operators", "degraded" → `analyze_clusteroperators.py` ONLY
|
||||
- "version", "cluster version", "update", "upgrade" → `analyze_clusterversion.py` ONLY
|
||||
- "events", "warnings", "errors" → `analyze_events.py` ONLY
|
||||
- "storage", "pv", "pvc", "volumes", "persistent" → `analyze_pvs.py` ONLY
|
||||
- "alerts", "prometheus", "monitoring" → `analyze_prometheus.py` ONLY
|
||||
|
||||
**STEP 2: No specific component mentioned**
|
||||
|
||||
If generic request like "analyze must-gather", "/must-gather:analyze", or "check the cluster", run ALL scripts in this order:
|
||||
1. ClusterVersion (`analyze_clusterversion.py`)
|
||||
2. Cluster Operators (`analyze_clusteroperators.py`)
|
||||
3. Nodes (`analyze_nodes.py`)
|
||||
4. Pods - problems only (`analyze_pods.py --problems-only`)
|
||||
5. Network (`analyze_network.py`)
|
||||
6. Events - warnings only (`analyze_events.py --type Warning --count 50`)
|
||||
7. etcd (`analyze_etcd.py`)
|
||||
8. Storage (`analyze_pvs.py`)
|
||||
9. Monitoring (`analyze_prometheus.py`)
|
||||
|
||||
3. **Locate Plugin Scripts**:
|
||||
- Use the script availability check from the Error Handling section to find the plugin root
|
||||
- Store the scripts directory path in `$SCRIPTS_DIR`
|
||||
|
||||
4. **Execute Analysis Scripts**:
|
||||
```bash
|
||||
python3 "$SCRIPTS_DIR/<script>.py" <must-gather-path>
|
||||
```
|
||||
|
||||
Example:
|
||||
```bash
|
||||
python3 "$SCRIPTS_DIR/analyze_clusteroperators.py" ./must-gather.local.123/quay-io-...
|
||||
```
|
||||
|
||||
5. **Synthesize Results**: Generate findings and recommendations based on script output
|
||||
|
||||
## Return Value
|
||||
|
||||
The command outputs structured analysis results to stdout:
|
||||
|
||||
**For Component-Specific Analysis:**
|
||||
- Script output for the requested component only
|
||||
- Focused findings and recommendations
|
||||
|
||||
**For Full Analysis:**
|
||||
- Organized sections for each component
|
||||
- Executive summary of overall cluster health
|
||||
- Prioritized list of critical issues
|
||||
- Actionable recommendations
|
||||
- Suggested log files to review
|
||||
|
||||
## Output Structure
|
||||
|
||||
```
|
||||
================================================================================
|
||||
MUST-GATHER ANALYSIS SUMMARY
|
||||
================================================================================
|
||||
|
||||
[Script outputs organized by component]
|
||||
|
||||
CLUSTER VERSION:
|
||||
[output from analyze_clusterversion.py]
|
||||
|
||||
CLUSTER OPERATORS:
|
||||
[output from analyze_clusteroperators.py]
|
||||
|
||||
NODES:
|
||||
[output from analyze_nodes.py]
|
||||
|
||||
PROBLEMATIC PODS:
|
||||
[output from analyze_pods.py --problems-only]
|
||||
|
||||
NETWORK STATUS:
|
||||
[output from analyze_network.py]
|
||||
|
||||
WARNING EVENTS (Last 50):
|
||||
[output from analyze_events.py --type Warning --count 50]
|
||||
|
||||
ETCD CLUSTER HEALTH:
|
||||
[output from analyze_etcd.py]
|
||||
|
||||
STORAGE (PVs/PVCs):
|
||||
[output from analyze_pvs.py]
|
||||
|
||||
MONITORING (Alerts):
|
||||
[output from analyze_prometheus.py]
|
||||
|
||||
================================================================================
|
||||
FINDINGS AND RECOMMENDATIONS
|
||||
================================================================================
|
||||
|
||||
Critical Issues:
|
||||
- [Critical problems requiring immediate attention]
|
||||
|
||||
Warnings:
|
||||
- [Potential issues or degraded components]
|
||||
|
||||
Recommendations:
|
||||
- [Specific next steps for investigation]
|
||||
|
||||
Logs to Review:
|
||||
- [Specific log files to examine based on findings]
|
||||
```
|
||||
|
||||
## Examples
|
||||
|
||||
1. **Full cluster analysis**:
|
||||
```
|
||||
/must-gather:analyze ./must-gather/registry-ci-openshift-org-origin-4-20-...-sha256-abc123/
|
||||
```
|
||||
Runs all analysis scripts and provides comprehensive cluster diagnostics.
|
||||
|
||||
2. **Analyze pod issues only**:
|
||||
```
|
||||
/must-gather:analyze ./must-gather/registry-ci-openshift-org-origin-4-20-...-sha256-abc123/ analyze the pod statuses
|
||||
```
|
||||
Runs only `analyze_pods.py` to focus on pod-related issues.
|
||||
|
||||
3. **Check etcd health**:
|
||||
```
|
||||
/must-gather:analyze check etcd health
|
||||
```
|
||||
Asks for must-gather path, then runs only `analyze_etcd.py`.
|
||||
|
||||
4. **Network troubleshooting**:
|
||||
```
|
||||
/must-gather:analyze ./must-gather/registry-ci-openshift-org-origin-4-20-...-sha256-abc123/ show me network issues
|
||||
```
|
||||
Runs only `analyze_network.py` for network-specific analysis.
|
||||
|
||||
## Notes
|
||||
|
||||
- **Must-Gather Path**: Always use the subdirectory containing `cluster-scoped-resources/` and `namespaces/`, not the parent directory
|
||||
- **Script Dependencies**: Analysis scripts must be executable and have required Python dependencies installed
|
||||
- **Error Handling**: If scripts are not found or must-gather path is invalid, clear error messages are displayed
|
||||
- **Cross-Referencing**: The analysis attempts to correlate issues across components (e.g., degraded operator → failing pods)
|
||||
- **Pattern Detection**: Identifies patterns like multiple pod failures on the same node
|
||||
- **Actionable Output**: Focuses on insights and recommendations rather than raw data dumps
|
||||
- **Priority**: Issues are prioritized by severity (Critical > Warning > Info)
|
||||
|
||||
## Arguments
|
||||
|
||||
- **$1** (must-gather-path): Optional. Path to the must-gather directory (the subdirectory with the hash name). If not provided, the user will be asked.
|
||||
- **$2+** (component): Optional. If keywords for a specific component are detected, only that component's analysis script will run. Otherwise, all scripts run.
|
||||
266
commands/ovn-dbs.md
Normal file
266
commands/ovn-dbs.md
Normal file
@@ -0,0 +1,266 @@
|
||||
---
|
||||
description: Analyze OVN databases from a must-gather using ovsdb-tool
|
||||
argument-hint: [must-gather-path]
|
||||
---
|
||||
|
||||
## Name
|
||||
must-gather:ovn-dbs
|
||||
|
||||
## Synopsis
|
||||
```
|
||||
/must-gather:ovn-dbs [must-gather-path] [--node <node-name>] [--query <json>]
|
||||
```
|
||||
|
||||
## Description
|
||||
|
||||
The `ovn-dbs` command analyzes OVN Northbound and Southbound databases collected from clusters. It uses `ovsdb-tool` to query the binary database files (`.db`) collected per-node, providing detailed information about the logical network topology, pods, ACLs, and routers on each node.
|
||||
|
||||
The command automatically maps ovnkube pods to their corresponding nodes by reading pod specifications from the must-gather data.
|
||||
|
||||
**Two modes of operation:**
|
||||
1. **Standard Analysis** (default): Runs pre-built analysis showing switches, ports, ACLs, and routers
|
||||
2. **Query Mode** (`--query`): Run custom OVSDB JSON queries for specific data extraction
|
||||
|
||||
**What it analyzes:**
|
||||
- **Per-zone logical network topology**
|
||||
- **Logical Switches** and their ports
|
||||
- **Pod Logical Switch Ports** with namespace, pod name, and IP addresses
|
||||
- **Access Control Lists (ACLs)** with priorities, directions, and match rules
|
||||
- **Logical Routers** and their ports
|
||||
|
||||
**Important:** This command only works with must-gathers from clusters, where each node/zone has its own database files.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
The must-gather should contain:
|
||||
```
|
||||
network_logs/
|
||||
└── ovnk_database_store.tar.gz
|
||||
```
|
||||
|
||||
**Required Tools:**
|
||||
|
||||
- `ovsdb-tool` must be installed (from openvswitch package)
|
||||
- Check with: `which ovsdb-tool`
|
||||
- Install: `sudo dnf install openvswitch` or `sudo apt install openvswitch-common`
|
||||
|
||||
**Analysis Script:**
|
||||
|
||||
The script is bundled with this plugin:
|
||||
```
|
||||
<plugin-root>/skills/must-gather-analyzer/scripts/analyze_ovn_dbs.py
|
||||
```
|
||||
|
||||
Where `<plugin-root>` is the directory where this plugin is installed (typically `~/.cursor/commands/ai-helpers/plugins/must-gather/` or similar).
|
||||
|
||||
Claude will automatically locate it by searching for the script in the plugin installation directory, regardless of your current working directory.
|
||||
|
||||
## Implementation
|
||||
|
||||
The command performs the following steps:
|
||||
|
||||
1. **Locate Analysis Script**:
|
||||
```bash
|
||||
SCRIPT_PATH=$(find ~ -name "analyze_ovn_dbs.py" -path "*/must-gather/skills/must-gather-analyzer/scripts/*" 2>/dev/null | head -1)
|
||||
|
||||
if [ -z "$SCRIPT_PATH" ]; then
|
||||
echo "ERROR: analyze_ovn_dbs.py script not found."
|
||||
echo "Please ensure the must-gather plugin from ai-helpers is properly installed."
|
||||
exit 1
|
||||
fi
|
||||
|
||||
SCRIPTS_DIR=$(dirname "$SCRIPT_PATH")
|
||||
```
|
||||
|
||||
2. **Extract Database Tarball**:
|
||||
- Locate `network_logs/ovnk_database_store.tar.gz`
|
||||
- Extract if not already extracted
|
||||
- Find all `*_nbdb` and `*_sbdb` files
|
||||
|
||||
3. **Query Each Zone's Database**:
|
||||
For each zone (node), query the Northbound database using `ovsdb-tool query`:
|
||||
|
||||
```bash
|
||||
ovsdb-tool query <zone>_nbdb '["OVN_Northbound", {"op":"select", "table":"<table>", "where":[], "columns":[...]}]'
|
||||
```
|
||||
|
||||
4. **Analyze and Display**:
|
||||
- **Logical Switches**: Names and port counts
|
||||
- **Logical Switch Ports**: Filter for pods (external_ids.pod=true), show namespace, pod name, and IP
|
||||
- **ACLs**: Priority, direction, match rules, and actions
|
||||
- **Logical Routers**: Names and port counts
|
||||
|
||||
5. **Present Zone Summary**:
|
||||
- Total counts per zone
|
||||
- Detailed breakdowns
|
||||
- Sorted and formatted output
|
||||
|
||||
## Return Value
|
||||
|
||||
The command outputs structured analysis for each node:
|
||||
|
||||
```
|
||||
Found 6 node(s)
|
||||
|
||||
================================================================================
|
||||
Node: ip-10-0-26-145.us-east-2.compute.internal
|
||||
Pod: ovnkube-node-79cbh
|
||||
================================================================================
|
||||
Logical Switches: 4
|
||||
Logical Switch Ports: 55
|
||||
ACLs: 7
|
||||
Logical Routers: 2
|
||||
|
||||
LOGICAL SWITCHES (4):
|
||||
NAME PORTS
|
||||
--------------------------------------------------------------------------------
|
||||
transit_switch 6
|
||||
ip-10-0-1-10.us-east-2.compute.internal 7
|
||||
ext_ip-10-0-1-10.us-east-2.compute.internal 2
|
||||
join 2
|
||||
|
||||
POD LOGICAL SWITCH PORTS (5):
|
||||
NAMESPACE POD IP
|
||||
------------------------------------------------------------------------------------------------------------------------
|
||||
openshift-dns dns-default-abc123 10.128.0.5
|
||||
openshift-monitoring prometheus-k8s-0 10.128.0.10
|
||||
openshift-etcd etcd-master-0 10.128.0.3
|
||||
...
|
||||
|
||||
ACCESS CONTROL LISTS (7):
|
||||
PRIORITY DIRECTION ACTION MATCH
|
||||
------------------------------------------------------------------------------------------------------------------------
|
||||
1012 from-lport allow inport == @a4743249366342378346 && (ip4.mcast ...
|
||||
1011 to-lport drop (ip4.mcast || mldv1 || mldv2 || ...
|
||||
1001 to-lport allow-related ip4.src==10.128.0.2
|
||||
...
|
||||
|
||||
LOGICAL ROUTERS (2):
|
||||
NAME PORTS
|
||||
--------------------------------------------------------------------------------
|
||||
ovn_cluster_router 3
|
||||
GR_ip-10-0-1-10.us-east-2.compute.internal 2
|
||||
```
|
||||
|
||||
## Examples
|
||||
|
||||
1. **Analyze all nodes in a must-gather**:
|
||||
```
|
||||
/must-gather:ovn-dbs ./must-gather/registry-ci-openshift-org-origin-4-20-...-sha256-abc123/
|
||||
```
|
||||
Shows logical network topology for all nodes.
|
||||
|
||||
2. **Analyze specific node**:
|
||||
```
|
||||
/must-gather:ovn-dbs ./must-gather/.../ --node ip-10-0-26-145
|
||||
```
|
||||
Shows OVN database information only for the specified node (supports partial name matching).
|
||||
|
||||
3. **Analyze master node**:
|
||||
```
|
||||
/must-gather:ovn-dbs ./must-gather/.../ --node master-0
|
||||
```
|
||||
Filter to a specific master node using partial name matching.
|
||||
|
||||
4. **Interactive usage without path**:
|
||||
```
|
||||
/must-gather:ovn-dbs
|
||||
```
|
||||
The command will ask for the must-gather path.
|
||||
|
||||
5. **Check if pod exists in OVN**:
|
||||
```
|
||||
/must-gather:ovn-dbs ./must-gather/.../
|
||||
```
|
||||
Then search the output for the pod name to see which node it's on and its IP allocation.
|
||||
|
||||
6. **Investigate ACL rules on a specific node**:
|
||||
```
|
||||
/must-gather:ovn-dbs ./must-gather/.../ --node worker-1
|
||||
```
|
||||
Review the ACL section for a specific node to understand traffic filtering rules.
|
||||
|
||||
7. **Run custom OVSDB query** (Query Mode):
|
||||
```
|
||||
/must-gather:ovn-dbs ./must-gather/.../ --query '["OVN_Northbound", {"op":"select", "table":"ACL", "where":[["priority", ">", 1000]], "columns":["priority","match","action"]}]'
|
||||
```
|
||||
Query ACLs with priority > 1000 across all nodes. Claude can construct the JSON query for any OVSDB table.
|
||||
|
||||
8. **Query specific node with custom query**:
|
||||
```
|
||||
/must-gather:ovn-dbs ./must-gather/.../ --node master-0 --query '["OVN_Northbound", {"op":"select", "table":"Logical_Switch", "where":[], "columns":["name","ports"]}]'
|
||||
```
|
||||
List all logical switches with their ports on master-0.
|
||||
|
||||
9. **Query specific table** (Claude constructs JSON):
|
||||
Just ask Claude to query a specific OVSDB table and it will construct the appropriate JSON query. For example:
|
||||
- "Show all Logical_Router_Static_Route entries"
|
||||
- "Find ACLs with action 'drop'"
|
||||
- "List Logical_Switch_Port entries where external_ids contains 'openshift-etcd'"
|
||||
|
||||
## Error Handling
|
||||
|
||||
**Missing ovsdb-tool:**
|
||||
```
|
||||
Error: ovsdb-tool not found. Please install openvswitch package.
|
||||
```
|
||||
Solution: Install openvswitch: `sudo dnf install openvswitch`
|
||||
|
||||
**Missing database tarball:**
|
||||
```
|
||||
Error: Database tarball not found: network_logs/ovnk_database_store.tar.gz
|
||||
```
|
||||
Solution: Ensure this is a must-gather from an OVN cluster.
|
||||
|
||||
|
||||
**Node not found:**
|
||||
```
|
||||
Error: No databases found for node matching 'master-5'
|
||||
|
||||
Available nodes:
|
||||
- ip-10-0-77-117.us-east-2.compute.internal
|
||||
- ip-10-0-26-145.us-east-2.compute.internal
|
||||
- ip-10-0-1-194.us-east-2.compute.internal
|
||||
```
|
||||
Solution: Use one of the listed node names or a partial match.
|
||||
|
||||
## Notes
|
||||
|
||||
- **Binary Database Format**: Uses `ovsdb-tool` to read OVSDB binary files directly
|
||||
- **Per-Node Analysis**: Each node in IC mode has its own database (one NB and one SB per zone)
|
||||
- **Node Mapping**: Automatically correlates ovnkube pods to nodes by reading pod specs from must-gather
|
||||
- **Pod Discovery**: Pods are identified by `external_ids` with `pod=true`
|
||||
- **IP Extraction**: Pod IPs are parsed from the `addresses` field (format: "MAC IP")
|
||||
- **ACL Priorities**: Higher priority ACLs are processed first (shown at top)
|
||||
- **Node Filtering**: Supports partial name matching for convenience (e.g., "--node master" matches all masters)
|
||||
- **Query Mode**: Accepts raw OVSDB JSON queries in the format `["OVN_Northbound", {"op":"select", "table":"...", ...}]`
|
||||
- **Claude Query Construction**: Claude can automatically construct OVSDB JSON queries based on natural language requests
|
||||
- **Performance**: Querying large databases may take a few seconds per node
|
||||
|
||||
## Use Cases
|
||||
|
||||
1. **Verify Pod Network Configuration**:
|
||||
- Check if pods are registered in OVN
|
||||
- Verify IP address assignments
|
||||
- Confirm logical switch port creation
|
||||
|
||||
2. **Troubleshoot Connectivity Issues**:
|
||||
- Review ACL rules blocking traffic
|
||||
- Check if pods are in correct logical switches
|
||||
- Verify router configurations
|
||||
|
||||
3. **Understand Topology**:
|
||||
- See how zones are interconnected via transit_switch
|
||||
- Review gateway router configurations
|
||||
- Understand logical network structure
|
||||
|
||||
4. **Audit Network Policies**:
|
||||
- See ACL rules generated from NetworkPolicies
|
||||
- Identify overly permissive or restrictive rules
|
||||
- Check rule priorities and match conditions
|
||||
|
||||
## Arguments
|
||||
|
||||
- **$1** (must-gather-path): Optional. Path to the must-gather directory containing network_logs/. If not provided, user will be prompted.
|
||||
- **--node, -n** (node-name): Optional. Filter analysis to a specific node. Supports partial name matching (e.g., "master-0", "ip-10-0-26-145"). If no match is found, displays list of available nodes.
|
||||
- **--query, -q** (json-query): Optional. Run a raw OVSDB JSON query instead of standard analysis. Claude can construct the JSON query based on OVSDB transaction format. When provided, outputs raw JSON results instead of formatted analysis.
|
||||
Reference in New Issue
Block a user