Initial commit

2025-11-30 08:46:06 +08:00
commit cf9da06850
16 changed files with 3315 additions and 0 deletions
--- a/skills/must-gather-analyzer/SKILL.md
+++ b/skills/must-gather-analyzer/SKILL.md
@@ -0,0 +1,285 @@
+---
+name: Must-Gather Analyzer
+description: |
+  Analyze OpenShift must-gather diagnostic data including cluster operators, pods, nodes,
+  and network components. Use this skill when the user asks about cluster health, operator status,
+  pod issues, node conditions, or wants diagnostic insights from must-gather data.
+
+  Triggers: "analyze must-gather", "check cluster health", "operator status", "pod issues",
+  "node status", "failing pods", "degraded operators", "cluster problems", "crashlooping",
+  "network issues", "etcd health", "analyze clusteroperators", "analyze pods", "analyze nodes"
+---
+
+# Must-Gather Analyzer Skill
+
+Comprehensive analysis of OpenShift must-gather diagnostic data with helper scripts that parse YAML and display output in `oc`-like format.
+
+## Overview
+
+This skill provides analysis for:
+- **ClusterVersion**: Current version, update status, and capabilities
+- **Cluster Operators**: Status, degradation, and availability
+- **Pods**: Health, restarts, crashes, and failures across namespaces
+- **Nodes**: Conditions, capacity, and readiness
+- **Network**: OVN/SDN diagnostics and connectivity
+- **Events**: Warning and error events across namespaces
+- **etcd**: Cluster health, member status, and quorum
+- **Storage**: PersistentVolumes and PersistentVolumeClaims status
+
+## Must-Gather Directory Structure
+
+**Important**: Must-gather data is contained in a subdirectory with a long hash name:
+```
+must-gather/
+└── registry-ci-openshift-org-origin-...-sha256-<hash>/
+    ├── cluster-scoped-resources/
+    │   ├── config.openshift.io/clusteroperators/
+    │   └── core/nodes/
+    ├── namespaces/
+    │   └── <namespace>/
+    │       └── pods/
+    │           └── <pod-name>/
+    │               └── <pod-name>.yaml
+    └── network_logs/
+```
+
+The analysis scripts expect the path to the **subdirectory** (the one with the hash), not the root must-gather folder.
+
+## Instructions
+
+### 1. Get Must-Gather Path
+Ask the user for the must-gather directory path if not already provided.
+- If they provide the root directory, look for the subdirectory with the hash name
+- The correct path contains `cluster-scoped-resources/` and `namespaces/` directories
+
+### 2. Choose Analysis Type
+
+Based on user's request, run the appropriate helper script:
+
+#### ClusterVersion Analysis
+```bash
+./scripts/analyze_clusterversion.py <must-gather-path>
+```
+
+Shows cluster version information similar to `oc get clusterversion`:
+- Current version and update status
+- Progressing state
+- Available updates
+- Version conditions
+- Enabled capabilities
+- Update history
+
+#### Cluster Operators Analysis
+```bash
+./scripts/analyze_clusteroperators.py <must-gather-path>
+```
+
+Shows cluster operator status similar to `oc get clusteroperators`:
+- Available, Progressing, Degraded conditions
+- Version information
+- Time since condition change
+- Detailed messages for operators with issues
+
+#### Pods Analysis
+```bash
+# All namespaces
+./scripts/analyze_pods.py <must-gather-path>
+
+# Specific namespace
+./scripts/analyze_pods.py <must-gather-path> --namespace <namespace>
+
+# Show only problematic pods
+./scripts/analyze_pods.py <must-gather-path> --problems-only
+```
+
+Shows pod status similar to `oc get pods -A`:
+- Ready/Total containers
+- Status (Running, Pending, CrashLoopBackOff, etc.)
+- Restart counts
+- Age
+- Categorized issues (crashlooping, pending, failed)
+
+#### Nodes Analysis
+```bash
+./scripts/analyze_nodes.py <must-gather-path>
+
+# Show only nodes with issues
+./scripts/analyze_nodes.py <must-gather-path> --problems-only
+```
+
+Shows node status similar to `oc get nodes`:
+- Ready status
+- Roles (master, worker)
+- Age
+- Kubernetes version
+- Node conditions (DiskPressure, MemoryPressure, etc.)
+- Capacity and allocatable resources
+
+#### Network Analysis
+```bash
+./scripts/analyze_network.py <must-gather-path>
+```
+
+Shows network health:
+- Network type (OVN-Kubernetes, OpenShift SDN)
+- Network operator status
+- OVN pod health
+- PodNetworkConnectivityCheck results
+- Network-related issues
+
+#### Events Analysis
+```bash
+# Recent events (last 100)
+./scripts/analyze_events.py <must-gather-path>
+
+# Warning events only
+./scripts/analyze_events.py <must-gather-path> --type Warning
+
+# Events in specific namespace
+./scripts/analyze_events.py <must-gather-path> --namespace openshift-etcd
+
+# Show last 50 events
+./scripts/analyze_events.py <must-gather-path> --count 50
+```
+
+Shows cluster events:
+- Event type (Warning, Normal)
+- Last seen timestamp
+- Reason and message
+- Affected object
+- Event count
+
+#### etcd Analysis
+```bash
+./scripts/analyze_etcd.py <must-gather-path>
+```
+
+Shows etcd cluster health:
+- Member health status
+- Member list with IDs and URLs
+- Endpoint status (leader, version, DB size)
+- Quorum status
+- Cluster summary
+
+#### Storage Analysis
+```bash
+# All PVs and PVCs
+./scripts/analyze_pvs.py <must-gather-path>
+
+# PVCs in specific namespace
+./scripts/analyze_pvs.py <must-gather-path> --namespace openshift-monitoring
+```
+
+Shows storage resources:
+- PersistentVolumes (capacity, status, claims)
+- PersistentVolumeClaims (binding, capacity)
+- Storage classes
+- Pending/unbound volumes
+
+#### Monitoring Analysis
+```bash
+# All alerts.
+./scripts/analyze_prometheus.py <must-gather-path>
+
+# Alerts in specific namespace
+./scripts/analyze_prometheus.py <must-gather-path> --namespace openshift-monitoring
+```
+
+Shows monitoring information:
+- Alerts (state, namespace, name, active since, labels)
+- Total of pending/firing alerts
+
+### 3. Interpret and Report
+
+After running the scripts:
+1. Review the summary statistics
+2. Focus on items flagged with issues
+3. Provide actionable insights and next steps
+4. Suggest log analysis for specific components if needed
+5. Cross-reference issues (e.g., degraded operator → failing pods → node issues)
+
+## Output Format
+
+All scripts provide:
+- **Summary Section**: High-level statistics with emoji indicators
+- **Table View**: `oc`-like formatted output
+- **Issues Section**: Detailed breakdown of problems
+
+Example summary format:
+```
+================================================================================
+SUMMARY: 25/28 operators healthy
+  ⚠️  3 operators with issues
+  🔄 1 progressing
+  ❌ 2 degraded
+================================================================================
+```
+
+## Helper Scripts Reference
+
+### scripts/analyze_clusterversion.py
+Parses: `cluster-scoped-resources/config.openshift.io/clusterversions/version.yaml`
+Output: ClusterVersion table with detailed version info, conditions, and capabilities
+
+### scripts/analyze_clusteroperators.py
+Parses: `cluster-scoped-resources/config.openshift.io/clusteroperators/`
+Output: ClusterOperator status table with conditions
+
+### scripts/analyze_pods.py
+Parses: `namespaces/*/pods/*/*.yaml` (individual pod directories)
+Output: Pod status table with issues categorized
+
+### scripts/analyze_nodes.py
+Parses: `cluster-scoped-resources/core/nodes/`
+Output: Node status table with conditions and capacity
+
+### scripts/analyze_network.py
+Parses: `network_logs/`, network operator, OVN resources
+Output: Network health summary and diagnostics
+
+### scripts/analyze_events.py
+Parses: `namespaces/*/core/events.yaml`
+Output: Event table sorted by last occurrence
+
+### scripts/analyze_etcd.py
+Parses: `etcd_info/` (endpoint_health.json, member_list.json, endpoint_status.json)
+Output: etcd cluster health and member status
+
+### scripts/analyze_pvs.py
+Parses: `cluster-scoped-resources/core/persistentvolumes/`, `namespaces/*/core/persistentvolumeclaims.yaml`
+Output: PV and PVC status tables
+
+## Tips for Analysis
+
+1. **Start with Cluster Operators**: They often reveal system-wide issues
+2. **Check Timing**: Look at "SINCE" columns to understand when issues started
+3. **Follow Dependencies**: Degraded operator → check its namespace pods → check hosting nodes
+4. **Look for Patterns**: Multiple pods failing on same node suggests node issue
+5. **Cross-reference**: Use multiple scripts together for complete picture
+
+## Common Scenarios
+
+### "Why is my cluster degraded?"
+1. Run `analyze_clusteroperators.py` - identify degraded operators
+2. Run `analyze_pods.py --namespace <operator-namespace>` - check operator pods
+3. Run `analyze_nodes.py` - verify node health
+
+### "Pods keep crashing"
+1. Run `analyze_pods.py --problems-only` - find crashlooping pods
+2. Check which nodes they're on
+3. Run `analyze_nodes.py` - verify node conditions
+4. Suggest checking pod logs in must-gather data
+
+### "Network connectivity issues"
+1. Run `analyze_network.py` - check network health
+2. Run `analyze_pods.py --namespace openshift-ovn-kubernetes`
+3. Check PodNetworkConnectivityCheck results
+
+## Next Steps After Analysis
+
+Based on findings, suggest:
+- Examining specific pod logs in `namespaces/<ns>/pods/<pod>/<container>/logs/`
+- Reviewing events in `namespaces/<ns>/core/events.yaml`
+- Checking audit logs in `audit_logs/`
+- Analyzing metrics data if available
+- Looking at host service logs in `host_service_logs/`