Initial commit

2025-11-30 08:47:38 +08:00
commit 18faa0569e
47 changed files with 7969 additions and 0 deletions
--- a/skills/proxmox/references/troubleshooting.md
+++ b/skills/proxmox/references/troubleshooting.md
@@ -0,0 +1,197 @@
+# Proxmox Troubleshooting Reference
+
+## Common Errors
+
+| Error | Cause | Solution |
+|-------|-------|----------|
+| VM won't start | Lock, storage, resources | `qm unlock`, check storage, verify resources |
+| Migration failed | No shared storage, resources | Verify shared storage, check target capacity |
+| Cluster issues | Quorum, network, time | `pvecm status`, check NTP, network |
+| Storage unavailable | Mount failed, network | Check mount, network access |
+| High load | Resource contention | Identify bottleneck, rebalance VMs |
+| Network issues | Bridge, VLAN, firewall | `brctl show`, check tags, firewall rules |
+| Backup failed | Disk space, VM state | Check space, storage access |
+| Template not found | Not downloaded | Download from Proxmox repo |
+| API errors | Auth, permissions | Check token, user permissions |
+
+## Diagnostic Commands
+
+### Cluster Health
+
+```bash
+pvecm status                     # Quorum and node status
+pvecm nodes                      # List cluster members
+systemctl status pve-cluster     # Cluster service
+systemctl status corosync        # Corosync service
+```
+
+### Node Health
+
+```bash
+pveversion -v                    # Proxmox version info
+uptime                           # Load and uptime
+free -h                          # Memory usage
+df -h                            # Disk space
+top -bn1 | head -20              # Process overview
+```
+
+### VM Diagnostics
+
+```bash
+qm status <vmid>                 # VM state
+qm config <vmid>                 # VM configuration
+qm showcmd <vmid>                # QEMU command line
+qm unlock <vmid>                 # Clear locks
+qm monitor <vmid>                # QEMU monitor access
+```
+
+### Container Diagnostics
+
+```bash
+pct status <ctid>                # Container state
+pct config <ctid>                # Container configuration
+pct enter <ctid>                 # Enter container shell
+pct unlock <ctid>                # Clear locks
+```
+
+### Storage Diagnostics
+
+```bash
+pvesm status                     # Storage status
+df -h                            # Disk space
+mount | grep -E 'nfs|ceph'       # Mounted storage
+zpool status                     # ZFS pool status (if using ZFS)
+ceph -s                          # Ceph status (if using Ceph)
+```
+
+### Network Diagnostics
+
+```bash
+brctl show                       # Bridge configuration
+ip link                          # Network interfaces
+ip addr                          # IP addresses
+ip route                         # Routing table
+bridge vlan show                 # VLAN configuration
+```
+
+### Log Files
+
+```bash
+# Cluster logs
+journalctl -u pve-cluster
+journalctl -u corosync
+
+# VM/Container logs
+journalctl | grep <vmid>
+tail -f /var/log/pve/tasks/*
+
+# Firewall logs
+journalctl -u pve-firewall
+
+# Web interface logs
+journalctl -u pveproxy
+```
+
+## Troubleshooting Workflows
+
+### VM Won't Start
+
+1. Check for locks: `qm unlock <vmid>`
+2. Verify storage: `pvesm status`
+3. Check resources: `free -h`, `df -h`
+4. Review config: `qm config <vmid>`
+5. Check logs: `journalctl | grep <vmid>`
+6. Try manual start: `qm start <vmid> --debug`
+
+### Migration Failure
+
+1. Verify shared storage: `pvesm status`
+2. Check target resources: `pvesh get /nodes/<target>/status`
+3. Verify network: `ping <target-node>`
+4. Check version match: `pveversion` on both nodes
+5. Review migration logs
+
+### Cluster Quorum Lost
+
+1. Check status: `pvecm status`
+2. Identify online nodes
+3. If majority lost, set expected: `pvecm expected <n>`
+4. Recover remaining nodes
+5. Rejoin lost nodes when available
+
+### Storage Mount Failed
+
+1. Check network: `ping <storage-server>`
+2. Verify mount: `mount | grep <storage>`
+3. Try manual mount
+4. Check permissions on storage server
+5. Review `/var/log/syslog`
+
+### High CPU/Memory Usage
+
+1. Identify culprit: `top`, `htop`
+2. Check VM resources: `qm monitor <vmid>` → `info balloon`
+3. Review resource allocation across cluster
+4. Consider migration or resource limits
+
+## Recovery Procedures
+
+### Remove Failed Node
+
+```bash
+# On healthy node
+pvecm delnode <failed-node>
+
+# Clean up node-specific configs
+rm -rf /etc/pve/nodes/<failed-node>
+```
+
+### Force Stop Locked VM
+
+```bash
+# Remove lock
+qm unlock <vmid>
+
+# If still stuck, find and kill QEMU process
+ps aux | grep <vmid>
+kill <pid>
+
+# Force cleanup
+qm stop <vmid> --skiplock
+```
+
+### Recover from Corrupt Config
+
+```bash
+# Backup current config
+cp /etc/pve/qemu-server/<vmid>.conf /root/<vmid>.conf.bak
+
+# Edit config manually
+nano /etc/pve/qemu-server/<vmid>.conf
+
+# Or restore from backup
+qmrestore <backup> <vmid>
+```
+
+## Health Check Script
+
+```bash
+#!/bin/bash
+echo "=== Cluster Status ==="
+pvecm status
+
+echo -e "\n=== Node Resources ==="
+for node in $(pvecm nodes | awk 'NR>1 {print $3}'); do
+  echo "--- $node ---"
+  pvesh get /nodes/$node/status --output-format yaml | grep -E '^(cpu|memory):'
+done
+
+echo -e "\n=== Storage Status ==="
+pvesm status
+
+echo -e "\n=== Running VMs ==="
+qm list | grep running
+
+echo -e "\n=== Running Containers ==="
+pct list | grep running
+```