Files
gh-poindexter12-waypoint-te…/skills/proxmox/references/troubleshooting.md
2025-11-30 08:47:38 +08:00

4.9 KiB

Proxmox Troubleshooting Reference

Common Errors

Error Cause Solution
VM won't start Lock, storage, resources qm unlock, check storage, verify resources
Migration failed No shared storage, resources Verify shared storage, check target capacity
Cluster issues Quorum, network, time pvecm status, check NTP, network
Storage unavailable Mount failed, network Check mount, network access
High load Resource contention Identify bottleneck, rebalance VMs
Network issues Bridge, VLAN, firewall brctl show, check tags, firewall rules
Backup failed Disk space, VM state Check space, storage access
Template not found Not downloaded Download from Proxmox repo
API errors Auth, permissions Check token, user permissions

Diagnostic Commands

Cluster Health

pvecm status                     # Quorum and node status
pvecm nodes                      # List cluster members
systemctl status pve-cluster     # Cluster service
systemctl status corosync        # Corosync service

Node Health

pveversion -v                    # Proxmox version info
uptime                           # Load and uptime
free -h                          # Memory usage
df -h                            # Disk space
top -bn1 | head -20              # Process overview

VM Diagnostics

qm status <vmid>                 # VM state
qm config <vmid>                 # VM configuration
qm showcmd <vmid>                # QEMU command line
qm unlock <vmid>                 # Clear locks
qm monitor <vmid>                # QEMU monitor access

Container Diagnostics

pct status <ctid>                # Container state
pct config <ctid>                # Container configuration
pct enter <ctid>                 # Enter container shell
pct unlock <ctid>                # Clear locks

Storage Diagnostics

pvesm status                     # Storage status
df -h                            # Disk space
mount | grep -E 'nfs|ceph'       # Mounted storage
zpool status                     # ZFS pool status (if using ZFS)
ceph -s                          # Ceph status (if using Ceph)

Network Diagnostics

brctl show                       # Bridge configuration
ip link                          # Network interfaces
ip addr                          # IP addresses
ip route                         # Routing table
bridge vlan show                 # VLAN configuration

Log Files

# Cluster logs
journalctl -u pve-cluster
journalctl -u corosync

# VM/Container logs
journalctl | grep <vmid>
tail -f /var/log/pve/tasks/*

# Firewall logs
journalctl -u pve-firewall

# Web interface logs
journalctl -u pveproxy

Troubleshooting Workflows

VM Won't Start

  1. Check for locks: qm unlock <vmid>
  2. Verify storage: pvesm status
  3. Check resources: free -h, df -h
  4. Review config: qm config <vmid>
  5. Check logs: journalctl | grep <vmid>
  6. Try manual start: qm start <vmid> --debug

Migration Failure

  1. Verify shared storage: pvesm status
  2. Check target resources: pvesh get /nodes/<target>/status
  3. Verify network: ping <target-node>
  4. Check version match: pveversion on both nodes
  5. Review migration logs

Cluster Quorum Lost

  1. Check status: pvecm status
  2. Identify online nodes
  3. If majority lost, set expected: pvecm expected <n>
  4. Recover remaining nodes
  5. Rejoin lost nodes when available

Storage Mount Failed

  1. Check network: ping <storage-server>
  2. Verify mount: mount | grep <storage>
  3. Try manual mount
  4. Check permissions on storage server
  5. Review /var/log/syslog

High CPU/Memory Usage

  1. Identify culprit: top, htop
  2. Check VM resources: qm monitor <vmid>info balloon
  3. Review resource allocation across cluster
  4. Consider migration or resource limits

Recovery Procedures

Remove Failed Node

# On healthy node
pvecm delnode <failed-node>

# Clean up node-specific configs
rm -rf /etc/pve/nodes/<failed-node>

Force Stop Locked VM

# Remove lock
qm unlock <vmid>

# If still stuck, find and kill QEMU process
ps aux | grep <vmid>
kill <pid>

# Force cleanup
qm stop <vmid> --skiplock

Recover from Corrupt Config

# Backup current config
cp /etc/pve/qemu-server/<vmid>.conf /root/<vmid>.conf.bak

# Edit config manually
nano /etc/pve/qemu-server/<vmid>.conf

# Or restore from backup
qmrestore <backup> <vmid>

Health Check Script

#!/bin/bash
echo "=== Cluster Status ==="
pvecm status

echo -e "\n=== Node Resources ==="
for node in $(pvecm nodes | awk 'NR>1 {print $3}'); do
  echo "--- $node ---"
  pvesh get /nodes/$node/status --output-format yaml | grep -E '^(cpu|memory):'
done

echo -e "\n=== Storage Status ==="
pvesm status

echo -e "\n=== Running VMs ==="
qm list | grep running

echo -e "\n=== Running Containers ==="
pct list | grep running