zhongwei/gh-poindexter12-waypoint-technologies

Files

Zhongwei Li 18faa0569e Initial commit

2025-11-30 08:47:38 +08:00

4.9 KiB

Raw Permalink Blame History

Proxmox Troubleshooting Reference

Common Errors

Error	Cause	Solution
VM won't start	Lock, storage, resources	`qm unlock`, check storage, verify resources
Migration failed	No shared storage, resources	Verify shared storage, check target capacity
Cluster issues	Quorum, network, time	`pvecm status`, check NTP, network
Storage unavailable	Mount failed, network	Check mount, network access
High load	Resource contention	Identify bottleneck, rebalance VMs
Network issues	Bridge, VLAN, firewall	`brctl show`, check tags, firewall rules
Backup failed	Disk space, VM state	Check space, storage access
Template not found	Not downloaded	Download from Proxmox repo
API errors	Auth, permissions	Check token, user permissions

Diagnostic Commands

Cluster Health

pvecm status                     # Quorum and node status
pvecm nodes                      # List cluster members
systemctl status pve-cluster     # Cluster service
systemctl status corosync        # Corosync service

Node Health

pveversion -v                    # Proxmox version info
uptime                           # Load and uptime
free -h                          # Memory usage
df -h                            # Disk space
top -bn1 | head -20              # Process overview

VM Diagnostics

qm status <vmid>                 # VM state
qm config <vmid>                 # VM configuration
qm showcmd <vmid>                # QEMU command line
qm unlock <vmid>                 # Clear locks
qm monitor <vmid>                # QEMU monitor access

Container Diagnostics

pct status <ctid>                # Container state
pct config <ctid>                # Container configuration
pct enter <ctid>                 # Enter container shell
pct unlock <ctid>                # Clear locks

Storage Diagnostics

pvesm status                     # Storage status
df -h                            # Disk space
mount | grep -E 'nfs|ceph'       # Mounted storage
zpool status                     # ZFS pool status (if using ZFS)
ceph -s                          # Ceph status (if using Ceph)

Network Diagnostics

brctl show                       # Bridge configuration
ip link                          # Network interfaces
ip addr                          # IP addresses
ip route                         # Routing table
bridge vlan show                 # VLAN configuration

Log Files

# Cluster logs
journalctl -u pve-cluster
journalctl -u corosync

# VM/Container logs
journalctl | grep <vmid>
tail -f /var/log/pve/tasks/*

# Firewall logs
journalctl -u pve-firewall

# Web interface logs
journalctl -u pveproxy

Troubleshooting Workflows

VM Won't Start

Check for locks: qm unlock <vmid>
Verify storage: pvesm status
Check resources: free -h, df -h
Review config: qm config <vmid>
Check logs: journalctl | grep <vmid>
Try manual start: qm start <vmid> --debug

Migration Failure

Verify shared storage: pvesm status
Check target resources: pvesh get /nodes/<target>/status
Verify network: ping <target-node>
Check version match: pveversion on both nodes
Review migration logs

Cluster Quorum Lost

Check status: pvecm status
Identify online nodes
If majority lost, set expected: pvecm expected <n>
Recover remaining nodes
Rejoin lost nodes when available

Storage Mount Failed

Check network: ping <storage-server>
Verify mount: mount | grep <storage>
Try manual mount
Check permissions on storage server
Review /var/log/syslog

High CPU/Memory Usage

Identify culprit: top, htop
Check VM resources: qm monitor <vmid> → info balloon
Review resource allocation across cluster
Consider migration or resource limits

Recovery Procedures

Remove Failed Node

# On healthy node
pvecm delnode <failed-node>

# Clean up node-specific configs
rm -rf /etc/pve/nodes/<failed-node>

Force Stop Locked VM

# Remove lock
qm unlock <vmid>

# If still stuck, find and kill QEMU process
ps aux | grep <vmid>
kill <pid>

# Force cleanup
qm stop <vmid> --skiplock

Recover from Corrupt Config

# Backup current config
cp /etc/pve/qemu-server/<vmid>.conf /root/<vmid>.conf.bak

# Edit config manually
nano /etc/pve/qemu-server/<vmid>.conf

# Or restore from backup
qmrestore <backup> <vmid>

Health Check Script

#!/bin/bash
echo "=== Cluster Status ==="
pvecm status

echo -e "\n=== Node Resources ==="
for node in $(pvecm nodes | awk 'NR>1 {print $3}'); do
  echo "--- $node ---"
  pvesh get /nodes/$node/status --output-format yaml | grep -E '^(cpu|memory):'
done

echo -e "\n=== Storage Status ==="
pvesm status

echo -e "\n=== Running VMs ==="
qm list | grep running

echo -e "\n=== Running Containers ==="
pct list | grep running

4.9 KiB Raw Permalink Blame History

Proxmox Troubleshooting Reference

Common Errors

Diagnostic Commands

Cluster Health

Node Health

VM Diagnostics

Container Diagnostics

Storage Diagnostics

Network Diagnostics

Log Files

Troubleshooting Workflows

VM Won't Start

Migration Failure

Cluster Quorum Lost

Storage Mount Failed

High CPU/Memory Usage

Recovery Procedures

Remove Failed Node

Force Stop Locked VM

Recover from Corrupt Config

Health Check Script

4.9 KiB

Raw Permalink Blame History