Initial commit

2025-11-30 08:47:38 +08:00
commit 18faa0569e
47 changed files with 7969 additions and 0 deletions
--- a/skills/proxmox/references/clustering.md
+++ b/skills/proxmox/references/clustering.md
@@ -0,0 +1,181 @@
+# Proxmox Clustering Reference
+
+## Cluster Benefits
+
+- Centralized web management
+- Live VM migration between nodes
+- High availability (HA) with automatic failover
+- Shared configuration
+
+## Cluster Requirements
+
+| Requirement | Details |
+|-------------|---------|
+| Version | Same major/minor Proxmox version |
+| Time | NTP synchronized |
+| Network | Low-latency cluster network |
+| Names | Unique node hostnames |
+| Storage | Shared storage for HA |
+
+## Cluster Commands
+
+```bash
+# Check cluster status
+pvecm status
+
+# List cluster nodes
+pvecm nodes
+
+# Add node to cluster (run on new node)
+pvecm add <existing-node>
+
+# Remove node (run on remaining node)
+pvecm delnode <node-name>
+
+# Expected votes (split-brain recovery)
+pvecm expected <votes>
+```
+
+## Quorum
+
+Cluster requires majority of nodes online to operate.
+
+| Nodes | Quorum | Can Lose |
+|-------|--------|----------|
+| 2 | 2 | 0 (use QDevice) |
+| 3 | 2 | 1 |
+| 4 | 3 | 1 |
+| 5 | 3 | 2 |
+
+### QDevice
+
+External quorum device for even-node clusters:
+
+- Prevents split-brain in 2-node clusters
+- Runs on separate machine
+- Provides tie-breaking vote
+
+## High Availability (HA)
+
+Automatic VM restart on healthy node if host fails.
+
+### Requirements
+
+- Shared storage (Ceph, NFS, iSCSI)
+- Fencing enabled (watchdog)
+- HA group configured
+- VM added to HA
+
+### HA States
+
+| State | Description |
+|-------|-------------|
+| started | VM running, managed by HA |
+| stopped | VM stopped intentionally |
+| migrate | Migration in progress |
+| relocate | Moving to different node |
+| error | Problem detected |
+
+### HA Configuration
+
+1. Enable fencing (watchdog device)
+2. Create HA group (optional)
+3. Add VM to HA: Datacenter → HA → Add
+
+### Fencing
+
+Prevents split-brain by forcing failed node to stop:
+
+```bash
+# Check watchdog status
+cat /proc/sys/kernel/watchdog
+
+# Watchdog config
+/etc/pve/ha/fence.cfg
+```
+
+## Live Migration
+
+Move running VM between nodes without downtime.
+
+### Requirements
+
+- Shared storage OR local-to-local migration
+- Same CPU architecture
+- Network connectivity
+- Sufficient resources on target
+
+### Migration Types
+
+| Type | Downtime | Requirements |
+|------|----------|--------------|
+| Live | Minimal | Shared storage |
+| Offline | Full | Any storage |
+| Local storage | Moderate | Copies disk |
+
+### Migration Command
+
+```bash
+# Live migrate
+qm migrate <vmid> <target-node>
+
+# Offline migrate
+qm migrate <vmid> <target-node> --offline
+
+# With local disk
+qm migrate <vmid> <target-node> --with-local-disks
+```
+
+## Cluster Network
+
+### Corosync Network
+
+Cluster communication (default port 5405):
+
+- Low-latency required
+- Dedicated VLAN recommended
+- Redundant links for HA
+
+### Configuration
+
+```
+# /etc/pve/corosync.conf
+nodelist {
+  node {
+    name: node1
+    ring0_addr: 192.168.10.1
+  }
+  node {
+    name: node2
+    ring0_addr: 192.168.10.2
+  }
+}
+```
+
+## Troubleshooting
+
+### Quorum Lost
+
+```bash
+# Check status
+pvecm status
+
+# Force expected votes (DANGEROUS)
+pvecm expected 1
+
+# Then: recover remaining nodes
+```
+
+### Node Won't Join
+
+- Check network connectivity
+- Verify time sync
+- Check Proxmox versions match
+- Review /var/log/pve-cluster/
+
+### Split Brain Recovery
+
+1. Identify authoritative node
+2. Stop cluster services on other nodes
+3. Set expected votes
+4. Restart and rejoin nodes