Initial commit
This commit is contained in:
181
skills/proxmox/references/clustering.md
Normal file
181
skills/proxmox/references/clustering.md
Normal file
@@ -0,0 +1,181 @@
|
||||
# Proxmox Clustering Reference
|
||||
|
||||
## Cluster Benefits
|
||||
|
||||
- Centralized web management
|
||||
- Live VM migration between nodes
|
||||
- High availability (HA) with automatic failover
|
||||
- Shared configuration
|
||||
|
||||
## Cluster Requirements
|
||||
|
||||
| Requirement | Details |
|
||||
|-------------|---------|
|
||||
| Version | Same major/minor Proxmox version |
|
||||
| Time | NTP synchronized |
|
||||
| Network | Low-latency cluster network |
|
||||
| Names | Unique node hostnames |
|
||||
| Storage | Shared storage for HA |
|
||||
|
||||
## Cluster Commands
|
||||
|
||||
```bash
|
||||
# Check cluster status
|
||||
pvecm status
|
||||
|
||||
# List cluster nodes
|
||||
pvecm nodes
|
||||
|
||||
# Add node to cluster (run on new node)
|
||||
pvecm add <existing-node>
|
||||
|
||||
# Remove node (run on remaining node)
|
||||
pvecm delnode <node-name>
|
||||
|
||||
# Expected votes (split-brain recovery)
|
||||
pvecm expected <votes>
|
||||
```
|
||||
|
||||
## Quorum
|
||||
|
||||
Cluster requires majority of nodes online to operate.
|
||||
|
||||
| Nodes | Quorum | Can Lose |
|
||||
|-------|--------|----------|
|
||||
| 2 | 2 | 0 (use QDevice) |
|
||||
| 3 | 2 | 1 |
|
||||
| 4 | 3 | 1 |
|
||||
| 5 | 3 | 2 |
|
||||
|
||||
### QDevice
|
||||
|
||||
External quorum device for even-node clusters:
|
||||
|
||||
- Prevents split-brain in 2-node clusters
|
||||
- Runs on separate machine
|
||||
- Provides tie-breaking vote
|
||||
|
||||
## High Availability (HA)
|
||||
|
||||
Automatic VM restart on healthy node if host fails.
|
||||
|
||||
### Requirements
|
||||
|
||||
- Shared storage (Ceph, NFS, iSCSI)
|
||||
- Fencing enabled (watchdog)
|
||||
- HA group configured
|
||||
- VM added to HA
|
||||
|
||||
### HA States
|
||||
|
||||
| State | Description |
|
||||
|-------|-------------|
|
||||
| started | VM running, managed by HA |
|
||||
| stopped | VM stopped intentionally |
|
||||
| migrate | Migration in progress |
|
||||
| relocate | Moving to different node |
|
||||
| error | Problem detected |
|
||||
|
||||
### HA Configuration
|
||||
|
||||
1. Enable fencing (watchdog device)
|
||||
2. Create HA group (optional)
|
||||
3. Add VM to HA: Datacenter → HA → Add
|
||||
|
||||
### Fencing
|
||||
|
||||
Prevents split-brain by forcing failed node to stop:
|
||||
|
||||
```bash
|
||||
# Check watchdog status
|
||||
cat /proc/sys/kernel/watchdog
|
||||
|
||||
# Watchdog config
|
||||
/etc/pve/ha/fence.cfg
|
||||
```
|
||||
|
||||
## Live Migration
|
||||
|
||||
Move running VM between nodes without downtime.
|
||||
|
||||
### Requirements
|
||||
|
||||
- Shared storage OR local-to-local migration
|
||||
- Same CPU architecture
|
||||
- Network connectivity
|
||||
- Sufficient resources on target
|
||||
|
||||
### Migration Types
|
||||
|
||||
| Type | Downtime | Requirements |
|
||||
|------|----------|--------------|
|
||||
| Live | Minimal | Shared storage |
|
||||
| Offline | Full | Any storage |
|
||||
| Local storage | Moderate | Copies disk |
|
||||
|
||||
### Migration Command
|
||||
|
||||
```bash
|
||||
# Live migrate
|
||||
qm migrate <vmid> <target-node>
|
||||
|
||||
# Offline migrate
|
||||
qm migrate <vmid> <target-node> --offline
|
||||
|
||||
# With local disk
|
||||
qm migrate <vmid> <target-node> --with-local-disks
|
||||
```
|
||||
|
||||
## Cluster Network
|
||||
|
||||
### Corosync Network
|
||||
|
||||
Cluster communication (default port 5405):
|
||||
|
||||
- Low-latency required
|
||||
- Dedicated VLAN recommended
|
||||
- Redundant links for HA
|
||||
|
||||
### Configuration
|
||||
|
||||
```
|
||||
# /etc/pve/corosync.conf
|
||||
nodelist {
|
||||
node {
|
||||
name: node1
|
||||
ring0_addr: 192.168.10.1
|
||||
}
|
||||
node {
|
||||
name: node2
|
||||
ring0_addr: 192.168.10.2
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Quorum Lost
|
||||
|
||||
```bash
|
||||
# Check status
|
||||
pvecm status
|
||||
|
||||
# Force expected votes (DANGEROUS)
|
||||
pvecm expected 1
|
||||
|
||||
# Then: recover remaining nodes
|
||||
```
|
||||
|
||||
### Node Won't Join
|
||||
|
||||
- Check network connectivity
|
||||
- Verify time sync
|
||||
- Check Proxmox versions match
|
||||
- Review /var/log/pve-cluster/
|
||||
|
||||
### Split Brain Recovery
|
||||
|
||||
1. Identify authoritative node
|
||||
2. Stop cluster services on other nodes
|
||||
3. Set expected votes
|
||||
4. Restart and rejoin nodes
|
||||
Reference in New Issue
Block a user