Files
gh-basher83-lunar-claude-pl…/skills/proxmox-infrastructure/workflows/ceph-deployment.md
2025-11-29 18:00:27 +08:00

22 KiB
Raw Blame History

CEPH Storage Deployment Workflow

Complete guide to deploying CEPH storage on a Proxmox VE cluster with automated OSD creation, pool configuration, and health verification.

Overview

This workflow automates CEPH deployment with:

  • CEPH package installation
  • Cluster initialization with proper network configuration
  • Monitor and manager creation across all nodes
  • Automated OSD creation with partition support
  • Pool configuration with replication and compression
  • Comprehensive health verification

Prerequisites

Before deploying CEPH:

  1. Cluster must be formed:

    • Proxmox cluster already initialized and healthy
    • All nodes showing quorum
    • See Cluster Formation first
  2. Network requirements:

    • Dedicated CEPH public network (192.168.5.0/24 for Matrix)
    • Dedicated CEPH private/cluster network (192.168.7.0/24 for Matrix)
    • MTU 9000 (jumbo frames) configured on CEPH networks
    • Bridges configured: vmbr1 (public), vmbr2 (private)
  3. Storage requirements:

    • Dedicated disks for OSDs (not boot disks)
    • All OSD disks should be the same type (SSD/NVMe)
    • Matrix: 2× 4TB Samsung 990 PRO NVMe per node = 24TB raw
  4. System requirements:

    • Minimum 3 nodes for production (replication factor 3)
    • At least 4GB RAM per OSD
    • Fast network (10GbE recommended for CEPH networks)

Phase 1: Install CEPH Packages

Step 1: Install CEPH

# roles/proxmox_ceph/tasks/install.yml
---
- name: Check if CEPH is already installed
  ansible.builtin.stat:
    path: /etc/pve/ceph.conf
  register: ceph_conf_check

- name: Check CEPH packages
  ansible.builtin.command:
    cmd: dpkg -l ceph-common
  register: ceph_package_check
  failed_when: false
  changed_when: false

- name: Install CEPH packages via pveceph
  ansible.builtin.command:
    cmd: "pveceph install --repository {{ ceph_repository }}"
  when: ceph_package_check.rc != 0
  register: ceph_install
  changed_when: "'installed' in ceph_install.stdout | default('')"

- name: Verify CEPH installation
  ansible.builtin.command:
    cmd: ceph --version
  register: ceph_version
  changed_when: false
  failed_when: ceph_version.rc != 0

- name: Display CEPH version
  ansible.builtin.debug:
    msg: "Installed CEPH version: {{ ceph_version.stdout }}"

Phase 2: Initialize CEPH Cluster

Step 2: Initialize CEPH (First Node Only)

# roles/proxmox_ceph/tasks/init.yml
---
- name: Check if CEPH cluster is initialized
  ansible.builtin.command:
    cmd: ceph status
  register: ceph_status_check
  failed_when: false
  changed_when: false

- name: Set CEPH initialization facts
  ansible.builtin.set_fact:
    ceph_initialized: "{{ ceph_status_check.rc == 0 }}"
    is_ceph_first_node: "{{ inventory_hostname == groups[cluster_group | default('matrix_cluster')][0] }}"

- name: Initialize CEPH cluster on first node
  ansible.builtin.command:
    cmd: >
      pveceph init
      --network {{ ceph_network }}
      --cluster-network {{ ceph_cluster_network }}
  when:
    - is_ceph_first_node
    - not ceph_initialized
  register: ceph_init
  changed_when: ceph_init.rc == 0

- name: Wait for CEPH cluster to initialize
  ansible.builtin.pause:
    seconds: 15
  when: ceph_init.changed

- name: Verify CEPH initialization
  ansible.builtin.command:
    cmd: ceph status
  register: ceph_init_verify
  changed_when: false
  when:
    - is_ceph_first_node
  failed_when:
    - ceph_init_verify.rc != 0

- name: Display initial CEPH status
  ansible.builtin.debug:
    var: ceph_init_verify.stdout_lines
  when:
    - is_ceph_first_node
    - ceph_init.changed or ansible_verbosity > 0

Phase 3: Create Monitors and Managers

Step 3: Create CEPH Monitors

# roles/proxmox_ceph/tasks/monitors.yml
---
- name: Check existing CEPH monitors
  ansible.builtin.command:
    cmd: ceph mon dump --format json
  register: mon_dump
  delegate_to: "{{ groups[cluster_group | default('matrix_cluster')][0] }}"
  run_once: true
  failed_when: false
  changed_when: false

- name: Parse monitor list
  ansible.builtin.set_fact:
    existing_monitors: "{{ (mon_dump.stdout | from_json).mons | map(attribute='name') | list }}"
  when: mon_dump.rc == 0

- name: Set monitor facts
  ansible.builtin.set_fact:
    has_monitor: "{{ inventory_hostname_short in existing_monitors | default([]) }}"

- name: Create CEPH monitor on first node
  ansible.builtin.command:
    cmd: pveceph mon create
  when:
    - is_ceph_first_node
    - not has_monitor
  register: mon_create_first
  changed_when: mon_create_first.rc == 0

- name: Wait for first monitor to stabilize
  ansible.builtin.pause:
    seconds: 10
  when: mon_create_first.changed

- name: Create CEPH monitors on other nodes
  ansible.builtin.command:
    cmd: pveceph mon create
  when:
    - not is_ceph_first_node
    - not has_monitor
  register: mon_create_others
  changed_when: mon_create_others.rc == 0

- name: Verify monitor quorum
  ansible.builtin.command:
    cmd: ceph quorum_status --format json
  register: quorum_status
  changed_when: false
  delegate_to: "{{ groups[cluster_group | default('matrix_cluster')][0] }}"
  run_once: true

- name: Check monitor quorum size
  ansible.builtin.assert:
    that:
      - (quorum_status.stdout | from_json).quorum | length >= ((groups[cluster_group | default('matrix_cluster')] | length // 2) + 1)
    fail_msg: "Monitor quorum not established"
  delegate_to: "{{ groups[cluster_group | default('matrix_cluster')][0] }}"
  run_once: true

Step 4: Create CEPH Managers

# roles/proxmox_ceph/tasks/managers.yml
---
- name: Check existing CEPH managers
  ansible.builtin.command:
    cmd: ceph mgr dump --format json
  register: mgr_dump
  delegate_to: "{{ groups[cluster_group | default('matrix_cluster')][0] }}"
  run_once: true
  failed_when: false
  changed_when: false

- name: Parse manager list
  ansible.builtin.set_fact:
    existing_managers: "{{ [(mgr_dump.stdout | from_json).active_name] + ((mgr_dump.stdout | from_json).standbys | map(attribute='name') | list) }}"
  when: mgr_dump.rc == 0

- name: Initialize empty manager list if check failed
  ansible.builtin.set_fact:
    existing_managers: []
  when: mgr_dump.rc != 0

- name: Set manager facts
  ansible.builtin.set_fact:
    has_manager: "{{ inventory_hostname_short in (existing_managers | default([])) }}"

- name: Create CEPH manager
  ansible.builtin.command:
    cmd: pveceph mgr create
  when: not has_manager
  register: mgr_create
  changed_when: mgr_create.rc == 0

- name: Wait for managers to stabilize
  ansible.builtin.pause:
    seconds: 5
  when: mgr_create.changed

- name: Enable CEPH dashboard module
  ansible.builtin.command:
    cmd: ceph mgr module enable dashboard
  delegate_to: "{{ groups[cluster_group | default('matrix_cluster')][0] }}"
  run_once: true
  register: dashboard_enable
  changed_when: "'already enabled' not in dashboard_enable.stderr"
  failed_when:
    - dashboard_enable.rc != 0
    - "'already enabled' not in dashboard_enable.stderr"

- name: Enable Prometheus module
  ansible.builtin.command:
    cmd: ceph mgr module enable prometheus
  delegate_to: "{{ groups[cluster_group | default('matrix_cluster')][0] }}"
  run_once: true
  register: prometheus_enable
  changed_when: "'already enabled' not in prometheus_enable.stderr"
  failed_when:
    - prometheus_enable.rc != 0
    - "'already enabled' not in prometheus_enable.stderr"

Phase 4: Create OSDs

Step 5: Prepare and Create OSDs

# roles/proxmox_ceph/tasks/osd_create.yml
---
- name: Get list of existing OSDs
  ansible.builtin.command:
    cmd: ceph osd ls
  register: existing_osds
  changed_when: false
  failed_when: false
  delegate_to: "{{ groups[cluster_group | default('matrix_cluster')][0] }}"
  run_once: true

- name: Check OSD devices availability
  ansible.builtin.command:
    cmd: "lsblk -ndo NAME,SIZE,TYPE {{ item.device }}"
  register: device_check
  failed_when: device_check.rc != 0
  changed_when: false
  loop: "{{ ceph_osds[inventory_hostname_short] | default([]) }}"
  loop_control:
    label: "{{ item.device }}"

- name: Display device information
  ansible.builtin.debug:
    msg: "Device {{ item.item.device }}: {{ item.stdout }}"
  loop: "{{ device_check.results }}"
  loop_control:
    label: "{{ item.item.device }}"
  when: ansible_verbosity > 0

- name: Wipe existing partitions on OSD devices
  ansible.builtin.command:
    cmd: "wipefs -a {{ item.device }}"
  when:
    - ceph_wipe_disks | default(false)
  loop: "{{ ceph_osds[inventory_hostname_short] | default([]) }}"
  loop_control:
    label: "{{ item.device }}"
  register: wipe_result
  changed_when: wipe_result.rc == 0

- name: Create OSDs from whole devices (no partitioning)
  ansible.builtin.command:
    cmd: >
      pveceph osd create {{ item.device }}
      {% if item.db_device is defined and item.db_device %}--db_dev {{ item.db_device }}{% endif %}
      {% if item.wal_device is defined and item.wal_device %}--wal_dev {{ item.wal_device }}{% endif %}
  when:
    - item.partitions | default(1) == 1
  loop: "{{ ceph_osds[inventory_hostname_short] | default([]) }}"
  loop_control:
    label: "{{ item.device }}"
  register: osd_create_whole
  changed_when: "'successfully created' in osd_create_whole.stdout | default('')"
  failed_when:
    - osd_create_whole.rc != 0
    - "'already in use' not in osd_create_whole.stderr | default('')"
    - "'ceph-volume' not in osd_create_whole.stderr | default('')"

- name: Create multiple OSDs per device (with partitioning)
  ansible.builtin.command:
    cmd: >
      pveceph osd create {{ item.0.device }}
      --size {{ (item.0.device_size_gb | default(4000) / item.0.partitions) | int }}G
      {% if item.0.db_device is defined and item.0.db_device %}--db_dev {{ item.0.db_device }}{% endif %}
      {% if item.0.wal_device is defined and item.0.wal_device %}--wal_dev {{ item.0.wal_device }}{% endif %}
  when:
    - item.0.partitions > 1
  with_subelements:
    - "{{ ceph_osds[inventory_hostname_short] | default([]) }}"
    - partition_indices
    - skip_missing: true
  loop_control:
    label: "{{ item.0.device }} partition {{ item.1 }}"
  register: osd_create_partition
  changed_when: "'successfully created' in osd_create_partition.stdout | default('')"
  failed_when:
    - osd_create_partition.rc != 0
    - "'already in use' not in osd_create_partition.stderr | default('')"

- name: Wait for OSDs to come up
  ansible.builtin.command:
    cmd: ceph osd tree --format json
  register: osd_tree
  changed_when: false
  delegate_to: "{{ groups[cluster_group | default('matrix_cluster')][0] }}"
  run_once: true
  until: >
    (osd_tree.stdout | from_json).nodes
    | selectattr('type', 'equalto', 'osd')
    | selectattr('status', 'equalto', 'up')
    | list | length >= expected_osd_count | int
  retries: 20
  delay: 10
  vars:
    expected_osd_count: >-
      {{
        ceph_osds.values()
        | map('map', attribute='partitions')
        | map('default', 1)
        | sum
      }}

Phase 5: Create and Configure Pools

Step 6: Create CEPH Pools

# roles/proxmox_ceph/tasks/pools.yml
---
- name: Get existing CEPH pools
  ansible.builtin.command:
    cmd: ceph osd pool ls
  register: existing_pools
  changed_when: false

- name: Create CEPH pools
  ansible.builtin.command:
    cmd: >
      ceph osd pool create {{ item.name }}
      {{ item.pg_num }}
      {{ item.pgp_num | default(item.pg_num) }}
  when: item.name not in existing_pools.stdout_lines
  loop: "{{ ceph_pools }}"
  loop_control:
    label: "{{ item.name }}"
  register: pool_create
  changed_when: pool_create.rc == 0

- name: Set pool replication size
  ansible.builtin.command:
    cmd: "ceph osd pool set {{ item.name }} size {{ item.size }}"
  loop: "{{ ceph_pools }}"
  loop_control:
    label: "{{ item.name }}"
  register: pool_size
  changed_when: "'set pool' in pool_size.stdout"

- name: Set pool minimum replication size
  ansible.builtin.command:
    cmd: "ceph osd pool set {{ item.name }} min_size {{ item.min_size }}"
  loop: "{{ ceph_pools }}"
  loop_control:
    label: "{{ item.name }}"
  register: pool_min_size
  changed_when: "'set pool' in pool_min_size.stdout"

- name: Set pool application
  ansible.builtin.command:
    cmd: "ceph osd pool application enable {{ item.name }} {{ item.application }}"
  when: item.application is defined
  loop: "{{ ceph_pools }}"
  loop_control:
    label: "{{ item.name }}"
  register: pool_app
  changed_when: "'enabled application' in pool_app.stdout"
  failed_when:
    - pool_app.rc != 0
    - "'already enabled' not in pool_app.stderr"

- name: Enable compression on pools
  ansible.builtin.command:
    cmd: "ceph osd pool set {{ item.name }} compression_mode aggressive"
  when: item.compression | default(false)
  loop: "{{ ceph_pools }}"
  loop_control:
    label: "{{ item.name }}"
  register: pool_compression
  changed_when: "'set pool' in pool_compression.stdout"

- name: Set compression algorithm
  ansible.builtin.command:
    cmd: "ceph osd pool set {{ item.name }} compression_algorithm {{ item.compression_algorithm | default('zstd') }}"
  when: item.compression | default(false)
  loop: "{{ ceph_pools }}"
  loop_control:
    label: "{{ item.name }}"
  register: pool_compression_algo
  changed_when: "'set pool' in pool_compression_algo.stdout"

Phase 6: Verify CEPH Health

Step 7: Health Verification

# roles/proxmox_ceph/tasks/verify.yml
---
- name: Wait for CEPH to stabilize
  ansible.builtin.pause:
    seconds: 30

- name: Check CEPH cluster health
  ansible.builtin.command:
    cmd: ceph health
  register: ceph_health
  changed_when: false
  delegate_to: "{{ groups[cluster_group | default('matrix_cluster')][0] }}"
  run_once: true

- name: Get CEPH status
  ansible.builtin.command:
    cmd: ceph status --format json
  register: ceph_status
  changed_when: false
  delegate_to: "{{ groups[cluster_group | default('matrix_cluster')][0] }}"
  run_once: true

- name: Parse CEPH status
  ansible.builtin.set_fact:
    ceph_status_data: "{{ ceph_status.stdout | from_json }}"

- name: Calculate expected OSD count
  ansible.builtin.set_fact:
    expected_osd_count: >-
      {{
        ceph_osds.values()
        | map('map', attribute='partitions')
        | map('default', 1)
        | sum
      }}
  delegate_to: "{{ groups[cluster_group | default('matrix_cluster')][0] }}"
  run_once: true

- name: Verify OSD count
  ansible.builtin.assert:
    that:
      - ceph_status_data.osdmap.num_osds | int == expected_osd_count | int
    fail_msg: "Expected {{ expected_osd_count }} OSDs but found {{ ceph_status_data.osdmap.num_osds }}"
  delegate_to: "{{ groups[cluster_group | default('matrix_cluster')][0] }}"
  run_once: true

- name: Verify all OSDs are up
  ansible.builtin.assert:
    that:
      - ceph_status_data.osdmap.num_up_osds == ceph_status_data.osdmap.num_osds
    fail_msg: "Not all OSDs are up: {{ ceph_status_data.osdmap.num_up_osds }}/{{ ceph_status_data.osdmap.num_osds }}"
  delegate_to: "{{ groups[cluster_group | default('matrix_cluster')][0] }}"
  run_once: true

- name: Verify all OSDs are in
  ansible.builtin.assert:
    that:
      - ceph_status_data.osdmap.num_in_osds == ceph_status_data.osdmap.num_osds
    fail_msg: "Not all OSDs are in cluster: {{ ceph_status_data.osdmap.num_in_osds }}/{{ ceph_status_data.osdmap.num_osds }}"
  delegate_to: "{{ groups[cluster_group | default('matrix_cluster')][0] }}"
  run_once: true

- name: Wait for PGs to become active+clean
  ansible.builtin.command:
    cmd: ceph pg stat --format json
  register: pg_stat
  changed_when: false
  delegate_to: "{{ groups[cluster_group | default('matrix_cluster')][0] }}"
  run_once: true
  until: >
    (pg_stat.stdout | from_json).num_pg_by_state
    | selectattr('name', 'equalto', 'active+clean')
    | map(attribute='num')
    | sum == (pg_stat.stdout | from_json).num_pgs
  retries: 60
  delay: 10

- name: Display CEPH cluster summary
  ansible.builtin.debug:
    msg: |
      CEPH Cluster Health: {{ ceph_health.stdout }}
      Total OSDs: {{ ceph_status_data.osdmap.num_osds }}
      OSDs Up: {{ ceph_status_data.osdmap.num_up_osds }}
      OSDs In: {{ ceph_status_data.osdmap.num_in_osds }}
      PGs: {{ ceph_status_data.pgmap.num_pgs }}
      Data: {{ ceph_status_data.pgmap.bytes_used | default(0) | human_readable }}
      Available: {{ ceph_status_data.pgmap.bytes_avail | default(0) | human_readable }}
  delegate_to: "{{ groups[cluster_group | default('matrix_cluster')][0] }}"
  run_once: true

Matrix Cluster Configuration Example

# group_vars/matrix_cluster.yml (CEPH section)
---
# CEPH configuration
ceph_enabled: true
ceph_repository: "no-subscription"  # or "enterprise" with subscription
ceph_network: "192.168.5.0/24"          # vmbr1 - Public network
ceph_cluster_network: "192.168.7.0/24"  # vmbr2 - Private network

# OSD configuration (4 OSDs per node = 12 total)
ceph_osds:
  foxtrot:
    - device: /dev/nvme1n1
      partitions: 2  # Create 2 OSDs per 4TB NVMe
      device_size_gb: 4000
      partition_indices: [0, 1]
      db_device: null
      wal_device: null
      crush_device_class: nvme
    - device: /dev/nvme2n1
      partitions: 2
      device_size_gb: 4000
      partition_indices: [0, 1]
      db_device: null
      wal_device: null
      crush_device_class: nvme

  golf:
    - device: /dev/nvme1n1
      partitions: 2
      device_size_gb: 4000
      partition_indices: [0, 1]
      crush_device_class: nvme
    - device: /dev/nvme2n1
      partitions: 2
      device_size_gb: 4000
      partition_indices: [0, 1]
      crush_device_class: nvme

  hotel:
    - device: /dev/nvme1n1
      partitions: 2
      device_size_gb: 4000
      partition_indices: [0, 1]
      crush_device_class: nvme
    - device: /dev/nvme2n1
      partitions: 2
      device_size_gb: 4000
      partition_indices: [0, 1]
      crush_device_class: nvme

# Pool configuration
ceph_pools:
  - name: vm_ssd
    pg_num: 128
    pgp_num: 128
    size: 3           # Replicate across 3 nodes
    min_size: 2       # Minimum 2 replicas required
    application: rbd
    compression: false

  - name: vm_containers
    pg_num: 64
    pgp_num: 64
    size: 3
    min_size: 2
    application: rbd
    compression: true
    compression_algorithm: zstd

# Safety flags
ceph_wipe_disks: false  # Set to true for fresh deployment (DESTRUCTIVE!)

Complete Playbook Example

# playbooks/ceph-deploy.yml
---
- name: Deploy CEPH Storage on Proxmox Cluster
  hosts: "{{ cluster_group | default('matrix_cluster') }}"
  become: true
  serial: 1  # Deploy one node at a time

  pre_tasks:
    - name: Verify cluster is healthy
      ansible.builtin.command:
        cmd: pvecm status
      register: cluster_check
      changed_when: false
      failed_when: "'Quorate: Yes' not in cluster_check.stdout"

    - name: Verify CEPH networks MTU
      ansible.builtin.command:
        cmd: "ip link show {{ item }}"
      register: mtu_check
      changed_when: false
      failed_when: "'mtu 9000' not in mtu_check.stdout"
      loop:
        - vmbr1  # CEPH public
        - vmbr2  # CEPH private

    - name: Display CEPH configuration
      ansible.builtin.debug:
        msg: |
          Deploying CEPH to cluster: {{ cluster_name }}
          Public network: {{ ceph_network }}
          Cluster network: {{ ceph_cluster_network }}
          Expected OSDs: {{ ceph_osds.values() | map('map', attribute='partitions') | map('default', 1) | sum }}
      run_once: true

  roles:
    - role: proxmox_ceph

  post_tasks:
    - name: Display CEPH OSD tree
      ansible.builtin.command:
        cmd: ceph osd tree
      register: osd_tree_final
      changed_when: false
      delegate_to: "{{ groups[cluster_group | default('matrix_cluster')][0] }}"
      run_once: true

    - name: Show OSD tree
      ansible.builtin.debug:
        var: osd_tree_final.stdout_lines
      run_once: true

    - name: Display pool information
      ansible.builtin.command:
        cmd: ceph osd pool ls detail
      register: pool_info
      changed_when: false
      delegate_to: "{{ groups[cluster_group | default('matrix_cluster')][0] }}"
      run_once: true

    - name: Show pool details
      ansible.builtin.debug:
        var: pool_info.stdout_lines
      run_once: true

Usage

Deploy CEPH to Matrix Cluster

# Check syntax
ansible-playbook playbooks/ceph-deploy.yml --syntax-check

# Deploy CEPH
ansible-playbook playbooks/ceph-deploy.yml --limit matrix_cluster

# Verify CEPH status
ansible -i inventory/proxmox.yml foxtrot -m shell -a "ceph status"
ansible -i inventory/proxmox.yml foxtrot -m shell -a "ceph osd tree"
ansible -i inventory/proxmox.yml foxtrot -m shell -a "ceph df"

Add mise Tasks

# .mise.toml
[tasks."ceph:deploy"]
description = "Deploy CEPH storage on cluster"
run = """
cd ansible
uv run ansible-playbook playbooks/ceph-deploy.yml
"""

[tasks."ceph:status"]
description = "Show CEPH cluster status"
run = """
ansible -i ansible/inventory/proxmox.yml foxtrot -m shell -a "ceph -s"
"""

[tasks."ceph:health"]
description = "Show CEPH health detail"
run = """
ansible -i ansible/inventory/proxmox.yml foxtrot -m shell -a "ceph health detail"
"""

Troubleshooting

OSDs Won't Create

Symptoms:

  • pveceph osd create fails with "already in use" error

Solutions:

  1. Check if disk has existing partitions: lsblk /dev/nvme1n1
  2. Wipe disk: wipefs -a /dev/nvme1n1 (DESTRUCTIVE!)
  3. Set ceph_wipe_disks: true in group_vars
  4. Check for existing LVM: pvdisplay, lvdisplay

PGs Stuck in Creating

Symptoms:

  • PGs stay in "creating" state for extended period

Solutions:

  1. Check OSD status: ceph osd tree
  2. Verify all OSDs are up and in: ceph osd stat
  3. Check mon/mgr status: ceph mon stat, ceph mgr stat
  4. Review logs: journalctl -u ceph-osd@*.service -n 100

Poor CEPH Performance

Symptoms:

  • Slow VM disk I/O

Solutions:

  1. Verify MTU 9000: ip link show vmbr1 | grep mtu
  2. Test network throughput: iperf3 between nodes
  3. Check OSD utilization: ceph osd df
  4. Verify SSD/NVMe is being used: ceph osd tree
  5. Check for rebalancing: ceph -s (look for "recovery")

References

  • ProxSpray analysis: docs/proxspray-analysis.md (lines 1431-1562)
  • Proxmox VE CEPH documentation
  • CEPH deployment best practices
  • Ansible CEPH automation pattern