Files
gh-basher83-lunar-claude-pl…/skills/ansible-best-practices/patterns/ceph-automation.md
2025-11-29 18:00:24 +08:00

20 KiB

CEPH Storage Automation Patterns

Best practices for automating CEPH cluster deployment in Proxmox VE environments.

Pattern: Declarative CEPH OSD Configuration

Problem: ProxSpray leaves OSD creation as a manual step, defeating the purpose of automation.

Solution: Fully automate OSD creation with declarative configuration that specifies devices and partitioning.

Configuration Model

# group_vars/matrix_cluster.yml
---
# CEPH network configuration
ceph_enabled: true
ceph_network: "192.168.5.0/24"          # Public network (vmbr1)
ceph_cluster_network: "192.168.7.0/24"  # Private network (vmbr2)

# OSD configuration per node (4 OSDs per node = 12 total)
ceph_osds:
  foxtrot:
    - device: /dev/nvme1n1
      partitions: 2  # Create 2 OSDs per 4TB NVMe
      db_device: null
      wal_device: null
      crush_device_class: nvme
    - device: /dev/nvme2n1
      partitions: 2
      db_device: null
      wal_device: null
      crush_device_class: nvme

  golf:
    - device: /dev/nvme1n1
      partitions: 2
      crush_device_class: nvme
    - device: /dev/nvme2n1
      partitions: 2
      crush_device_class: nvme

  hotel:
    - device: /dev/nvme1n1
      partitions: 2
      crush_device_class: nvme
    - device: /dev/nvme2n1
      partitions: 2
      crush_device_class: nvme

# Pool configuration
ceph_pools:
  - name: vm_ssd
    pg_num: 128
    pgp_num: 128
    size: 3           # Replicate across 3 nodes
    min_size: 2       # Minimum 2 replicas required
    application: rbd
    crush_rule: replicated_rule
    compression: false

  - name: vm_containers
    pg_num: 64
    pgp_num: 64
    size: 3
    min_size: 2
    application: rbd
    crush_rule: replicated_rule
    compression: true

Pattern: Idempotent CEPH Installation

Problem: CEPH installation commands fail if already installed.

Solution: Check CEPH status before attempting installation.

Implementation

# roles/proxmox_ceph/tasks/install.yml
---
- name: Check if CEPH is already installed
  ansible.builtin.stat:
    path: /etc/pve/ceph.conf
  register: ceph_conf_check

- name: Check CEPH packages
  ansible.builtin.command:
    cmd: dpkg -l ceph-common
  register: ceph_package_check
  failed_when: false
  changed_when: false

- name: Install CEPH packages
  ansible.builtin.command:
    cmd: "pveceph install --repository no-subscription"
  when:
    - ceph_package_check.rc != 0
  register: ceph_install
  changed_when: "'installed' in ceph_install.stdout"

- name: Verify CEPH installation
  ansible.builtin.command:
    cmd: ceph --version
  register: ceph_version
  changed_when: false
  failed_when: ceph_version.rc != 0

Pattern: CEPH Cluster Initialization

Problem: CEPH cluster can only be initialized once, must be idempotent.

Solution: Check for existing cluster configuration before initialization.

Implementation

# roles/proxmox_ceph/tasks/init.yml
---
- name: Check if CEPH cluster is initialized
  ansible.builtin.command:
    cmd: ceph status
  register: ceph_status_check
  failed_when: false
  changed_when: false

- name: Set CEPH initialization facts
  ansible.builtin.set_fact:
    ceph_initialized: "{{ ceph_status_check.rc == 0 }}"
    is_ceph_first_node: "{{ inventory_hostname == groups[cluster_group][0] }}"

- name: Initialize CEPH cluster on first node
  ansible.builtin.command:
    cmd: "pveceph init --network {{ ceph_network }} --cluster-network {{ ceph_cluster_network }}"
  when:
    - is_ceph_first_node | default(false)
    - not ceph_initialized
  register: ceph_init
  changed_when: ceph_init.rc == 0

- name: Wait for CEPH cluster to initialize
  ansible.builtin.pause:
    seconds: 15
  when: ceph_init.changed

Pattern: CEPH Monitor Creation

Problem: Monitors must be created in specific order and verified for quorum.

Solution: Create monitors with proper ordering and quorum verification.

Implementation

# roles/proxmox_ceph/tasks/monitors.yml
---
- name: Check existing CEPH monitors
  ansible.builtin.command:
    cmd: ceph mon dump
  register: mon_dump
  delegate_to: "{{ groups[cluster_group][0] }}"
  run_once: true
  failed_when: false
  changed_when: false

- name: Set monitor facts
  ansible.builtin.set_fact:
    has_monitor: "{{ inventory_hostname in mon_dump.stdout }}"
  when: mon_dump.rc == 0

- name: Set local is_ceph_first_node fact
  ansible.builtin.set_fact:
    is_ceph_first_node: "{{ inventory_hostname == groups[cluster_group][0] }}"

- name: Create CEPH monitor on first node
  ansible.builtin.command:
    cmd: pveceph mon create
  when:
    - is_ceph_first_node | default(false)
    - not has_monitor | default(false)
  register: mon_create_first
  changed_when: mon_create_first.rc == 0

- name: Wait for first monitor to stabilize
  ansible.builtin.pause:
    seconds: 10
  when: mon_create_first.changed

- name: Create CEPH monitors on other nodes
  ansible.builtin.command:
    cmd: pveceph mon create
  when:
    - not (is_ceph_first_node | default(false))
    - not has_monitor | default(false)
  register: mon_create_others
  changed_when: mon_create_others.rc == 0

- name: Verify monitor quorum
  ansible.builtin.command:
    cmd: ceph quorum_status
  register: quorum_status
  changed_when: false
  delegate_to: "{{ groups[cluster_group][0] }}"
  run_once: true
  vars:
    expected_mons: "{{ ceph_mon_count | default(3) }}"
  failed_when: ((quorum_status.stdout | from_json).quorum | length) < expected_mons

Pattern: CEPH Manager Creation

Problem: Managers provide web interface and monitoring; should run on all nodes for HA.

Solution: Create managers on all nodes with proper verification.

Implementation

# roles/proxmox_ceph/tasks/managers.yml
---
- name: Check existing CEPH managers
  ansible.builtin.command:
    cmd: ceph mgr dump
  register: mgr_dump
  delegate_to: "{{ groups[cluster_group][0] }}"
  run_once: true
  failed_when: false
  changed_when: false

- name: Set manager facts
  ansible.builtin.set_fact:
    has_manager: "{{ inventory_hostname in mgr_dump.stdout }}"
  when: mgr_dump.rc == 0

- name: Create CEPH manager
  ansible.builtin.command:
    cmd: pveceph mgr create
  when: not has_manager | default(false)
  register: mgr_create
  changed_when: mgr_create.rc == 0

- name: Enable CEPH dashboard module
  ansible.builtin.command:
    cmd: ceph mgr module enable dashboard
  delegate_to: "{{ groups[cluster_group][0] }}"
  run_once: true
  register: dashboard_enable
  changed_when: "'already enabled' not in dashboard_enable.stderr"
  failed_when:
    - dashboard_enable.rc != 0
    - "'already enabled' not in dashboard_enable.stderr"

Pattern: Automated OSD Creation with Partitioning

Problem: Manual OSD creation is error-prone and doesn't support partitioning large drives.

Solution: Automate partition creation and OSD deployment.

Implementation

# roles/proxmox_ceph/tasks/osd_create.yml
---
- name: Get list of existing OSDs
  ansible.builtin.command:
    cmd: pveceph osd ls
  register: existing_osds
  changed_when: false
  failed_when: false

- name: Probe existing CEPH volumes
  ansible.builtin.command:
    cmd: ceph-volume lvm list --format json
  register: ceph_volume_probe
  changed_when: false
  failed_when: false

- name: Check OSD devices availability
  ansible.builtin.command:
    cmd: "lsblk -ndo NAME,TYPE {{ item.device }}"
  register: device_check
  failed_when: device_check.rc != 0
  changed_when: false
  loop: "{{ ceph_osds[inventory_hostname_short] | default([]) }}"
  loop_control:
    label: "{{ item.device }}"

- name: Wipe existing partitions on OSD devices
  ansible.builtin.command:
    cmd: "wipefs -a {{ item.device }}"
  when:
    - ceph_volume_probe.rc == 0
    - ceph_volume_probe.stdout | from_json | dict2items | selectattr('value.0.devices', 'defined') | map(attribute='value.0.devices') | flatten | select('match', '^' + item.device) | list | length == 0
    - ceph_wipe_disks | default(false)
  loop: "{{ ceph_osds[inventory_hostname_short] | default([]) }}"
  loop_control:
    label: "{{ item.device }}"
  register: wipe_result
  changed_when: wipe_result.rc == 0

- name: Build list of partitions to create
  ansible.builtin.set_fact:
    osd_partitions: >-
      {% set result = [] -%}
      {% for osd in ceph_osds[inventory_hostname_short] | default([]) -%}
        {% if (osd.partitions | default(1) | int) > 1 -%}
          {% for part_num in range(1, (osd.partitions | int) + 1) -%}
            {% set _ = result.append({
              'device': osd.device,
              'partition_num': part_num,
              'total_partitions': osd.partitions,
              'db_device': osd.get('db_device'),
              'wal_device': osd.get('wal_device')
            }) -%}
          {% endfor -%}
        {% endif -%}
      {% endfor -%}
      {{ result }}

- name: Create partitions for multiple OSDs per device
  community.general.parted:
    device: "{{ item.device }}"
    number: "{{ item.partition_num }}"
    state: present
    part_start: "{{ ((item.partition_num - 1) * (100 / item.total_partitions)) }}%"
    part_end: "{{ (item.partition_num * (100 / item.total_partitions)) }}%"
    label: gpt
  loop: "{{ osd_partitions }}"
  loop_control:
    label: "{{ item.device }}{{ 'p' if item.device.startswith('/dev/nvme') else '' }}{{ item.partition_num }}"

- name: Create OSDs from whole devices
  ansible.builtin.command:
    cmd: >
      pveceph osd create {{ item.device }}
      {% if item.db_device %}--db_dev {{ item.db_device }}{% endif %}
      {% if item.wal_device %}--wal_dev {{ item.wal_device }}{% endif %}
  when:
    - item.partitions | default(1) == 1
    - ceph_volume_probe.rc == 0
    - ceph_volume_probe.stdout | from_json | dict2items | selectattr('value.0.devices', 'defined') | map(attribute='value.0.devices') | flatten | select('match', '^' + item.device + '$') | list | length == 0
  loop: "{{ ceph_osds[inventory_hostname_short] | default([]) }}"
  loop_control:
    label: "{{ item.device }}"
  register: osd_create_whole
  changed_when: "'successfully created' in osd_create_whole.stdout"
  failed_when:
    - osd_create_whole.rc != 0
    - "'already in use' not in osd_create_whole.stderr"

- name: Create OSDs from partitions
  ansible.builtin.command:
    cmd: >
      pveceph osd create {{ item.device }}{{ 'p' if item.device.startswith('/dev/nvme') else '' }}{{ item.partition_num }}
      {% if item.db_device %}--db_dev {{ item.db_device }}{% endif %}
      {% if item.wal_device %}--wal_dev {{ item.wal_device %}{% endif %}
  when:
    - ceph_volume_probe.rc == 0
    - ceph_volume_probe.stdout | from_json | dict2items | selectattr('value.0.devices', 'defined') | map(attribute='value.0.devices') | flatten | select('match', '^' + item.device + ('p' if item.device.startswith('/dev/nvme') else '') + (item.partition_num | string) + '$') | list | length == 0
  loop: "{{ osd_partitions }}"
  loop_control:
    label: "{{ item.device }}{{ 'p' if item.device.startswith('/dev/nvme') else '' }}{{ item.partition_num }}"
  register: osd_create_partition
  changed_when: "'successfully created' in osd_create_partition.stdout"
  failed_when:
    - osd_create_partition.rc != 0
    - "'already in use' not in osd_create_partition.stderr"

- name: Wait for OSDs to come up
  ansible.builtin.command:
    cmd: ceph osd tree
  register: osd_tree
  changed_when: false
  delegate_to: "{{ groups[cluster_group][0] }}"
  run_once: true
  until: "'up' in osd_tree.stdout"
  retries: 10
  delay: 5

Pattern: CEPH Pool Creation

Problem: Pools must be created with proper PG counts, replication, and application tags.

Solution: Declarative pool configuration with validation.

Implementation

# roles/proxmox_ceph/tasks/pools.yml
---
- name: Get existing CEPH pools
  ansible.builtin.command:
    cmd: ceph osd pool ls
  register: existing_pools
  changed_when: false

- name: Create CEPH pools
  ansible.builtin.command:
    cmd: >
      ceph osd pool create {{ item.name }}
      {{ item.pg_num }}
      {{ item.pgp_num | default(item.pg_num) }}
      replicated
      {{ item.crush_rule | default('replicated_rule') }}
  when: item.name not in existing_pools.stdout_lines
  loop: "{{ ceph_pools }}"
  loop_control:
    label: "{{ item.name }}"
  register: pool_create
  changed_when: pool_create.rc == 0

- name: Get current pool replication size
  ansible.builtin.command:
    cmd: "ceph osd pool get {{ item.name }} size -f json"
  loop: "{{ ceph_pools }}"
  loop_control:
    label: "{{ item.name }}"
  register: pool_size_current
  changed_when: false

- name: Set pool replication size
  ansible.builtin.command:
    cmd: "ceph osd pool set {{ item.name }} size {{ item.size }}"
  when: (pool_size_current.results[loop_index].stdout | from_json).size != item.size
  loop: "{{ ceph_pools }}"
  loop_control:
    label: "{{ item.name }}"
    index_var: loop_index

- name: Get current pool minimum replication size
  ansible.builtin.command:
    cmd: "ceph osd pool get {{ item.name }} min_size -f json"
  loop: "{{ ceph_pools }}"
  loop_control:
    label: "{{ item.name }}"
  register: pool_min_size_current
  changed_when: false

- name: Set pool minimum replication size
  ansible.builtin.command:
    cmd: "ceph osd pool set {{ item.name }} min_size {{ item.min_size }}"
  when: (pool_min_size_current.results[loop_index].stdout | from_json).min_size != item.min_size
  loop: "{{ ceph_pools }}"
  loop_control:
    label: "{{ item.name }}"
    index_var: loop_index

- name: Get current pool applications
  ansible.builtin.command:
    cmd: "ceph osd pool application get {{ item.name }} -f json"
  when: item.application is defined
  loop: "{{ ceph_pools }}"
  loop_control:
    label: "{{ item.name }}"
  register: pool_app_current
  changed_when: false
  failed_when: false

- name: Set pool application
  ansible.builtin.command:
    cmd: "ceph osd pool application enable {{ item.name }} {{ item.application }}"
  when:
    - item.application is defined
    - pool_app_current.results[loop_index].rc == 0
    - item.application not in (pool_app_current.results[loop_index].stdout | from_json | default({}))
  loop: "{{ ceph_pools }}"
  loop_control:
    label: "{{ item.name }}"
    index_var: loop_index

- name: Get current pool compression mode
  ansible.builtin.command:
    cmd: "ceph osd pool get {{ item.name }} compression_mode -f json"
  when: item.compression | default(false)
  loop: "{{ ceph_pools }}"
  loop_control:
    label: "{{ item.name }}"
  register: pool_compression_current
  changed_when: false

- name: Enable compression on pools
  ansible.builtin.command:
    cmd: "ceph osd pool set {{ item.name }} compression_mode aggressive"
  when:
    - item.compression | default(false)
    - (pool_compression_current.results[loop_index].stdout | from_json).compression_mode != 'aggressive'
  loop: "{{ ceph_pools }}"
  loop_control:
    label: "{{ item.name }}"
    index_var: loop_index

Pattern: CEPH Health Verification

Problem: CEPH cluster may appear successful but have health issues.

Solution: Comprehensive health checks after deployment.

Implementation

# roles/proxmox_ceph/tasks/verify.yml
---
- name: Check CEPH cluster health
  ansible.builtin.command:
    cmd: ceph health
  register: ceph_health
  changed_when: false
  delegate_to: "{{ groups[cluster_group][0] }}"
  run_once: true

- name: Get CEPH status
  ansible.builtin.command:
    cmd: ceph status
  register: ceph_status
  changed_when: false
  delegate_to: "{{ groups[cluster_group][0] }}"
  run_once: true

- name: Verify expected OSD count
  ansible.builtin.set_fact:
    expected_osd_count: >-
      {{
        ceph_osds
        | dict2items
        | map(attribute='value')
        | sum(start=[])
        | map('default', {'partitions': 1})
        | map(attribute='partitions')
        | map('int')
        | sum
      }}
  delegate_to: "{{ groups[cluster_group][0] }}"
  run_once: true

- name: Check OSD count matches expected
  ansible.builtin.assert:
    that:
      - "(ceph_status.stdout | from_json).osdmap.num_osds == (expected_osd_count | int)"
    fail_msg: >-
      Expected {{ expected_osd_count }} OSDs but found
      {{ (ceph_status.stdout | from_json).osdmap.num_osds }}
  delegate_to: "{{ groups[cluster_group][0] }}"
  run_once: true

- name: Check all OSDs are up
  ansible.builtin.command:
    cmd: ceph osd tree
  register: osd_tree
  changed_when: false
  failed_when: "'down' in osd_tree.stdout"
  delegate_to: "{{ groups[cluster_group][0] }}"
  run_once: true

- name: Verify PG status
  ansible.builtin.command:
    cmd: ceph pg stat
  register: pg_stat
  changed_when: false
  failed_when: "'active+clean' not in pg_stat.stdout"
  delegate_to: "{{ groups[cluster_group][0] }}"
  run_once: true
  retries: 30
  delay: 10
  until: "'active+clean' in pg_stat.stdout"

- name: Display CEPH status
  ansible.builtin.debug:
    msg: |
      CEPH Cluster Health: {{ ceph_health.stdout }}
      {{ ceph_status.stdout_lines | join('\n') }}
  delegate_to: "{{ groups[cluster_group][0] }}"
  run_once: true

Anti-Pattern: Manual OSD Creation

Don't Do This (from ProxSpray):

- name: Create OSD on available disks (manual step required)
  ansible.builtin.debug:
    msg: |
      To create OSDs, run manually:
      pveceph osd create /dev/sda
      pveceph osd create /dev/sdb

Problems:

  • Defeats purpose of automation
  • Error-prone manual process
  • No consistency across nodes
  • Difficult to scale

Do This Instead: Use the declarative OSD configuration pattern shown above.

Complete Role Example

# roles/proxmox_ceph/tasks/main.yml
---
- name: Install CEPH packages
  ansible.builtin.include_tasks: install.yml

- name: Initialize CEPH cluster (first node only)
  ansible.builtin.include_tasks: init.yml
  when: inventory_hostname == groups[cluster_group][0]

- name: Create CEPH monitors
  ansible.builtin.include_tasks: monitors.yml

- name: Create CEPH managers
  ansible.builtin.include_tasks: managers.yml

- name: Create OSDs
  ansible.builtin.include_tasks: osd_create.yml
  when: ceph_osds[inventory_hostname_short] is defined

- name: Create CEPH pools
  ansible.builtin.include_tasks: pools.yml
  when: inventory_hostname == groups[cluster_group][0]

- name: Verify CEPH health
  ansible.builtin.include_tasks: verify.yml

Testing

# Syntax check
ansible-playbook --syntax-check playbooks/ceph-deploy.yml

# Check mode (limited - CEPH commands don't support check mode well)
ansible-playbook playbooks/ceph-deploy.yml --check --diff

# Deploy CEPH to Matrix cluster
ansible-playbook playbooks/ceph-deploy.yml --limit matrix_cluster

# Verify CEPH status
ansible -i inventory/proxmox.yml foxtrot -m shell -a "ceph status"
ansible -i inventory/proxmox.yml foxtrot -m shell -a "ceph osd tree"
ansible -i inventory/proxmox.yml foxtrot -m shell -a "ceph health detail"

Matrix Cluster Example

# playbooks/ceph-deploy.yml
---
- name: Deploy CEPH Storage on Matrix Cluster
  hosts: matrix_cluster
  become: true
  serial: 1  # Deploy one node at a time

  pre_tasks:
    - name: Verify network MTU
      ansible.builtin.command:
        cmd: "ip link show vmbr1"
      register: mtu_check
      changed_when: false
      failed_when: "'mtu 9000' not in mtu_check.stdout"

  roles:
    - role: proxmox_ceph
      vars:
        cluster_group: matrix_cluster
        ceph_wipe_disks: false  # Set to true for fresh deployment

References

  • ProxSpray analysis: docs/proxspray-analysis.md (lines 333-488)
  • Proxmox VE CEPH documentation
  • CEPH configuration reference
  • OSD deployment best practices