Files
gh-basher83-lunar-claude-pl…/skills/ansible-best-practices/patterns/error-handling.md
2025-11-29 18:00:24 +08:00

13 KiB

Error Handling Patterns

Overview

Proper error handling in Ansible ensures playbooks are robust, idempotent, and provide clear failure messages. This guide covers patterns from the Virgo-Core repository.

Core Concepts

changed_when

Controls when Ansible reports a task as "changed". Critical for idempotency with command and shell modules.

Syntax:

changed_when: <boolean expression>

failed_when

Controls when Ansible considers a task as failed. Allows graceful handling of expected errors.

Syntax:

failed_when: <boolean expression>

register

Captures task output for later inspection and conditional logic.

Syntax:

register: variable_name

Pattern 1: Idempotent Command Execution

Problem

command and shell modules always report "changed" even if nothing changed.

Solution

Use changed_when to detect actual changes:

Example from repository:

- name: Create Proxmox API token
  ansible.builtin.command: >
    pveum user token add {{ system_username }}@{{ proxmox_user_realm }}
    {{ proxmox_token_name }}
  register: token_result
  changed_when: "'already exists' not in token_result.stderr"
  failed_when:
    - token_result.rc != 0
    - "'already exists' not in token_result.stderr"
  no_log: true

Explanation:

  1. register: token_result - Captures command output
  2. changed_when: "'already exists' not in token_result.stderr" - Only report "changed" if token didn't already exist
  3. failed_when - Don't fail if token already exists (expected scenario)

Pattern 2: Check Before Create

Problem

Creating resources that may already exist causes unnecessary errors.

Solution

Check for existence first, create conditionally:

Example:

- name: Check if VM template exists
  ansible.builtin.shell: |
    set -o pipefail
    qm list | awk '{print $1}' | grep -q "^{{ template_id }}$"
  args:
    executable: /bin/bash
  register: template_exists
  changed_when: false  # Checking doesn't change anything
  failed_when: false   # Don't fail if template not found

- name: Create VM template
  ansible.builtin.command: >
    qm create {{ template_id }}
    --name {{ template_name }}
    --memory 2048
    --cores 2
  when: template_exists.rc != 0  # Only create if check failed (doesn't exist)
  register: create_result

Key points:

  • changed_when: false - Read-only operation
  • failed_when: false - Expected that template might not exist
  • when: template_exists.rc != 0 - Conditional creation

Pattern 3: Verify After Create

Problem

Resource creation appears to succeed but may have failed silently.

Solution

Verify resource exists after creation:

Example:

- name: Create VM
  ansible.builtin.command: >
    qm create {{ vmid }}
    --name {{ vm_name }}
    --memory 4096
  register: create_result

- name: Verify VM was created
  ansible.builtin.shell: |
    set -o pipefail
    qm list | grep "{{ vmid }}"
  args:
    executable: /bin/bash
  register: verify_result
  changed_when: false
  failed_when: verify_result.rc != 0

Pattern 4: Graceful Failure Handling

Problem

Task failures may be expected in certain scenarios.

Solution

Use failed_when with specific conditions:

Example:

- name: Try to stop service
  ansible.builtin.systemd:
    name: myservice
    state: stopped
  register: stop_result
  failed_when:
    - stop_result.failed
    - "'not found' not in stop_result.msg"
  # Allow failure if service doesn't exist

Multiple failure conditions:

- name: Run migration
  ansible.builtin.command: /usr/bin/migrate-database
  register: migrate_result
  failed_when:
    - migrate_result.rc != 0
    - "'already applied' not in migrate_result.stdout"
    - "'no changes' not in migrate_result.stdout"
  # Success if: rc=0, OR "already applied", OR "no changes"

Pattern 5: Block with Rescue

Problem

Need to handle failures and perform cleanup.

Solution

Use block/rescue/always:

Example:

- name: Deploy application
  block:
    - name: Stop application
      ansible.builtin.systemd:
        name: myapp
        state: stopped

    - name: Deploy new version
      ansible.builtin.copy:
        src: myapp-v2.0
        dest: /usr/bin/myapp

    - name: Start application
      ansible.builtin.systemd:
        name: myapp
        state: started

  rescue:
    - name: Rollback to previous version
      ansible.builtin.copy:
        src: myapp-backup
        dest: /usr/bin/myapp

    - name: Start application (rollback)
      ansible.builtin.systemd:
        name: myapp
        state: started

    - name: Report failure
      ansible.builtin.fail:
        msg: "Deployment failed, rolled back to previous version"

  always:
    - name: Cleanup temp files
      ansible.builtin.file:
        path: /tmp/deploy-*
        state: absent

Explanation:

  • block: - Main tasks
  • rescue: - Runs if any task in block fails
  • always: - Runs regardless of success/failure

Pattern 6: Retry with Until

Problem

Transient failures need retries before giving up.

Solution

Use until, retries, delay:

Example:

- name: Wait for service to be ready
  ansible.builtin.uri:
    url: http://localhost:8080/health
    status_code: 200
  register: health_check
  until: health_check.status == 200
  retries: 30
  delay: 10
  # Retry every 10 seconds, up to 30 times (5 minutes total)

With command:

- name: Wait for VM to get IP address
  ansible.builtin.command: qm agent {{ vmid }} network-get-interfaces
  register: vm_network
  until: vm_network.rc == 0
  retries: 12
  delay: 5
  changed_when: false

Pattern 7: Conditional Failure Messages

Problem

Generic failure messages don't help with troubleshooting.

Solution

Use ansible.builtin.fail with conditional messages:

Example:

- name: Check prerequisites
  ansible.builtin.command: which docker
  register: docker_check
  changed_when: false
  failed_when: false

- name: Fail if Docker not installed
  ansible.builtin.fail:
    msg: |
      Docker is not installed on {{ inventory_hostname }}
      Please install Docker before running this playbook.
      Installation: sudo apt install docker.io
  when: docker_check.rc != 0

- name: Check Docker version
  ansible.builtin.command: docker --version
  register: docker_version
  changed_when: false

- name: Validate Docker version
  ansible.builtin.fail:
    msg: |
      Docker version is too old: {{ docker_version.stdout }}
      Minimum required version: 20.10
  when: docker_version.stdout is version('20.10', '<')

Pattern 8: Assert for Validation

Problem

Need to validate multiple conditions with clear error messages.

Solution

Use ansible.builtin.assert:

Example from repository:

- name: Validate required variables
  ansible.builtin.assert:
    that:
      - secret_name is defined and secret_name|trim|length > 0
      - secret_var_name is defined and secret_var_name|trim|length > 0
    fail_msg: "secret_name and secret_var_name must be provided and non-empty"
    success_msg: "All required variables present"
    quiet: true
  no_log: true

Multiple assertions:

- name: Validate VM configuration
  ansible.builtin.assert:
    that:
      - vm_memory >= 2048
      - vm_cores >= 2
      - vm_disk_size >= 20
      - vm_name is match('^[a-z0-9-]+$')
    fail_msg: |
      Invalid VM configuration:
      - Memory must be >= 2048 MB (got: {{ vm_memory }})
      - Cores must be >= 2 (got: {{ vm_cores }})
      - Disk must be >= 20 GB (got: {{ vm_disk_size }})
      - Name must be lowercase alphanumeric with hyphens (got: {{ vm_name }})

Pattern 9: Ignore Errors Temporarily

Problem

Task may fail but playbook should continue.

Solution

Use ignore_errors (sparingly!):

Example:

- name: Try to remove old backup
  ansible.builtin.file:
    path: /backup/old-backup.tar.gz
    state: absent
  ignore_errors: true  # OK if file doesn't exist
  register: cleanup_result

- name: Report cleanup result
  ansible.builtin.debug:
    msg: "Cleanup {{ 'successful' if not cleanup_result.failed else 'skipped (file not found)' }}"

Better approach with failed_when:

- name: Remove old backup
  ansible.builtin.file:
    path: /backup/old-backup.tar.gz
    state: absent
  register: cleanup_result
  failed_when:
    - cleanup_result.failed
    - "'does not exist' not in cleanup_result.msg"

Pattern 10: Task Delegation

Problem

Need to run task locally or on a different host.

Solution

Use delegate_to:

Example:

- name: Check API endpoint from controller
  ansible.builtin.uri:
    url: "https://{{ inventory_hostname }}:8006/api2/json/version"
    validate_certs: false
  delegate_to: localhost
  register: api_check
  failed_when: api_check.status != 200

Complete Example: Robust VM Creation

Combining multiple patterns:

---
- name: Create Proxmox VM with robust error handling
  hosts: proxmox_nodes
  gather_facts: false

  vars:
    vmid: 101
    vm_name: docker-01-nexus

  tasks:
    - name: Validate VM configuration
      ansible.builtin.assert:
        that:
          - vmid is defined and vmid >= 100
          - vm_name is match('^[a-z0-9-]+$')
        fail_msg: "Invalid VM configuration"

    - name: Check if VM already exists
      ansible.builtin.shell: |
        set -o pipefail
        qm list | awk '{print $1}' | grep -q "^{{ vmid }}$"
      args:
        executable: /bin/bash
      register: vm_exists
      changed_when: false
      failed_when: false

    - name: Create VM
      block:
        - name: Clone template
          ansible.builtin.command: >
            qm clone 9000 {{ vmid }}
            --name {{ vm_name }}
            --full
            --storage local-lvm
          when: vm_exists.rc != 0
          register: clone_result
          changed_when: true

        - name: Wait for clone to complete
          ansible.builtin.pause:
            seconds: 5
          when: clone_result is changed

        - name: Verify VM exists
          ansible.builtin.shell: |
            set -o pipefail
            qm list | grep "{{ vmid }}"
          args:
            executable: /bin/bash
          register: verify_vm
          changed_when: false
          failed_when: verify_vm.rc != 0
          retries: 3
          delay: 5
          until: verify_vm.rc == 0

        - name: Configure VM
          ansible.builtin.command: >
            qm set {{ vmid }}
            --memory 4096
            --cores 4
            --ipconfig0 ip=192.168.1.100/24,gw=192.168.1.1
          register: config_result
          changed_when: true

        - name: Start VM
          ansible.builtin.command: qm start {{ vmid }}
          register: start_result
          changed_when: true

      rescue:
        - name: Cleanup failed VM
          ansible.builtin.command: qm destroy {{ vmid }}
          when: vm_exists.rc != 0  # Only destroy if we created it
          ignore_errors: true

        - name: Report failure
          ansible.builtin.fail:
            msg: |
              Failed to create VM {{ vmid }}
              Clone result: {{ clone_result.stderr | default('N/A') }}
              Config result: {{ config_result.stderr | default('N/A') }}
              Start result: {{ start_result.stderr | default('N/A') }}

    - name: Report success
      ansible.builtin.debug:
        msg: "VM {{ vmid }} ({{ vm_name }}) created successfully"
      when: vm_exists.rc != 0

Best Practices Summary

  1. Use changed_when: false for checks - Read-only operations don't change state
  2. Use failed_when for expected errors - Don't fail on "already exists" scenarios
  3. Always register command output - Needed for changed_when and failed_when
  4. Use set -euo pipefail in shell - Catch errors in pipes
  5. Validate inputs with assert - Clear failure messages for bad config
  6. Use blocks for complex operations - Enable rollback with rescue
  7. Add retries for transient failures - Network calls, service startup
  8. Verify critical operations - Check resource exists after creation
  9. Use no_log with secrets - Never log sensitive data
  10. Provide clear error messages - Help troubleshooting with context

Anti-Patterns to Avoid

Bad: Silent Failures

- name: Important task
  ansible.builtin.command: critical-operation
  ignore_errors: true  # Hides failures!

Bad: No Error Context

- name: Deploy
  ansible.builtin.command: deploy.sh
  # No register, no error handling, no context

Bad: Always Changed

- name: Check if exists
  ansible.builtin.command: check-resource
  # Missing: changed_when: false

Good: Explicit Error Handling

- name: Critical operation
  ansible.builtin.command: critical-operation
  register: result
  changed_when: "'created' in result.stdout"
  failed_when:
    - result.rc != 0
    - "'already exists' not in result.stderr"

- name: Verify operation
  ansible.builtin.command: verify-operation
  changed_when: false
  failed_when: false
  register: verify

- name: Report result
  ansible.builtin.fail:
    msg: "Operation failed: {{ result.stderr }}"
  when: verify.rc != 0

Further Reading