zhongwei/gh-openshift-eng-ai-helpers-plugins-node-tuning

Files

Zhongwei Li b9da7b3a23 Initial commit

2025-11-30 08:46:08 +08:00

8.5 KiB

Raw Blame History

name, description

name	description
Node Tuning Helper Scripts	Generate tuned manifests and evaluate node tuning snapshots

Node Tuning Helper Scripts

Detailed instructions for invoking the helper utilities that back /node-tuning commands:

generate_tuned_profile.py renders Tuned manifests (tuned.openshift.io/v1).
analyze_node_tuning.py inspects live nodes or sosreports for tuning gaps.

When to Use These Scripts

Translate structured command inputs into Tuned manifests for the Node Tuning Operator.
Iterate on generated YAML outside the assistant or integrate the generator into automation.
Analyze CPU isolation, IRQ affinity, huge pages, sysctl values, and networking counters from live clusters or archived sosreports.

Prerequisites

Python 3.8 or newer (python3 --version).
Repository checkout so the scripts under plugins/node-tuning/skills/scripts/ are accessible.
Optional: oc CLI when validating or applying manifests.
Optional: Extracted sosreport directory when running the analysis script offline.
Optional (remote analysis): oc CLI access plus a valid KUBECONFIG when capturing /proc//sys or sosreport via oc debug node/<name>. The sosreport workflow pulls the registry.redhat.io/rhel9/support-tools image (override with --toolbox-image or TOOLBOX_IMAGE) and requires registry access. HTTP(S) proxy env vars from the host are forwarded automatically when present, but using a proxy is optional.

Script: `generate_tuned_profile.py`

Implementation Steps

Collect Inputs
- --profile-name: Tuned resource name.
- --summary: [main] section summary.
- Repeatable options: --include, --main-option, --variable, --sysctl, --section (SECTION:KEY=VALUE).
- Target selectors: --machine-config-label key=value, --match-label key[=value].
- Optional: --priority (default 20), --namespace, --output, --dry-run.
- Use --list-nodes/--node-selector to inspect nodes and --label-node NODE:KEY[=VALUE] (plus --overwrite-labels) to tag machines.

Inspect or Label Nodes (optional)

# List all worker nodes
python3 plugins/node-tuning/skills/scripts/generate_tuned_profile.py --list-nodes --node-selector "node-role.kubernetes.io/worker" --skip-manifest

# Label a specific node for the worker-hp pool
python3 plugins/node-tuning/skills/scripts/generate_tuned_profile.py \
  --label-node ip-10-0-1-23.ec2.internal:node-role.kubernetes.io/worker-hp= \
  --overwrite-labels \
  --skip-manifest

Render the Manifest

python3 plugins/node-tuning/skills/scripts/generate_tuned_profile.py \
  --profile-name "$PROFILE" \
  --summary "$SUMMARY" \
  --sysctl net.core.netdev_max_backlog=16384 \
  --match-label tuned.openshift.io/custom-net \
  --output .work/node-tuning/$PROFILE/tuned.yaml

Omit --output to write <profile-name>.yaml in the current directory.
Add --dry-run to print the manifest to stdout.

Review Output
- Inspect the generated YAML for accuracy.
- Optionally format with yq or open in an editor for readability.
Validate and Apply
- Dry-run: oc apply --server-dry-run=client -f <manifest>.
- Apply: oc apply -f <manifest>.

Error Handling

Missing required options raise ValueError with descriptive messages.
The script exits non-zero when no target selectors (--machine-config-label or --match-label) are supplied.
Invalid key/value or section inputs identify the failing argument explicitly.

Examples

python3 plugins/node-tuning/skills/scripts/generate_tuned_profile.py \
  --profile-name realtime-worker \
  --summary "Realtime tuned profile" \
  --include openshift-node --include realtime \
  --variable isolated_cores=1 \
  --section bootloader:cmdline_ocp_realtime=+systemd.cpu_affinity=${not_isolated_cores_expanded} \
  --machine-config-label machineconfiguration.openshift.io/role=worker-rt \
  --priority 25 \
  --output .work/node-tuning/realtime-worker/tuned.yaml

python3 plugins/node-tuning/skills/scripts/generate_tuned_profile.py \
  --profile-name openshift-node-hugepages \
  --summary "Boot time configuration for hugepages" \
  --include openshift-node \
  --section bootloader:cmdline_openshift_node_hugepages="hugepagesz=2M hugepages=50" \
  --machine-config-label machineconfiguration.openshift.io/role=worker-hp \
  --priority 30 \
  --output .work/node-tuning/openshift-node-hugepages/hugepages-tuned-boottime.yaml

Script: `analyze_node_tuning.py`

Purpose

Inspect either a live node (/proc, /sys) or an extracted sosreport snapshot for tuning signals (CPU isolation, IRQ affinity, huge pages, sysctl state, networking counters) and emit actionable recommendations.

Usage Patterns

Live node analysis

python3 plugins/node-tuning/skills/scripts/analyze_node_tuning.py --format markdown

Remote analysis via oc debug

python3 plugins/node-tuning/skills/scripts/analyze_node_tuning.py \
  --node worker-rt-0 \
  --kubeconfig ~/.kube/prod \
  --format markdown

Collect sosreport via oc debug and analyze locally

python3 plugins/node-tuning/skills/scripts/analyze_node_tuning.py \
  --node worker-rt-0 \
  --toolbox-image registry.example.com/support-tools:latest \
  --sosreport-arg "--case-id=01234567" \
  --sosreport-output .work/node-tuning/sosreports \
  --format json

Offline sosreport analysis

python3 plugins/node-tuning/skills/scripts/analyze_node_tuning.py \
  --sosreport /path/to/sosreport-2025-10-20

Automation-friendly JSON

python3 plugins/node-tuning/skills/scripts/analyze_node_tuning.py \
  --sosreport /path/to/sosreport \
  --format json --output .work/node-tuning/node-analysis.json

Implementation Steps

Select data source
- Provide --node <name> (with optional --kubeconfig / --oc-binary). By default the helper runs sosreport remotely from inside the RHCOS toolbox container (registry.redhat.io/rhel9/support-tools). Override the image with --toolbox-image, extend the sosreport command with --sosreport-arg, or disable the curated OpenShift flags via --skip-default-sosreport-flags. Pass --no-collect-sosreport to fall back to the direct /proc snapshot mode.
- Provide --sosreport <dir> for archived diagnostics; detection finds embedded proc/ and sys/.
- Omit both switches to query the live filesystem (defaults to /proc and /sys).
- Override paths with --proc-root or --sys-root when the layout differs.
Run analysis
- The script parses cpuinfo, kernel cmdline parameters (isolcpus, nohz_full, tuned.non_isolcpus), default IRQ affinities, huge page counters, sysctl values (net, vm, kernel), transparent hugepage settings, netstat/sockstat counters, and ps snapshots (when available in sosreport).
Review the report
- Markdown output groups findings by section (System Overview, CPU & Isolation, Huge Pages, Sysctl Highlights, Network Signals, IRQ Affinity, Process Snapshot) and lists recommendations.
- JSON output contains the same information in structured form for pipelines or dashboards.
Act on recommendations
- Apply Tuned profiles, MachineConfig updates, or manual sysctl/irqbalance adjustments.
- Feed actionable items back into /node-tuning:generate-tuned-profile to codify desired state.

Error Handling

Missing proc/ or sys/ directories trigger descriptive errors.
Unreadable files are skipped gracefully and noted in observations where relevant.
Non-numeric sysctl values are flagged for manual investigation.

Example Output (Markdown excerpt)

# Node Tuning Analysis

## System Overview
- Hostname: worker-rt-1
- Kernel: 4.18.0-477.el8
- NUMA nodes: 2
- Kernel cmdline: `BOOT_IMAGE=... isolcpus=2-15 tuned.non_isolcpus=0-1`

## CPU & Isolation
- Logical CPUs: 32
- Physical cores: 16 across 2 socket(s)
- SMT detected: yes
- Isolated CPUs: 2-15
...

## Recommended Actions
- Configure net.core.netdev_max_backlog (>=32768) to accommodate bursty NIC traffic.
- Transparent Hugepages are not disabled (`[never]` not selected). Consider setting to `never` for latency-sensitive workloads.
- 4 IRQs overlap isolated CPUs. Relocate interrupt affinities using tuned profiles or irqbalance.

Follow-up Automation Ideas

Persist JSON results in .work/node-tuning/<host>/analysis.json for historical tracing.
Gate upgrades by comparing recommendations across nodes.
Integrate with CI jobs that validate cluster tuning post-change.

8.5 KiB Raw Blame History

Node Tuning Helper Scripts

When to Use These Scripts

Prerequisites

Script: generate_tuned_profile.py

Implementation Steps

Error Handling

Examples

Script: analyze_node_tuning.py

Purpose

Usage Patterns

Implementation Steps

Error Handling

Example Output (Markdown excerpt)

Follow-up Automation Ideas

8.5 KiB

Raw Blame History

Script: `generate_tuned_profile.py`

Script: `analyze_node_tuning.py`