openshift-docs/modules/telco-core-cpu-partitioning-and-performance-tuning.adoc

// Module included in the following assemblies:
//
// * scalability_and_performance/telco_core_ref_design_specs/telco-core-rds.adoc

:_mod-docs-content-type: REFERENCE
[id="telco-core-cpu-partitioning-and-performance-tuning_{context}"]
= CPU partitioning and performance tuning

New in this release::
* Optional support for the `acpi_idle` CPUIdle driver.
* The `systemReserved` field replaces the `autoSizingReserved` field to specify 11Gi memory for worker nodes and 30Gi for control plane nodes.
* Enable triggering a kernel panic through a non-maskable interrupt for system recovery and diagnostic purposes when `x86_64` architecture nodes become unresponsive.

Description::
CPU partitioning improves performance and reduces latency by separating sensitive workloads from general-purpose tasks, interrupts, and driver work queues.
The CPUs allocated to those auxiliary processes are referred to as *reserved* in the following sections.
In a system with Hyper-Threading enabled, a CPU is one hyper-thread.

Limits and requirements::
* The operating system needs a certain amount of CPU to perform all the support tasks, including kernel networking.
** A system with just user plane networking applications (DPDK) needs at least one core (2 hyper-threads when enabled) reserved for the operating system and the infrastructure components.
* In a system with Hyper-Threading enabled, core sibling threads must always be in the same pool of CPUs.
* The set of reserved and isolated cores must include all CPU cores.
* Core 0 of each NUMA node must be included in the reserved CPU set.
* Low latency workloads require special configuration to avoid being affected by interrupts, kernel scheduler, or other parts of the platform.

For more information, see "Creating a performance profile".

Engineering considerations::
* As of OpenShift 4.19, `cgroup v1` is no longer supported and has been removed.
All workloads must now be compatible with `cgroup v2`.
For more information, see link:https://www.redhat.com/en/blog/rhel-9-changes-context-red-hat-openshift-workloads[Red Hat Enterprise Linux 9 changes in the context of Red Hat OpenShift workloads].
* The minimum reserved capacity (`systemReserved`) required can be found by following the guidance in link:https://access.redhat.com/solutions/5843241[Which amount of CPU and memory are recommended to reserve for the system in OCP 4 nodes?].
** The specific values must be custom-tuned for each cluster based on its size and application workload.
** The minimum recommended `systemReserved` memory is 11Gi for worker nodes and 30Gi for control plane nodes.
* For schedulable control planes, the minimum recommended reserved capacity is at least 16 CPUs.
* The actual required reserved CPU capacity depends on the cluster configuration and workload attributes.
* The reserved CPU value must be rounded up to a full core (2 hyper-threads) alignment.
* Changes to CPU partitioning cause the nodes contained in the relevant machine config pool to be drained and rebooted.
* The reserved CPUs reduce the pod density, because the reserved CPUs are removed from the allocatable capacity of the {product-title} node.
* The real-time workload hint should be enabled for real-time capable workloads.
** Applying the real-time `workloadHint` setting results in the `nohz_full` kernel command line parameter being applied to improve performance of high performance applications.
When you apply the `workloadHint` setting, any isolated or burstable pods that do not have the `cpu-quota.crio.io: "disable"` annotation and a proper `runtimeClassName` value, are subject to CRI-O rate limiting.
When you set the `workloadHint` parameter, be aware of the tradeoff between increased performance and the potential impact of CRI-O rate limiting.
Ensure that required pods are correctly annotated.
* Hardware without IRQ affinity support affects isolated CPUs.
All server hardware must support IRQ affinity to ensure that pods with guaranteed CPU QoS can fully use allocated CPUs.
* OVS dynamically manages its `cpuset` entry to adapt to network traffic needs.
You do not need to reserve an additional CPU for handling high network throughput on the primary CNI.
* If workloads running on the cluster use kernel level networking, the RX/TX queue count for the participating NICs should be set to 16 or 32 queues if the hardware permits it.
Be aware of the default queue count.
With no configuration, the default queue count is one RX/TX queue per online CPU; which can result in too many interrupts being allocated.
* The irdma kernel module might result in the allocation of too many interrupt vectors on systems with high core counts.
To prevent this condition the reference configuration excludes this kernel module from loading through a kernel commandline argument in the `PerformanceProfile` resource.
Typically Core workloads do not require this kernel module.
* To enable the `acpi_idle` CPUIdle driver, for example for Intel FlexRAN workloads, add `intel_idle.max_cstate=0` to the `additionalKernelArgs` list in the `PerformanceProfile` resource.
* The `TunedPerformancePatch.yaml` file in the reference configures the `kernel.panic_on_unrecovered_nmi` sysctl parameter to enable triggering a kernel panic through BMC Non-Maskable Interrupt (NMI) on x86_64 architures.
This provides a mechanism to force a kernel panic for system recovery and diagnostic purposes when nodes become unresponsive.
+
[NOTE]
====
Some drivers do not deallocate the interrupts even after reducing the queue count.
====