Merge pull request #87549 from openshift-cherrypick-robot/cherry-pick-86820-to-enterprise-4.18

[enterprise-4.18] OCPBUGS-44632 Adding scoring strategy
2026-02-05 12:46:18 +01:00 · 2025-01-23 14:41:50 -05:00
parent ffea06c44e a325f66b67
commit 4e58253f5a
3 changed files with 202 additions and 1 deletions
--- a/modules/cnf-numa-resource-scheduling-strategies.adoc
+++ b/modules/cnf-numa-resource-scheduling-strategies.adoc
@@ -0,0 +1,72 @@
+// Module included in the following assemblies:
+//
+// * scalability_and_performance/cnf-numa-aware-scheduling.adoc
+
+:_mod-docs-content-type: CONCEPT
+[id="cnf-numa-resource-scheduling-strategies_{context}"]
+= NUMA resource scheduling strategies 
+
+When scheduling high-performance workloads, the secondary scheduler can employ different strategies to determine which NUMA node within a chosen worker node will handle the workload. The supported strategies in {product-title} include `LeastAllocated`, `MostAllocated`, and `BalancedAllocation`. Understanding these strategies helps optimize workload placement for performance and resource utilization.
+
+When a high-performance workload is scheduled in a NUMA-aware cluster, the following steps occur: 
+
+.  The scheduler first selects a suitable worker node based on cluster-wide criteria. For example taints, labels, or resource availability.
+
+. After a worker node is selected, the scheduler evaluates its NUMA nodes and applies a scoring strategy to decide which NUMA node will handle the workload.
+
+. After a workload is scheduled, the selected NUMA node’s resources are updated to reflect the allocation.
+
+The default strategy applied is the `LeastAllocated` strategy. This assigns workloads to the NUMA node with the most available resources that is the least utilized NUMA node. The goal of this strategy is to spread workloads across NUMA nodes to reduce contention and avoid hotspots.
+
+The following table summarizes the different strategies and their outcomes:
+
+[discrete]
+[id="cnf-scoringstrategy-summary_{context}"]
+== Scoring strategy summary
+
+.Scoring strategy summary
+[cols="2,3,3", options="header"]
+|===
+|Strategy |Description |Outcome
+|`LeastAllocated` |Favors NUMA nodes with the most available resources. |Spreads workloads to reduce contention and ensure headroom for high-priority tasks.
+|`MostAllocated` |Favors NUMA nodes with the least available resources. |Consolidates workloads on fewer NUMA nodes, freeing others for energy efficiency.
+|`BalancedAllocation` |Favors NUMA nodes with balanced CPU and memory usage. |Ensures even resource utilization, preventing skewed usage patterns.
+|===
+
+[discrete]
+[id="cnf-leastallocated-example_{context}"]
+== LeastAllocated strategy example
+The `LeastAllocated` is the default strategy. This strategy assigns workloads to the NUMA node with the most available resources, minimizing resource contention and spreading workloads across NUMA nodes. This reduces hotspots and ensures sufficient headroom for high-priority tasks. Assume a worker node has two NUMA nodes, and the workload requires 4 vCPUs and 8 GB of memory:
+
+.Example initial NUMA nodes state
+[cols="5,2,2,2,2,2", options="header"]
+|===
+|NUMA node |Total CPUs |Used CPUs |Total memory (GB) |Used memory (GB) |Available resources
+|NUMA 1 |16 |12 |64 |56 |4 CPUs, 8 GB memory
+|NUMA 2 |16 |6 |64 |24 |10 CPUs, 40 GB memory
+|===
+
+Because NUMA 2 has more available resources compared to NUMA 1, the workload is assigned to NUMA 2.
+
+[discrete]
+[id="cnf-mostallocated-example_{context}"]
+== MostAllocated strategy example
+The `MostAllocated` strategy consolidates workloads by assigning them to the NUMA node with the least available resources, which is the most utilized NUMA node. This approach helps free other NUMA nodes for energy efficiency or critical workloads requiring full isolation. This example uses the "Example initial NUMA nodes state" values listed in the `LeastAllocated` section.
+
+The workload again requires 4 vCPUs and 8 GB memory. NUMA 1 has fewer available resources compared to NUMA 2, so the scheduler assigns the workload to NUMA 1, further utilizing its resources while leaving NUMA 2 idle or minimally loaded.
+
+[discrete]
+[id="cnf-balanceallocated-example_{context}"]
+== BalancedAllocation strategy example
+The `BalancedAllocation` strategy assigns workloads to the NUMA node with the most balanced resource utilization across CPU and memory. The goal is to prevent imbalanced usage, such as high CPU utilization with underutilized memory. Assume a worker node has the following NUMA node states:
+
+.Example NUMA nodes initial state for `BalancedAllocation`
+[cols="2,2,2,2",options="header"]
+|===
+|NUMA node |CPU usage |Memory usage |`BalancedAllocation` score
+|NUMA 1 |60% |55% |High (more balanced)
+|NUMA 2 |80% |20% |Low (less balanced)
+|===
+
+NUMA 1 has a more balanced CPU and memory utilization compared to NUMA 2 and therefore, with the `BalancedAllocation` strategy in place, the workload is assigned to NUMA 1.  
+
--- a/modules/cnf-scheduling-exact-based-on-reource.adoc
+++ b/modules/cnf-scheduling-exact-based-on-reource.adoc
@@ -0,0 +1,121 @@
+// Module included in the following assemblies:
+//
+// * scalability_and_performance/cnf-numa-aware-scheduling.adoc
+
+:_mod-docs-content-type: PROCEDURE
+[id="cnf-changing-where-high-performance-workloads-run_{context}"]
+= Changing where high-performance workloads run 
+
+The NUMA-aware secondary scheduler is responsible for scheduling high-performance workloads on a worker node and within a NUMA node where the workloads can be optimally processed. By default, the secondary scheduler assigns workloads to the NUMA node within the chosen worker node that has the most available resources.
+
+If you want to change where the workloads run, you can add the `scoringStrategy` setting to the `NUMAResourcesScheduler` custom resource and set its value to either `MostAllocated`  or `BalancedAllocation`.  
+
+.Prerequisites
+
+* Install the OpenShift CLI (`oc`).
+* Log in as a user with `cluster-admin` privileges.
+
+.Procedure
+
+. Delete the currently running `NUMAResourcesScheduler` resource by using the following steps:
+
+.. Get the active `NUMAResourcesScheduler` by running the following command:
+
+[source,terminal]
+----
+$ oc get NUMAResourcesScheduler
+----
+
+.Example output
+[source,terminal]
+----
+NAME                     AGE
+numaresourcesscheduler   92m
+----
+
+.. Delete the secondary scheduler resource by running the following command:
+
+[source,terminal]
+----
+$ oc delete NUMAResourcesScheduler numaresourcesscheduler
+----
+
+.Example output
+[source,terminal]
+----
+numaresourcesscheduler.nodetopology.openshift.io "numaresourcesscheduler" deleted
+----
+
+. Save the following YAML in the file `nro-scheduler-mostallocated.yaml`. This example changes the `scoringStrategy` to `MostAllocated`:
+
+[source,yaml]
+----
+apiVersion: nodetopology.openshift.io/v1
+kind: NUMAResourcesScheduler
+metadata:
+  name: numaresourcesscheduler
+spec:
+  imageSpec: "registry.redhat.io/openshift4/noderesourcetopology-scheduler-container-rhel8:v{product-version}"
+  scoringStrategy:
+        type: "MostAllocated" <1>
+----
+<1> If the `scoringStrategy` configuration is omitted, the default of `LeastAllocated` applies.
+
+. Create the updated `NUMAResourcesScheduler` resource by running the following command:
+
+[source,terminal]
+----
+$ oc create -f nro-scheduler-mostallocated.yaml
+----
+
+.Example output
+[source,terminal]
+----
+numaresourcesscheduler.nodetopology.openshift.io/numaresourcesscheduler created
+----
+
+.Verification
+
+. Check that the NUMA-aware scheduler was successfully deployed by using the following steps:
+
+.. Run the following command to check that the custom resource definition (CRD) is created successfully:
+
+[source,terminal]
+----
+$ oc get crd | grep numaresourcesschedulers
+----
+
+.Example output
+[source,terminal]
+----
+NAME                                                              CREATED AT
+numaresourcesschedulers.nodetopology.openshift.io                 2022-02-25T11:57:03Z
+----
+
+.. Check that the new custom scheduler is available by running the following command:
+
+[source,terminal]
+----
+$ oc get numaresourcesschedulers.nodetopology.openshift.io
+----
+
+.Example output
+[source,terminal]
+----
+NAME                     AGE
+numaresourcesscheduler   3h26m
+----
+
+. Verify that the `ScoringStrategy` has been applied correctly by running the following command to check the relevant `ConfigMap` resource for the scheduler: 
+
+[source,terminal]
+----
+$ oc get -n openshift-numaresources cm topo-aware-scheduler-config -o yaml | grep scoring -A 1
+----
+
+.Example output
+[source,terminal]
+----
+scoringStrategy:
+  type: MostAllocated
+----
--- a/scalability_and_performance/cnf-numa-aware-scheduling.adoc
+++ b/scalability_and_performance/cnf-numa-aware-scheduling.adoc
@@ -14,9 +14,14 @@ The NUMA Resources Operator allows you to schedule high-performance workloads in

 include::modules/cnf-about-numa-aware-scheduling.adoc[leveloffset=+1]

+include::modules/cnf-numa-resource-scheduling-strategies.adoc[leveloffset=+1]
+
+[role="_additional-resources"]
 .Additional resources

-* For more information about running secondary pod schedulers in your cluster and how to deploy pods with a secondary pod scheduler, see xref:../nodes/scheduling/secondary_scheduler/nodes-secondary-scheduler-configuring.adoc#secondary-scheduler-configuring[Scheduling pods using a secondary scheduler].
+* xref:../nodes/scheduling/secondary_scheduler/nodes-secondary-scheduler-configuring.adoc#secondary-scheduler-configuring[Scheduling pods using a secondary scheduler]
+
+* xref:../scalability_and_performance/cnf-numa-aware-scheduling.adoc#cnf-changing-where-high-performance-workloads-run_numa-aware[Changing where high-performance workloads run]

 [id="installing-the-numa-resources-operator_{context}"]
 == Installing the NUMA Resources Operator
@@ -35,6 +40,7 @@ include::modules/cnf-deploying-the-numa-aware-scheduler.adoc[leveloffset=+2]

 include::modules/cnf-configuring-single-numa-policy.adoc[leveloffset=+2]

+[role="_additional-resources"]
 .Additional resources

 * xref:../disconnected/updating/disconnected-update.adoc#images-configuration-registry-mirror-configuring_updating-disconnected-cluster[Configuring image registry repository mirroring]
@@ -53,6 +59,8 @@ include::modules/cnf-troubleshooting-numa-aware-workloads.adoc[leveloffset=+1]

 include::modules/cnf-reporting-more-exact-reource-availability.adoc[leveloffset=+2]

+include::modules/cnf-scheduling-exact-based-on-reource.adoc[leveloffset=+2]
+
 include::modules/cnf-checking-numa-aware-scheduler-logs.adoc[leveloffset=+2]

 include::modules/cnf-troubleshooting-resource-topo-exporter.adoc[leveloffset=+2]