openshift-docs/modules/setting-up-cpu-manager.adoc

// Module included in the following assemblies:
//
// * scaling_and_performance/using-cpu-manager.adoc
// * post_installation_configuration/node-tasks.adoc

[id="seting_up_cpu_manager_{context}"]
= Setting up CPU Manager

.Procedure

. Optional: Label a node:
+
[source,terminal]
----
# oc label node perf-node.example.com cpumanager=true
----

. Edit the `MachineConfigPool` of the nodes where CPU Manager should be enabled. In this example, all workers have CPU Manager enabled:
+
[source,terminal]
----
# oc edit machineconfigpool worker
----

. Add a label to the worker machine config pool:
+
[source,yaml]
----
metadata:
  creationTimestamp: 2020-xx-xxx
  generation: 3
  labels:
    custom-kubelet: cpumanager-enabled
----

. Create a `KubeletConfig`, `cpumanager-kubeletconfig.yaml`, custom resource (CR). Refer to the label created in the previous step to have the correct nodes updated with the new kubelet config. See the `machineConfigPoolSelector` section:
+
[source,yaml]
----
apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
  name: cpumanager-enabled
spec:
  machineConfigPoolSelector:
    matchLabels:
      custom-kubelet: cpumanager-enabled
  kubeletConfig:
     cpuManagerPolicy: static <1>
     cpuManagerReconcilePeriod: 5s <2>
----
<1> Specify a policy:
* `none`. This policy explicitly enables the existing default CPU affinity scheme, providing no affinity beyond what the scheduler does automatically.
* `static`. This policy allows pods with certain resource characteristics to be granted increased CPU affinity and exclusivity on the node.
<2> Optional. Specify the CPU Manager reconcile frequency. The default is `5s`.

. Create the dynamic kubelet config:
+
[source,terminal]
----
# oc create -f cpumanager-kubeletconfig.yaml
----
+
This adds the CPU Manager feature to the kubelet config and, if needed, the Machine Config Operator (MCO) reboots the node. To enable CPU Manager, a reboot is not needed.

. Check for the merged kubelet config:
+
[source,terminal]
----
# oc get machineconfig 99-worker-XXXXXX-XXXXX-XXXX-XXXXX-kubelet -o json | grep ownerReference -A7
----
+
.Example output
[source,json]
----
       "ownerReferences": [
            {
                "apiVersion": "machineconfiguration.openshift.io/v1",
                "kind": "KubeletConfig",
                "name": "cpumanager-enabled",
                "uid": "7ed5616d-6b72-11e9-aae1-021e1ce18878"
            }
        ],
----

. Check the worker for the updated `kubelet.conf`:
+
[source,terminal]
----
# oc debug node/perf-node.example.com
sh-4.2# cat /host/etc/kubernetes/kubelet.conf | grep cpuManager
----
+
.Example output
[source,terminal]
----
cpuManagerPolicy: static        <1>
cpuManagerReconcilePeriod: 5s   <1>
----
<1> These settings were defined when you created the `KubeletConfig` CR.

. Create a pod that requests a core or multiple cores. Both limits and requests must have their CPU value set to a whole integer. That is the number of cores that will be dedicated to this pod:
+
[source,terminal]
----
# cat cpumanager-pod.yaml
----
+
.Example output
[source,yaml]
----
apiVersion: v1
kind: Pod
metadata:
  generateName: cpumanager-
spec:
  containers:
  - name: cpumanager
    image: gcr.io/google_containers/pause-amd64:3.0
    resources:
      requests:
        cpu: 1
        memory: "1G"
      limits:
        cpu: 1
        memory: "1G"
  nodeSelector:
    cpumanager: "true"
----

. Create the pod:
+
[source,terminal]
----
# oc create -f cpumanager-pod.yaml
----

. Verify that the pod is scheduled to the node that you labeled:
+
[source,terminal]
----
# oc describe pod cpumanager
----
+
.Example output
[source,terminal]
----
Name:               cpumanager-6cqz7
Namespace:          default
Priority:           0
PriorityClassName:  <none>
Node:  perf-node.example.com/xxx.xx.xx.xxx
...
 Limits:
      cpu:     1
      memory:  1G
    Requests:
      cpu:        1
      memory:     1G
...
QoS Class:       Guaranteed
Node-Selectors:  cpumanager=true
----

. Verify that the `cgroups` are set up correctly. Get the process ID (PID) of the `pause` process:
+
[source,terminal]
----
# ├─init.scope
│ └─1 /usr/lib/systemd/systemd --switched-root --system --deserialize 17
└─kubepods.slice
  ├─kubepods-pod69c01f8e_6b74_11e9_ac0f_0a2b62178a22.slice
  │ ├─crio-b5437308f1a574c542bdf08563b865c0345c8f8c0b0a655612c.scope
  │ └─32706 /pause
----
+
Pods of quality of service (QoS) tier `Guaranteed` are placed within the `kubepods.slice`. Pods of other QoS tiers end up in child `cgroups` of `kubepods`:
+
[source,terminal]
----
# cd /sys/fs/cgroup/cpuset/kubepods.slice/kubepods-pod69c01f8e_6b74_11e9_ac0f_0a2b62178a22.slice/crio-b5437308f1ad1a7db0574c542bdf08563b865c0345c86e9585f8c0b0a655612c.scope
# for i in `ls cpuset.cpus tasks` ; do echo -n "$i "; cat $i ; done
----
+
.Example output
[source,terminal]
----
cpuset.cpus 1
tasks 32706
----

. Check the allowed CPU list for the task:
+
[source,terminal]
----
# grep ^Cpus_allowed_list /proc/32706/status
----
+
.Example output
[source,terminal]
----
 Cpus_allowed_list:    1
----

. Verify that another pod (in this case, the pod in the `burstable` QoS tier) on the system cannot run on the core allocated for the `Guaranteed` pod:
+
[source,terminal]
----
# cat /sys/fs/cgroup/cpuset/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podc494a073_6b77_11e9_98c0_06bba5c387ea.slice/crio-c56982f57b75a2420947f0afc6cafe7534c5734efc34157525fa9abbf99e3849.scope/cpuset.cpus
0
# oc describe node perf-node.example.com
----
+
.Example output
[source, terminal]
----
...
Capacity:
 attachable-volumes-aws-ebs:  39
 cpu:                         2
 ephemeral-storage:           124768236Ki
 hugepages-1Gi:               0
 hugepages-2Mi:               0
 memory:                      8162900Ki
 pods:                        250
Allocatable:
 attachable-volumes-aws-ebs:  39
 cpu:                         1500m
 ephemeral-storage:           124768236Ki
 hugepages-1Gi:               0
 hugepages-2Mi:               0
 memory:                      7548500Ki
 pods:                        250
-------                               ----                           ------------  ----------  ---------------  -------------  ---
  default                                 cpumanager-6cqz7               1 (66%)       1 (66%)     1G (12%)         1G (12%)       29m

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests          Limits
  --------                    --------          ------
  cpu                         1440m (96%)       1 (66%)
----
+
This VM has two CPU cores. You set `kube-reserved` to 500 millicores, meaning half of one core is subtracted from the total capacity of the node to arrive at the `Node Allocatable` amount. You can see that `Allocatable CPU` is 1500 millicores. This means you can run one of the CPU Manager pods since each will take one whole core. A whole core is equivalent to 1000 millicores. If you try to schedule a second pod, the system will accept the pod, but it will never be scheduled:
+
[source, terminal]
----
NAME                    READY   STATUS    RESTARTS   AGE
cpumanager-6cqz7        1/1     Running   0          33m
cpumanager-7qc2t        0/1     Pending   0          11s
----